Received: by 2002:a25:e74b:0:0:0:0:0 with SMTP id e72csp1913018ybh; Tue, 14 Jul 2020 10:28:08 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwIpQfeQdpK5r09O37xOX0d3kzAxi2YGElbpqTAnUvqg+6UZZWpFqOdXRqJQTTq8O9MPD0a X-Received: by 2002:a17:906:cd19:: with SMTP id oz25mr5455520ejb.36.1594747688414; Tue, 14 Jul 2020 10:28:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1594747688; cv=none; d=google.com; s=arc-20160816; b=oayZRm3Sp4CMZunpw3dYXj7y9vaWEi/SrJdELe4LQZIzf65XojeFD+Jfms2s8MR98P UMWzyA8YPfYxUJCok/j2NY4AeUn7hzyWL18zhxD3U2zu6emq7bk4n3wzQPPZRhbwAV8I lUXI7f0p6b3d5VoyZ8Wo4KQ3KLMGMkXBQFI3xTyXblX4tKC+aeZ5jjO7MyuqZNrv4xII StHd/SA8F47v6H36Xjj413Pybp4/w4yVdqC4PaHsDwOlIyolmUZWlJ7zOrkZPZprZVlz 5PWvqRQ1Ud8jCN6cV9DL1ysqxONd5/r7XA6rEDLrakUIMzCA0XXt8Z8wu5tyqvYEZI8W Dnlw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=1xTgHLpduq9c9JN/ikTOg527pQJsOTf1rQHTQ4WlTZg=; b=Zz7JrA6zm2+SdTdWURvoFiMGSN3J8Zb6lcVZ1uztZReI24P8n4WaZpm3NoRI2NwGxo NyNanx26/laZu6ee+950NbkMv/AJokVtXOZq45QqyoZwDfO1kGulCqQk5do+1C7vP0EF 0HX2zWeB451inc+BmxA5ITDM3HUVjbD6kFT47UZy74dCPrcRznEzi1Ja3dgXy1VB28fK OEzbv7uuZDZlbU4kcXrEK2Xf64tGHYM0DiDtBGGn/xjpfBCea6GzCKGGLMHeb2T5FTGA gLOSgxBfrBofg/tAJQHKBYCfz7oS2FQ53Bz3K4soWqfKLl3m7GPI1xOgWkcJQ6o5bWA6 YlVw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@posk.io header.s=google header.b=MyewALBa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id k19si10405808eji.419.2020.07.14.10.27.43; Tue, 14 Jul 2020 10:28:08 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@posk.io header.s=google header.b=MyewALBa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728177AbgGNRY6 (ORCPT + 99 others); Tue, 14 Jul 2020 13:24:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53244 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726364AbgGNRY5 (ORCPT ); Tue, 14 Jul 2020 13:24:57 -0400 Received: from mail-vs1-xe44.google.com (mail-vs1-xe44.google.com [IPv6:2607:f8b0:4864:20::e44]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 82CD1C061755 for ; Tue, 14 Jul 2020 10:24:57 -0700 (PDT) Received: by mail-vs1-xe44.google.com with SMTP id q15so8906811vso.9 for ; Tue, 14 Jul 2020 10:24:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=posk.io; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=1xTgHLpduq9c9JN/ikTOg527pQJsOTf1rQHTQ4WlTZg=; b=MyewALBadXpPPljrz1yTNnZ5p2YE7I3mogDQDHYcHhbaIkJYMuInIwUfKqL1hPINCt Q4NiEXOTJ77+8e3rowUdC3ahYRy4JF3gV2pZWSN2AfyMON0FliW6wSeEvVACsSyeZNBm PNyoiOOk+NgRLNo/eRJVH0861frqubO1JboItfjc/0Ad1OigW2O4IxVNs3LYAVfsmRy2 XDHUs9bidaZBFZE2oj6I+6jFzIsH9fWEGt2EN5vH08MAZ1fMNVej4NT8bFZVV4jGXSC5 kd12ibATFlT9NAzdT+GC5hAnGKp93Y5w0cg7+KlOAPF3TUlLvJSzFwjebOGidjVWKU0i uKVg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=1xTgHLpduq9c9JN/ikTOg527pQJsOTf1rQHTQ4WlTZg=; b=F1fy/cuPhD0SSa3tbmRaLJ8J1knWni5IK4ZG703SWbljkfyNZazQnSQwRn1q/i4758 ErOwUNJAbkszJrmVYD9TvmPLr/G4V0dmfbw8jkazQHMFVc7wb/N84TdhKUm4IPlbz6r1 750JGcth+T/RsaWLQMKkBkPstuXd2tM1kP7K8gGsEGzqvWOU2qQ96itwiIFauBeq1SIM XBOCun14lCLPUpH32sjAQQ/cTzXqHBm3sAmW1sTKRc7nUL4bBIRznxfaJ+OW02GigluV 5hvR3cuJb8MbEVUD62ggoOlm27YHAxhALbfOfqP9EaOJ/AIw/5VGPPK+Huo5JTTJf/Mx LQ5A== X-Gm-Message-State: AOAM532i9fQRkB+rVoYxnI22zfuXSGzbhFSxR6to/onBGKUNCwZrz+lL OVhtQesOmPkgpE8vU9UxH0j0lyKDGvkxtFRdtPzyCg== X-Received: by 2002:a67:c90c:: with SMTP id w12mr4227101vsk.86.1594747495295; Tue, 14 Jul 2020 10:24:55 -0700 (PDT) MIME-Version: 1.0 References: <20200714030348.6214-1-mathieu.desnoyers@efficios.com> <20200714030348.6214-3-mathieu.desnoyers@efficios.com> In-Reply-To: <20200714030348.6214-3-mathieu.desnoyers@efficios.com> From: Peter Oskolkov Date: Tue, 14 Jul 2020 10:24:43 -0700 Message-ID: Subject: Re: [RFC PATCH 2/4] rseq: Allow extending struct rseq To: Mathieu Desnoyers Cc: Peter Zijlstra , Linux Kernel Mailing List , Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , linux-api@vger.kernel.org, Christian Brauner , Florian Weimer , carlos@redhat.com, Peter Oskolkov Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org At Google, we actually extended struct rseq (I will post the patches here once they are fully deployed and we have specific benefits/improvements to report). We did this by adding several fields below __u32 flags (the last field currently), and correspondingly increasing rseq_len in rseq() syscall. If the kernel does not know of this extension, it will return -EINVAL due to an unexpected rseq_len; then the application can either fall-back to the standard/upstream rseq, or bail. If the kernel does know of this extension, it accepts it. If the application passes the old rseq_len (32), the kernel knows that this is an old application and treats it as such. I looked through the archives, but I did not find specifically why the pretty standard approach described above is considered inferior to the one taken in this patch (freeze rseq_len at 32, add additional length fields to struct rseq). Can these be summarized? Thanks, Peter On Mon, Jul 13, 2020 at 8:04 PM Mathieu Desnoyers wrote: > > Add a __rseq_abi.flags "RSEQ_TLS_FLAG_SIZE", which indicates support for > extending struct rseq. This adds two new fields to struct rseq: > user_size and kernel_size. > > The user_size field allows the size of the __rseq_abi definition (which > can be overridden by symbol interposition either by a preloaded library > or by the application) to be handed over to the kernel at registration. > This registration can be performed by a library, e.g. glibc, which does > not know there is interposition taking place. > > The kernel_size is populated by the kernel when the "RSEQ_TLS_FLAG_SIZE" > flag is set in __rseq_abi.flags to the minimum between user_size and > the offset of the "end" field of struct rseq as known by the kernel. > This allows user-space to query which fields are effectively populated > by the kernel. > > A rseq_size field is added to the task struct to keep track of the > "kernel_size" effective for each thread. > > Signed-off-by: Mathieu Desnoyers > --- > include/linux/sched.h | 4 ++++ > include/uapi/linux/rseq.h | 37 ++++++++++++++++++++++++++++++++-- > kernel/rseq.c | 42 +++++++++++++++++++++++++++++++++------ > 3 files changed, 75 insertions(+), 8 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 692e327d7455..5d61a3197987 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1147,6 +1147,7 @@ struct task_struct { > #ifdef CONFIG_RSEQ > struct rseq __user *rseq; > u32 rseq_sig; > + u32 rseq_size; > /* > * RmW on rseq_event_mask must be performed atomically > * with respect to preemption. > @@ -1976,10 +1977,12 @@ static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) > if (clone_flags & CLONE_VM) { > t->rseq = NULL; > t->rseq_sig = 0; > + t->rseq_size = 0; > t->rseq_event_mask = 0; > } else { > t->rseq = current->rseq; > t->rseq_sig = current->rseq_sig; > + t->rseq_size = current->rseq_size; > t->rseq_event_mask = current->rseq_event_mask; > } > } > @@ -1988,6 +1991,7 @@ static inline void rseq_execve(struct task_struct *t) > { > t->rseq = NULL; > t->rseq_sig = 0; > + t->rseq_size = 0; > t->rseq_event_mask = 0; > } > > diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h > index e11d9df5e564..03c0b5e9a859 100644 > --- a/include/uapi/linux/rseq.h > +++ b/include/uapi/linux/rseq.h > @@ -37,6 +37,15 @@ enum rseq_cs_flags { > (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), > }; > > +enum rseq_tls_flags_bit { > + /* enum rseq_cs_flags reserves bits 0-2. */ > + RSEQ_TLS_FLAG_SIZE_BIT = 3, > +}; > + > +enum rseq_tls_flags { > + RSEQ_TLS_FLAG_SIZE = (1U << RSEQ_TLS_FLAG_SIZE_BIT), > +}; > + > /* The rseq_len expected by rseq registration is always 32 bytes. */ > enum rseq_len_expected { > RSEQ_LEN_EXPECTED = 32, > @@ -133,8 +142,9 @@ struct rseq { > * > * This field should only be updated by the thread which > * registered this data structure. Read by the kernel. > - * Mainly used for single-stepping through rseq critical sections > - * with debuggers. > + * > + * The RSEQ_CS flags are mainly used for single-stepping through rseq > + * critical sections with debuggers. > * > * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT > * Inhibit instruction sequence block restart on preemption > @@ -145,8 +155,31 @@ struct rseq { > * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE > * Inhibit instruction sequence block restart on migration for > * this thread. > + * > + * - RSEQ_TLS_FLAG_SIZE > + * Extensible struct rseq ABI. This flag should be statically > + * initialized. > */ > __u32 flags; > + /* > + * With __rseq_abi.flags RSEQ_TLS_FLAG_SIZE set, user_size should be > + * statically initialized to offsetof(struct rseq, end). > + */ > + __u16 user_size; > + /* > + * With __rseq_abi.flags RSEQ_TLS_FLAG_SIZE set, if the kernel supports > + * extensible struct rseq ABI, the kernel_size field is populated by > + * the kernel to the minimum between user_size and the offset of the > + * "end" field within the struct rseq supported by the kernel on > + * successful registration. Should be initialized to 0. > + */ > + __u16 kernel_size; > + > + /* > + * Very last field of the structure, to calculate size excluding padding > + * with offsetof(). > + */ > + char end[]; > } __attribute__((aligned(4 * sizeof(__u64)))); > > #endif /* _UAPI_LINUX_RSEQ_H */ > diff --git a/kernel/rseq.c b/kernel/rseq.c > index a4f86a9d6937..bbc57fc18573 100644 > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -96,6 +96,7 @@ static int rseq_update_cpu_id(struct task_struct *t) > static int rseq_reset_rseq_cpu_id(struct task_struct *t) > { > u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED; > + u16 kernel_size = 0; > > /* > * Reset cpu_id_start to its initial state (0). > @@ -109,6 +110,11 @@ static int rseq_reset_rseq_cpu_id(struct task_struct *t) > */ > if (put_user(cpu_id, &t->rseq->cpu_id)) > return -EFAULT; > + /* > + * Reset kernel_size to its initial state (0). > + */ > + if (put_user(kernel_size, &t->rseq->kernel_size)) > + return -EFAULT; > return 0; > } > > @@ -266,7 +272,7 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) > > if (unlikely(t->flags & PF_EXITING)) > return; > - if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq)))) > + if (unlikely(!access_ok(t->rseq, t->rseq_size))) > goto error; > ret = rseq_ip_fixup(regs); > if (unlikely(ret < 0)) > @@ -294,7 +300,7 @@ void rseq_syscall(struct pt_regs *regs) > > if (!t->rseq) > return; > - if (!access_ok(t->rseq, sizeof(*t->rseq)) || > + if (!access_ok(t->rseq, t->rseq_size) || > rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs)) > force_sig(SIGSEGV); > } > @@ -308,6 +314,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, > int, flags, u32, sig) > { > int ret; > + u32 tls_flags; > > if (flags & RSEQ_FLAG_UNREGISTER) { > if (flags & ~RSEQ_FLAG_UNREGISTER) > @@ -315,7 +322,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, > /* Unregister rseq for current thread. */ > if (current->rseq != rseq || !current->rseq) > return -EINVAL; > - if (rseq_len != sizeof(*rseq)) > + if (rseq_len != RSEQ_LEN_EXPECTED) > return -EINVAL; > if (current->rseq_sig != sig) > return -EPERM; > @@ -323,6 +330,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, > if (ret) > return ret; > current->rseq = NULL; > + current->rseq_size = 0; > current->rseq_sig = 0; > return 0; > } > @@ -336,7 +344,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, > * the provided address differs from the prior > * one. > */ > - if (current->rseq != rseq || rseq_len != sizeof(*rseq)) > + if (current->rseq != rseq || rseq_len != RSEQ_LEN_EXPECTED) > return -EINVAL; > if (current->rseq_sig != sig) > return -EPERM; > @@ -349,10 +357,32 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, > * ensure the provided rseq is properly aligned and valid. > */ > if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq)) || > - rseq_len != sizeof(*rseq)) > + rseq_len != RSEQ_LEN_EXPECTED) > return -EINVAL; > - if (!access_ok(rseq, rseq_len)) > + if (!access_ok(rseq, RSEQ_LEN_EXPECTED)) > return -EFAULT; > + > + /* Handle extensible struct rseq ABI. */ > + ret = get_user(tls_flags, &rseq->flags); > + if (ret) > + return ret; > + if (tls_flags & RSEQ_TLS_FLAG_SIZE) { > + u16 user_size, kernel_size; > + > + ret = get_user(user_size, &rseq->user_size); > + if (ret) > + return ret; > + if (user_size < offsetof(struct rseq, kernel_size) + sizeof(u16)) > + return -EINVAL; > + kernel_size = min_t(u16, user_size, offsetof(struct rseq, end)); > + ret = put_user(kernel_size, &rseq->kernel_size); > + if (ret) > + return ret; > + current->rseq_size = kernel_size; > + } else { > + current->rseq_size = offsetof(struct rseq, user_size); > + } > + > current->rseq = rseq; > current->rseq_sig = sig; > /* > -- > 2.17.1 >