Received: by 2002:a05:6a10:a852:0:0:0:0 with SMTP id d18csp2247063pxy; Sun, 2 May 2021 16:25:47 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzGprd02ID5Xc9EmUKVALkpVk02pRItH3rZuJb8LIkyjoMLsh8JRo93kEtg2x1kkYxSHzqk X-Received: by 2002:a65:618c:: with SMTP id c12mr15724592pgv.296.1619997947468; Sun, 02 May 2021 16:25:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1619997947; cv=none; d=google.com; s=arc-20160816; b=A0qBdrCzuccpBeOlKxRPEwKgj4szdqtPqcH7PxlSJReSRlBsCUSK+DBPyY709wJ17k lsz0XOwjD68Qnr/so46G+LQuXM45SrzHRKRZS+JzrHI7FDMa4D/0LW4W3kGbjzfcL9Pl PYajvT/VnIgfn0nD7f2oQvjJIESM1BJmeJ+00vKQZZOBu50nRUqgTrS6zb8Uq1KT5xep jx9LRj7RFFi9+1GgKUcKUpPZE1DuM5JuIBlnCJ4TU7Trb4cXM3ulx6vmB+CmcfTox/sF tHUxByJpCtR1ApcEP8zFDEE+WaG8HhOrlpGazkVoss9Nc87Iwk5xnWz/JR2XMxj0Yc25 gBbQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=gpU1FdN4A+ZeCEzhaaXZjubBq7cdBa/A9lcIlGBGNKs=; b=ZrK7xxf/rcsplLdTEEr9mbZHoFG5/WM+lr3lFwWgvhlllbAbPE6EBzq6UEDGCfds0n 1b01hLImvaZvkAB76laQqtPiXb5YAI8cDnUk98U7zDGYJJXO8rvEiwgue/Fm33hYefAB H8C0IxwShtWogxYLDtFzLzIxSuWMyZhYJStobhtgt23YjkVxmW22xXWqyC1sAcrFuYJF YkB0pgbufUSXQhjCMqj//rCTbmziGmX//b4gHGYBGmcFEeGfYVrK6UYEkqYAhyeXv1Xw Lg8ZKc+YQMLtlOFZaiZLSUvL7DmMzhc5FwNIe/uJgIhRsqC9EZ6JbrH+eFtBDZp4drNI Ooig== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Xvk3lamc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id p11si11134663pjj.44.2021.05.02.16.25.26; Sun, 02 May 2021 16:25:47 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Xvk3lamc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232457AbhEBXYz (ORCPT + 99 others); Sun, 2 May 2021 19:24:55 -0400 Received: from mail.kernel.org ([198.145.29.99]:58754 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232415AbhEBXYy (ORCPT ); Sun, 2 May 2021 19:24:54 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 563E7613AC for ; Sun, 2 May 2021 23:24:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1619997842; bh=9XrmVhh3x/hu1xiO6wy8hQM4EVEmoIsidC0CNOUDewU=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=Xvk3lamcKONXtdY5uj063AqT9gflYjk1fOHzc1MCkTApqwbDua24lTAWDMxpEsZsX lAHSS6SKBlozDYzK1NU/OaZPq2Mof8xS7ujGJWSw0nlJG0xl3kFyMYCujZo+PPmrUY seqosSZHewIS5RM1c2DbHSz0RiFwGMZcDKbkd/+EOOMGkvIRyVYIG6qZh3qC5VVheT vV07VCHhxySchqQrOBISbGaR7WQkUiDWSaRmQ1iiERMnQV/MayYQjhRAnN6E6dC4iz 3vmcvEYa6o8H4+MeSdO9zYAmLqPb4nSovMpsPmtPJ8PiEpTC3BsLZlmxO/wJ0+uvE0 HFsxdg81u2cxQ== Received: by mail-ej1-f44.google.com with SMTP id t4so5283968ejo.0 for ; Sun, 02 May 2021 16:24:02 -0700 (PDT) X-Gm-Message-State: AOAM532ht0WqKPcoXbhzbzjnnwWq0PcCNZF8PWlHWqQ4uh5W5mw81NVt +ZmO9QD8VHBqfrYNvgNwZM0scOqNEQm10fwgiUS3gQ== X-Received: by 2002:a17:906:ccc9:: with SMTP id ot9mr1244151ejb.253.1619997840467; Sun, 02 May 2021 16:24:00 -0700 (PDT) MIME-Version: 1.0 References: <20210427204315.24153-1-yu-cheng.yu@intel.com> <20210427204315.24153-26-yu-cheng.yu@intel.com> <8fd86049-930d-c9b7-379c-56c02a12cd77@intel.com> In-Reply-To: From: Andy Lutomirski Date: Sun, 2 May 2021 16:23:49 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: extending ucontext (Re: [PATCH v26 25/30] x86/cet/shstk: Handle signals for shadow stack) To: Andy Lutomirski Cc: "Yu, Yu-cheng" , linux-arch , X86 ML , "H. Peter Anvin" , Thomas Gleixner , Ingo Molnar , LKML , "open list:DOCUMENTATION" , Linux-MM , Linux API , Arnd Bergmann , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H.J. Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , "Ravi V. Shankar" , Vedvyas Shanbhogue , Dave Martin , Weijiang Yang , Pengfei Xu , Haitao Huang Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 30, 2021 at 10:47 AM Andy Lutomirski wrote: > > On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng wrote: > > > > On 4/28/2021 4:03 PM, Andy Lutomirski wrote: > > > On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu wrote: > > >> > > >> When shadow stack is enabled, a task's shadow stack states must be saved > > >> along with the signal context and later restored in sigreturn. However, > > >> currently there is no systematic facility for extending a signal context. > > >> There is some space left in the ucontext, but changing ucontext is likely > > >> to create compatibility issues and there is not enough space for further > > >> extensions. > > >> > > >> Introduce a signal context extension struct 'sc_ext', which is used to save > > >> shadow stack restore token address. The extension is located above the fpu > > >> states, plus alignment. The struct can be extended (such as the ibt's > > >> wait_endbr status to be introduced later), and sc_ext.total_size field > > >> keeps track of total size. > > > > > > I still don't like this. > > > > > > Here's how the signal layout works, for better or for worse: > > > > > > The kernel has: > > > > > > struct rt_sigframe { > > > char __user *pretcode; > > > struct ucontext uc; > > > struct siginfo info; > > > /* fp state follows here */ > > > }; > > > > > > This is roughly the actual signal frame. But userspace does not have > > > this struct declared, and user code does not know the sizes of the > > > fields. So it's accessed in a nonsensical way. The signal handler > > > function is passed a pointer to the whole sigframe implicitly in RSP, > > > a pointer to &frame->info in RSI, anda pointer to &frame->uc in RDX. > > > User code can *find* the fp state by following a pointer from > > > mcontext, which is, in turn, found via uc: > > > > > > struct ucontext { > > > unsigned long uc_flags; > > > struct ucontext *uc_link; > > > stack_t uc_stack; > > > struct sigcontext uc_mcontext; <-- fp pointer is in here > > > sigset_t uc_sigmask; /* mask last for extensibility */ > > > }; > > > > > > The kernel, in sigreturn, works a bit differently. The sigreturn > > > variants know the base address of the frame but don't have the benefit > > > of receiving pointers to the fields. So instead the kernel takes > > > advantage of the fact that it knows the offset to uc and parses uc > > > accordingly. And the kernel follows the pointer in mcontext to find > > > the fp state. The latter bit is quite important later. The kernel > > > does not parse info at all. > > > > > > The fp state is its own mess. When XSAVE happened, Intel kindly (?) > > > gave us a software defined area between the "legacy" x87 region and > > > the modern supposedly extensible part. Linux sticks the following > > > structure in that hole: > > > > > > struct _fpx_sw_bytes { > > > /* > > > * If set to FP_XSTATE_MAGIC1 then this is an xstate context. > > > * 0 if a legacy frame. > > > */ > > > __u32 magic1; > > > > > > /* > > > * Total size of the fpstate area: > > > * > > > * - if magic1 == 0 then it's sizeof(struct _fpstate) > > > * - if magic1 == FP_XSTATE_MAGIC1 then it's sizeof(struct _xstate) > > > * plus extensions (if any) > > > */ > > > __u32 extended_size; > > > > > > /* > > > * Feature bit mask (including FP/SSE/extended state) that is present > > > * in the memory layout: > > > */ > > > __u64 xfeatures; > > > > > > /* > > > * Actual XSAVE state size, based on the xfeatures saved in the layout. > > > * 'extended_size' is greater than 'xstate_size': > > > */ > > > __u32 xstate_size; > > > > > > /* For future use: */ > > > __u32 padding[7]; > > > }; > > > > > > > > > That's where we are right now upstream. The kernel has a parser for > > > the FPU state that is bugs piled upon bugs and is going to have to be > > > rewritten sometime soon. On top of all this, we have two upcoming > > > features, both of which require different kinds of extensions: > > > > > > 1. AVX-512. (Yeah, you thought this story was over a few years ago, > > > but no. And AMX makes it worse.) To make a long story short, we > > > promised user code many years ago that a signal frame fit in 2048 > > > bytes with some room to spare. With AVX-512 this is false. With AMX > > > it's so wrong it's not even funny. The only way out of the mess > > > anyone has come up with involves making the length of the FPU state > > > vary depending on which features are INIT, i.e. making it more compact > > > than "compact" mode is. This has a side effect: it's no longer > > > possible to modify the state in place, because enabling a feature with > > > no space allocated will make the structure bigger, and the stack won't > > > have room. Fortunately, one can relocate the entire FPU state, update > > > the pointer in mcontext, and the kernel will happily follow the > > > pointer. So new code on a new kernel using a super-compact state > > > could expand the state by allocating new memory (on the heap? very > > > awkwardly on the stack?) and changing the pointer. For all we know, > > > some code already fiddles with the pointer. This is great, except > > > that your patch sticks more data at the end of the FPU block that no > > > one is expecting, and your sigreturn code follows that pointer, and > > > will read off into lala land. > > > > > > > Then, what about we don't do that at all. Is it possible from now on we > > don't stick more data at the end, and take the relocating-fpu approach? > > > > > 2. CET. CET wants us to find a few more bytes somewhere, and those > > > bytes logically belong in ucontext, and here we are. > > > > > > > Fortunately, we can spare CET the need of ucontext extension. When the > > kernel handles sigreturn, the user-mode shadow stack pointer is right at > > the restore token. There is no need to put that in ucontext. > > That seems entirely reasonable. This might also avoid needing to > teach CRIU about CET at all. Wait, what's the actual shadow stack token format? And is the token on the new stack or the old stack when sigaltstack is in use? For that matter, is there any support for an alternate shadow stack for signals? --Andy