Received: by 2002:a05:6a10:a852:0:0:0:0 with SMTP id d18csp2437884pxy; Sun, 2 May 2021 23:07:02 -0700 (PDT) X-Google-Smtp-Source: ABdhPJznpSKbeAfX/VIoSmntpjwJq8x2Qi84kPE5oO2QUKhZQTja9h/X38HvaS/OkobmF/2qRT/9 X-Received: by 2002:a17:90a:8410:: with SMTP id j16mr19379992pjn.120.1620022022578; Sun, 02 May 2021 23:07:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1620022022; cv=none; d=google.com; s=arc-20160816; b=kCzKMOavATj/AJdbefH+6mr8uZD1KJACZKKaFYVJB5U0pq/6VuzyclrAmVEh33kK26 +Svk3L8NcYH+bJXqUXBQHmCIcm4rCOrVTKi4qZdTKASwPtWCBLwxBaua02YZ7K8gLvjp frq+F6GOtQUNJSaupkExQ7O/FZwhHynGm7jSY1l4FpB7Dw9Ef6OjsqdmFzs0vXuv5c6B Vh4T0wy31fc8THsmFBzPlRSnkkvwJKrMCASvctLR5ARk/sf0F9g2vjbAEOduDhVZ79VW EBOqzmIAd/BQTrPU2oZA8g0wKjTNobiG6ZX2FFLPfDHpHgcGPcUpbrAT09LlATNkCmHV A1Vw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:content-transfer-encoding :mime-version:references:in-reply-to:user-agent:date:message-id:from; bh=fUDwGBar9UKx3ExfIVjyXtQ975xduZv8GS/7O9Jqz4w=; b=lFJF9gr0fMn109w2OxSTQ8djnx7tcL3CuBUHHpCKR7zIYMYLxMUcSukm2HqwJmmW5P lLdBBcngxJjHMrzHW2j7pWCD31wA9LQVtFeK5Joz6LpSXa91DeHzWRcz7ZB5pTQWbZE7 ltp3yvzD1hCpjuQ8sfhkQzVF2pbRmK+9X/xja0TQXL10JnteGhVj05TcB9oq3hcptS8X LZ4yvCRMLVbXDd0L0YVs36wryXFtWdrGzGJYOscmbnzYWzVYHLUEIOoz8BNeyxgBxwTs ymDINCIyt+uzjfRYqaX7trfGNFJpFFQr/LpSVXCTZfvRZoqXsESYbe2KyGFzPmIDK44v OzJg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=zytor.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id g12si12886698pfi.75.2021.05.02.23.06.50; Sun, 02 May 2021 23:07:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=zytor.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232939AbhECGFu convert rfc822-to-8bit (ORCPT + 99 others); Mon, 3 May 2021 02:05:50 -0400 Received: from terminus.zytor.com ([198.137.202.136]:59289 "EHLO mail.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232831AbhECGFf (ORCPT ); Mon, 3 May 2021 02:05:35 -0400 Received: from [IPv6:2601:646:8602:8be1:c08d:3145:af85:af1c] ([IPv6:2601:646:8602:8be1:c08d:3145:af85:af1c]) (authenticated bits=0) by mail.zytor.com (8.16.1/8.15.2) with ESMTPSA id 14363Tu2908670 (version=TLSv1.3 cipher=TLS_AES_128_GCM_SHA256 bits=128 verify=NO); Sun, 2 May 2021 23:03:30 -0700 Authentication-Results: mail.zytor.com; dkim=permerror (bad message/signature format) From: "H. Peter Anvin" Message-Id: <202105030603.14363Tu2908670@mail.zytor.com> Date: Sun, 02 May 2021 23:03:06 -0700 User-Agent: K-9 Mail for Android In-Reply-To: References: <20210427204315.24153-1-yu-cheng.yu@intel.com> <20210427204315.24153-26-yu-cheng.yu@intel.com> <8fd86049-930d-c9b7-379c-56c02a12cd77@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT Subject: Re: extending ucontext (Re: [PATCH v26 25/30] x86/cet/shstk: Handle signals for shadow stack) To: Andy Lutomirski CC: "Yu, Yu-cheng" , linux-arch , X86 ML , Thomas Gleixner , Ingo Molnar , LKML , "open list:DOCUMENTATION" , Linux-MM , Linux API , Arnd Bergmann , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H.J. Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , "Ravi V. Shankar" , Vedvyas.Shanbhogue@zytor.com Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ,Dave Martin ,Weijiang Yang ,Pengfei Xu ,Haitao Huang From: "H. Peter Anvin" Message-ID: <2360408E-9967-469C-96AF-E8AC40044021@zytor.com> There are a few words at the end of the FXSAVE area for software use for this reason: so we can extend the signal frame – originally to enable saving the XSAVE state but that doesn't use up the whole software available area. On May 2, 2021 4:23:49 PM PDT, Andy Lutomirski wrote: >On Fri, Apr 30, 2021 at 10:47 AM Andy Lutomirski >wrote: >> >> On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng >wrote: >> > >> > On 4/28/2021 4:03 PM, Andy Lutomirski wrote: >> > > On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu > wrote: >> > >> >> > >> When shadow stack is enabled, a task's shadow stack states must >be saved >> > >> along with the signal context and later restored in sigreturn. >However, >> > >> currently there is no systematic facility for extending a signal >context. >> > >> There is some space left in the ucontext, but changing ucontext >is likely >> > >> to create compatibility issues and there is not enough space for >further >> > >> extensions. >> > >> >> > >> Introduce a signal context extension struct 'sc_ext', which is >used to save >> > >> shadow stack restore token address. The extension is located >above the fpu >> > >> states, plus alignment. The struct can be extended (such as the >ibt's >> > >> wait_endbr status to be introduced later), and sc_ext.total_size >field >> > >> keeps track of total size. >> > > >> > > I still don't like this. >> > > >> > > Here's how the signal layout works, for better or for worse: >> > > >> > > The kernel has: >> > > >> > > struct rt_sigframe { >> > > char __user *pretcode; >> > > struct ucontext uc; >> > > struct siginfo info; >> > > /* fp state follows here */ >> > > }; >> > > >> > > This is roughly the actual signal frame. But userspace does not >have >> > > this struct declared, and user code does not know the sizes of >the >> > > fields. So it's accessed in a nonsensical way. The signal >handler >> > > function is passed a pointer to the whole sigframe implicitly in >RSP, >> > > a pointer to &frame->info in RSI, anda pointer to &frame->uc in >RDX. >> > > User code can *find* the fp state by following a pointer from >> > > mcontext, which is, in turn, found via uc: >> > > >> > > struct ucontext { >> > > unsigned long uc_flags; >> > > struct ucontext *uc_link; >> > > stack_t uc_stack; >> > > struct sigcontext uc_mcontext; <-- fp pointer is in here >> > > sigset_t uc_sigmask; /* mask last for extensibility >*/ >> > > }; >> > > >> > > The kernel, in sigreturn, works a bit differently. The sigreturn >> > > variants know the base address of the frame but don't have the >benefit >> > > of receiving pointers to the fields. So instead the kernel takes >> > > advantage of the fact that it knows the offset to uc and parses >uc >> > > accordingly. And the kernel follows the pointer in mcontext to >find >> > > the fp state. The latter bit is quite important later. The >kernel >> > > does not parse info at all. >> > > >> > > The fp state is its own mess. When XSAVE happened, Intel kindly >(?) >> > > gave us a software defined area between the "legacy" x87 region >and >> > > the modern supposedly extensible part. Linux sticks the >following >> > > structure in that hole: >> > > >> > > struct _fpx_sw_bytes { >> > > /* >> > > * If set to FP_XSTATE_MAGIC1 then this is an xstate >context. >> > > * 0 if a legacy frame. >> > > */ >> > > __u32 magic1; >> > > >> > > /* >> > > * Total size of the fpstate area: >> > > * >> > > * - if magic1 == 0 then it's sizeof(struct _fpstate) >> > > * - if magic1 == FP_XSTATE_MAGIC1 then it's sizeof(struct >_xstate) >> > > * plus extensions (if any) >> > > */ >> > > __u32 extended_size; >> > > >> > > /* >> > > * Feature bit mask (including FP/SSE/extended state) that >is present >> > > * in the memory layout: >> > > */ >> > > __u64 xfeatures; >> > > >> > > /* >> > > * Actual XSAVE state size, based on the xfeatures saved in >the layout. >> > > * 'extended_size' is greater than 'xstate_size': >> > > */ >> > > __u32 xstate_size; >> > > >> > > /* For future use: */ >> > > __u32 padding[7]; >> > > }; >> > > >> > > >> > > That's where we are right now upstream. The kernel has a parser >for >> > > the FPU state that is bugs piled upon bugs and is going to have >to be >> > > rewritten sometime soon. On top of all this, we have two >upcoming >> > > features, both of which require different kinds of extensions: >> > > >> > > 1. AVX-512. (Yeah, you thought this story was over a few years >ago, >> > > but no. And AMX makes it worse.) To make a long story short, we >> > > promised user code many years ago that a signal frame fit in 2048 >> > > bytes with some room to spare. With AVX-512 this is false. With >AMX >> > > it's so wrong it's not even funny. The only way out of the mess >> > > anyone has come up with involves making the length of the FPU >state >> > > vary depending on which features are INIT, i.e. making it more >compact >> > > than "compact" mode is. This has a side effect: it's no longer >> > > possible to modify the state in place, because enabling a feature >with >> > > no space allocated will make the structure bigger, and the stack >won't >> > > have room. Fortunately, one can relocate the entire FPU state, >update >> > > the pointer in mcontext, and the kernel will happily follow the >> > > pointer. So new code on a new kernel using a super-compact state >> > > could expand the state by allocating new memory (on the heap? >very >> > > awkwardly on the stack?) and changing the pointer. For all we >know, >> > > some code already fiddles with the pointer. This is great, >except >> > > that your patch sticks more data at the end of the FPU block that >no >> > > one is expecting, and your sigreturn code follows that pointer, >and >> > > will read off into lala land. >> > > >> > >> > Then, what about we don't do that at all. Is it possible from now >on we >> > don't stick more data at the end, and take the relocating-fpu >approach? >> > >> > > 2. CET. CET wants us to find a few more bytes somewhere, and >those >> > > bytes logically belong in ucontext, and here we are. >> > > >> > >> > Fortunately, we can spare CET the need of ucontext extension. When >the >> > kernel handles sigreturn, the user-mode shadow stack pointer is >right at >> > the restore token. There is no need to put that in ucontext. >> >> That seems entirely reasonable. This might also avoid needing to >> teach CRIU about CET at all. > >Wait, what's the actual shadow stack token format? And is the token >on the new stack or the old stack when sigaltstack is in use? For >that matter, is there any support for an alternate shadow stack for >signals? > >--Andy -- Sent from my Android device with K-9 Mail. Please excuse my brevity.