Received: by 2002:a05:6a10:1a4d:0:0:0:0 with SMTP id nk13csp1706021pxb; Wed, 9 Feb 2022 02:42:51 -0800 (PST) X-Google-Smtp-Source: ABdhPJy5jpnbb6/oZI7fcVFsI0RKx8RAfqp6UgksKTQ4Vq4Fpcd0aGmy9xxwOkBemmQFlTi3yfI0 X-Received: by 2002:a05:6a00:124d:: with SMTP id u13mr1540519pfi.69.1644403371555; Wed, 09 Feb 2022 02:42:51 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1644403371; cv=none; d=google.com; s=arc-20160816; b=R1Dy0s+zwzieyZqZK6U4m80D1T4l0g1RpcbK7+eoGO0EEj5VMhtm+Go/GZPFLtxETe xiDaVYDag4LwdnC/P4iAusxpKc4o7DPLntaJHgoBwh2W/kaFW8p7EJ2pfmbOc2fwOYEB Qo7oKDQKOc6KImScsCjFUzbdOUOPm9zgIFTW6JXhFJiIUcgO1lcjVYoUoI26PswNvB1z M1lFjO/uqQS2hecAv3aAhwXlavLQVyiaQ+p713sHm7cmJkfurxGx8xuy9QLbD3XpsbZR B1YbIwg5fuO4tkMWlLFqZMPStsSY9IoSddqW0t5g9vpBpvvf+7ywsgv11kgPCJJT9JC0 1iYg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:subject :from:references:cc:to:content-language:user-agent:mime-version:date :message-id:dkim-signature; bh=2hLDWi1m2xvWTK7d9eXJPAL/qcJUp3y2s2K6YkfwRco=; b=T4XyVx1tQ2G5mIZUoC5bjUp0B5bsW1VkxPg8T3yK9oPsb5G1QM6PqzMHVaMnUb3HjT NEQ8Icp9E1p8FeAOet6t/OrIzV0268C6hjWwEoF2958ZFvwXlhr7TqSQGILguHxiPVfz D0Mg4gz48KA5x2NxcxTxDa+zVery2iby/BcENFJSSFt3c5aCZeXl1kqWi8fdFqasQghW 1WP3pS6rMpDLRNbvGEQgcaURZqXkEddowRuaQQDD8gK8QysGq70gyMRmOuPwhiiyJqTc 0kFeYPTmgR2TfK7MtAFiI4+uBuNvvBzk1bI8u1g+Ynk4Lw8fn/XVaH1FqqEUWBBjQX6F 4EMg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=YLpuo49U; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id oo13si5586485pjb.149.2022.02.09.02.42.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 09 Feb 2022 02:42:51 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=YLpuo49U; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A1CAEE0AFAAB; Wed, 9 Feb 2022 01:25:04 -0800 (PST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344279AbiBHBNO (ORCPT + 99 others); Mon, 7 Feb 2022 20:13:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37120 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245032AbiBHBH5 (ORCPT ); Mon, 7 Feb 2022 20:07:57 -0500 Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 649BDC03FEC1; Mon, 7 Feb 2022 17:05:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1644282356; x=1675818356; h=message-id:date:mime-version:to:cc:references:from: subject:in-reply-to:content-transfer-encoding; bh=RmcEBt/Tc2ZNpDmtkpC+8Teoj0FX4XdvKx/tePya/LE=; b=YLpuo49U9T6Ttv5z0xswjk4u7YO65Spj6T72/14AXFJOV96rsF8h4OOq r1C2V9NBZe7L0KmxMZXISIhXwz88ob1Pz32j9gWVswLrP7FqvpaLl5AE5 ws/t9L10lUV8MMj1N1NnwcIxiNHkoF8oAFvbsNtugzAXlhbUv3BOcs1m6 6enJYR5citXhk4B77AOG9hm6/uV+436wO9YVDxlyagVfYmNb8tj+izWxC y/w/ho/7SmQKOPzJFA8kW26aMUywsTtZBShR6z2SeDdQCXhGq5QwdpJnm sJx2tOhRN4CapO+FdQw8LGHZey+F1Q717MzQrndiZJNHYnlacp8CqrGkp g==; X-IronPort-AV: E=McAfee;i="6200,9189,10251"; a="249053192" X-IronPort-AV: E=Sophos;i="5.88,351,1635231600"; d="scan'208";a="249053192" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Feb 2022 17:05:54 -0800 X-IronPort-AV: E=Sophos;i="5.88,351,1635231600"; d="scan'208";a="525348020" Received: from hgrunes-mobl1.amr.corp.intel.com (HELO [10.251.3.57]) ([10.251.3.57]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Feb 2022 17:05:53 -0800 Message-ID: <9b6878b0-433e-b52a-caa1-aa306595c51a@intel.com> Date: Mon, 7 Feb 2022 17:05:50 -0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0 Content-Language: en-US To: Rick Edgecombe , x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , "Ravi V . Shankar" , Dave Martin , Weijiang Yang , "Kirill A . Shutemov" , joao.moreira@intel.com, John Allen , kcc@google.com, eranian@google.com Cc: Yu-cheng Yu References: <20220130211838.8382-1-rick.p.edgecombe@intel.com> <20220130211838.8382-10-rick.p.edgecombe@intel.com> From: Dave Hansen Subject: Re: [PATCH 09/35] x86/mm: Introduce _PAGE_COW In-Reply-To: <20220130211838.8382-10-rick.p.edgecombe@intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,RDNS_NONE,SPF_HELO_NONE, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 1/30/22 13:18, Rick Edgecombe wrote: > From: Yu-cheng Yu > > There is essentially no room left in the x86 hardware PTEs on some OSes > (not Linux). That left the hardware architects looking for a way to > represent a new memory type (shadow stack) within the existing bits. > They chose to repurpose a lightly-used state: Write=0, Dirty=1. > > The reason it's lightly used is that Dirty=1 is normally set by hardware > and cannot normally be set by hardware on a Write=0 PTE. Software must > normally be involved to create one of these PTEs, so software can simply > opt to not create them. This is kinda skipping over something important: The reason it's lightly used is that Dirty=1 is normally set _before_ a write. A write with a Write=0 PTE would typically only generate a fault, not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow stacks will no longer exhibit this oddity. > In places where Linux normally creates Write=0, Dirty=1, it can use the > software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other > words, whenever Linux needs to create Write=0, Dirty=1, it instead creates > Write=0, Cow=1, except for shadow stack, which is Write=0, Dirty=1. This > clearly separates shadow stack from other data, and results in the > following: Following _what_... What are these? I think they're PTE states. Best to say that. > (a) A modified, copy-on-write (COW) page: (Write=0, Cow=1) Could you give an example of this? Would this be a typical anonymous page which was Write=1,Dirty=1, then historically made Write=0,Dirty=1 at fork()? > (b) A R/O page that has been COW'ed: (Write=0, Cow=1) > The user page is in a R/O VMA, and get_user_pages() needs a writable > copy. The page fault handler creates a copy of the page and sets > the new copy's PTE as Write=0 and Cow=1. > (c) A shadow stack PTE: (Write=0, Dirty=1) > (d) A shared shadow stack PTE: (Write=0, Cow=1) > When a shadow stack page is being shared among processes (this happens > at fork()), its PTE is made Dirty=0, so the next shadow stack access > causes a fault, and the page is duplicated and Dirty=1 is set again. > This is the COW equivalent for shadow stack pages, even though it's > copy-on-access rather than copy-on-write. Just like code, it's also nice to format these in a way which allows them to be visually compared, trivially. So, let's expand all the bits and vertically align everything. To break this down a bit, we have two old states: [a] (Write=0, Dirty=0, Cow=1) [b] (Write=0, Dirty=0, Cow=1) And two new ones: [c] (Write=0, Dirty=1, Cow=0) [d] (Write=0, Dirty=0, Cow=1) That makes me wonder what the difference is between [a] and [b] and why they are separate. Is their handling different? How are those two states differentiated? > (e) A page where the processor observed a Write=1 PTE, started a write, set > Dirty=1, but then observed a Write=0 PTE. That's possible today, but > will not happen on processors that support shadow stack. This left me wondering how you are going to detangle the mess where PTEs look like shadow-stack PTEs on non-shadow-stack hardware. Could you cover that here? You can shorten that above bullet to this to help make the space: (e) (Write=0, Dirty=1, Cow=0) PTE created when a processor without shadow stack support set Dirty=1. > Define _PAGE_COW and update pte_*() helpers and apply the same changes to > pmd and pud. > > After this, there are six free bits left in the 64-bit PTE, and no more > free bits in the 32-bit PTE (except for PAE) and Shadow Stack is not > implemented for the 32-bit kernel. Just say: There are six bits left available to software in the 64-bit PTE after consuming a bit for _PAGE_COW. No space is consumed in 32-bit kernels because shadow stacks are not enabled there. There's no need to rub it in that 32-bit is out of space. > -static inline int pte_dirty(pte_t pte) > +static inline bool pte_dirty(pte_t pte) > { > - return pte_flags(pte) & _PAGE_DIRTY; > + /* > + * A dirty PTE has Dirty=1 or Cow=1. > + */ I don't really like that comment because "Cow" isn't anywhere to be found. > + return pte_flags(pte) & _PAGE_DIRTY_BITS; > +} > + > +static inline bool pte_shstk(pte_t pte) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) > + return false; > + > + return (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY; > } > > static inline int pte_young(pte_t pte) > @@ -133,9 +144,20 @@ static inline int pte_young(pte_t pte) > return pte_flags(pte) & _PAGE_ACCESSED; > } > > -static inline int pmd_dirty(pmd_t pmd) > +static inline bool pmd_dirty(pmd_t pmd) > { > - return pmd_flags(pmd) & _PAGE_DIRTY; > + /* > + * A dirty PMD has Dirty=1 or Cow=1. > + */ > + return pmd_flags(pmd) & _PAGE_DIRTY_BITS; > +} > + > +static inline bool pmd_shstk(pmd_t pmd) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) > + return false; > + > + return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY; > } > > static inline int pmd_young(pmd_t pmd) > @@ -143,9 +165,12 @@ static inline int pmd_young(pmd_t pmd) > return pmd_flags(pmd) & _PAGE_ACCESSED; > } > > -static inline int pud_dirty(pud_t pud) > +static inline bool pud_dirty(pud_t pud) > { > - return pud_flags(pud) & _PAGE_DIRTY; > + /* > + * A dirty PUD has Dirty=1 or Cow=1. > + */ > + return pud_flags(pud) & _PAGE_DIRTY_BITS; > } > > static inline int pud_young(pud_t pud) > @@ -155,13 +180,23 @@ static inline int pud_young(pud_t pud) > > static inline int pte_write(pte_t pte) > { > - return pte_flags(pte) & _PAGE_RW; > + /* > + * Shadow stack pages are always writable - but not by normal > + * instructions, and only by shadow stack operations. Therefore, > + * the W=0,D=1 test with pte_shstk(). > + */ I think that comment is off a bit. It's not really connected to the code. We don't, for instance need to know what the bit combination is inside pte_shstk(). Further, it's a bit mean to talk about "W" in the comment and _PAGE_RW in the code. How about: /* * Shadow stack pages are logically writable, but do not have * _PAGE_RW. Check for them separately from _PAGE_RW itself. */ > + return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte); > } > > #define pmd_write pmd_write > static inline int pmd_write(pmd_t pmd) > { > - return pmd_flags(pmd) & _PAGE_RW; > + /* > + * Shadow stack pages are always writable - but not by normal > + * instructions, and only by shadow stack operations. Therefore, > + * the W=0,D=1 test with pmd_shstk(). > + */ > + return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd); > } Ditto on the comment. Please copy the pte_write() one here too. > > #define pud_write pud_write > @@ -299,6 +334,24 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear) > return native_make_pte(v & ~clear); > } > > +static inline pte_t pte_mkcow(pte_t pte) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) > + return pte; > + > + pte = pte_clear_flags(pte, _PAGE_DIRTY); > + return pte_set_flags(pte, _PAGE_COW); > +} > + > +static inline pte_t pte_clear_cow(pte_t pte) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) > + return pte; > + > + pte = pte_set_flags(pte, _PAGE_DIRTY); > + return pte_clear_flags(pte, _PAGE_COW); > +} I think we need to say *SOMETHING* about the X86_FEATURE_SHSTK and _PAGE_COW connection here. Otherwise they look like two random features that are interacting in an unknown way. Maybe even something this simple: /* * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels. * See the _PAGE_COW definition for more details. */ Also, the manipulation of _PAGE_DIRTY is not clear here. It's obvious why we have to: pte_clear_flags(pte, _PAGE_COW); in a function called pte_clear_cow() but, again, how does _PAGE_DIRTY fit? > #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP > static inline int pte_uffd_wp(pte_t pte) > { > @@ -318,7 +371,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte) > > static inline pte_t pte_mkclean(pte_t pte) > { > - return pte_clear_flags(pte, _PAGE_DIRTY); > + return pte_clear_flags(pte, _PAGE_DIRTY_BITS); > } > > static inline pte_t pte_mkold(pte_t pte) > @@ -328,7 +381,16 @@ static inline pte_t pte_mkold(pte_t pte) > > static inline pte_t pte_wrprotect(pte_t pte) > { > - return pte_clear_flags(pte, _PAGE_RW); > + pte = pte_clear_flags(pte, _PAGE_RW); > + > + /* > + * Blindly clearing _PAGE_RW might accidentally create > + * a shadow stack PTE (RW=0, Dirty=1). Move the hardware Could you grep this series and try to be consistent about the formatting here? (Not that I've been perfect in this regard either). I think we have at least: Write=X,Dirty=Y W=X,D=Y RW=X,Dirty=Y > + * dirty value to the software bit. > + */ > + if (pte_dirty(pte)) > + pte = pte_mkcow(pte); > + return pte; > } One of my logical checks for this is "does it all go away when this is compiled out". Because of this: +static inline pte_t pte_mkcow(pte_t pte) +{ + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) + return pte; ... the answer is yes! So, this looks good to me. Just thought I'd share a bit of my thought process. > static inline pte_t pte_mkexec(pte_t pte) > @@ -338,7 +400,18 @@ static inline pte_t pte_mkexec(pte_t pte) > > static inline pte_t pte_mkdirty(pte_t pte) > { > - return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); > + pteval_t dirty = _PAGE_DIRTY; > + > + /* Avoid creating (HW)Dirty=1, Write=0 PTEs */ The "(HW)" thing doesn't make a lot of sense any longer. I think we had a set of HWDirty and SWDirty bits, but SWDirty ended up being morphed over to _PAGE_COW. > + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte)) > + dirty = _PAGE_COW; > + > + return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY); > +} > + > +static inline pte_t pte_mkwrite_shstk(pte_t pte) > +{ > + return pte_clear_cow(pte); > } This one is a bit of black magic. This is taking a PTE from (presumably) states [c]->[d] from earlier in the changelog. Write=0,Dirty=0,Cow=1 to Write=0,Dirty=1,Cow=0 It's hard to wrap my head around how clearing a software bit (from the naming) will make this PTE writable. There's either something wrong with the naming, or something wrong with my mental model of what "COW clearing" is. > static inline pte_t pte_mkyoung(pte_t pte) > @@ -348,7 +421,12 @@ static inline pte_t pte_mkyoung(pte_t pte) > > static inline pte_t pte_mkwrite(pte_t pte) > { > - return pte_set_flags(pte, _PAGE_RW); > + pte = pte_set_flags(pte, _PAGE_RW); > + > + if (pte_dirty(pte)) > + pte = pte_clear_cow(pte); > + > + return pte; > } Along the same lines as the last few comments, this leaves me wondering why a pte_dirty() can't also be a "COW PTE". ... ... > #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY > diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h > index 3781a79b6388..1bfab70ff9ac 100644 > --- a/arch/x86/include/asm/pgtable_types.h > +++ b/arch/x86/include/asm/pgtable_types.h > @@ -21,7 +21,8 @@ > #define _PAGE_BIT_SOFTW2 10 /* " */ > #define _PAGE_BIT_SOFTW3 11 /* " */ > #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ > -#define _PAGE_BIT_SOFTW4 58 /* available for programmer */ > +#define _PAGE_BIT_SOFTW4 57 /* available for programmer */ > +#define _PAGE_BIT_SOFTW5 58 /* available for programmer */ > #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ > #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ > #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */ > @@ -34,6 +35,15 @@ > #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ > #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 > > +/* > + * Indicates a copy-on-write page. > + */ > +#ifdef CONFIG_X86_SHADOW_STACK > +#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */ > +#else > +#define _PAGE_BIT_COW 0 > +#endif > + > /* If _PAGE_BIT_PRESENT is clear, we use these: */ > /* - if the user mapped it with PROT_NONE; pte_present gives true */ > #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL > @@ -115,6 +125,36 @@ > #define _PAGE_DEVMAP (_AT(pteval_t, 0)) > #endif > > +/* > + * The hardware requires shadow stack to be read-only and Dirty. > + * _PAGE_COW is a software-only bit used to separate copy-on-write PTEs > + * from shadow stack PTEs: > + * (a) A modified, copy-on-write (COW) page: (Write=0, Cow=1) > + * (b) A R/O page that has been COW'ed: (Write=0, Cow=1) > + * The user page is in a R/O VMA, and get_user_pages() needs a > + * writable copy. The page fault handler creates a copy of the page > + * and sets the new copy's PTE as Write=0, Cow=1. > + * (c) A shadow stack PTE: (Write=0, Dirty=1) > + * (d) A shared (copy-on-access) shadow stack PTE: (Write=0, Cow=1) > + * When a shadow stack page is being shared among processes (this > + * happens at fork()), its PTE is cleared of _PAGE_DIRTY, so the next > + * shadow stack access causes a fault, and the page is duplicated and > + * _PAGE_DIRTY is set again. This is the COW equivalent for shadow > + * stack pages, even though it's copy-on-access rather than > + * copy-on-write. > + * (e) A page where the processor observed a Write=1 PTE, started a write, > + * set Dirty=1, but then observed a Write=0 PTE (changed by another > + * thread). That's possible today, but will not happen on processors > + * that support shadow stack. This info, again, is great. Let's keep it, but please do reformat it like the changelog version to make the bit states easier to grok.