Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp3381880pxu; Tue, 8 Dec 2020 10:28:38 -0800 (PST) X-Google-Smtp-Source: ABdhPJwTm9OTWRv7zxYBob1LFwj3vV93Wmgh2e9z2sVcgDtsu1t/KYx765lRiDUJiw59VTlblsIs X-Received: by 2002:a50:ed04:: with SMTP id j4mr26635706eds.84.1607452118362; Tue, 08 Dec 2020 10:28:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1607452118; cv=none; d=google.com; s=arc-20160816; b=OJus1z9b987dGN7n7CKadvSk/M4wlyaPTyb15oWetmB74Q6MlzNYdvxK27vyhcMw64 7RBFxN9BpMu9iYtoofjX09swwlJ2nYU/e23YfAjggMZIk8zWXk+l0ycvbi06dkn7Nj0z 9ZE11ssbSbucVoofbK7wJ3kYUrc/I5JDfPscXH2GG3gN8grYpGYTdVllhzUyR2Hnt2XQ KmC0nIX0PZgYIecx9Osy5wpDJOysYmAjkdrfD9J/9wRoY/XX+Be5RcPj6i7b8oqsHelh AHYAGpCLRPTjym/CIdM39o8x2RxOjN+j/gSedHigNZ4kjHhq7IQ/53BDtEJlHSAcKcVA UnBg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject:ironport-sdr:ironport-sdr; bh=uIvWOPXImL9oTOnPD3YgKtWLlxS4hSGTg8xM6ezdcNw=; b=EZTlE3HdyeOD82yP4dgXVjMKwvfY/ngwtSUvssCs+5pyS9zH+KH7nk62P+ge83cMex U3yoapwSHuirfZlzMUiSYQZCclVVfuJqybLIAeO2FoK4EkTYdlHe5YU0dac4yMI026G4 qN/4yAJgODks6215mK0u2CCo5vmeXQeWgNGPYCfbeg3Sebwo6HBbfHLFfCPdV4+Jw/79 0L2p3L57ImwsakSINbYIcsuXUR0r5R1AHSUcpO1d7otMHzwx2KBxsi8o/+Kfc1Bj5Vr+ q6VdpprOZuBJLc+NtCfqmt48tyjr6MHxa2nIfKEAR0BP0JyUfJ3/kTNw2hmy8HWAM+P2 PFng== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id w10si4226869ejv.730.2020.12.08.10.28.14; Tue, 08 Dec 2020 10:28:38 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730994AbgLHS0K (ORCPT + 99 others); Tue, 8 Dec 2020 13:26:10 -0500 Received: from mga02.intel.com ([134.134.136.20]:29455 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730894AbgLHS0K (ORCPT ); Tue, 8 Dec 2020 13:26:10 -0500 IronPort-SDR: /yg7CeF3YvlmAQnd/x01t6+NXRWrpogDAn9HQAxozVRZT4CkfnTv3k2FaxfHf7o4iwsfhbAMex r18DfVZEjBDw== X-IronPort-AV: E=McAfee;i="6000,8403,9829"; a="160995662" X-IronPort-AV: E=Sophos;i="5.78,403,1599548400"; d="scan'208";a="160995662" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2020 10:25:29 -0800 IronPort-SDR: MVXY+ufzzjWpRkO/fcrw7djp1AEmJjNq9PnlBlwIbD8V1ZktkEzklCAxWd0oBRp2C8pAxGfKhi xQw8ZyCXIxMA== X-IronPort-AV: E=Sophos;i="5.78,403,1599548400"; d="scan'208";a="317917122" Received: from yyu32-mobl1.amr.corp.intel.com (HELO [10.209.139.184]) ([10.209.139.184]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2020 10:25:27 -0800 Subject: Re: [PATCH v15 08/26] x86/mm: Introduce _PAGE_COW To: Borislav Petkov Cc: x86@kernel.org, "H. Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H.J. Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , "Ravi V. Shankar" , Vedvyas Shanbhogue , Dave Martin , Weijiang Yang , Pengfei Xu References: <20201110162211.9207-1-yu-cheng.yu@intel.com> <20201110162211.9207-9-yu-cheng.yu@intel.com> <20201208175014.GD27920@zn.tnic> From: "Yu, Yu-cheng" Message-ID: <218503f6-eec1-94b0-8404-6f92c55799e3@intel.com> Date: Tue, 8 Dec 2020 10:25:15 -0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.5.1 MIME-Version: 1.0 In-Reply-To: <20201208175014.GD27920@zn.tnic> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 12/8/2020 9:50 AM, Borislav Petkov wrote: > On Tue, Nov 10, 2020 at 08:21:53AM -0800, Yu-cheng Yu wrote: >> There is essentially no room left in the x86 hardware PTEs on some OSes >> (not Linux). That left the hardware architects looking for a way to >> represent a new memory type (shadow stack) within the existing bits. >> They chose to repurpose a lightly-used state: Write=0,Dirty=1. > > It is not clear to me what the definition and semantics of that bit is. > > +#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */ > > Is it set by hw or by sw and hw uses it to know it is a shadow stack > page, and so on. > > I think you should lead with its definition. Ok. ... >> Write=0,Dirty=1 PTEs. In places where we do create them, we need to find >> an alternative way to represent them _without_ using the same hardware bit >> combination. Thus, enter _PAGE_COW. This results in the following: >> >> (a) A modified, copy-on-write (COW) page: (R/O + _PAGE_COW) >> (b) A R/O page that has been COW'ed: (R/O + _PAGE_COW) > > Both are "R/O + _PAGE_COW". Where's the difference? The dirty bit? The PTEs are the same for both (a) and (b), but come from different routes. >> The user page is in a R/O VMA, and get_user_pages() needs a writable >> copy. The page fault handler creates a copy of the page and sets >> the new copy's PTE as R/O and _PAGE_COW. >> (c) A shadow stack PTE: (R/O + _PAGE_DIRTY_HW) > > So W=0, D=1 ? Yes. >> (d) A shared shadow stack PTE: (R/O + _PAGE_COW) >> When a shadow stack page is being shared among processes (this happens >> at fork()), its PTE is cleared of _PAGE_DIRTY_HW, so the next shadow >> stack access causes a fault, and the page is duplicated and >> _PAGE_DIRTY_HW is set again. This is the COW equivalent for shadow >> stack pages, even though it's copy-on-access rather than copy-on-write. >> (e) A page where the processor observed a Write=1 PTE, started a write, set >> Dirty=1, but then observed a Write=0 PTE. > > How does that happen? Something changed the PTE's W bit to 0 in-between? Yes. ... >> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h >> index b23697658b28..c88c7ccf0318 100644 >> --- a/arch/x86/include/asm/pgtable.h >> +++ b/arch/x86/include/asm/pgtable.h >> @@ -121,9 +121,9 @@ extern pmdval_t early_pmd_flags; >> * The following only work if pte_present() is true. >> * Undefined behaviour if not.. >> */ >> -static inline int pte_dirty(pte_t pte) >> +static inline bool pte_dirty(pte_t pte) >> { >> - return pte_flags(pte) & _PAGE_DIRTY_HW; >> + return pte_flags(pte) & _PAGE_DIRTY_BITS; > > Why? > > Does _PAGE_COW mean dirty too? Yes. Basically [read-only & dirty] is created by software. Now the software uses a different bit. >> @@ -343,6 +349,17 @@ static inline pte_t pte_mkold(pte_t pte) >> >> static inline pte_t pte_wrprotect(pte_t pte) >> { >> + /* >> + * Blindly clearing _PAGE_RW might accidentally create >> + * a shadow stack PTE (RW=0,Dirty=1). Move the hardware >> + * dirty value to the software bit. >> + */ >> + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { >> + pte.pte |= (pte.pte & _PAGE_DIRTY_HW) >> >> + _PAGE_BIT_DIRTY_HW << _PAGE_BIT_COW; > > Let that line stick out. And that shifting is not grokkable at a quick > glance, at least not to me. Simplify? Ok. >> static inline pmd_t pmd_wrprotect(pmd_t pmd) >> { >> + /* >> + * Blindly clearing _PAGE_RW might accidentally create >> + * a shadow stack PMD (RW=0,Dirty=1). Move the hardware >> + * dirty value to the software bit. > > This whole carefully sidestepping the possiblity of creating a shadow > stack pXd is kinda sucky... > >> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h >> index 7462a574fc93..5f764d8d9bae 100644 >> --- a/arch/x86/include/asm/pgtable_types.h >> +++ b/arch/x86/include/asm/pgtable_types.h >> @@ -23,7 +23,8 @@ >> #define _PAGE_BIT_SOFTW2 10 /* " */ >> #define _PAGE_BIT_SOFTW3 11 /* " */ >> #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ >> -#define _PAGE_BIT_SOFTW4 58 /* available for programmer */ >> +#define _PAGE_BIT_SOFTW4 57 /* available for programmer */ >> +#define _PAGE_BIT_SOFTW5 58 /* available for programmer */ >> #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ >> #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ >> #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */ >> @@ -36,6 +37,16 @@ >> #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ >> #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 >> >> +/* >> + * This bit indicates a copy-on-write page, and is different from >> + * _PAGE_BIT_SOFT_DIRTY, which tracks which pages a task writes to. >> + */ >> +#ifdef CONFIG_X86_64 > > CONFIG_X86_64 ? Do all x86 machines out there support CET? > > If anything, CONFIG_X86_CET... Ok. -- Yu-cheng