Received: by 2002:a05:6a10:413:0:0:0:0 with SMTP id 19csp2861592pxp; Tue, 8 Mar 2022 03:27:50 -0800 (PST) X-Google-Smtp-Source: ABdhPJxTmAnI8x/P7qVK3L+ot0Ad+iE2HEJRSznRFcEP0kRnjomncLdFhsYcBHyX3U9/0ezkBj04 X-Received: by 2002:aa7:d494:0:b0:415:a309:7815 with SMTP id b20-20020aa7d494000000b00415a3097815mr15493231edr.340.1646738870632; Tue, 08 Mar 2022 03:27:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1646738870; cv=none; d=google.com; s=arc-20160816; b=ObHPyakti7zRNnTYNxVM9InOwUpLj2LiVWsmq91KB707bKtphn9b1p47mkoR9NJHjx VAO2Mvdp1FDcw2ibK6qMmSmUiUTATI8vXqdVg9E4QrWc2cfOq7RJnxSn896oKEV75Bw7 TZvgF9re8fxUrMRBdfzXLXm3TXA2zLLuAKuLdSQ9A4fm8kfADmGbpTBtjpXLXN53vgLB 3kDde0M/226HduQrahEP5Up/TGyloDSBRX/+hkzzUtU4XplYVDzR1N7ZkKhd/Tllu95w sll0xtfkFq1tbZPBDQE0sFvna22LRqxtgv6vpRw3uDviARvOA6jlbBwUIp2FTcQUka24 syKg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=ghTbOWe0LDd6HXnjRCHukLklmmMXOneQF1/fOPd7iuo=; b=iWZGYkyt2vj9mAUMEl1wJpRV+pQf/jyl1uMOl6RqAwR+/Yz26sLFEoG3euvyDQUBP7 mF86e/X0G4mnb344MUtNhtPbUHVlrwVerRsPe/N84Y7Bf+ShrTmio+UCnHDYByO9Wkgs l3QVRPMzzIGRhGdhC1p758ZLEZnEL6em2/+BP6QlQ2Xn9LqEN3e4cjgSFvMyfsB+OMBR l2OF/a539AhyLBnshh/u4o6hGKSasQ77C1RjVDUAJ63w3mYcdkv7kdb5oMY+FXJcJdBs zopyEtFWQNDNhNc6UCY1KGuIEZuod2ZOgqrizJVQS6Yl7gV7MZBXqb9Qp1B1zKr7MUTH 2yaA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=KIyJ5Z4W; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x7-20020a170906148700b006d159679417si10267309ejc.958.2022.03.08.03.27.27; Tue, 08 Mar 2022 03:27:50 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=KIyJ5Z4W; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243239AbiCHJOf (ORCPT + 99 others); Tue, 8 Mar 2022 04:14:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42106 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344104AbiCHJOf (ORCPT ); Tue, 8 Mar 2022 04:14:35 -0500 Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5AB6033E32; Tue, 8 Mar 2022 01:13:38 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sin.source.kernel.org (Postfix) with ESMTPS id 948FCCE1390; Tue, 8 Mar 2022 09:13:36 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 51FF4C340EC; Tue, 8 Mar 2022 09:13:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1646730814; bh=P2ow813GjxN7y90Nbi5mdcX3/Qx4pkrzBJMjlJuB0bI=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=KIyJ5Z4W+xpoAD4uEopMj0FyUy6TilJVBwQnKphQxiVjAS2oYNV3kOC9cFmXV4Ds5 QgqGPhJX4awF1yKDWV/IK6R2owmLHfLdXpPoJmRvt83sizBCQ/6hR969iatF/Oz/L0 YfIyLex4Fovm0OIRiU8g3yW/vSwrLVb4tkwMRHOtV3ydtGlqsZ1zBPldiPVr1ObBcE B6KfzSomzYQsALKvIzauTgMdqqc3BlgdX7lb3PWKcRIlmBenkIoXlZSq4hExloE9JF wHVDm6Fjl3kg3l72/8on1OtOQZ0vc+L4AUf0+SNUi6IqPK84tLw06Luq+robd0ShKk eE15ZSJbq+1gg== Date: Tue, 8 Mar 2022 11:12:54 +0200 From: Jarkko Sakkinen To: Reinette Chatre Cc: dave.hansen@linux.intel.com, tglx@linutronix.de, bp@alien8.de, luto@kernel.org, mingo@redhat.com, linux-sgx@vger.kernel.org, x86@kernel.org, seanjc@google.com, kai.huang@intel.com, cathy.zhang@intel.com, cedric.xing@intel.com, haitao.huang@intel.com, mark.shanahan@intel.com, hpa@zytor.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions Message-ID: References: <0555a4b4a5e8879eb8f879ab3d9908302000f11c.1644274683.git.reinette.chatre@intel.com> <774e906c-53fc-279c-1fb7-0922557a9da5@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-7.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 08, 2022 at 11:06:46AM +0200, Jarkko Sakkinen wrote: > On Tue, Mar 08, 2022 at 10:14:42AM +0200, Jarkko Sakkinen wrote: > > On Mon, Mar 07, 2022 at 09:36:36AM -0800, Reinette Chatre wrote: > > > Hi Jarkko, > > > > > > On 3/7/2022 9:10 AM, Jarkko Sakkinen wrote: > > > > On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote: > > > >> === Summary === > > > >> > > > >> An SGX VMA can only be created if its permissions are the same or > > > >> weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA > > > >> creation this same rule is again enforced by the page fault handler: > > > >> faulted enclave pages are required to have equal or more relaxed > > > >> EPCM permissions than the VMA permissions. > > > >> > > > >> On SGX1 systems the additional enforcement in the page fault handler > > > >> is redundant and on SGX2 systems it incorrectly prevents access. > > > >> On SGX1 systems it is unnecessary to repeat the enforcement of the > > > >> permission rule. The rule used during original VMA creation will > > > >> ensure that any access attempt will use correct permissions. > > > >> With SGX2 the EPCM permissions of a page can change after VMA > > > >> creation resulting in the VMA permissions potentially being more > > > >> relaxed than the EPCM permissions and the page fault handler > > > >> incorrectly blocking valid access attempts. > > > >> > > > >> Enable the VMA's pages to remain accessible while ensuring that > > > >> the PTEs are installed to match the EPCM permissions but not be > > > >> more relaxed than the VMA permissions. > > > >> > > > >> === Full Changelog === > > > >> > > > >> An SGX enclave is an area of memory where parts of an application > > > >> can reside. First an enclave is created and loaded (from > > > >> non-enclave memory) with the code and data of an application, > > > >> then user space can map (mmap()) the enclave memory to > > > >> be able to enter the enclave at its defined entry points for > > > >> execution within it. > > > >> > > > >> The hardware maintains a secure structure, the Enclave Page Cache Map > > > >> (EPCM), that tracks the contents of the enclave. Of interest here is > > > >> its tracking of the enclave page permissions. When a page is loaded > > > >> into the enclave its permissions are specified and recorded in the > > > >> EPCM. In parallel the kernel maintains permissions within the > > > >> page table entries (PTEs) and the rule is that PTE permissions > > > >> are not allowed to be more relaxed than the EPCM permissions. > > > >> > > > >> A new mapping (mmap()) of enclave memory can only succeed if the > > > >> mapping has the same or weaker permissions than the permissions that > > > >> were vetted during enclave creation. This is enforced by > > > >> sgx_encl_may_map() that is called on the mmap() as well as mprotect() > > > >> paths. This rule remains. > > > >> > > > >> One feature of SGX2 is to support the modification of EPCM permissions > > > >> after enclave initialization. Enclave pages may thus already be part > > > >> of a VMA at the time their EPCM permissions are changed resulting > > > >> in the VMA's permissions potentially being more relaxed than the EPCM > > > >> permissions. > > > >> > > > >> Allow permissions of existing VMAs to be more relaxed than EPCM > > > >> permissions in preparation for dynamic EPCM permission changes > > > >> made possible in SGX2. New VMAs that attempt to have more relaxed > > > >> permissions than EPCM permissions continue to be unsupported. > > > >> > > > >> Reasons why permissions of existing VMAs are allowed to be more relaxed > > > >> than EPCM permissions instead of dynamically changing VMA permissions > > > >> when EPCM permissions change are: > > > >> 1) Changing VMA permissions involve splitting VMAs which is an > > > >> operation that can fail. Additionally changing EPCM permissions of > > > >> a range of pages could also fail on any of the pages involved. > > > >> Handling these error cases causes problems. For example, if an > > > >> EPCM permission change fails and the VMA has already been split > > > >> then it is not possible to undo the VMA split nor possible to > > > >> undo the EPCM permission changes that did succeed before the > > > >> failure. > > > >> 2) The kernel has little insight into the user space where EPCM > > > >> permissions are controlled from. For example, a RW page may > > > >> be made RO just before it is made RX and splitting the VMAs > > > >> while the VMAs may change soon is unnecessary. > > > >> > > > >> Remove the extra permission check called on a page fault > > > >> (vm_operations_struct->fault) or during debugging > > > >> (vm_operations_struct->access) when loading the enclave page from swap > > > >> that ensures that the VMA permissions are not more relaxed than the > > > >> EPCM permissions. Since a VMA could only exist if it passed the > > > >> original permission checks during mmap() and a VMA may indeed > > > >> have more relaxed permissions than the EPCM permissions this extra > > > >> permission check is no longer appropriate. > > > >> > > > >> With the permission check removed, ensure that PTEs do > > > >> not blindly inherit the VMA permissions but instead the permissions > > > >> that the VMA and EPCM agree on. PTEs for writable pages (from VMA > > > >> and enclave perspective) are installed with the writable bit set, > > > >> reducing the need for this additional flow to the permission mismatch > > > >> cases handled next. > > > >> > > > >> Signed-off-by: Reinette Chatre > > > >> --- > > > >> Changes since V1: > > > >> - Reword commit message (Jarkko). > > > >> - Use "relax" instead of "exceed" when referring to permissions (Dave). > > > >> - Add snippet to Documentation/x86/sgx.rst that highlights the > > > >> relationship between VMA, EPCM, and PTE permissions on SGX > > > >> systems (Andy). > > > >> > > > >> Documentation/x86/sgx.rst | 10 +++++++++ > > > >> arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++---------------- > > > >> 2 files changed, 30 insertions(+), 18 deletions(-) > > > >> > > > >> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst > > > >> index 89ff924b1480..5659932728a5 100644 > > > >> --- a/Documentation/x86/sgx.rst > > > >> +++ b/Documentation/x86/sgx.rst > > > >> @@ -99,6 +99,16 @@ The relationships between the different permission masks are: > > > >> * PTEs are installed to match the EPCM permissions, but not be more > > > >> relaxed than the VMA permissions. > > > >> > > > >> +On systems supporting SGX2 EPCM permissions may change while the > > > >> +enclave page belongs to a VMA without impacting the VMA permissions. > > > >> +This means that a running VMA may appear to allow access to an enclave > > > >> +page that is not allowed by its EPCM permissions. For example, when an > > > >> +enclave page with RW EPCM permissions is mapped by a RW VMA but is > > > >> +subsequently changed to have read-only EPCM permissions. The kernel > > > >> +continues to maintain correct access to the enclave page through the > > > >> +PTE that will ensure that only access allowed by both the VMA > > > >> +and EPCM permissions are permitted. > > > >> + > > > >> Application interface > > > >> ===================== > > > >> > > > >> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > > > >> index 48afe96ae0f0..b6105d9e7c46 100644 > > > >> --- a/arch/x86/kernel/cpu/sgx/encl.c > > > >> +++ b/arch/x86/kernel/cpu/sgx/encl.c > > > >> @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page, > > > >> } > > > >> > > > >> static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl, > > > >> - unsigned long addr, > > > >> - unsigned long vm_flags) > > > >> + unsigned long addr) > > > >> { > > > >> - unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC); > > > >> struct sgx_epc_page *epc_page; > > > >> struct sgx_encl_page *entry; > > > >> > > > >> @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl, > > > >> if (!entry) > > > >> return ERR_PTR(-EFAULT); > > > >> > > > >> - /* > > > >> - * Verify that the faulted page has equal or higher build time > > > >> - * permissions than the VMA permissions (i.e. the subset of {VM_READ, > > > >> - * VM_WRITE, VM_EXECUTE} in vma->vm_flags). > > > >> - */ > > > >> - if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits) > > > >> - return ERR_PTR(-EFAULT); > > > >> - > > > >> /* Entry successfully located. */ > > > >> if (entry->epc_page) { > > > >> if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED) > > > >> @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf) > > > >> { > > > >> unsigned long addr = (unsigned long)vmf->address; > > > >> struct vm_area_struct *vma = vmf->vma; > > > >> + unsigned long page_prot_bits; > > > >> struct sgx_encl_page *entry; > > > >> + unsigned long vm_prot_bits; > > > >> unsigned long phys_addr; > > > >> struct sgx_encl *encl; > > > >> vm_fault_t ret; > > > >> @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf) > > > >> > > > >> mutex_lock(&encl->lock); > > > >> > > > >> - entry = sgx_encl_load_page(encl, addr, vma->vm_flags); > > > >> + entry = sgx_encl_load_page(encl, addr); > > > >> if (IS_ERR(entry)) { > > > >> mutex_unlock(&encl->lock); > > > > > > > >> @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf) > > > >> > > > >> phys_addr = sgx_get_epc_phys_addr(entry->epc_page); > > > >> > > > >> - ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr)); > > > >> + /* > > > >> + * Insert PTE to match the EPCM page permissions ensured to not > > > >> + * exceed the VMA permissions. > > > >> + */ > > > >> + vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC); > > > >> + page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits; > > > >> + /* > > > >> + * Add VM_SHARED so that PTE is made writable right away if VMA > > > >> + * and EPCM are writable (no COW in SGX). > > > >> + */ > > > >> + page_prot_bits |= (vma->vm_flags & VM_SHARED); > > > >> + ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr), > > > >> + vm_get_page_prot(page_prot_bits)); > > > >> if (ret != VM_FAULT_NOPAGE) { > > > >> mutex_unlock(&encl->lock); > > > >> > > > >> @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag > > > >> * Load an enclave page to EPC if required, and take encl->lock. > > > >> */ > > > >> static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl, > > > >> - unsigned long addr, > > > >> - unsigned long vm_flags) > > > >> + unsigned long addr) > > > >> { > > > >> struct sgx_encl_page *entry; > > > >> > > > >> for ( ; ; ) { > > > >> mutex_lock(&encl->lock); > > > >> > > > >> - entry = sgx_encl_load_page(encl, addr, vm_flags); > > > >> + entry = sgx_encl_load_page(encl, addr); > > > >> if (PTR_ERR(entry) != -EBUSY) > > > >> break; > > > >> > > > >> @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr, > > > >> return -EFAULT; > > > >> > > > >> for (i = 0; i < len; i += cnt) { > > > >> - entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK, > > > >> - vma->vm_flags); > > > >> + entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK); > > > >> if (IS_ERR(entry)) { > > > >> ret = PTR_ERR(entry); > > > >> break; > > > >> -- > > > >> 2.25.1 > > > >> > > > > > > > > If you unconditionally set vm_max_prot_bits to RWX for dynamically created > > > > pags, you would not need to do this. > > > > > > > > These patches could be then safely dropped then: > > > > > > > > - [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions > > > > - [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes > > > > - [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions > > > > > > > > And that would also keep full ABI compatibility without exceptions to the > > > > existing mainline code. > > > > > > > > > > Dropping these changes do not just impact dynamically created pages. Dropping > > > these patches would result in EPCM page permission restriction being supported > > > for all pages, those added before enclave initialization as well as dynamically > > > added pages, but their PTEs will not be impacted. > > > > > > For example, if a RW enclave page is added via SGX_IOC_ENCLAVE_ADD_PAGES and > > > then later made read-only via SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS then Linux > > > would keep allowing and installing RW PTEs to this page. > > > > I think that would be perfectly fine, if someone wants to do that. There is > > no corrateral damage on doing that. Kernel does not get messed because of > > that. It's a use case that does not make sense in the first place, so it'd > > be stupid to build anything extensive around it to the kernel. > > > > Shooting yourself to the foot is something that kernel does and should not > > protect user space from unless there is a risk of messing the state of the > > kernel itself. > > > > Much worse is that we have e.g. completely artificial ioctl > > SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to support this scheme, which could e.g. > > cause extra roundtrips for simple EMODPE. > > > > Also this means not having to include 06/32, which keeps 100% backwards > > compatibility in run-time behaviour to the mainline while not restricting > > at all dynamically created pages. And we get rid of complex book keeping > > of vm_run_prot_bits. > > > > And generally the whole model is then very easy to understand and explain. > > If I had to keep presentation of the current mess in the patch set in a > > conference, I can honestly say that I would be in serious trouble. It's > > not clean and clear security model, which is a risk by itself. > > I.e. > > 1. For EADD'd pages: stick what has been the invariant 1,5 years now. Do > not change it by any means (e.g. 06/32). > 2. For EAUG'd pages: set vm_max_prot_bits RWX, which essentially means do > what ever you want with PTE's and EPCM. > > It's a clear and understandable model that does nothing bad to the kernel, > and a run-time developer can surely find away to get things on going. For > user space, the most important thing is the clarity in kernel behaviour, > and this does deliver that clarity. It's not perfect but it does do the > job and anyone can get it. Also a quantitive argument for this is that by simplifying security model this way it is one ioctl less, which must be considered as +1. We do not want to add new ioctls unless it is something we absolutely cannnot live without. We absolutely can live without SGX_IOC_ENCLAVE_RELAX_PERMISSIONS. BR, Jarkko