Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp3202197imu; Mon, 17 Dec 2018 15:24:23 -0800 (PST) X-Google-Smtp-Source: AFSGD/WI1QK71L2tXWnohvRn8KFO83wNZ62tTp7NIxffGlrnCfV3YumDothZdthrgDYRJwCcRRV7 X-Received: by 2002:a63:f74f:: with SMTP id f15mr13921655pgk.190.1545089063262; Mon, 17 Dec 2018 15:24:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545089063; cv=none; d=google.com; s=arc-20160816; b=JK8ghb0ElIqZ+Gn+huUf3YCkdxLFF8rzPUzCR/C9lgyYBQ4RiSqih5SakSOE0/N9lY yNNTpKhZkX2oJ01Q1tRmjzlTpS5c2wQTWjZqwVYhbF6x3tBnHXEwLhoI03bl43jlS8fz abwiuy32tKsg6kCxNJ9lOkYSbVTv0sb1wzwyVs7KuuCJWqB/0JVY4sqRnTVgS8mdaD0K PD9Hv8/0lnvjJ3SUGJfRLALrtX138I+Lc4rMaImaeB/P94y9vjuB5KnYfJbB+XFyK2pF wehLoR4j6nCHqIDliQXvUCadFy+vlIxTFmhjsEbLbWyM4MoiAOIIQh5bwGY8FV3evpBP wQpg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=3sIWCKudqBSLyPgSLmPQgtwb+a+bsugV0iJUaJbWwU0=; b=H97qIs8AbQEN05SDzV/F+eepi1RxQ1UwP5zlIpshplR0k9qFmgBbEJo1zM0k/N8Pj5 F9s+xsnUFFOADiedvBejC6J4hrHalt/NZukGRHRfEaibxPmhTssE5qmS1oAiG+geVE7E kjdaUKhydcwdEBJEfYRO/7KAEZrin5ByaF1A4iQ+gslMe9g7YsR7ZTajxSlqPUmONZxB iCFbLWPYU83vw9IUVyPSE8PksT2un/pVHTujZ3/cLvq/irIjAG4RKQ+0sj/3P8hfmuGr WZ6S21Z7l0JH3b24x2rzDhIR3Jlc63nPTiHYXroQAFNJC7UFa/EKL6mOWeIZY8WihNeU loqw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x74si12801467pfe.23.2018.12.17.15.24.08; Mon, 17 Dec 2018 15:24:23 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730928AbeLQWUt (ORCPT + 99 others); Mon, 17 Dec 2018 17:20:49 -0500 Received: from mga18.intel.com ([134.134.136.126]:13641 "EHLO mga18.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726882AbeLQWUt (ORCPT ); Mon, 17 Dec 2018 17:20:49 -0500 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga106.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 17 Dec 2018 14:20:48 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,366,1539673200"; d="scan'208";a="111269674" Received: from sjchrist-coffee.jf.intel.com (HELO linux.intel.com) ([10.54.74.154]) by orsmga003.jf.intel.com with ESMTP; 17 Dec 2018 14:20:48 -0800 Date: Mon, 17 Dec 2018 14:20:48 -0800 From: Sean Christopherson To: Andy Lutomirski Cc: Dave Hansen , Jarkko Sakkinen , X86 ML , Platform Driver , linux-sgx@vger.kernel.org, nhorman@redhat.com, npmccallum@redhat.com, "Ayoun, Serge" , shay.katz-zamir@intel.com, Haitao Huang , Andy Shevchenko , Thomas Gleixner , "Svahn, Kai" , mark.shanahan@intel.com, Suresh Siddha , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , Darren Hart , Andy Shevchenko , "open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)" Subject: Re: [PATCH v17 18/23] platform/x86: Intel SGX driver Message-ID: <20181217222047.GG12491@linux.intel.com> References: <20181116010412.23967-1-jarkko.sakkinen@linux.intel.com> <20181116010412.23967-19-jarkko.sakkinen@linux.intel.com> <7d5cde02-4649-546b-0f03-2d6414bb80b5@intel.com> <20181217180102.GA12560@linux.intel.com> <20181217183613.GD12491@linux.intel.com> <20181217184333.GA26920@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 17, 2018 at 11:12:21AM -0800, Andy Lutomirski wrote: > On Mon, Dec 17, 2018 at 10:47 AM Dave Hansen wrote: > > > > On 12/17/18 10:43 AM, Jarkko Sakkinen wrote: > > > On Mon, Dec 17, 2018 at 10:36:13AM -0800, Sean Christopherson wrote: > > >> I'm pretty sure doing mmget() would result in circular dependencies and > > >> a zombie enclave. In the do_exit() case where a task is abruptly killed: > > >> > > >> - __mmput() is never called because the enclave holds a ref > > >> - sgx_encl_release() is never be called because its VMAs hold refs > > >> - sgx_vma_close() is never called because __mmput()->exit_mmap() is > > >> blocked and the process itself is dead, i.e. won't unmap anything. > > > Right, it does, you are absolutely right. Tried it and removed the > > > commit already. > > > > > > Well, what we came up from your suggestion i.e. setting mm to NULL > > > and checking that is very subtle change and does not have any such > > > circular dependencies. We'll go with that. > > > > This all screams that you need to break out this code from the massive > > "18" patch and get the mm interactions reviewed more thoroughly. > > > > Also, no matter what method you go with, you have a bunch of commenting > > and changelogging to do here. > > I'm going to ask an obnoxious high-level question: why does an enclave > even refer to a specific mm? Primarily because that's what the code has "always" done. I can't speak for Jarkko, but I got involved with this joyful project long after the code was originally written. > If I were designing this thing, and if I hadn't started trying to > implement it, my first thought would be that an enclave tracks its > linear address range, which is just a pair of numbers, and also keeps > track of a whole bunch of physical EPC pages, data structures, etc. > And that mmap() gets rejected unless the requested virtual address > matches the linear address range that the enclave wants and, aside > from that, just creates a VMA that keeps a reference to the enclave. > (And, for convenience, I suppose that the first mmap() call done > before any actual enclave setup happens could choose any address and > then cause the enclave to lock itself to that address, although a > regular anonymous PROT_NONE MAP_NORESERVE mapping would do just fine, > too.) And the driver would explicitly allow multiple different mms to > have the same enclave mapped. More importantly, a daemon could set up > an enclave without necessarily mapping it at all and then SCM_RIGHTS > the enclave over to the process that plans to run it. Hmm, this could work, the obvious weirdness would be ensuring the linear range is available in the destination mm, but that'd be userspace's problem. I don't think we'd need to keep a reference to the enclave in the VMA. The enclave's ref could be held by the fd. Assuming the kernel is using its private mapping to access the enclave, that's all we'd need to be able to manipulate the enclave, e.g. reclaim EPC pages. Userspace would need to keep the fd alive in order to use the VMA, but that sort of goes without saying. The mm/VMA juggling today is for zapping/testing the correct PTEs, but as you pointed out in a different email we can use unmap_mapping_range(), with the enclave's fd being the source of the address space passed to unmap_mapping_range(). Removing a VMA simply means we don't need to zap it or test its age. > Now I'm sure this has all kinds of problems, such as the ISA possibly > making it rather obnoxious to add pages to the enclave without having > it mapped. But these operations could, in principle, be done by > having the enclave own a private mm that's used just for setup. While > this would be vaguely annoying, Nadav's still-pending-but-nearly-done > text_poke series adds all the infrastructure that's needed for the > kernel to manage little private mms. But some things get simpler -- > invalidating the enclave can presumably use the regular rmap APIs to > zap all the PTEs in all VMAs pointing into the enclave. We don't even need a private mm, we can (and already do) use the kernel's translations for ENCLS instructions. Hardware only enforces the linear address stuff when it's actually in enclave mode, i.e. executing the enclave. ENCLS instructions aren't subject to the ELRANGE checks and can use any VA->PA combination. > So I'm not saying that you shouldn't do it the way you are now, but I > do think that the changelog or at least some emails should explain > *why* the enclave needs to keep a pointer to the creating process's > mm. And, if you do keep the current model, it would be nice to > understand what happens if you do something awful like mremap()ing an > enclave, or calling madvise on it, or otherwise abusing the vma. Or > doing fork(), for that matter. > > I also find it suspicious that the various ioctl handlers > systematically ignore their "filep" parameters and instead use > find_vma() to find the relevant mm data structures. That seems > backwards. My brain is still sorting out the details, but I generally like the idea of allocating an anon inode when creating an enclave, and exposing the other ioctls() via the returned fd. This is essentially the approach used by KVM to manage multiple "layers" of ioctls across KVM itself, VMs and vCPUS. There are even similarities to accessing physical memory via multiple disparate domains, e.g. host kernel, host userspace and guest. The only potential hiccup I can see is the build flow. Currently, EADD+EEXTEND is done via a work queue to avoid major performance issues (10x regression) when userspace is building multiple enclaves in parallel using goroutines to wrap Cgo (the issue might apply to any M:N scheduler, but I've only confirmed the Golang case). The issue is that allocating an EPC page acts like a blocking syscall when the EPC is under pressure, i.e. an EPC page isn't immediately available. This causes Go's scheduler to thrash and tank performance[1]. That being said, I think we could simply do mmgrab()/mmdrop() for each page to be added, and then do mmget_not_zero()/mmput() when actually inserting into the mm's page tables. Conceptually that seems cleaner than implicitly relying on the mmu_notifier to guarantee the lifecycle of the mm. Alternatively, we could change the EADD+EEXTEND flow to not insert the added page's PFN into the owner's process space, i.e. force userspace to fault when it runs the enclave. But that only delays the issue because eventually we'll want to account EPC pages, i.e. add a cgroup, at which point we'll likely need current->mm anyways. [1] https://github.com/golang/go/issues/19574