Received: by 2002:a25:2c96:0:0:0:0:0 with SMTP id s144csp800869ybs; Sun, 24 May 2020 22:33:02 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwLVLEFW4j0vmL4msLR05W1RKKjGzcfiNWhOZyn7MwFvowg0rQjVxz8fNQ+IQkODcfruYFf X-Received: by 2002:a17:906:edd3:: with SMTP id sb19mr16888446ejb.39.1590384782601; Sun, 24 May 2020 22:33:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1590384782; cv=none; d=google.com; s=arc-20160816; b=AkRnazCjHu9Bb44279giNpSKsjLLZYm4KCte0b61QX9uiowkNly7GVUzJRpK8kykFt mHCW9fuq2vlwIsTNL8rLCx8TfPfVe5CNia8uPd4HdqUKP+NEf43+m9oK5f+kuQdz4EdW IbBmU6J5vdUmMbCv7Q71xjfffs2r1FpPGljCIW4SMReM10gJh7JDtEaoEFefWXFQW0B9 saXOCcsiw1tSZjODzBsVHtanb6VC1N9k0ZVtDtaL/BvIgHWerLu7G8MRxYL/6M59Syrc Gq+Ao51DkalSuYx6PP6WV6vgFkRGY78CYz/vytjWAW0L/u6YusxxeqHGxU4lZKSmc2qm qKHA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:ironport-sdr:ironport-sdr; bh=3eY4JjA5HnyFnD/0N0BUpcXJXSymKkQuP9fv99YQegI=; b=TsCeTVyTTJV7uaxHqdDcSmRJpZW5x/rLtDIpJr4TJdxOOeJFu5CYIPGQ3cvByingUA Kchd+MfRYt6eynT7S58ccXN2uyT1I3se0G0lhsFidvjUjTUk4/Au0hwqALxOjUq9aLmo sLG9GETZjQL3af5ejUd1u1mT8VBX5HqEh4VCWc1IT9zOt7zEJ0KconS1t8hxozi0Fmw0 N6igDgFB2nA39C7wlpGtgUwiVl68/GH3MLKqJEaWH5O25dbVznEvhEa85r3BNgpnQ5xC OZaWmPyjlbieZROaAXgqXlRnPDb+UbBHMry29g2Y9dXOhFwMRa44VfwvBuE5Bk/01Ci+ Z5Pw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id oq11si8617102ejb.552.2020.05.24.22.32.40; Sun, 24 May 2020 22:33:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729628AbgEYF1L (ORCPT + 99 others); Mon, 25 May 2020 01:27:11 -0400 Received: from mga14.intel.com ([192.55.52.115]:46592 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725802AbgEYF1L (ORCPT ); Mon, 25 May 2020 01:27:11 -0400 IronPort-SDR: gtHPSKWX4gfNEWkgD+EO4FVcER8/zsz7eaOw6QdHKOR2+WQsNUaEYzgSFQc+pLDqWvF3TCHiDX aC5fLU+zP7UQ== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 May 2020 22:27:10 -0700 IronPort-SDR: iJgO3NPxbSQrde39Pl/sGd3uIpknDtkS1Z6bF9xo+xNLMwaEOMKfpi1qnscwN4yVnKqkqnWG39 FlfTk6lI7Z/g== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.73,432,1583222400"; d="scan'208";a="467949545" Received: from black.fi.intel.com ([10.237.72.28]) by fmsmga006.fm.intel.com with ESMTP; 24 May 2020 22:27:06 -0700 Received: by black.fi.intel.com (Postfix, from userid 1000) id 3F116D7; Mon, 25 May 2020 08:27:04 +0300 (EEST) Date: Mon, 25 May 2020 08:27:04 +0300 From: "Kirill A. Shutemov" To: "Kirill A. Shutemov" Cc: Dave Hansen , Andy Lutomirski , Peter Zijlstra , Paolo Bonzini , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , David Rientjes , Andrea Arcangeli , Kees Cook , Will Drewry , "Edgecombe, Rick P" , "Kleen, Andi" , x86@kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Mike Rapoport , Alexandre Chartre , Marius Hillenbrand Subject: Re: [RFC 00/16] KVM protected memory extension Message-ID: <20200525052704.phyk5olkykncj3bj@black.fi.intel.com> References: <20200522125214.31348-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20200522125214.31348-1-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 22, 2020 at 03:51:58PM +0300, Kirill A. Shutemov wrote: > == Background / Problem == > > There are a number of hardware features (MKTME, SEV) which protect guest > memory from some unauthorized host access. The patchset proposes a purely > software feature that mitigates some of the same host-side read-only > attacks. CC people who worked on the related patchsets. > == What does this set mitigate? == > > - Host kernel ”accidental” access to guest data (think speculation) > > - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len)) > > - Host userspace access to guest data (compromised qemu) > > == What does this set NOT mitigate? == > > - Full host kernel compromise. Kernel will just map the pages again. > > - Hardware attacks > > > The patchset is RFC-quality: it works but has known issues that must be > addressed before it can be considered for applying. > > We are looking for high-level feedback on the concept. Some open > questions: > > - This protects from some kernel and host userspace read-only attacks, > but does not place the host kernel outside the trust boundary. Is it > still valuable? > > - Can this approach be used to avoid cache-coherency problems with > hardware encryption schemes that repurpose physical bits? > > - The guest kernel must be modified for this to work. Is that a deal > breaker, especially for public clouds? > > - Are the costs of removing pages from the direct map too high to be > feasible? > > == Series Overview == > > The hardware features protect guest data by encrypting it and then > ensuring that only the right guest can decrypt it. This has the > side-effect of making the kernel direct map and userspace mapping > (QEMU et al) useless. But, this teaches us something very useful: > neither the kernel or userspace mappings are really necessary for normal > guest operations. > > Instead of using encryption, this series simply unmaps the memory. One > advantage compared to allowing access to ciphertext is that it allows bad > accesses to be caught instead of simply reading garbage. > > Protection from physical attacks needs to be provided by some other means. > On Intel platforms, (single-key) Total Memory Encryption (TME) provides > mitigation against physical attacks, such as DIMM interposers sniffing > memory bus traffic. > > The patchset modifies both host and guest kernel. The guest OS must enable > the feature via hypercall and mark any memory range that has to be shared > with the host: DMA regions, bounce buffers, etc. SEV does this marking via a > bit in the guest’s page table while this approach uses a hypercall. > > For removing the userspace mapping, use a trick similar to what NUMA > balancing does: convert memory that belongs to KVM memory slots to > PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and > the newly faulted in pages get PROT_NONE from the updated vm_page_prot. > The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the > VMA must be treated in a special way in the GUP and fault paths. The flag > allows GUP to return the page even though it is mapped with PROT_NONE, but > only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access > to the memory would result in SIGBUS. Any GUP access without FOLL_KVM > would result in -EFAULT. > > Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from > the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only > flushes local TLB. I think it's a reasonable compromise between security and > perfromance. > > Zapping the PTE would bring the page back to the direct mapping after clearing. > At least for now, we don't remove file-backed pages from the direct mapping. > File-backed pages could be accessed via read/write syscalls. It adds > complexity. > > Occasionally, host kernel has to access guest memory that was not made > shared by the guest. For instance, it happens for instruction emulation. > Normally, it's done via copy_to/from_user() which would fail with -EFAULT > now. We introduced a new pair of helpers: copy_to/from_guest(). The new > helpers acquire the page via GUP, map it into kernel address space with > kmap_atomic()-style mechanism and only then copy the data. > > For some instruction emulation copying is not good enough: cmpxchg > emulation has to have direct access to the guest memory. __kvm_map_gfn() > is modified to accommodate the case. > > The patchset is on top of v5.7-rc6 plus this patch: > > https://lkml.kernel.org/r/20200402172507.2786-1-jimmyassarsson@gmail.com > > == Open Issues == > > Unmapping the pages from direct mapping bring a few of issues that have > not rectified yet: > > - Touching direct mapping leads to fragmentation. We need to be able to > recover from it. I have a buggy patch that aims at recovering 2M/1G page. > It has to be fixed and tested properly > > - Page migration and KSM is not supported yet. > > - Live migration of a guest would require a new flow. Not sure yet how it > would look like. > > - The feature interfere with NUMA balancing. Not sure yet if it's > possible to make them work together. > > - Guests have no mechanism to ensure that even a well-behaving host has > unmapped its private data. With SEV, for instance, the guest only has > to trust the hardware to encrypt a page after the C bit is set in a > guest PTE. A mechanism for a guest to query the host mapping state, or > to constantly assert the intent for a page to be Private would be > valuable. -- Kirill A. Shutemov