Received: by 2002:a05:6a10:17d3:0:0:0:0 with SMTP id hz19csp2738550pxb; Mon, 19 Apr 2021 12:37:56 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxdT+w5e0DOOrJDNcEJwQ292K/f6OLC9G3W32JewqNLy3IRWZz+oLl5Z+Yiw45vCiJlH1ab X-Received: by 2002:a05:6402:35cd:: with SMTP id z13mr27806049edc.21.1618861076789; Mon, 19 Apr 2021 12:37:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618861076; cv=none; d=google.com; s=arc-20160816; b=OFxzB7A7nZ/T6kqTV/P6eUouUhbx/IoidLedb7U7lcRZxepjmsNV9Z32AUFRW5lK31 8J2jroXkkJAolaPTJdmkuTsjBsGEGLr2XRCZljkz6C9ym2nsSW8Ms7iXaNOHgb0WH9Ob h9AfgZzvqcrCGq42MoSzjHnmhEFfpGxR9+YfqUB2MyEG4ikmXrDFDggq07yeCZoje1zI 4cL9iIzu/3Fiy+TMeGhdsQbOVhPNLRBoQ2QtCzEQwfgHsUYbM4DcyTJqV2qv2sXGiv+Y DZwHOvQfDdtIo1J2XchYM2zLpyK/tJcA1q+CD1djKsB2tMmpkkE5im8uCd2OVOzhGK24 HvCQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=D+scCxaF/k3XnD2sscAAiRoJE5CZQjd/qu5mFzNH7gg=; b=k2ueqGqRdZJr/yWMJua25pUo5+HayvUgVZg0eEZ6OsKI5b+hSCfiidLgrMzNOP6gC1 nT0SOjxT7PpL2eaZQJ0HcwvJ7cqVD2qdkLXPhRU2Or60OwpU3Pn7wC3V1zCknm2Lhtgu Cc8HyLeZ9bBYv2ccyW/u552+PiqctHKmbc8CDfvX8+5LooYOjRlh/KUvd+ppdyHgO/Tx QUv4j+QO69to0p/JnqRgo2bdg6vTpOq0nCLFoL9O5WpLDa24kXXgQfqcCPxep6i9QB2T FEQHbqch7B6XpSftJZWySpGbEj7TLiNiQ67k7WKpb+jiZiWq+GHtolGdDOXopgPhPBjz quSA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=i6RwnX5x; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id t20si3614529edq.484.2021.04.19.12.37.32; Mon, 19 Apr 2021 12:37:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=i6RwnX5x; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242016AbhDSQCa (ORCPT + 99 others); Mon, 19 Apr 2021 12:02:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38766 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241742AbhDSQCU (ORCPT ); Mon, 19 Apr 2021 12:02:20 -0400 Received: from mail-pg1-x52f.google.com (mail-pg1-x52f.google.com [IPv6:2607:f8b0:4864:20::52f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4A763C061761 for ; Mon, 19 Apr 2021 09:01:51 -0700 (PDT) Received: by mail-pg1-x52f.google.com with SMTP id 31so9065281pgn.13 for ; Mon, 19 Apr 2021 09:01:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=D+scCxaF/k3XnD2sscAAiRoJE5CZQjd/qu5mFzNH7gg=; b=i6RwnX5xqtSrZgM5rvmPeRmsfWNxhgopBYEi7XmYe2x+pC/K41omM/bvXmWAVKQdQY lzyWCwQbk11DZ3Geq6gmuYxwk0ELNs7jLPwq+wZYFINt7uY6uHeRzZeqXSB6GuVtClOA T/QiaTut2wOhP/aP+L6oWxJPjRQX/NcOKzvG3dsWlR6j98FCd85cnQHmT+GLhSUQ+eLK sTJRB51lTBIsffNdMn97QnCUyrA6Ego1GbUSI1/LFLZlqp9RfFKBr1WqljqJyIHL2DX1 oJfkMteb3JLrkZxNM6c8IJPJiXC4uWjDX/CntY5u9HWERN4m8Otuu8VxDKFGVIw3genw SV6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=D+scCxaF/k3XnD2sscAAiRoJE5CZQjd/qu5mFzNH7gg=; b=eWUGH2MpAAeDrQYSOLb2+A8VRWyPCo1xB/5kkHoBqci/JSyZy+LphH//x+PjFl4tBA bOQ1zqmDOe9dURwaAquybeGbgE8jPlm7U2JLF1IKVg0ENKKniJvh6tDVF5BL83SFSYv7 iRT72+J7wcaLIbP9IssOnUQJJYQ0IoIcgEHhEsSfaxA3SN0lBvhPI8hgWB+4BSYp4Gie P4RRqZikWt93SCBdrXG6OfN4G6QXsvD/xtgBgUw/NYWpJO8fAhqeX3ZxhAmjYFRknnCY xBiWJdKsA3b/gm+YXYHjadvM3b/5mXDaiCiUZz1Yld2MDLWU0IJeRxDOFNq+QJ6ik3l0 S8CQ== X-Gm-Message-State: AOAM5336Uy6XBlBxZbxg0WfVzRnKhRqQH9CQusEgRqb3IOGzqEcs0EfM d4L/w8KAnV86es5IRq8IIjFpYg== X-Received: by 2002:a62:7d07:0:b029:21b:d1bc:f6c8 with SMTP id y7-20020a627d070000b029021bd1bcf6c8mr20258928pfc.45.1618848110621; Mon, 19 Apr 2021 09:01:50 -0700 (PDT) Received: from google.com (240.111.247.35.bc.googleusercontent.com. [35.247.111.240]) by smtp.gmail.com with ESMTPSA id l17sm13229762pgi.66.2021.04.19.09.01.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 19 Apr 2021 09:01:49 -0700 (PDT) Date: Mon, 19 Apr 2021 16:01:46 +0000 From: Sean Christopherson To: "Kirill A. Shutemov" Cc: "Kirill A. Shutemov" , Dave Hansen , Andy Lutomirski , Peter Zijlstra , Jim Mattson , David Rientjes , "Edgecombe, Rick P" , "Kleen, Andi" , "Yamahata, Isaku" , Erdem Aktas , Steve Rutherford , Peter Gonda , David Hildenbrand , x86@kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages Message-ID: References: <20210416154106.23721-1-kirill.shutemov@linux.intel.com> <20210416154106.23721-14-kirill.shutemov@linux.intel.com> <20210419142602.khjbzktk5tk5l6lk@box.shutemov.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210419142602.khjbzktk5tk5l6lk@box.shutemov.name> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 19, 2021, Kirill A. Shutemov wrote: > On Fri, Apr 16, 2021 at 05:30:30PM +0000, Sean Christopherson wrote: > > I like the idea of using "special" PTE value to denote guest private memory, > > e.g. in this RFC, HWPOISON. But I strongly dislike having KVM involved in the > > manipulation of the special flag/value. > > > > Today, userspace owns the gfn->hva translations and the kernel effectively owns > > the hva->pfn translations (with input from userspace). KVM just connects the > > dots. > > > > Having KVM own the shared/private transitions means KVM is now part owner of the > > entire gfn->hva->pfn translation, i.e. KVM is effectively now a secondary MMU > > and a co-owner of the primary MMU. This creates locking madness, e.g. KVM taking > > mmap_sem for write, mmu_lock under page lock, etc..., and also takes control away > > from userspace. E.g. userspace strategy could be to use a separate backing/pool > > for shared memory and change the gfn->hva translation (memslots) in reaction to > > a shared/private conversion. Automatically swizzling things in KVM takes away > > that option. > > > > IMO, KVM should be entirely "passive" in this process, e.g. the guest shares or > > protects memory, userspace calls into the kernel to change state, and the kernel > > manages the page tables to prevent bad actors. KVM simply does the plumbing for > > the guest page tables. > > That's a new perspective for me. Very interesting. > > Let's see how it can look like: > > - KVM only allows poisoned pages (or whatever flag we end up using for > protection) in the private mappings. SIGBUS otherwise. > > - Poisoned pages must be tied to the KVM instance to be allowed in the > private mappings. Like kvm->id in the current prototype. SIGBUS > otherwise. > > - Pages get poisoned on fault in if the VMA has a new vmflag set. > > - Fault in of a poisoned page leads to hwpoison entry. Userspace cannot > access such pages. > > - Poisoned pages produced this way get unpoisoned on free. > > - The new VMA flag set by userspace. mprotect(2)? Ya, or mmap(), though I'm not entirely sure a VMA flag would suffice. The notion of the page being private is tied to the PFN, which would suggest "struct page" needs to be involved. But fundamentally the private pages, are well, private. They can't be shared across processes, so I think we could (should?) require the VMA to always be MAP_PRIVATE. Does that buy us enough to rely on the VMA alone? I.e. is that enough to prevent userspace and unaware kernel code from acquiring a reference to the underlying page? > - Add a new GUP flag to retrive such pages from the userspace mapping. > Used only for private mapping population. > - Shared gfn ranges managed by userspace, based on hypercalls from the > guest. > > - Shared mappings get populated via normal VMA. Any poisoned pages here > would lead to SIGBUS. > > So far it looks pretty straight-forward. > > The only thing that I don't understand is at way point the page gets tied > to the KVM instance. Currently we do it just before populating shadow > entries, but it would not work with the new scheme: as we poison pages > on fault it they may never get inserted into shadow entries. That's not > good as we rely on the info to unpoison page on free. Can you elaborate on what you mean by "unpoison"? If the page is never actually mapped into the guest, then its poisoned status is nothing more than a software flag, i.e. nothing extra needs to be done on free. If the page is mapped into the guest, then KVM can be made responsible for reinitializing the page with keyid=0 when the page is removed from the guest. The TDX Module prevents mapping the same PFN into multiple guests, so the kernel doesn't actually have to care _which_ KVM instance(s) is associated with a page, it only needs to prevent installing valid PTEs in the host page tables. > Maybe we should tie VMA to the KVM instance on setting the vmflags? > I donno. > > Any comments? > > -- > Kirill A. Shutemov