Received: by 2002:a05:6a10:17d3:0:0:0:0 with SMTP id hz19csp2703653pxb; Mon, 19 Apr 2021 11:41:42 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxGa1OT6jygXrN/fo+1FQYO990w16kaHuerN69YEtoCJ9BtA+ZA/gYhq/0jnuzgeFeC7k+H X-Received: by 2002:a17:906:8144:: with SMTP id z4mr23361766ejw.404.1618857702222; Mon, 19 Apr 2021 11:41:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618857702; cv=none; d=google.com; s=arc-20160816; b=Uo5Bh5u4zvxuwVlFSkWITNM0IQYoggX8VnWHv/SrHTMxi/RYRquXowmKfS7Cry8LzO eqvesZlOu+MZR8x3tHt19S4D3YimSLs+O3dZzCCwb8AwYJu3wdiM2E8UKnWasdhaJHCi cVSPQn3BxrSx1RudyxFIkzR+4SeJSOLvd0EMX8PrqYo9CN1Qj1iTIJpz+OfUST0Gxisq bRRzJBm+BL4Rc3z4/pzetgF9SpMh6QDh6jK04ccK3+ta429oLLB2IWHjJhHGwCwXu+fF QdarXrBCziz/Cthx2yZBahUfmRwRD0Z5mZkBMUopmyC/pwxI/niX5ZvTEc2SP5uc8/sp 1A6w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=9S0Wtbq4es5oYYPpSLSAT/39jDBxRK85obPsvkL2qUw=; b=VINc0hPHnkvKFSORIonKenFri+ch5+wlRnTDJz5nYWRpB+znq85Lnk8xcq/YHSp6IU S954+2JvVpcoiJ9lJClJfCj85j4yRor1RbgLxpnflNVkJJBEzaKTLAA4A+bR10DBAjqO F+79u9cfI+gF44XwMVc68zfZlXvuqzfgMpa2gbXjr02RFJLBSicI2CDb+BQ3EhfrUxoP mwZJwIlU9D1k9BSgt+xw9Nr6gkvWuQ6qfZoAMBfAxInTrVl+FAbVNhnUwh54U6IwxPja crGnjS9nr+GXeg/6seMmgcaWdxrdzDaOp6vZZ4mO0FToFQ5yOtZKHs4ioXTwjslNIiHb YG+g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@shutemov-name.20150623.gappssmtp.com header.s=20150623 header.b=esWeQJK9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h15si3406451ejt.69.2021.04.19.11.41.17; Mon, 19 Apr 2021 11:41:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@shutemov-name.20150623.gappssmtp.com header.s=20150623 header.b=esWeQJK9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239684AbhDSO0j (ORCPT + 99 others); Mon, 19 Apr 2021 10:26:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45744 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239296AbhDSO0h (ORCPT ); Mon, 19 Apr 2021 10:26:37 -0400 Received: from mail-lf1-x134.google.com (mail-lf1-x134.google.com [IPv6:2a00:1450:4864:20::134]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D1488C06174A for ; Mon, 19 Apr 2021 07:26:05 -0700 (PDT) Received: by mail-lf1-x134.google.com with SMTP id x20so25957998lfu.6 for ; Mon, 19 Apr 2021 07:26:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov-name.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=9S0Wtbq4es5oYYPpSLSAT/39jDBxRK85obPsvkL2qUw=; b=esWeQJK9FvJxBiDnuuwK9c4Efnqbdt1lBBJgNBak/jqhFCoVL1cuozpyz50hNpbSmd HMgHelsPCYu1ktEGO4GVisbbcAvHpEseTEB3W88F/PG5nGdR0XNjNsVZWF4Swvg77pIz Z0GMahYBXyHfnfjCO2cK2bGVW8vrHbD/+wKTktdf7KN7JbKGPwXbzCfjqLcY3MGUBtsm av/Gl4y/evv+AXR6Yi3ZcfUWdnnOAT5WHNbHIfjHyUx80U9mmboGjMSOMpEMxBL2rCz6 G3soJyBYDWYTxf9pqRZVqnJk8z5iNeTMtlKD1PJjR3b2FQUWsLHnJMsmM+IN8S5XY7Dr dryQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=9S0Wtbq4es5oYYPpSLSAT/39jDBxRK85obPsvkL2qUw=; b=ghBDWEFLmZJD2xo7mONzJq8iS4wjlkCFYZAJTqUb19jotFe4+gWO18uzVESgAtI33u 4kaP+8qVbK86OTq46Ok7yYY3C0IWBZDlEwStHoHlI2ZhU9WqG+KZIpqG5o2atQz9TuXN d4Jt9XSQ8IxWX2dHNLlOicsMo0jk+v4/19GhQ/JyZE+UCj+9iplvoFjtEMQilHMQIS0B 58YBz7tx/KXKphi9SgdI03swOEBAdmr6dJaspa4GbTxbQnQg4FbFGho2tWPo3SsAfdia /P4LBeE3OOldowJ7P7uxUV969tFiiKKxgD76VAWODvUl/sNGjRgei1ezFKLwS7/vDbro OgVg== X-Gm-Message-State: AOAM5306mXp2mrUxDvuK94U8cOM6fycm/r0tV9V1LLDjyKOxK6GcySYl T2C+ETEoFlJWMdUI4Xe+/36mUA== X-Received: by 2002:a05:6512:110d:: with SMTP id l13mr12523932lfg.612.1618842364357; Mon, 19 Apr 2021 07:26:04 -0700 (PDT) Received: from box.localdomain ([86.57.175.117]) by smtp.gmail.com with ESMTPSA id a27sm241053lfo.190.2021.04.19.07.26.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 19 Apr 2021 07:26:03 -0700 (PDT) Received: by box.localdomain (Postfix, from userid 1000) id A028A102567; Mon, 19 Apr 2021 17:26:02 +0300 (+03) Date: Mon, 19 Apr 2021 17:26:02 +0300 From: "Kirill A. Shutemov" To: Sean Christopherson Cc: "Kirill A. Shutemov" , Dave Hansen , Andy Lutomirski , Peter Zijlstra , Jim Mattson , David Rientjes , "Edgecombe, Rick P" , "Kleen, Andi" , "Yamahata, Isaku" , Erdem Aktas , Steve Rutherford , Peter Gonda , David Hildenbrand , x86@kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages Message-ID: <20210419142602.khjbzktk5tk5l6lk@box.shutemov.name> References: <20210416154106.23721-1-kirill.shutemov@linux.intel.com> <20210416154106.23721-14-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 16, 2021 at 05:30:30PM +0000, Sean Christopherson wrote: > On Fri, Apr 16, 2021, Kirill A. Shutemov wrote: > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 1b404e4d7dd8..f8183386abe7 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -8170,6 +8170,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) > > kvm_sched_yield(vcpu->kvm, a0); > > ret = 0; > > break; > > + case KVM_HC_ENABLE_MEM_PROTECTED: > > + ret = kvm_protect_memory(vcpu->kvm); > > + break; > > + case KVM_HC_MEM_SHARE: > > + ret = kvm_share_memory(vcpu->kvm, a0, a1); > > Can you take a look at a proposed hypercall interface for SEV live migration and > holler if you (or anyone else) thinks it will have extensibility issues? > > https://lkml.kernel.org/r/93d7f2c2888315adc48905722574d89699edde33.1618498113.git.ashish.kalra@amd.com Will look closer. Thanks. > > @@ -1868,11 +1874,17 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, > > flags |= FOLL_WRITE; > > if (async) > > flags |= FOLL_NOWAIT; > > + if (kvm->mem_protected) > > + flags |= FOLL_ALLOW_POISONED; > > This is unsafe, only the flows that are mapping the PFN into the guest should > use ALLOW_POISONED, e.g. __kvm_map_gfn() should fail on a poisoned page. That's true for TDX. I prototyped with pure KVM with minimal modification to the guest. We had to be more permissive for the reason. It will go away for TDX. > > -static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn, > > - void *data, int offset, int len) > > +int copy_from_guest(struct kvm *kvm, void *data, unsigned long hva, int len) > > +{ > > + int offset = offset_in_page(hva); > > + struct page *page; > > + int npages, seg; > > + void *vaddr; > > + > > + if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) || > > + !kvm->mem_protected) { > > + return __copy_from_user(data, (void __user *)hva, len); > > + } > > + > > + might_fault(); > > + kasan_check_write(data, len); > > + check_object_size(data, len, false); > > + > > + while ((seg = next_segment(len, offset)) != 0) { > > + npages = get_user_pages_unlocked(hva, 1, &page, > > + FOLL_ALLOW_POISONED); > > + if (npages != 1) > > + return -EFAULT; > > + > > + if (!kvm_page_allowed(kvm, page)) > > + return -EFAULT; > > + > > + vaddr = kmap_atomic(page); > > + memcpy(data, vaddr + offset, seg); > > + kunmap_atomic(vaddr); > > Why is KVM allowed to access a poisoned page? I would expect shared pages to > _not_ be poisoned. Except for pure software emulation of SEV, KVM can't access > guest private memory. Again, it's not going to be in TDX implementation. > I like the idea of using "special" PTE value to denote guest private memory, > e.g. in this RFC, HWPOISON. But I strongly dislike having KVM involved in the > manipulation of the special flag/value. > > Today, userspace owns the gfn->hva translations and the kernel effectively owns > the hva->pfn translations (with input from userspace). KVM just connects the > dots. > > Having KVM own the shared/private transitions means KVM is now part owner of the > entire gfn->hva->pfn translation, i.e. KVM is effectively now a secondary MMU > and a co-owner of the primary MMU. This creates locking madness, e.g. KVM taking > mmap_sem for write, mmu_lock under page lock, etc..., and also takes control away > from userspace. E.g. userspace strategy could be to use a separate backing/pool > for shared memory and change the gfn->hva translation (memslots) in reaction to > a shared/private conversion. Automatically swizzling things in KVM takes away > that option. > > IMO, KVM should be entirely "passive" in this process, e.g. the guest shares or > protects memory, userspace calls into the kernel to change state, and the kernel > manages the page tables to prevent bad actors. KVM simply does the plumbing for > the guest page tables. That's a new perspective for me. Very interesting. Let's see how it can look like: - KVM only allows poisoned pages (or whatever flag we end up using for protection) in the private mappings. SIGBUS otherwise. - Poisoned pages must be tied to the KVM instance to be allowed in the private mappings. Like kvm->id in the current prototype. SIGBUS otherwise. - Pages get poisoned on fault in if the VMA has a new vmflag set. - Fault in of a poisoned page leads to hwpoison entry. Userspace cannot access such pages. - Poisoned pages produced this way get unpoisoned on free. - The new VMA flag set by userspace. mprotect(2)? - Add a new GUP flag to retrive such pages from the userspace mapping. Used only for private mapping population. - Shared gfn ranges managed by userspace, based on hypercalls from the guest. - Shared mappings get populated via normal VMA. Any poisoned pages here would lead to SIGBUS. So far it looks pretty straight-forward. The only thing that I don't understand is at way point the page gets tied to the KVM instance. Currently we do it just before populating shadow entries, but it would not work with the new scheme: as we poison pages on fault it they may never get inserted into shadow entries. That's not good as we rely on the info to unpoison page on free. Maybe we should tie VMA to the KVM instance on setting the vmflags? I donno. Any comments? -- Kirill A. Shutemov