Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp3154238rwb; Tue, 8 Nov 2022 00:55:57 -0800 (PST) X-Google-Smtp-Source: AMsMyM6bfpTAeXVUohsV/NSqQ4lNESWyqAPW9PGMFDOI0BkuqqAFAefXrmjgdHIPvO74BDiBxYet X-Received: by 2002:a05:6402:a47:b0:462:a70e:31a6 with SMTP id bt7-20020a0564020a4700b00462a70e31a6mr54005921edb.233.1667897757058; Tue, 08 Nov 2022 00:55:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1667897757; cv=none; d=google.com; s=arc-20160816; b=Hq9ktkvbZ9Pa0+/kOy/aUuU2uee3xvVhpWPRjODr9a196lYOdjgM3PQVuolp2lLs94 QXAAY5chXvf6bzVimKoOGQsQZRT5kdQC4zfzxsmGEN5X57MclloD38mxq8iIiSYmCbC4 MKy3De3nhbK3MyjJsJrGy6c0W0nbYq7XWJV1nsi1e5fatDPJ8JExo73Dn5oci40Itmqc nV5ZQk/FcfVdX8a74heAjWHEFSqbUihzjoi8tXu2H97xa0MptmzCp7kAziqZxtDJmM1h lkFZ18Jqe/G2f+u2pnnN9frpCeCVP7i1kCM7twgoS+c7fX1VB+WL2+C0EUPOb8razpAb Ei6Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=55h2JgkWltKfzXIereN3kkaFUM+D7TNNjzuE7HgUgRs=; b=gxW4Yevz99L3Sd5HkfPEF9lqaOuAo/F2mt2bK7m/rx6qpHM+WcNfvhUvOsvfbt35QE DvySZhna4igMlFVZPt8t4MI6k9h+ij9c8OO01FgoZhS8mELA3mORyHiQUlwizI2/K+TF GO58B2X7rM2bMXw++BvWinyXnF6ccy/M6Er1aB85Tv39jG4b5ihBkn40sQsVOXJvgrIA qcSs/+K5KsrGQ6CHCufGNLxr2Y53BLfjtA4q7ufmqEd2ZStCpZ/OcxMRtBvl89KVhs5K VzcV6aiuoJHoU5aURHlBme/+o0Yi9QpvlYbOGr9rzkil1xdCHWDWdOiCVIaPjyo8eKdb XbAQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=aWBNPcrb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q11-20020a056402518b00b00460faf7d2a3si13423141edd.277.2022.11.08.00.55.35; Tue, 08 Nov 2022 00:55:57 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=aWBNPcrb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233313AbiKHI2z (ORCPT + 90 others); Tue, 8 Nov 2022 03:28:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42310 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230154AbiKHI2u (ORCPT ); Tue, 8 Nov 2022 03:28:50 -0500 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CEA6C2791F; Tue, 8 Nov 2022 00:28:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1667896128; x=1699432128; h=date:from:to:cc:subject:message-id:reply-to:references: mime-version:in-reply-to; bh=WIPURCPIsIJH60CXb2Fq3UuwkzckzrtWTTFyYgbY51I=; b=aWBNPcrbX+BLGev3E1xOsWythLVkTi6/AiT24RMYu8IG8Ur8z/UO7EaV EWOrVN20KpmneXXjp39Ignaj5egFn0y4MU/qIWOPIf57bZN84KnxRwJAo pFxVrBL1z8UZOUPa+2XAuTRqMGTxuE+b1C3iqS/uy8p4iNsMyOCK+JtwI KY/NsAbyDwGGp4gVYG011+cdUEM7lR8BKG4YaVhoO7wysI82WHOTjxyRB PwNkzy/0msZ3O3WFG9BsBs7p1IvLEV4ip8KxqW1QffyciAVTWdvGHVzSN bgs1V4Cp6AYqQTVTqn2hzp6cp/dP2DoJIJNiIW9r2cO9VxMdzwM6xJSqh g==; X-IronPort-AV: E=McAfee;i="6500,9779,10524"; a="291027452" X-IronPort-AV: E=Sophos;i="5.96,147,1665471600"; d="scan'208";a="291027452" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Nov 2022 00:28:48 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10524"; a="630798909" X-IronPort-AV: E=Sophos;i="5.96,147,1665471600"; d="scan'208";a="630798909" Received: from chaop.bj.intel.com (HELO localhost) ([10.240.193.75]) by orsmga007.jf.intel.com with ESMTP; 08 Nov 2022 00:28:36 -0800 Date: Tue, 8 Nov 2022 16:24:10 +0800 From: Chao Peng To: Sean Christopherson Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , tabba@google.com, Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com Subject: Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions Message-ID: <20221108082410.GA84375@chaop.bj.intel.com> Reply-To: Chao Peng References: <20221025151344.3784230-1-chao.p.peng@linux.intel.com> <20221025151344.3784230-6-chao.p.peng@linux.intel.com> <20221104082843.GA4142342@chaop.bj.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 04, 2022 at 09:19:31PM +0000, Sean Christopherson wrote: > Paolo, any thoughts before I lead things further astray? > > On Fri, Nov 04, 2022, Chao Peng wrote: > > On Thu, Nov 03, 2022 at 11:04:53PM +0000, Sean Christopherson wrote: > > > On Tue, Oct 25, 2022, Chao Peng wrote: > > > > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp, > > > > r = kvm_vm_ioctl_set_memory_region(kvm, &mem); > > > > break; > > > > } > > > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM > > > > + case KVM_MEMORY_ENCRYPT_REG_REGION: > > > > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: { > > > > > > I'm having second thoughts about usurping KVM_MEMORY_ENCRYPT_(UN)REG_REGION. Aside > > > from the fact that restricted/protected memory may not be encrypted, there are > > > other potential use cases for per-page memory attributes[*], e.g. to make memory > > > read-only (or no-exec, or exec-only, etc...) without having to modify memslots. > > > > > > Any paravirt use case where the attributes of a page are effectively dictated by > > > the guest is going to run into the exact same performance problems with memslots, > > > which isn't suprising in hindsight since shared vs. private is really just an > > > attribute, albeit with extra special semantics. > > > > > > And if we go with a brand new ioctl(), maybe someday in the very distant future > > > we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION. > > > > > > Switching to a new ioctl() should be a minor change, i.e. shouldn't throw too big > > > of a wrench into things. > > > > > > Something like: > > > > > > KVM_SET_MEMORY_ATTRIBUTES > > > > > > struct kvm_memory_attributes { > > > __u64 address; > > > __u64 size; > > > __u64 flags; > > Oh, this is half-baked. I lost track of which flags were which. What I intended > was a separate, initially-unused flags, e.g. That makes sense. > > struct kvm_memory_attributes { > __u64 address; > __u64 size; > __u64 attributes; > __u64 flags; > } > > so that KVM can tweak behavior and/or extend the effective size of the struct. > > > I like the idea of adding a new ioctl(). But putting all attributes into > > a flags in uAPI sounds not good to me, e.g. forcing userspace to set all > > attributes in one call can cause pain for userspace, probably for KVM > > implementation as well. For private<->shared memory conversion, we > > actually only care the KVM_MEM_ATTR_SHARED or KVM_MEM_ATTR_PRIVATE bit, > > Not necessarily, e.g. I can see pKVM wanting to convert from RW+PRIVATE => RO+SHARED > or even RW+PRIVATE => NONE+SHARED so that the guest can't write/access the memory > while it's accessible from the host. > > And if this does extend beyond shared/private, dropping from RWX=>R, i.e. dropping > WX permissions, would also be a common operation. > > Hmm, typing that out makes me think that if we do end up supporting other "attributes", > i.e. protections, we should go straight to full RWX protections instead of doing > things piecemeal, i.e. add individual protections instead of combinations like > NO_EXEC and READ_ONLY. The protections would have to be inverted for backwards > compatibility, but that's easy enough to handle. The semantics could be like > protection keys, which also have inverted persmissions, where the final protections > are the combination of memslot+attributes, i.e. a read-only memslot couldn't be made > writable via attributes. > > E.g. userspace could do "NO_READ | NO_WRITE | NO_EXEC" to temporarily block access > to memory without needing to delete the memslot. KVM would need to disallow > unsupported combinations, e.g. disallowed effective protections would be: > > - W or WX [unless there's an arch that supports write-only memory] > - R or RW [until KVM plumbs through support for no-exec, or it's unsupported in hardware] > - X [until KVM plumbs through support for exec-only, or it's unsupported in hardware] > > Anyways, that's all future work... > > > but we force userspace to set other irrelevant bits as well if use this > > API. > > They aren't irrelevant though, as the memory attributes are all describing the > allowed protections for a given page. The 'allowed' protections seems answer my concern. But after we enabled "NO_READ | NO_WRITE | NO_EXEC", are we going to check "NO_READ | NO_WRITE | NO_EXEC" are also set together with the PRIVATE bit? I just can't imagine what the semantic would be if we have the PRIVATE bit set but other bits indicate it's actually can READ/WRITE/EXEC from usrspace. > If there's a use case where userspace "can't" > keep track of the attributes for whatever reason, then userspace could do a RMW > to set/clear attributes. Alternatively, the ioctl() could take an "operation" and > support WRITE/OR/AND to allow setting/clearing individual flags, e.g. tweak the > above to be: A getter would be good, it might also be needed for live migration. > > struct kvm_memory_attributes { > __u64 address; > __u64 size; > __u64 attributes; > __u32 operation; > __u32 flags; > } > > > I looked at kvm_device_attr, sounds we can do similar: > > The device attributes deal with isolated, arbitrary values, whereas memory attributes > are flags, i.e. devices are 1:1 whereas memory is 1:MANY. There is no "unset" for > device attributes, because they aren't flags. Device attributes vs. memory attributes > really are two very different things that just happen to use a common name. > > If it helped clarify things without creating naming problems, we could even use > PROTECTIONS instead of ATTRIBUTES. > > > KVM_SET_MEMORY_ATTR > > > > struct kvm_memory_attr { > > __u64 address; > > __u64 size; > > #define KVM_MEM_ATTR_SHARED BIT(0) > > #define KVM_MEM_ATTR_READONLY BIT(1) > > #define KVM_MEM_ATTR_NOEXEC BIT(2) > > __u32 attr; > > As above, letting userspace set only a single attribute would prevent setting > (or clearing) multiple attributes in a single ioctl(). > > > __u32 pad; > > } > > > > I'm not sure if we need KVM_GET_MEMORY_ATTR/KVM_HAS_MEMORY_ATTR as well, > > Definitely would need to communicate to userspace that various attributes are > supported. That doesn't necessarily require a common ioctl(), but I don't see > any reason not to add a common helper, and adding a common helper would mean > KVM_CAP_PRIVATE_MEM can go away. But it should return a bitmask so that userspace > can do a single query to get all supported attributes, e.g. KVM_SUPPORTED_MEMORY_ATTRIBUTES. Do you have preference on using a new ioctl or just keep it as a cap? E.g. KVM_CAP_MEMORY_ATTIBUTES can also returns a mask. > > As for KVM_GET_MEMORY_ATTRIBUTES, we wouldn't necessarily have to provide such an > API, e.g. we could hold off until someone came along with a RMW use case (as above). > That said, debug would likely be a nightmare without KVM_GET_MEMORY_ATTRIBUTES, > so it's probably best to add it straightway. Dive into the implementation a bit, for KVM_GET_MEMORY_ATTRIBUTES we can have different attributes for different pages in the same user-provided range, in that case we will have to either return a list or just a error number. Or we only support per-page attributes for the getter? Chao > > > but sounds like we need a KVM_UNSET_MEMORY_ATTR. > > No need if the setter operates on all attributes. > > > Since we are exposing the attribute directly to userspace I also think > > we'd better treat shared memory as the default, so even when the private > > memory is not used, the bit can still be meaningful. So define BIT(0) as > > KVM_MEM_ATTR_PRIVATE instead of KVM_MEM_ATTR_SHARED. > > Ah, right.