Received: by 2002:a05:6358:a55:b0:ec:fcf4:3ecf with SMTP id 21csp4779223rwb; Tue, 17 Jan 2023 05:31:40 -0800 (PST) X-Google-Smtp-Source: AMrXdXskFsdMeaVlmyGFPEP2gyCtisJC3B53ViVG/D0oC7eajtRi3cnHoNrT1LGpBhi+xJYIDXGA X-Received: by 2002:a17:902:f2ca:b0:193:2303:c9e5 with SMTP id h10-20020a170902f2ca00b001932303c9e5mr3218053plc.20.1673962300253; Tue, 17 Jan 2023 05:31:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673962300; cv=none; d=google.com; s=arc-20160816; b=xfCKbC8N34XMnVlvfSOlxVwzO18Ontc/y10Ye9/s/EzNPadh7V3BK3IOkEkwUJ+BKZ 1nWtLyox0V87WL9o676B5ef3+ThG/m2rh/CJapLiSSIcbsOt/9QamRhQdxxwytRrkRE1 xkCEQ/JWHf0+ZUo46z04oNv6dfLIlt9qiFqFI9LhOOiYLHFJoPSfoKi8VM+qADl3mGHD ya+fQjgM/AYqYCZeBYQj5/acH2JPuQtJqOfZjU8llOUGKJJ7+lUxC5EStgdBfA/eUAJg M00ziGJdXtLanUtqiWQsUPcigkGsl6q0b/K7pwcQxP1AfHn/2hWXVeetmI6lCSRhegPk TWlg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=AX9igY5coR0NLRh160M/4PyGSmWPI9js6B7eJK5Kiy0=; b=Y+Itahh2CkUe/dbO/C+4KH4bxNvs3PZdKmd+Yi4ZfAFTL5XocY2N//cpHl8n503jsA Qqmq78gwwVYcsRCmfSb91nH2n3rnLxuOwkIFCHZelCpDZ1+6siRWZ8IwWo9bzUJYDigV /mwryrdcHo0rTCagLuhMg2KJcBThGRzG9qwp6qtY4mtKwd7cXHvZnh4SW2VuJiNrmbRN /UaVgMNgG2nk4D/lTDUC0+NZG/ANVFB4IePZ1ZiUzYuUzEMN89k1pT+LDtwbDq4WZ0JA qtsM+/lDMpaI3pI/Yn5+cxYqs6Rf9G/wWwAyPeohmLv4YmvXok7p2lE1cstlfnh5dUwb ISlA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Wa1lDjtg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m13-20020a170902db0d00b00189734b4f02si6266418plx.69.2023.01.17.05.31.34; Tue, 17 Jan 2023 05:31:40 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Wa1lDjtg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237009AbjAQNU6 (ORCPT + 48 others); Tue, 17 Jan 2023 08:20:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56988 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237012AbjAQNUx (ORCPT ); Tue, 17 Jan 2023 08:20:53 -0500 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EEC3439BA0; Tue, 17 Jan 2023 05:20:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673961653; x=1705497653; h=date:from:to:cc:subject:message-id:reply-to:references: mime-version:in-reply-to; bh=ydzl4pu/BnSzLPZ1RlJR0Y9KESs9mzMw8rQlYf4Kzvo=; b=Wa1lDjtgpSU4uygKkHIz1Dsj/E2ZHLW/CrnpG91ieGgCT0zvuxJQAYrm vp7y+IQ70RsSj6ArReb10U+53BXVsw95KCrBJ+YsOOC2mpHB0uQRXE57S 2zMR+fAhwQUWwOKViE41UUL4EeKgWn4ZB214VWlwiDajl0WibgXPzK00P PCdx/qV8X7kAPsNxnxx98C8F/lrLeSIpHZgHMDHKgEZXe/ghiDSXDLjos s9xQMBGwkomNSbhpjWGeguhCQCsLjT7cG1KK+B+ppPntvBmbp+iAGuyXV GYO1uWKPbN6iHsjNy4fj9BUXIDp/MXS9HL70CkVpMPDmhCmcqKmHubRCk Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10592"; a="305067244" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="305067244" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Jan 2023 05:20:51 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10592"; a="689797088" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="689797088" Received: from chaop.bj.intel.com (HELO localhost) ([10.240.192.105]) by orsmga008.jf.intel.com with ESMTP; 17 Jan 2023 05:20:39 -0800 Date: Tue, 17 Jan 2023 21:12:51 +0800 From: Chao Peng To: Sean Christopherson Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Arnd Bergmann , Naoya Horiguchi , Miaohe Lin , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , tabba@google.com, Michael Roth , mhocko@suse.com, wei.w.wang@intel.com Subject: Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE Message-ID: <20230117131251.GC273037@chaop.bj.intel.com> Reply-To: Chao Peng References: <20221202061347.1070246-1-chao.p.peng@linux.intel.com> <20221202061347.1070246-10-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote: > On Fri, Dec 02, 2022, Chao Peng wrote: > > @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > > > > if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu)) > > static_call(kvm_x86_update_cpu_dirty_logging)(vcpu); > > + > > + if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) { > > + vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN; > > Synthesizing triple fault shutdown is not the right approach. Even with TDX's > MCE "architecture" (heavy sarcasm), it's possible that host userspace and the > guest have a paravirt interface for handling memory errors without killing the > host. Agree shutdown is not the correct choice. I see you made below change: send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current) The MCE may happen in any thread than KVM thread, sending siginal to 'current' thread may not be the expected behavior. Also how userspace can tell is the MCE on the shared page or private page? Do we care? > > > + r = 0; > > + goto out; > > + } > > } > > > > @@ -1982,6 +2112,10 @@ int __kvm_set_memory_region(struct kvm *kvm, > > !access_ok((void __user *)(unsigned long)mem->userspace_addr, > > mem->memory_size)) > > return -EINVAL; > > + if (mem->flags & KVM_MEM_PRIVATE && > > + (mem->restricted_offset & (PAGE_SIZE - 1) || > > Align indentation. > > > + mem->restricted_offset > U64_MAX - mem->memory_size)) > > Strongly prefer to use similar logic to existing code that detects wraps: > > mem->restricted_offset + mem->memory_size < mem->restricted_offset > > This is also where I'd like to add the "gfn is aligned to offset" check, though > my brain is too fried to figure that out right now. > > > + return -EINVAL; > > if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM) > > return -EINVAL; > > if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr) > > @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm, > > if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages) > > return -EINVAL; > > } else { /* Modify an existing slot. */ > > + /* Private memslots are immutable, they can only be deleted. */ > > I'm 99% certain I suggested this, but if we're going to make these memslots > immutable, then we should straight up disallow dirty logging, otherwise we'll > end up with a bizarre uAPI. But in my mind dirty logging will be needed in the very short time, when live migration gets supported? > > > + if (mem->flags & KVM_MEM_PRIVATE) > > + return -EINVAL; > > if ((mem->userspace_addr != old->userspace_addr) || > > (npages != old->npages) || > > ((mem->flags ^ old->flags) & KVM_MEM_READONLY)) > > @@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm, > > new->npages = npages; > > new->flags = mem->flags; > > new->userspace_addr = mem->userspace_addr; > > + if (mem->flags & KVM_MEM_PRIVATE) { > > + new->restricted_file = fget(mem->restricted_fd); > > + if (!new->restricted_file || > > + !file_is_restrictedmem(new->restricted_file)) { > > + r = -EINVAL; > > + goto out; > > + } > > + new->restricted_offset = mem->restricted_offset; I see you changed slot->restricted_offset type from loff_t to gfn_t and used pgoff_t when doing the restrictedmem_bind/unbind(). Using page index is reasonable KVM internally and sounds simpler than loff_t. But we also need initialize it to page index here as well as changes in another two cases. This is needed when restricted_offset != 0. diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 547b92215002..49e375e78f30 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2364,8 +2364,7 @@ static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, kvm_pfn_t *pfn, int *order) { - pgoff_t index = gfn - slot->base_gfn + - (slot->restricted_offset >> PAGE_SHIFT); + pgoff_t index = gfn - slot->base_gfn + slot->restricted_offset; struct page *page; int ret; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 01db35ddd5b3..7439bdcb0d04 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -935,7 +935,7 @@ static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot, pgoff_t start, pgoff_t end, gfn_t *gfn_start, gfn_t *gfn_end) { - unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT; + unsigned long base_pgoff = slot->restricted_offset; if (start > base_pgoff) *gfn_start = slot->base_gfn + start - base_pgoff; @@ -2275,7 +2275,7 @@ int __kvm_set_memory_region(struct kvm *kvm, r = -EINVAL; goto out; } - new->restricted_offset = mem->restricted_offset; + new->restricted_offset = mem->restricted_offset >> PAGE_SHIFT; } r = kvm_set_memslot(kvm, old, new, change); Chao > > + } > > + > > + new->kvm = kvm; > > Set this above, just so that the code flows better.