Received: by 2002:a05:6a10:1287:0:0:0:0 with SMTP id d7csp4382175pxv; Tue, 27 Jul 2021 06:10:22 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz1pum9iPWwKRtz8LLXsO2wbu8+GqMYM4+mUZ/LunyRntjXepKbFC/teHhZ9hPY1T9VuaR+ X-Received: by 2002:a17:906:c831:: with SMTP id dd17mr16454101ejb.143.1627391422551; Tue, 27 Jul 2021 06:10:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1627391422; cv=none; d=google.com; s=arc-20160816; b=FOVYKRMUb+sb//cEQ3Zs0dhnUXqjYKDFxCXMIRhEb3zvc0YUDSE4Xql1Cbz54KGi8Q IVQK0NhBEXo3ToKkRaEsgvHHu1azfRXoh/p0jw2BBJKdLNnEysrlOlqpvvVYlHDWUut2 LVL3SToFFv5PuQCyAI2kR5XWJ6o5bGGXkGs8jWkOJfifI5mk/b+nR/eDBQZfBeNS67hD 9hWN/rrO51aO/eTT5AI+n6ikrAm0b7yQTwfXtU0mUFTvvaqdIgpYDD9pnTFkrwqkYp+J NezObXp6037itfgxZ66Znzq8wp2+nkQ3P7WasBmQUZ+Uj19gnp9g7RgOl+XZ4G+z8WXx LdcQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=4iyFwz7NDHGpvmEymRU+iBXixcU4YZHYlAfb7sMtL3w=; b=lC+Uf4NU6IwaGJx+l/E1ZHu3Y7cYefmFtAR3jDL/ZMVGXIDwSe7/Rpau7A3nL9F37P +ZW4zv5bqsGJllpcExW8759Ti3mDA+OUdlJt9hQd19yBkdyN6WnVEvwUhPcF6cZvhEGS +vYYPU6DsOddeP6fqjp+qgRTktT9Hte9QhVBsb8sw1U3+Qs5LvcRhDbsu6PquEZIN9wA b4qk9zC/N1Zc4rksQUr1NvYhdUNSCTjU4UM+XR5grDDPmlhpiXLWiVdcmRzxZxzT0+Kj //ByTAHJ2qZy/M/egoZDqgTOuxv70kV8qVHaHbxEmn7y/10vcvg/3Rtu8RZaGJgjQrXu yz6Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="XGNeMs/H"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id w5si2760161edc.581.2021.07.27.06.09.59; Tue, 27 Jul 2021 06:10:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="XGNeMs/H"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236566AbhG0NGB (ORCPT + 99 others); Tue, 27 Jul 2021 09:06:01 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:56145 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232106AbhG0NGA (ORCPT ); Tue, 27 Jul 2021 09:06:00 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1627391160; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=4iyFwz7NDHGpvmEymRU+iBXixcU4YZHYlAfb7sMtL3w=; b=XGNeMs/HU8jTRHXuz80L2QpyB6b/1sJL2Ds1f9HfxWA5Xo/GVMOk3YdUSm0epa5SKqo/7b UmivixvA9yPdCiLNessmCsTtYhfXfYIL0SlJu9AZd1b9YklyqcyhA1cCwELHfHwgAeTuMb 6DitITzb7aDKKYQTig+Qd4w6H+5BBQk= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-425-3Gxch96uPMajIDtQMTua6w-1; Tue, 27 Jul 2021 09:05:56 -0400 X-MC-Unique: 3Gxch96uPMajIDtQMTua6w-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 743DC1932481; Tue, 27 Jul 2021 13:05:54 +0000 (UTC) Received: from starship (unknown [10.40.192.10]) by smtp.corp.redhat.com (Postfix) with ESMTP id 627FA5C1B4; Tue, 27 Jul 2021 13:05:49 +0000 (UTC) Message-ID: <714b56eb83e94aca19e35a8c258e6f28edc0a60d.camel@redhat.com> Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use From: Maxim Levitsky To: Sean Christopherson Cc: kvm@vger.kernel.org, "open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , Jim Mattson , Joerg Roedel , Borislav Petkov , Vitaly Kuznetsov , Wanpeng Li , Paolo Bonzini , Thomas Gleixner , "H. Peter Anvin" , Ingo Molnar , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , Ben Gardon Date: Tue, 27 Jul 2021 16:05:48 +0300 In-Reply-To: References: <20210713142023.106183-1-mlevitsk@redhat.com> <20210713142023.106183-9-mlevitsk@redhat.com> <64ed28249c1895a59c9f2e2aa2e4c09a381f69e5.camel@redhat.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.36.5 (3.36.5-2.fc32) MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2021-07-22 at 19:06 +0000, Sean Christopherson wrote: > +Ben > > On Thu, Jul 22, 2021, Maxim Levitsky wrote: > > On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote: > > > On Sun, Jul 18, 2021, Maxim Levitsky wrote: > > > > I am more inclined to fix this by just tracking if we hold the srcu > > > > lock on each VCPU manually, just as we track the srcu index anyway, > > > > and then kvm_request_apicv_update can use this to drop the srcu > > > > lock when needed. > > > > > > The entire approach of dynamically adding/removing the memslot seems doomed to > > > failure, and is likely responsible for the performance issues with AVIC, e.g. a > > > single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable > > > and again on re-enable. > > > > > > Rather than pile on more gunk, what about special casing the APIC access page > > > memslot in try_async_pf()? E.g. zap the GFN in avic_update_access_page() when > > > disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page > > > fault path skip directly to MMIO emulation without caching the MMIO info. It'd > > > also give us a good excuse to rename try_async_pf() :-) > > > > > > If lack of MMIO caching is a performance problem, an alternative solution would > > > be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to > > > clear their cache. > > > > > > It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be > > > less awful than the current memslot+SRCU mess. > > > > Hi! > > > > I am testing your approach and it actually works very well! I can't seem to break it. > > > > Could you explain why do I need to do something with kvm_{inc,dec}_notifier_count()) ? > > Glad you asked, there's one more change needed. kvm_zap_gfn_range() currently > takes mmu_lock for read, but it needs to take mmu_lock for write for this case > (more way below). > > The existing users, update_mtrr() and kvm_post_set_cr0(), are a bit sketchy. The > whole thing is a grey area because KVM is trying to ensure it honors the guest's > UC memtype for non-coherent DMA, but the inputs (CR0 and MTRRs) are per-vCPU, > i.e. for it to work correctly, the guest has to ensure all running vCPUs do the > same transition. So in practice there's likely no observable bug, but it also > means that taking mmu_lock for read is likely pointless, because for things to > work the guest has to serialize all running vCPUs. > > Ben, any objection to taking mmu_lock for write in kvm_zap_gfn_range()? It would > effectively revert commit 6103bc074048 ("KVM: x86/mmu: Allow zap gfn range to > operate under the mmu read lock"); see attached patch. And we could even bump > the notifier count in that helper, e.g. on top of the attached: > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index b607e8763aa2..7174058e982b 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -5568,6 +5568,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) > > write_lock(&kvm->mmu_lock); > > + kvm_inc_notifier_count(kvm, gfn_start, gfn_end); > + > if (kvm_memslots_have_rmaps(kvm)) { > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { > slots = __kvm_memslots(kvm, i); > @@ -5598,6 +5600,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) > if (flush) > kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end); > > + kvm_dec_notifier_count(kvm, gfn_start, gfn_end); > + > write_unlock(&kvm->mmu_lock); > } > I understand what you mean now. I thought that I need to change to code of the kvm_inc_notifier_count/kvm_dec_notifier_count. > > > > Back to Maxim's original question... > > Elevating mmu_notifier_count and bumping mmu_notifier_seq will will handle the case > where APICv is being disabled while a different vCPU is concurrently faulting in a > new mapping for the APIC page. E.g. it handles this race: > > vCPU0 vCPU1 > apic_access_memslot_enabled = true; > #NPF on APIC > apic_access_memslot_enabled==true, proceed with #NPF > apic_access_memslot_enabled = false > kvm_zap_gfn_range(APIC); > __direct_map(APIC) > > mov [APIC], 0 <-- succeeds, but KVM wants to intercept to emulate I understand this now. I guess this can't happen with original memslot disable which I guess has the needed locking and flushing to avoid this. (I didnt' study the code in depth thought) > > > > The elevated mmu_notifier_count and/or changed mmu_notifier_seq will cause vCPU1 > to bail and resume the guest without fixing the #NPF. After acquiring mmu_lock, > vCPU1 will see the elevated mmu_notifier_count (if kvm_zap_gfn_range() is about > to be called, or just finised) and/or a modified mmu_notifier_seq (after the > count was decremented). > > This is why kvm_zap_gfn_range() needs to take mmu_lock for write. If it's allowed > to run in parallel with the page fault handler, there's no guarantee that the > correct apic_access_memslot_enabled will be observed. I understand now. So, Paolo, Ben Gardon, what do you think. Do you think this approach is feasable? Do you agree to revert the usage of the read lock? I will post a new series using this approach very soon, since I already have msot of the code done. Best regards, Maxim Levitsky > > if (is_tdp_mmu_fault) > read_lock(&vcpu->kvm->mmu_lock); > else > write_lock(&vcpu->kvm->mmu_lock); > > if (!is_noslot_pfn(pfn) && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva)) <--- look here! > goto out_unlock; > > if (is_tdp_mmu_fault) > r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level, > pfn, prefault); > else > r = __direct_map(vcpu, gpa, error_code, map_writable, max_level, pfn, > prefault, is_tdp);