Received: by 2002:a05:6a10:1287:0:0:0:0 with SMTP id d7csp4583830pxv; Tue, 27 Jul 2021 10:49:58 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw8t/afxXFiiuC+iOF/fvgrKfMBHD8zpHpJHI/e9gk7en2i4/BAyLW5eyDLEkQbvVZvMJ+L X-Received: by 2002:a92:d848:: with SMTP id h8mr17736420ilq.282.1627408197997; Tue, 27 Jul 2021 10:49:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1627408197; cv=none; d=google.com; s=arc-20160816; b=Ch3oNXiKaOfZ1dpoPsygzIGFKsUUdSSGfyJc55u6ZyZSbmQJ1/3ERp6Xb6maRYTaE+ +6s2tmTJ70tGJYfgL5La3t3QmHAfNAy/Pd6MgqWeCXMe7kJpkJcV8pSYsGgtxGvN7Nxa Qe1W/7XHC7xC05BdTo1zH8u5q3P9bzorOj4qpAdX25xNn0qxgWfeaLPNDZPuctVJAwAg eVOy4oCSsFpnCXZri/mp0DfX6B79VvUqog7IrcbB/IlOaXZrWoq3uZ7gQbVfIw6po9UP p2muDsfjyeNpIMTzuqAAnlrE2je7B0G5MmkZmQTJq0LfqAfjv01+C+7VxIbs9yHJPwHo tiFQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=hirHjnieMlx0uLSMjf9Bez4E7tmDYC0RWntJ9sY1Y4U=; b=R8ry9f6ndBkKNlsxhLsU6F0InCfTdEsACuQSVjrheopxxdTKuJLNTG3EgCT2ZmS0un WNNZEgXaDu6sq14wWLEOqsRFfJcGLYjXcVSYsDt8PKODzzUuvwUlq9ffKR2fndMOcN+4 1pIIgHEEGs9V/UnMEw5ZUTmZrf6xOXB2Zmr752oUl+6kqu4BlHgwKVe0VvOXDrVH4bQh H2OZiziEKVjY6NuN3hIr1ZQuKiE1E4VqLbC/ljlnP3xCvLxK5Der0NzRKAMTE0S/Hxgo cmlBldtnE3+QHCzGHQrlC4P69sySHCMdAA9Sprma9yeTToxT4nE0ziOb/kuXeWoPSzYI gZUA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=DkiBe+Cv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f17si4007304ion.28.2021.07.27.10.49.46; Tue, 27 Jul 2021 10:49:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=DkiBe+Cv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229981AbhG0Rsk (ORCPT + 99 others); Tue, 27 Jul 2021 13:48:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38346 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229497AbhG0Rsj (ORCPT ); Tue, 27 Jul 2021 13:48:39 -0400 Received: from mail-ot1-x335.google.com (mail-ot1-x335.google.com [IPv6:2607:f8b0:4864:20::335]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B8CE4C061757 for ; Tue, 27 Jul 2021 10:48:38 -0700 (PDT) Received: by mail-ot1-x335.google.com with SMTP id 61-20020a9d0d430000b02903eabfc221a9so14362061oti.0 for ; Tue, 27 Jul 2021 10:48:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=hirHjnieMlx0uLSMjf9Bez4E7tmDYC0RWntJ9sY1Y4U=; b=DkiBe+CvNLt7y+5WnZWaoVX+BUITXOV+CQ36UKsztAvM1GKqvJpGLOjvLWMKoiz/nn /p3QJBPV8mSmSgduJCPadwChAIDxrA70vHEtqoOf0z/eC3eXnQy8EEI38C1VZgGw8r1c 96AXn0NpYpp5eO13Z2CMyVIvxf2JbrSM/Zh8w50bqAygCKIxVvA0gNys9QzGI8AeJoOb BJFEgGMxqpZRXlb377ZzyEevaWo0BzooJYilT2qQnsz/QFyWLrRGWlvA5Oymph0PkahV 8macs/6FMoWQfHGU+Xo6kdgQPF5sRqkP8oJ6a4+lkJF8efkSgN2MhKV9r2GVXzSiEhwj peRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=hirHjnieMlx0uLSMjf9Bez4E7tmDYC0RWntJ9sY1Y4U=; b=tDzKfyx/4mzUvK3g2Ace4iM9sto5nj+xsdnQWXZO5qU98pyAunjv3t06FgKbyxotVJ liA2C8Tn9+Cw284iGisgk7wiqB9ltbQZEytQBmXjoAqSnLBTJFE3WCz7b8f0WE0/d08f Y5u8C+rLpaOWxYAT2cbaBbXWoyFfkTRsxQ2GOfSUaY/jQCslGM/3efBCv/p6kqS47bPX oPSCnP2r1Tb336y9iw29tBTP90ksv+3ZecrxqlPbsDb/c/RANcrc65+fvKx4SyNIdp8l bziArUvaW1faGBgroAXxRW0EjTSVZLB2/H00ID9rvqGadns9RBNMDib0KCkSQLlyvsER fWJA== X-Gm-Message-State: AOAM532qGPQPAZESYas1snp1O43YEJuRzJZKY0k1FXDdNE/V8iRBQ86m XP3vFzTYJvzuNGJOlQNXllv5wGhujiafR24HIUOcJg== X-Received: by 2002:a05:6830:242f:: with SMTP id k15mr16795416ots.72.1627408117842; Tue, 27 Jul 2021 10:48:37 -0700 (PDT) MIME-Version: 1.0 References: <20210713142023.106183-1-mlevitsk@redhat.com> <20210713142023.106183-9-mlevitsk@redhat.com> <64ed28249c1895a59c9f2e2aa2e4c09a381f69e5.camel@redhat.com> <714b56eb83e94aca19e35a8c258e6f28edc0a60d.camel@redhat.com> In-Reply-To: <714b56eb83e94aca19e35a8c258e6f28edc0a60d.camel@redhat.com> From: Ben Gardon Date: Tue, 27 Jul 2021 10:48:26 -0700 Message-ID: Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use To: Maxim Levitsky Cc: Sean Christopherson , kvm , "open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , Jim Mattson , Joerg Roedel , Borislav Petkov , Vitaly Kuznetsov , Wanpeng Li , Paolo Bonzini , Thomas Gleixner , "H. Peter Anvin" , Ingo Molnar , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 27, 2021 at 6:06 AM Maxim Levitsky wrote: > > On Thu, 2021-07-22 at 19:06 +0000, Sean Christopherson wrote: > > +Ben > > > > On Thu, Jul 22, 2021, Maxim Levitsky wrote: > > > On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote: > > > > On Sun, Jul 18, 2021, Maxim Levitsky wrote: > > > > > I am more inclined to fix this by just tracking if we hold the srcu > > > > > lock on each VCPU manually, just as we track the srcu index anyway, > > > > > and then kvm_request_apicv_update can use this to drop the srcu > > > > > lock when needed. > > > > > > > > The entire approach of dynamically adding/removing the memslot seems doomed to > > > > failure, and is likely responsible for the performance issues with AVIC, e.g. a > > > > single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable > > > > and again on re-enable. > > > > > > > > Rather than pile on more gunk, what about special casing the APIC access page > > > > memslot in try_async_pf()? E.g. zap the GFN in avic_update_access_page() when > > > > disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page > > > > fault path skip directly to MMIO emulation without caching the MMIO info. It'd > > > > also give us a good excuse to rename try_async_pf() :-) > > > > > > > > If lack of MMIO caching is a performance problem, an alternative solution would > > > > be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to > > > > clear their cache. > > > > > > > > It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be > > > > less awful than the current memslot+SRCU mess. > > > > > > Hi! > > > > > > I am testing your approach and it actually works very well! I can't seem to break it. > > > > > > Could you explain why do I need to do something with kvm_{inc,dec}_notifier_count()) ? > > > > Glad you asked, there's one more change needed. kvm_zap_gfn_range() currently > > takes mmu_lock for read, but it needs to take mmu_lock for write for this case > > (more way below). > > > > The existing users, update_mtrr() and kvm_post_set_cr0(), are a bit sketchy. The > > whole thing is a grey area because KVM is trying to ensure it honors the guest's > > UC memtype for non-coherent DMA, but the inputs (CR0 and MTRRs) are per-vCPU, > > i.e. for it to work correctly, the guest has to ensure all running vCPUs do the > > same transition. So in practice there's likely no observable bug, but it also > > means that taking mmu_lock for read is likely pointless, because for things to > > work the guest has to serialize all running vCPUs. > > > > Ben, any objection to taking mmu_lock for write in kvm_zap_gfn_range()? It would > > effectively revert commit 6103bc074048 ("KVM: x86/mmu: Allow zap gfn range to > > operate under the mmu read lock"); see attached patch. And we could even bump > > the notifier count in that helper, e.g. on top of the attached: > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index b607e8763aa2..7174058e982b 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -5568,6 +5568,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) > > > > write_lock(&kvm->mmu_lock); > > > > + kvm_inc_notifier_count(kvm, gfn_start, gfn_end); > > + > > if (kvm_memslots_have_rmaps(kvm)) { > > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { > > slots = __kvm_memslots(kvm, i); > > @@ -5598,6 +5600,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) > > if (flush) > > kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end); > > > > + kvm_dec_notifier_count(kvm, gfn_start, gfn_end); > > + > > write_unlock(&kvm->mmu_lock); > > } > > > > I understand what you mean now. I thought that I need to change to code of the > kvm_inc_notifier_count/kvm_dec_notifier_count. > > > > > > > > > > > > Back to Maxim's original question... > > > > Elevating mmu_notifier_count and bumping mmu_notifier_seq will will handle the case > > where APICv is being disabled while a different vCPU is concurrently faulting in a > > new mapping for the APIC page. E.g. it handles this race: > > > > vCPU0 vCPU1 > > apic_access_memslot_enabled = true; > > #NPF on APIC > > apic_access_memslot_enabled==true, proceed with #NPF > > apic_access_memslot_enabled = false > > kvm_zap_gfn_range(APIC); > > __direct_map(APIC) > > > > mov [APIC], 0 <-- succeeds, but KVM wants to intercept to emulate > > I understand this now. I guess this can't happen with original memslot disable > which I guess has the needed locking and flushing to avoid this. > (I didnt' study the code in depth thought) > > > > > > > > > The elevated mmu_notifier_count and/or changed mmu_notifier_seq will cause vCPU1 > > to bail and resume the guest without fixing the #NPF. After acquiring mmu_lock, > > vCPU1 will see the elevated mmu_notifier_count (if kvm_zap_gfn_range() is about > > to be called, or just finised) and/or a modified mmu_notifier_seq (after the > > count was decremented). > > > > This is why kvm_zap_gfn_range() needs to take mmu_lock for write. If it's allowed > > to run in parallel with the page fault handler, there's no guarantee that the > > correct apic_access_memslot_enabled will be observed. > > I understand now. > > So, Paolo, Ben Gardon, what do you think. Do you think this approach is feasable? > Do you agree to revert the usage of the read lock? > > I will post a new series using this approach very soon, since I already have > msot of the code done. > > Best regards, > Maxim Levitsky From reading through this thread, it seems like switching from read lock to write lock is only necessary for a small range of GFNs, (i.e. the APIC access page) is that correct? My initial reaction was that switching kvm_zap_gfn_range back to the write lock would be terrible for performance, but given its only two callers, I think it would actually be fine. If you do that though, you should pass shared=false to kvm_tdp_mmu_zap_gfn_range in that function, so that it knows it's operating with exclusive access to the MMU lock. > > > > > if (is_tdp_mmu_fault) > > read_lock(&vcpu->kvm->mmu_lock); > > else > > write_lock(&vcpu->kvm->mmu_lock); > > > > if (!is_noslot_pfn(pfn) && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva)) <--- look here! > > goto out_unlock; > > > > if (is_tdp_mmu_fault) > > r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level, > > pfn, prefault); > > else > > r = __direct_map(vcpu, gpa, error_code, map_writable, max_level, pfn, > > prefault, is_tdp); > >