Date:   Wed, 5 Jan 2022 16:33:53 +0000
From:   Sean Christopherson <seanjc@google.com>
To:     David Matlack <dmatlack@google.com>
Cc:     Paolo Bonzini <pbonzini@redhat.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Ben Gardon <bgardon@google.com>,
        Mingwei Zhang <mizhang@google.com>
Subject: Re: [PATCH v2 25/30] KVM: x86/mmu: Zap roots in two passes to avoid
 inducing RCU stalls
Message-ID: <YdXIcUcJ+6qg277s@google.com>
References: <20211223222318.1039223-1-seanjc@google.com>
 <20211223222318.1039223-26-seanjc@google.com>
 <YdTj/eHur+9Vqdw6@google.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <YdTj/eHur+9Vqdw6@google.com>
Precedence: bulk

On Wed, Jan 05, 2022, David Matlack wrote:
> On Thu, Dec 23, 2021 at 10:23:13PM +0000, Sean Christopherson wrote:
> > Zapping at 1gb in the first pass is not arbitrary.  First and foremost,
> > KVM relies on being able to zap 1gb shadow pages in a single shot when
> > when repacing a shadow page with a hugepage.
> 
> When dirty logging is disabled, zap_collapsible_spte_range() does the
> bulk of the work zapping leaf SPTEs and allows yielding. I guess that
> could race with a vCPU faulting in the huge page though and the vCPU
> could do the bulk of the work.
> 
> Are there any other scenarios where KVM relies on zapping 1GB worth of
> 4KB SPTEs without yielding?

Yes.  Zapping executable shadow pages that were forced to be small because of the
iTLB multihit mitigation.  If the VM is using nested EPT and a shadow page is
unaccounted, in which case decrementing disallow_lpage could allow a hugepage
and a fault in the 1gb region that installs a 1gb hugepage would then zap the
shadow page.

There are other scenarios, though they are much more contrived, e.g. if the guest
changes its MTRRs such that a previously disallowed hugepage is now allowed.

> In any case, 100ms is a long time to hog the CPU. Why not just do the
> safe thing and zap each level? 4K, then 2M, then 1GB, ..., then root
> level. The only argument against it I can think of is performance (lots
> of redundant walks through the page table). But I don't think root
> zapping is especially latency critical.

I'm not opposed to that approach, assuming putting a root is done asynchronously
so that the high latency isn't problematic for vCPUs.  Though from a test coverage
perspective, I do like zapping at the worst case level (for the above flows).

And regarding the latency, if it's problematic we could track the number of present
SPTEs and skip the walk if there are none.  The downside is that doing so would
require an atomic operation when installing SPTEs to play nice with parallel page
faults.

> > @@ -846,6 +858,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> >  		}
> >  	}
> >  
> > +	if (zap_level < root->role.level) {
> > +		zap_level = root->role.level;
> > +		goto start;
> > +	}
> 
> This is probably just person opinion but I find the 2 iteration goto
> loop harder to understand than just open-coding the 2 passes.

Yeah, but it's clever!  I'll add another helper unless it turns out gross for
some reason. :-)