Received: by 2002:a05:6a10:9afc:0:0:0:0 with SMTP id t28csp314924pxm; Wed, 2 Mar 2022 16:10:23 -0800 (PST) X-Google-Smtp-Source: ABdhPJzwbkjlDLzrqdclqp2qqVQ53cX+f96BvrIuuQ6/9YtacP3LqYmGsLEdjQ/SS2JZv/mZG1Sk X-Received: by 2002:a17:90a:db12:b0:1be:eb72:a63b with SMTP id g18-20020a17090adb1200b001beeb72a63bmr2443326pjv.94.1646266222898; Wed, 02 Mar 2022 16:10:22 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1646266222; cv=none; d=google.com; s=arc-20160816; b=zLmil5B10Gqltz6Iv1zFg5gpJYWTJzWdvZYiCXHNrq/8U3jKs6v4YCDAkTcw0BbAZ3 p7OrF2qKvyMbQkN+eg0JvipxjWxR+umsshsCIwVqasn4v+MJHf4SOwCsaPnCTd7oUxyL 6j3E26xw0cHvVheyjBsgem9dNL2ylrTf6bK5WV9738kz5W21xuuETmMtIwDtgGC5Y/Sc atU275Jq7S9vYEADWaSTTNa53sA+V5vaqMQmZsQwILu37c27nhILJd+vSd4UioJbqd8Z YiPItuO/FDexUjJZDC+m4+pGKOpqFKjLOJ/WZRdhKxlOGwvD3V2i/ZrR71SYEvJAqcbn ujPw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=ghxvLbBxxsI2q0Cumn+E+luGTHcGT5EcvSoKD1OR3J0=; b=CRGBgH0DbJ7CbO/+eYB7LkvLOUZoRtSVZvOJ9ib0Ec/2DUN6E3ttiRHPGZ5Fx0cAAS tQlbySNLP0SwEL6J5OUg4wFwUYjDcJQoe4JJUddYPxrdLrlGf2nyfoWpIZh4qkmgv9Sf 2NG3CSO52rvA44bqbiZzOs0OxefZHi0blu1rWkdhglkHJh9PnsXvRd271dkgFyX2qSM2 09A70frQV8qPYjTYUv8cYUspsSVgH3fyZ+K4om3m24PhozprrwjXYTOiHdHH52uuz+uI GeRpvviNLGrzdHoXiAPiek1B82h9u+qlRXN5gCkH85Xm8+qiCq30+X8NpmW/s7vWmx2l xUGA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=sbLwc5+s; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id g3-20020a056a000b8300b004e125b177cesi561566pfj.235.2022.03.02.16.10.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Mar 2022 16:10:22 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=sbLwc5+s; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 02FA51BE0E9; Wed, 2 Mar 2022 15:26:11 -0800 (PST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244362AbiCBSC2 (ORCPT + 99 others); Wed, 2 Mar 2022 13:02:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59934 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244349AbiCBSC1 (ORCPT ); Wed, 2 Mar 2022 13:02:27 -0500 Received: from mail-pl1-x630.google.com (mail-pl1-x630.google.com [IPv6:2607:f8b0:4864:20::630]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C850ED5DF4 for ; Wed, 2 Mar 2022 10:01:43 -0800 (PST) Received: by mail-pl1-x630.google.com with SMTP id e2so2225110pls.10 for ; Wed, 02 Mar 2022 10:01:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=ghxvLbBxxsI2q0Cumn+E+luGTHcGT5EcvSoKD1OR3J0=; b=sbLwc5+sUDMIWQaaab5ASuqK4VJmVcHXfJzDA7SnzHIfpBy+SgPHkUhqad229M2hxH uLlasaR5Y9jZsx0l1BfLCjnsWgm8ABD9vAwWuuR2RTBQCDksz9WzVu6MhqIfDhIqXbpb Hcyvl37b2ZTxrF/rWsOGCVV79fWH+49MIEfFco92Ad+TmOPBceolsKrAFKJIjj6exGPk 7iuPehBZ7viSl9PzGFsCsm2f9PDDov++kJF0xh+BmSsKJRCfSvusFZr3067tkN0bvbYY uXOar6+SPDePmTWCAEk/1R3R8dWoyRKxKpViH3zkuDJdatJ+P1lrzR+QkZ9OqhUph3zI JenA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=ghxvLbBxxsI2q0Cumn+E+luGTHcGT5EcvSoKD1OR3J0=; b=tBWD1bgff2skw1OjF/QJWCfDgN1zlMHwxGrIuejjyPNIjFcP62qRgNkAX+ZJD0TCMN NYp/f/tgp9PTFHTMpN8MaUuJ769atY0mkIbusH9yiKFozYf2ZXCX/00MYCYFCkftwQoj USB2b51ioQtfWO9cOnw6pSHinoRqmjDVOGCD0CuBc1KtSTqcq4Lck/wucMyGJqTUiCDV 4LxjbjYgPNiMqZbQzPTxGYRk/2WOT6azCPuiYtaADGeoWjOmGOEe/xo7MDRk3uszlqw4 K2Rp9TAaDLcz8hITMPQwThIHj3y+VujPk4fFqKP/xCkgxbagoxBz5tX0ZZJ5yzStPcuy 24rg== X-Gm-Message-State: AOAM530wPiZNapBuCvFa5BqILbLndS+iF5i1M7T/wI4rPErRlAcK7Plk Mc3gh3XCeqr8cl299gEaKoIDvA== X-Received: by 2002:a17:90b:4b0d:b0:1bc:4cdb:ebe3 with SMTP id lx13-20020a17090b4b0d00b001bc4cdbebe3mr1008662pjb.176.1646244103075; Wed, 02 Mar 2022 10:01:43 -0800 (PST) Received: from google.com (157.214.185.35.bc.googleusercontent.com. [35.185.214.157]) by smtp.gmail.com with ESMTPSA id c72-20020a624e4b000000b004f3ff3a3fb2sm13157960pfb.118.2022.03.02.10.01.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Mar 2022 10:01:42 -0800 (PST) Date: Wed, 2 Mar 2022 18:01:39 +0000 From: Sean Christopherson To: Paolo Bonzini Cc: Christian Borntraeger , Janosch Frank , Claudio Imbrenda , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , David Hildenbrand , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, David Matlack , Ben Gardon , Mingwei Zhang Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker Message-ID: References: <20220226001546.360188-1-seanjc@google.com> <20220226001546.360188-23-seanjc@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-10.0 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 02, 2022, Paolo Bonzini wrote: > On 2/26/22 01:15, Sean Christopherson wrote: > > Zap defunct roots, a.k.a. roots that have been invalidated after their > > last reference was initially dropped, asynchronously via the system work > > queue instead of forcing the work upon the unfortunate task that happened > > to drop the last reference. > > > > If a vCPU task drops the last reference, the vCPU is effectively blocked > > by the host for the entire duration of the zap. If the root being zapped > > happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging > > being active, the zap can take several hundred seconds. Unsurprisingly, > > most guests are unhappy if a vCPU disappears for hundreds of seconds. > > > > E.g. running a synthetic selftest that triggers a vCPU root zap with > > ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds. > > Offloading the zap to a worker drops the block time to <100ms. > > > > Signed-off-by: Sean Christopherson > > --- > > Do we even need kvm_tdp_mmu_zap_invalidated_roots() now? That is, > something like the following: Nice! I initially did something similar (moving invalidated roots to a separate list), but never circled back to idea after implementing the worker stuff. > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index bd3625a875ef..5fd8bc858c6f 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -5698,6 +5698,16 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm) > { > lockdep_assert_held(&kvm->slots_lock); > + /* > + * kvm_tdp_mmu_invalidate_all_roots() needs a nonzero reference > + * count. If we're dying, zap everything as it's going to happen > + * soon anyway. > + */ > + if (!refcount_read(&kvm->users_count)) { > + kvm_mmu_zap_all(kvm); > + return; > + } I'd prefer we make this an assertion and shove this logic to set_nx_huge_pages(), because in that case there's no need to zap anything, the guest can never run again. E.g. (I'm trying to remember why I didn't do this before...) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index b2c1c4eb6007..d4d25ab88ae7 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -6132,7 +6132,8 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp) list_for_each_entry(kvm, &vm_list, vm_list) { mutex_lock(&kvm->slots_lock); - kvm_mmu_zap_all_fast(kvm); + if (refcount_read(&kvm->users_count)) + kvm_mmu_zap_all_fast(kvm); mutex_unlock(&kvm->slots_lock); wake_up_process(kvm->arch.nx_lpage_recovery_thread); > + > write_lock(&kvm->mmu_lock); > trace_kvm_mmu_zap_all_fast(kvm); > @@ -5732,20 +5742,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm) > kvm_zap_obsolete_pages(kvm); > write_unlock(&kvm->mmu_lock); > - > - /* > - * Zap the invalidated TDP MMU roots, all SPTEs must be dropped before > - * returning to the caller, e.g. if the zap is in response to a memslot > - * deletion, mmu_notifier callbacks will be unable to reach the SPTEs > - * associated with the deleted memslot once the update completes, and > - * Deferring the zap until the final reference to the root is put would > - * lead to use-after-free. > - */ > - if (is_tdp_mmu_enabled(kvm)) { > - read_lock(&kvm->mmu_lock); > - kvm_tdp_mmu_zap_invalidated_roots(kvm); > - read_unlock(&kvm->mmu_lock); > - } > } > static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm) > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index cd1bf68e7511..af9db5b8f713 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -142,10 +142,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, > WARN_ON(!root->tdp_mmu_page); > /* > - * The root now has refcount=0 and is valid. Readers cannot acquire > - * a reference to it (they all visit valid roots only, except for > - * kvm_tdp_mmu_zap_invalidated_roots() which however does not acquire > - * any reference itself. > + * The root now has refcount=0. It is valid, but readers already > + * cannot acquire a reference to it because kvm_tdp_mmu_get_root() > + * rejects it. This remains true for the rest of the execution > + * of this function, because readers visit valid roots only One thing that keeps tripping me up is the "readers" verbiage. I get confused because taking mmu_lock for read vs. write doesn't really have anything to do with reading or writing state, e.g. "readers" still write SPTEs, and so I keep thinking "readers" means anything iterating over the set of roots. Not sure if there's a shorthand that won't be confusing. > + * (except for tdp_mmu_zap_root_work(), which however operates only > + * on one specific root and does not acquire any reference itself). > > * > * Even though there are flows that need to visit all roots for > * correctness, they all take mmu_lock for write, so they cannot yet ... > It passes a smoke test, and also resolves the debate on the fate of patch 1. +1000, I love this approach. Do you want me to work on a v3, or shall I let you have the honors?