Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id ABDF9C433EF for ; Tue, 14 Dec 2021 23:35:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238281AbhLNXfF (ORCPT ); Tue, 14 Dec 2021 18:35:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45848 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238256AbhLNXfE (ORCPT ); Tue, 14 Dec 2021 18:35:04 -0500 Received: from mail-pf1-x433.google.com (mail-pf1-x433.google.com [IPv6:2607:f8b0:4864:20::433]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D9C6FC061574 for ; Tue, 14 Dec 2021 15:35:03 -0800 (PST) Received: by mail-pf1-x433.google.com with SMTP id p13so19147723pfw.2 for ; Tue, 14 Dec 2021 15:35:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=gDXHyE+5f179Ixy1hF6TPncRkHGmUBczNontEgHGLBw=; b=dDaN4zSTIHp0O3AZVMYfOMB4N/6+RLi6EzucFkfHQYqjfaOAGlX6FUNKqpWB5LtzVi bJh4DQuX8zDd5XApnN65AT+uN2HQJ0Jh+ei/VEcDgco9n1FfIz5GsWOrpks4DAyEPOm/ awYs69O7VsXg8zrS7gcNIAn8i/BYkPFZXyREroDfmrTb5Vu9To2mg5ci/pNxReeQYUvt nRdJ6EcGTNopZTvqW1ppodF9meXBpOF2z3mlJxrg3Fi91zHqlXTSvqEcxXnSdLrp6iWk 3iGCHHv5HgRdbzv/qmYXN4eO57xaDy0QCNlbzx2kn5BxFW/CnIfPOo+56aL+obvOstax B0ug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=gDXHyE+5f179Ixy1hF6TPncRkHGmUBczNontEgHGLBw=; b=WFnT/aVIIGmvnRl/GkwfrOJHHHRG5XfYZ9fsmmfeum582KdzOmoVxw0vpqiE/+YOCc /TAYWZ24taaC6bukG29SJnky7tDvNbrjJ79LlkhUgxGYCbl1mJMNPIzYlFDLb7HgWnMc Zb8bdiXWVnrdaL9BVqRMkT9DeApeON9knXralaytRKpzEjAb1blG93b7GMqWPm9UzR17 Yqd0WCBP1elcfxrQvA7pkHUSBvnPAqTe1DoROCGvScqfQsk79VYg6Svaw//zxEL8olxR FLrIdyTq4AC6gJb9ooU39/qdpBlM+jgcffjNOThw/UA7vrQEKa/mNlR6nTdlO0+dt7Uc EmmQ== X-Gm-Message-State: AOAM531UHwKHQme5SQrfWIJNLqOFy0sHnCfqjVfvd0/MjjgUu/n9UjwW 3Q5yLzfQxNvEhWyebB1buHwD/A== X-Google-Smtp-Source: ABdhPJxmH9w+SFOO7sAAlh9QF2Iy0zuPOZtoPKFJhVTBJsR0/xtUUru5N6j85UkjG2tsYRQ9BAHBjQ== X-Received: by 2002:a63:8349:: with SMTP id h70mr5684564pge.53.1639524903171; Tue, 14 Dec 2021 15:35:03 -0800 (PST) Received: from google.com (157.214.185.35.bc.googleusercontent.com. [35.185.214.157]) by smtp.gmail.com with ESMTPSA id rm10sm86035pjb.29.2021.12.14.15.35.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 14 Dec 2021 15:35:02 -0800 (PST) Date: Tue, 14 Dec 2021 23:34:59 +0000 From: Sean Christopherson To: Ben Gardon Cc: Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Hou Wenlong Subject: Re: [PATCH 15/28] KVM: x86/mmu: Take TDP MMU roots off list when invalidating all roots Message-ID: References: <20211120045046.3940942-1-seanjc@google.com> <20211120045046.3940942-16-seanjc@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 22, 2021, Ben Gardon wrote: > On Mon, Nov 22, 2021 at 3:08 PM Sean Christopherson wrote: > > > > On Mon, Nov 22, 2021, Ben Gardon wrote: > > > On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson wrote: > > > > > > > > Take TDP MMU roots off the list of roots when they're invalidated instead > > > > of walking later on to find the roots that were just invalidated. In > > > > addition to making the flow more straightforward, this allows warning > > > > if something attempts to elevate the refcount of an invalid root, which > > > > should be unreachable (no longer on the list so can't be reached by MMU > > > > notifier, and vCPUs must reload a new root before installing new SPTE). > > > > > > > > Signed-off-by: Sean Christopherson > > > > > > There are a bunch of awesome little cleanups and unrelated fixes > > > included in this commit that could be factored out. > > > > > > I'm skeptical of immediately moving the invalidated roots into another > > > list as that seems like it has a lot of potential for introducing > > > weird races. > > > > I disagree, the entire premise of fast invalidate is that there can't be races, > > i.e. mmu_lock must be held for write. IMO, it's actually the opposite, as the only > > reason leaving roots on the per-VM list doesn't have weird races is because slots_lock > > is held. If slots_lock weren't required to do a fast zap, which is feasible for the > > TDP MMU since it doesn't rely on the memslots generation, then it would be possible > > for multiple calls to kvm_tdp_mmu_zap_invalidated_roots() to run in parallel. And in > > that case, leaving roots on the per-VM list would lead to a single instance of a > > "fast zap" zapping roots it didn't invalidate. That's wouldn't be problematic per se, > > but I don't like not having a clear "owner" of the invalidated root. > > That's a good point, the potential interleaving of zap_alls would be gross. > > My mental model for the invariant here was "roots that are still in > use are on the roots list," but I can see how "the roots list contains > all valid, in-use roots" could be a more useful invariant. Sadly, my idea of taking invalid roots off the list ain't gonna happen. The immediate issue is that the TDP MMU doesn't zap invalid roots in mmu_notifier callbacks. This leads to use-after-free and other issues if the mmu_notifier runs to completion while an invalid root zapper yields as KVM fails to honor the requirement that there must be _no_ references to the page after the mmu_notifier returns. This is most noticeable with set_nx_huge_pages() + kvm_mmu_notifier_release(), but the bug exists between kvm_mmu_notifier_invalidate_range_start() and memslot updates as well. The pages aren't accessible by the guest, and KVM won't read or write the data itself, but KVM will trigger e.g. kvm_set_pfn_dirty() when zapping SPTEs, _after_ the mmu_notifier returns.