Received: by 2002:a05:7412:e794:b0:fa:551:50a7 with SMTP id o20csp687591rdd; Tue, 9 Jan 2024 17:21:21 -0800 (PST) X-Google-Smtp-Source: AGHT+IGNwjQV9WgU0jYuovQYecnHIMc3UkBCPTH/ASPY3F/cU+xttdLohe+/2GmE1L6UXB0odPy2 X-Received: by 2002:ad4:5f4a:0:b0:67f:abf2:5152 with SMTP id p10-20020ad45f4a000000b0067fabf25152mr427979qvg.87.1704849680860; Tue, 09 Jan 2024 17:21:20 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704849680; cv=none; d=google.com; s=arc-20160816; b=oF+j1cxotyX5bkCalNUNJDafMUNLR9nJ3rh9zUrBfmc5xw/OhBe2zPnGEvj3y8Xljm IBOsLhYDUZ1EAed/Fqd2Tp6Mu/lVVdnk+KbyFu1JJZA1F4YBnpeWhS5foW6L0SQCxoT7 AU8FD5QwF0KLzDoMShmHNNVONTXT1Engk0p7qKnUg1UxbZXPu3mbmDtLnnPXDIbDAbqO 5GLVcGRaAA7UyCE2IRbEXZt44WFtqjIFTWbAsFckTcHRscpSVpHEXe310hgIB6sfsliV G5OplqntEa1oQEbU+sfG6eKlcv4cm1CVAKThlRx+iE8hKbECnGk7f5rFxWeNoSYm/Ub+ /kKQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:from:subject:message-id:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:reply-to:dkim-signature; bh=i/U8higRWLYYPzuwVChe68lacTt9MhufBB3K4ClFKBg=; fh=nFeFDjqm/730/SWWj1ap5mgrgLiZv5BZLCu2cTsFxVg=; b=F9U4GHd2JH20VzcdPzmvkIe/y8Mv7W6CMOvZ7TQTV2n28sLkYHpve+xVOZSdY3elYm ZBaKxU0ln+/YbFKbwlIuQi1SVgMbPHWOTwkB8Pz25Dn1AtzTUNX2d8W7pLxNduUVBeww Bfzfy6p0Q8Ev0uE/z7Qrp5mTkvS4i5enkWKXBDxnRoRsO4yauMpFSHRXl9k3NcV/kUpr ycgTUK+O9uC4Q63wR6c5hIynhZc3098y7qGBt5FlTxpMJSgw57luG2ZVm96y80z+8WgS Xg2vXQb+VNx0LRSRSndup91x0Tlr0e03MaP8yFz0eZVdzgcqjzqSkb1RkzRc55nUzx+u AfWw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=ebmo6RU7; spf=pass (google.com: domain of linux-kernel+bounces-21622-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-21622-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id br32-20020a05620a462000b007832c2e6545si3274477qkb.615.2024.01.09.17.21.20 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Jan 2024 17:21:20 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-21622-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=ebmo6RU7; spf=pass (google.com: domain of linux-kernel+bounces-21622-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-21622-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 7A4D61C24B94 for ; Wed, 10 Jan 2024 01:21:20 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 2EBAA3C39; Wed, 10 Jan 2024 01:20:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ebmo6RU7" Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C906E2916 for ; Wed, 10 Jan 2024 01:20:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-5cdc9b060afso2938444a12.2 for ; Tue, 09 Jan 2024 17:20:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1704849648; x=1705454448; darn=vger.kernel.org; h=cc:to:from:subject:message-id:mime-version:date:reply-to:from:to:cc :subject:date:message-id:reply-to; bh=i/U8higRWLYYPzuwVChe68lacTt9MhufBB3K4ClFKBg=; b=ebmo6RU7Ge32E+O2dWvvyrpz2Jq+5Sb9EDb0cPwYCWkPJmnDYw37VjnUBHUx98obLv x7IRY7bnIP+ERMM/SWE1kgCeJHjtyz+aM0HRIRU9R34iSvfeUVIeTX7iJjNftgCJ8T6b 6GGpQHL6/0W96qXaibsBCMtI/LFtuDfKUKxB4MZIk30OqD/DblIUu4NOhbjrtWlKoz6W rMP1etxzHt8ucKcRHZyCJ/4P25Tdg0gs08E2KMeUhb8WTGTujqarM0qtT5r89Q0cNq4A jictJwOrLL7uSxxQp/UkkdeZY3wK5keYD1FYw9GJDxsq0Gsf3iakCUnYUequXhWfXNzv 2SvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704849648; x=1705454448; h=cc:to:from:subject:message-id:mime-version:date:reply-to :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=i/U8higRWLYYPzuwVChe68lacTt9MhufBB3K4ClFKBg=; b=FSLv4nQo8mpS/kwycEn0sIPAmVBdocvT1Ic8ecqaNxOPfWab4ZGgysCyL37rXq51dl m2kcouHn+Aj0SDwaQ4q4vdzuKRlwbQ9opvkQPLHy/dFbJuyc7xWU0Inly2NnNhVM3l0x XDeiWxtcNB7vhg58NqMgaeVfXUTM43LqRlMkpu4xog7yBzx5fq8LpKhoX9e4ve8sGFzd Of9zvuGaM9pzWQ1RwqKJSGxXWg+oNnoWCM/729Bngd8dkNLfw76h73PZeBWAVmU+LZf6 CrXdgkJpOiVn++IA+F/iPJs8TcJKEZ9vrMSU8GJisSxVgydRh+mhWCYrTEhInObe4t64 HPaA== X-Gm-Message-State: AOJu0YxOg5smXK8OpEsAEN+GFYtXwUpvUeZZSVtkZsfYRO8maU7nFOYl ajVwTUztCysD4iIifklvNyAx7L3lJT2HPLSZOA== X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a63:f201:0:b0:5cf:1c78:27a8 with SMTP id v1-20020a63f201000000b005cf1c7827a8mr439pgh.1.1704849647971; Tue, 09 Jan 2024 17:20:47 -0800 (PST) Reply-To: Sean Christopherson Date: Tue, 9 Jan 2024 17:20:45 -0800 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 X-Mailer: git-send-email 2.43.0.472.g3155946c3a-goog Message-ID: <20240110012045.505046-1-seanjc@google.com> Subject: [PATCH v2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing From: Sean Christopherson To: Sean Christopherson , Paolo Bonzini Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Yan Zhao , Kai Huang Content-Type: text/plain; charset="UTF-8" Retry page faults without acquiring mmu_lock if the resolved gfn is covered by an active invalidation. Contending for mmu_lock is especially problematic on preemptible kernels as the mmu_notifier invalidation task will yield mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and ultimately increase the latency of resolving the page fault. And in the worst case scenario, yielding will be accompanied by a remote TLB flush, e.g. if the invalidation covers a large range of memory and vCPUs are accessing addresses that were already zapped. Alternatively, the yielding issue could be mitigated by teaching KVM's MMU iterators to perform more work before yielding, but that wouldn't solve the lock contention and would negatively affect scenarios where a vCPU is trying to fault in an address that is NOT covered by the in-progress invalidation. Add a dedicated lockess version of the range-based retry check to avoid false positives on the sanity check on start+end WARN, and so that it's super obvious that checking for a racing invalidation without holding mmu_lock is unsafe (though obviously useful). Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking invalidation in a loop won't put KVM into an infinite loop, e.g. due to caching the in-progress flag and never seeing it go to '0'. Force a load of mmu_invalidate_seq as well, even though it isn't strictly necessary to avoid an infinite loop, as doing so improves the probability that KVM will detect an invalidation that already completed before acquiring mmu_lock and bailing anyways. Do the pre-check even for non-preemptible kernels, as waiting to detect the invalidation until mmu_lock is held guarantees the vCPU will observe the worst case latency in terms of handling the fault, and can generate even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock, detect retry, drop mmu_lock, re-enter the guest, retake the fault, and eventually re-acquire mmu_lock. This behavior is also why there are no new starvation issues due to losing the fairness guarantees provided by rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting on mmu_lock doesn't guarantee forward progress in the face of _another_ mmu_notifier invalidation event. Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE() may generate a load into a register instead of doing a direct comparison (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost is a few bytes of code and maaaaybe a cycle or three. Reported-by: Yan Zhao Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com Acked-by: Kai Huang Signed-off-by: Sean Christopherson --- Note, this version adds a dedicated helper, mmu_invalidate_retry_gfn_unsafe(), instead of making mmu_invalidate_retry_gfn() play nice with being called without mmu_lock held. I was hesitant to drop the lockdep assertion before, and the recently introduced sanity check on the gfn start/end values pushed this past the threshold of being worth the duplicate code (preserving the start/end sanity check in lock-free code would comically difficult, and would add almost no value since it would have to be quite conservative to avoid false positives). Kai, I kept your Ack even though the code is obviously a little different. Holler if you want me to drop it. v2: - Introduce a dedicated helper and collapse to a single patch (because adding an unused helper would be quite silly). - Add a comment to explain the "unsafe" check in kvm_faultin_pfn(). [Kai] - Add Kai's Ack. v1: https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@google.com arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++++ include/linux/kvm_host.h | 26 ++++++++++++++++++++++++++ 2 files changed, 42 insertions(+) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 3c844e428684..92f51540c4a7 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4415,6 +4415,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, if (unlikely(!fault->slot)) return kvm_handle_noslot_fault(vcpu, fault, access); + /* + * Pre-check for a relevant mmu_notifier invalidation event prior to + * acquiring mmu_lock. If there is an in-progress invalidation and the + * kernel allows preemption, the invalidation task may drop mmu_lock + * and yield in response to mmu_lock being contended, which is *very* + * counter-productive as this vCPU can't actually make forward progress + * until the invalidation completes. This "unsafe" check can get false + * negatives, i.e. KVM needs to re-check after acquiring mmu_lock. Do + * the pre-check even for non-preemtible kernels, i.e. even if KVM will + * never yield mmu_lock in response to contention, as this vCPU ob + * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held + * to detect retry guarantees the worst case latency for the vCPU. + */ + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) + return RET_PF_RETRY; + return RET_PF_CONTINUE; } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 7e7fd25b09b3..179df96b20f8 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2031,6 +2031,32 @@ static inline int mmu_invalidate_retry_gfn(struct kvm *kvm, return 1; return 0; } + +/* + * This lockless version of the range-based retry check *must* be paired with a + * call to the locked version after acquiring mmu_lock, i.e. this is safe to + * use only as a pre-check to avoid contending mmu_lock. This version *will* + * get false negatives and false positives. + */ +static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm, + unsigned long mmu_seq, + gfn_t gfn) +{ + /* + * Use READ_ONCE() to ensure the in-progress flag and sequence counter + * are always read from memory, e.g. so that checking for retry in a + * loop won't result in an infinite retry loop. Don't force loads for + * start+end, as the key to avoiding infinite retry loops is observing + * the 1=>0 transition of in-progress, i.e. getting false negatives + * due to stale start+end values is acceptable. + */ + if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) && + gfn >= kvm->mmu_invalidate_range_start && + gfn < kvm->mmu_invalidate_range_end) + return true; + + return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq; +} #endif #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15 -- 2.43.0.472.g3155946c3a-goog