Received: by 2002:a05:6500:1b8f:b0:1fa:5c73:8e2d with SMTP id df15csp812519lqb; Wed, 29 May 2024 11:10:20 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCWc1w1T24CvWvPzqpLsRUHOFo3C4lesmBjIRub4r2aGc6dRtDp+jjzpYfojvoFbkB8aBUBofyv+ftAJzJpAFxg3UPLnL6eArzVjoLByGA== X-Google-Smtp-Source: AGHT+IG5tI3Pm0uemCVW2qUk4lMnDoSAFDh2GtavxW0EIZLjORbJ+HJcKAPzYOUocnIRaB8lrrt0 X-Received: by 2002:a05:6a20:de91:b0:1af:cecb:b64a with SMTP id adf61e73a8af0-1b212cf0063mr14561707637.9.1717006219553; Wed, 29 May 2024 11:10:19 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1717006219; cv=pass; d=google.com; s=arc-20160816; b=aN2UYht5F5fSFAdWBWJvVUhFmDPP3i9z+uD4bthfbDdCeu4l7rtRri0OzB67TlV4n5 9KdKaTiTNtvHcWsxrfZHWTBRCj2L3QbF38h3dr4Rh6DMzff+nMHhoIulynpoN6Oa7eGH QCaAEFoWGO1exUB1scPnvPSfAw9/7GqytzZ6m6289GDjDNB7PtbFaQLasI5pUDxXOaIJ I/UExMASahIJTwtHUd8SaKQDntQFBn5QaqoJLbO+Tl2YZeSr6d6eq4M/dkny5pK9gy8V ZnA84T7t2rMJEWy2xOSIKoeF4jU7N4cBcNh+zaYoONA2st7LeiC41EjmssdL3KTLpZkH qhzA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:from:subject:message-id:references:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:in-reply-to:date :dkim-signature; bh=Bhng0mHOZ4irF/CjBRL9b9lRPnBOuiSl+dtfP+mXSOo=; fh=vOOg+zCe+c5Qbnx1osxdZB8HzXJ3K4V7rEcnClV4P8k=; b=xeBPMv8ukmE7JZH+N7NxnlUPd2ErRV9j8gGSgjThYo0hUrX8pzSff34Jcf1AP/XYW3 c/l9k6REiC1CrnGnLlmqIqGiQgXwDOfruFbiUOZ+72AUEq2CzvxOPJrfw/VesE0KEvyC +LTnpoV/JnGrMWT4oKYLyxZuLaUeNl7D10taDa3yMZCk1ScpvDGWnroNs7KqTcZe3rXz I6dgAW/DcIicJKTOS61+CbNmLPvQL2zDxHtbIUtd1YBkoGXTxy/qhTnA+ROoPoHQgDR/ zhAsnn0Sbiz733GPn3cTacEftSy+HsSTYxVIXmk8ppd5UEv7R86WQdjYQYn9PK3BUHVN QqZQ==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=EdwmZSoX; arc=pass (i=1 spf=pass spfdomain=flex--jthoughton.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-194550-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-194550-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id 41be03b00d2f7-687af8ecafdsi157640a12.344.2024.05.29.11.10.19 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 May 2024 11:10:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-194550-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=EdwmZSoX; arc=pass (i=1 spf=pass spfdomain=flex--jthoughton.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-194550-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-194550-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id A56D8B23660 for ; Wed, 29 May 2024 18:07:11 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id BCCD41C9EB3; Wed, 29 May 2024 18:05:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="EdwmZSoX" Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 24F851C2317 for ; Wed, 29 May 2024 18:05:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717005926; cv=none; b=oF8+Jg8HFe27R/af69mRQg/KOLZVcewsZf2ewbJ06sH7A8U+hbe8Dljr/PHuGJ3obTCiAkcZcVflYcuI98/8mNOWVGv7dHxMbJuybSFqSW8hhM8zhcPGiJncjhBD8zb0inbsNlo90SAeNq99yjc8yG5Ml6i9PvA2Apo9fKcGpLE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717005926; c=relaxed/simple; bh=qTgSsDtdVZ+dwqZdm4QKshcvn/0b7ZtYjvzgdhBVuqg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=FbfzCtVOo0lw+8R7YoyjtAiLU2axefcad0F1PFLt8saQOkd48E42Np5VrV3nHM021uyC27je/757KumH4mAru/eC2SasfyG5b74hnnENQH34ZM1m2NivwCGeqaHlK0PR0Uyu2XF/Sz4Nwn1eAOyUWx+7TlawaNGgUGmiaDriSzk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=EdwmZSoX; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-627f43bec13so36543977b3.0 for ; Wed, 29 May 2024 11:05:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1717005922; x=1717610722; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Bhng0mHOZ4irF/CjBRL9b9lRPnBOuiSl+dtfP+mXSOo=; b=EdwmZSoXdj1gFUoUE9+fGDKF3Aj19v17OcE6itdMJ/ukBxVSGnd8if+LOGevh5It2s weVxYW4jHqPYGbQECb7Ejmj9cvbiI8cTZKljFgJAEtWDsYe97fwIkhOcmR2CgCVAUmoi yz7upr72s/d0ZmBRylivS11LFDhV33HTNWDlkpy9cdV4sSf1fy2DFDo4FR4Ji3XPIbvc IuHiuPhozvsTOWWabGgbYfBJCaTFtlovZllJhwVQ6FwPPiQ1jCzZr9ZJdQa0RGHIgDV6 xVoyrR+DdhEfS+CWknUMcg8XqlfjkFPesKeied84aoaMdyKuJI+UBBSqIxmtfU0dlp68 U9YA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717005922; x=1717610722; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Bhng0mHOZ4irF/CjBRL9b9lRPnBOuiSl+dtfP+mXSOo=; b=tj896MTj1AvMn3zUqtCqVT8aagSGXbpxbvyHgxs4FDAN66BUmC+3IxXWee9Mfq+fz6 bHgfLD1FKXEWAC34dva95PjmaN+GV1L5SUlFOll4fASyTF0+q17NsI1HA7vO9fWJolFU 2mowJqfsUqU6Q6HaZVDSc76atTKsvuRSd4SXaRXgy+QH0DeEoUH0aLBuTW3HS3fWqMIo jod+a0d8aiBFxenr8P21KoJI0/JMjYsX8eCyR8M7aCkIKSiHBZ7NgZFXcNNRGDc2uRFl RYhphXvlIpLm/jmdDY1i8pb5R3COy5NmFJW8Oe+dHxzoU0UGIRUl5aEsaQeeYYNa84Lq py0Q== X-Forwarded-Encrypted: i=1; AJvYcCVNmW0Zm4q4itQX6VAvI29zkb0o3BxmwrGLzaKXtsG9t5EL9jIOHtedD4cYWp3oYn4q5kpDX//58TbePfvx3cUwX4wS9M7vZFnxQvvl X-Gm-Message-State: AOJu0YyIz20tOGBxpMbRZS/JbFu7cqDHcqCHUDO2CobuVxRX5twnHwbr uTNqWs6lhp1oK01rz1x1WQ5nzoCG16Hr3EV9nrxk6yJXTXwgC+XhtDR+ACMcjz0YFoy/cT8NHoF p17dcyzadrlnvMUHRdg== X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6902:72d:b0:dee:60e9:69f4 with SMTP id 3f1490d57ef6-df77239930emr4274727276.10.1717005921915; Wed, 29 May 2024 11:05:21 -0700 (PDT) Date: Wed, 29 May 2024 18:05:08 +0000 In-Reply-To: <20240529180510.2295118-1-jthoughton@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240529180510.2295118-1-jthoughton@google.com> X-Mailer: git-send-email 2.45.1.288.g0e0cd299f1-goog Message-ID: <20240529180510.2295118-6-jthoughton@google.com> Subject: [PATCH v4 5/7] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn From: James Houghton To: Andrew Morton , Paolo Bonzini Cc: Albert Ou , Ankit Agrawal , Anup Patel , Atish Patra , Axel Rasmussen , Bibo Mao , Catalin Marinas , David Matlack , David Rientjes , Huacai Chen , James Houghton , James Morse , Jonathan Corbet , Marc Zyngier , Michael Ellerman , Nicholas Piggin , Oliver Upton , Palmer Dabbelt , Paul Walmsley , Raghavendra Rao Ananta , Ryan Roberts , Sean Christopherson , Shaoqin Huang , Shuah Khan , Suzuki K Poulose , Tianrui Zhao , Will Deacon , Yu Zhao , Zenghui Yu , kvm-riscv@lists.infradead.org, kvm@vger.kernel.org, kvmarm@lists.linux.dev, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mips@vger.kernel.org, linux-mm@kvack.org, linux-riscv@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, loongarch@lists.linux.dev Content-Type: text/plain; charset="UTF-8" Walk the TDP MMU in an RCU read-side critical section. This requires a way to do RCU-safe walking of the tdp_mmu_roots; do this with a new macro. The PTE modifications are now done atomically, and kvm_tdp_mmu_spte_need_atomic_write() has been updated to account for the fact that kvm_age_gfn can now lockless update the accessed bit and the R/X bits). If the cmpxchg for marking the spte for access tracking fails, we simply retry if the spte is still a leaf PTE. If it isn't, we return false to continue the walk. Harvesting age information from the shadow MMU is still done while holding the MMU write lock. Suggested-by: Yu Zhao Signed-off-by: James Houghton --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/mmu/mmu.c | 18 ++++----- arch/x86/kvm/mmu/tdp_iter.h | 27 +++++++------ arch/x86/kvm/mmu/tdp_mmu.c | 67 +++++++++++++++++++++++++-------- 4 files changed, 76 insertions(+), 37 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ece45b3f6f20..48fb29bb782f 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1438,6 +1438,7 @@ struct kvm_arch { * tdp_mmu_page set. * * For reads, this list is protected by: + * RCU alone or * the MMU lock in read mode + RCU or * the MMU lock in write mode * diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 6a2a557c2c31..956834da8a0e 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1630,16 +1630,15 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { bool young = false; - write_lock(&kvm->mmu_lock); - - if (kvm_memslots_have_rmaps(kvm)) + if (kvm_memslots_have_rmaps(kvm)) { + write_lock(&kvm->mmu_lock); young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap); + write_unlock(&kvm->mmu_lock); + } if (tdp_mmu_enabled) young |= kvm_tdp_mmu_age_gfn_range(kvm, range); - write_unlock(&kvm->mmu_lock); - return young; } @@ -1647,16 +1646,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { bool young = false; - write_lock(&kvm->mmu_lock); - - if (kvm_memslots_have_rmaps(kvm)) + if (kvm_memslots_have_rmaps(kvm)) { + write_lock(&kvm->mmu_lock); young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap); + write_unlock(&kvm->mmu_lock); + } if (tdp_mmu_enabled) young |= kvm_tdp_mmu_test_age_gfn(kvm, range); - write_unlock(&kvm->mmu_lock); - return young; } diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h index fae559559a80..f558ae9054af 100644 --- a/arch/x86/kvm/mmu/tdp_iter.h +++ b/arch/x86/kvm/mmu/tdp_iter.h @@ -24,16 +24,24 @@ static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep_t sptep, u64 new_spte) return xchg(rcu_dereference(sptep), new_spte); } +static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mask) +{ + atomic64_t *sptep_atomic = (atomic64_t *)rcu_dereference(sptep); + + return (u64)atomic64_fetch_and(~mask, sptep_atomic); +} + static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte) { WRITE_ONCE(*rcu_dereference(sptep), new_spte); } /* - * SPTEs must be modified atomically if they are shadow-present, leaf - * SPTEs, and have volatile bits, i.e. has bits that can be set outside - * of mmu_lock. The Writable bit can be set by KVM's fast page fault - * handler, and Accessed and Dirty bits can be set by the CPU. + * SPTEs must be modified atomically if they have bits that can be set outside + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the + * Writable bit can be set by KVM's fast page fault handler, the Accessed and + * Dirty bits can be set by the CPU, and the Accessed and R/X bits can be + * cleared by age_gfn_range. * * Note, non-leaf SPTEs do have Accessed bits and those bits are * technically volatile, but KVM doesn't consume the Accessed bit of @@ -44,8 +52,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte) static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level) { return is_shadow_present_pte(old_spte) && - is_last_spte(old_spte, level) && - spte_has_volatile_bits(old_spte); + is_last_spte(old_spte, level); } static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte, @@ -61,12 +68,8 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte, static inline u64 tdp_mmu_clear_spte_bits(tdp_ptep_t sptep, u64 old_spte, u64 mask, int level) { - atomic64_t *sptep_atomic; - - if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) { - sptep_atomic = (atomic64_t *)rcu_dereference(sptep); - return (u64)atomic64_fetch_and(~mask, sptep_atomic); - } + if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) + return tdp_mmu_clear_spte_bits_atomic(sptep, mask); __kvm_tdp_mmu_write_spte(sptep, old_spte & ~mask); return old_spte; diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 1259dd63defc..c74b0221dae0 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -29,6 +29,11 @@ static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm, return true; } +static __always_inline bool kvm_lockdep_assert_rcu_read_lock_held(void) +{ + WARN_ON_ONCE(!rcu_read_lock_held()); + return true; +} void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) { @@ -178,6 +183,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm, ((_only_valid) && (_root)->role.invalid))) { \ } else +/* + * Iterate over all TDP MMU roots in an RCU read-side critical section. + */ +#define for_each_tdp_mmu_root_rcu(_kvm, _root, _as_id) \ + list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link) \ + if (kvm_lockdep_assert_rcu_read_lock_held() && \ + (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id)) { \ + } else + #define for_each_tdp_mmu_root(_kvm, _root, _as_id) \ __for_each_tdp_mmu_root(_kvm, _root, _as_id, false) @@ -1223,6 +1237,27 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm, return ret; } +static __always_inline bool kvm_tdp_mmu_handle_gfn_lockless( + struct kvm *kvm, + struct kvm_gfn_range *range, + tdp_handler_t handler) +{ + struct kvm_mmu_page *root; + struct tdp_iter iter; + bool ret = false; + + rcu_read_lock(); + + for_each_tdp_mmu_root_rcu(kvm, root, range->slot->as_id) { + tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) + ret |= handler(kvm, &iter, range); + } + + rcu_read_unlock(); + + return ret; +} + /* * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero * if any of the GFNs in the range have been accessed. @@ -1236,28 +1271,30 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter, { u64 new_spte; +retry: /* If we have a non-accessed entry we don't need to change the pte. */ if (!is_accessed_spte(iter->old_spte)) return false; if (spte_ad_enabled(iter->old_spte)) { - iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep, - iter->old_spte, - shadow_accessed_mask, - iter->level); + iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep, + shadow_accessed_mask); new_spte = iter->old_spte & ~shadow_accessed_mask; } else { - /* - * Capture the dirty status of the page, so that it doesn't get - * lost when the SPTE is marked for access tracking. - */ + new_spte = mark_spte_for_access_track(iter->old_spte); + if (__tdp_mmu_set_spte_atomic(iter, new_spte)) { + /* + * The cmpxchg failed. If the spte is still a + * last-level spte, we can safely retry. + */ + if (is_shadow_present_pte(iter->old_spte) && + is_last_spte(iter->old_spte, iter->level)) + goto retry; + /* Otherwise, continue walking. */ + return false; + } if (is_writable_pte(iter->old_spte)) kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte)); - - new_spte = mark_spte_for_access_track(iter->old_spte); - iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep, - iter->old_spte, new_spte, - iter->level); } trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level, @@ -1267,7 +1304,7 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter, bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) { - return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range); + return kvm_tdp_mmu_handle_gfn_lockless(kvm, range, age_gfn_range); } static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter, @@ -1278,7 +1315,7 @@ static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter, bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { - return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn); + return kvm_tdp_mmu_handle_gfn_lockless(kvm, range, test_age_gfn); } /* -- 2.45.1.288.g0e0cd299f1-goog