Received: by 2002:ab2:6c55:0:b0:1fd:c486:4f03 with SMTP id v21csp89880lqp; Tue, 11 Jun 2024 16:05:39 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXst1uNKhilUdUEFN6WCpdWFaaYJxF9W3xF6uqD6uOh4E2R7zpZ1DTPWp6MG7N0GoWRJpchGIzNDgdPP/pczBAj12evQUvk4GUipImJOw== X-Google-Smtp-Source: AGHT+IG5AH0K+WvqV7RFyuHW5fu4NixUnfJ6fZo7HEKMv0SpRxQQJ4OeSpv4GEo0p2KZzyqaSLnQ X-Received: by 2002:a05:622a:1786:b0:440:2f78:37dd with SMTP id d75a77b69052e-44139feb771mr70821201cf.0.1718147139361; Tue, 11 Jun 2024 16:05:39 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1718147139; cv=pass; d=google.com; s=arc-20160816; b=bXv1SjhFGyQ/tGZAqpq86HYRX/Zn8H5/TO1X+KsA4zbbswxH8JhA2/+IB2V2iV6thz kuAB48UgiofGhq4ubyBvnZfxj7zo8a+sQxebXf2hCSdkaXZds9dYyU4Bo5T65E8lr9JM 22on/9I82oNye8mBJbIHfQxuU7rHsOsEIOvquuEU/gukAGsH1wowcqPRf31hBdFlMcfI HNZ1yI/C+G4PJeiWEwIl7Z831BBRqXaUIg9Oe1UYBjU7uKQnO0RBsh+rujiGbB9vznjh jUJxKr12id6Z17Lvw43qed9kxZ7AqkM2KdRlmGDVOVuuy2wV4tLUE2VMBFrCrNSHha/Z ACmQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:dkim-signature; bh=Pz/3hS9rvEm8ISFTbesB06oUlpFYSPujM05j9EdyaiQ=; fh=DI0mu5Y5KQ96O2IJwwoDzLoQgOKuaVS5FOsTOBxZMe8=; b=Q2Qexencm0odakMSDWcXUU4jL0g3bfxn9kdqk0rjGsTQGQfcKMh1uFqYC9JYZw5JH+ tlAnmjIudhm3HQLALToL00oCrUyNyRA593c8XvaXoJT+FSLr1PYfRX5VrlLk8F4jWkoY zF1aIr/b0naJkz+b9NAM6o09qCSxnA2YymY3Ru/uDfHfprKSDXi4CJPJIfHy3+u9ui02 pvFfEMqZGtoPtGShPPBWujJSfny8L3V7IQvTzp+wiAKPr9uuurV7vZ54HbCpJv4z9yB7 1HMQp635pfBrdNQSn9MXy26L2PIbrqalAnNftAvNgMGhSUVI+/0/PCfTeXZHijKCgyYP mRIg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=Dh07i8Le; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-210689-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-210689-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id d75a77b69052e-441549e38desi17457491cf.11.2024.06.11.16.05.39 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jun 2024 16:05:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-210689-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=Dh07i8Le; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-210689-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-210689-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 0B03B1C2332F for ; Tue, 11 Jun 2024 23:05:39 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id BCAF0155CA3; Tue, 11 Jun 2024 23:05:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Dh07i8Le" Received: from mail-qt1-f176.google.com (mail-qt1-f176.google.com [209.85.160.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B1A8E153580 for ; Tue, 11 Jun 2024 23:05:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.176 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718147127; cv=none; b=CX0hLliWGF62dIrK2bbaDy9R7a7Qk+dTWbA7RoNln9Z8XbmuN1olZ6QPdonBjXC4mm/TZbgjRkk+xGFxX8wq9aiSaXrIan5CGM6arhU1MFT1lfArDYluWGnyLv7UDKHmWORpJII8qyi854iiy5VZtYaMhOzHv9KlyMHWjM84ObI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718147127; c=relaxed/simple; bh=BoPidXd0qEL/VN+p66VMqndM+I9E6zuVNAGk0ln/hwA=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=eYcQi2GNzOSmLbdnR3HZShp69LzmIpNl7URSkn04BBWdW3+5T5wgiYd8gx2ntyaqrfxgBmpf6FP/as0p3jIPHWXibn68gsk38u/5DbEzpYKPGPsVlPjbBaQtQ7Xxa4Vn/Cuot4vQKWJRCelBmiYPD5KW1ZfzjrhP+2J9FT+q6+g= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Dh07i8Le; arc=none smtp.client-ip=209.85.160.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Received: by mail-qt1-f176.google.com with SMTP id d75a77b69052e-4405dffca81so36171cf.1 for ; Tue, 11 Jun 2024 16:05:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1718147124; x=1718751924; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Pz/3hS9rvEm8ISFTbesB06oUlpFYSPujM05j9EdyaiQ=; b=Dh07i8LeMTi1P/XiU8mOuGUnSW7YxOF575760j5BaYylEjPLGK2VdM5i77VxT2OEhK qtUauoNk0lWDctYB/6BiO0u4ly9jQ0JD0iX9RKY18q/tQqu3vHsv33PfUmLj+RRqo90a +EhSOc2tElTufMtRFxi2aoZKc9ZHzeUwJa1uPakME1Nu9uyFQahnASrFS27O3aSa8sgQ yWzIsprMDS1x2BhH/OKalXoMm7V8mBW/JBiLaL2D/GUoVxlFvmLMaqVXgKcIVVKQNyUM 7WNPT/PyRxoCuGWKx60vTi/qPzFkTFqfKiaq46aaX85wyCY9ZEZb3/GBgttedpIPtsqY 3gnQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718147124; x=1718751924; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Pz/3hS9rvEm8ISFTbesB06oUlpFYSPujM05j9EdyaiQ=; b=kXEE/HCz4yCs5GaxzsFivggvn0qyRij8I2t4qF1eSoLkq9/vnxWD49DbFGAmw4B//Q dbNZLeF1Wn8NApSEY22FbqXofCDW/B+FRGfdlMwe1G0Uja/hNuViXj2D82E/cNHwMNwA buufFyl3Q/dyzxHzGInWLB+a2IpWhGAetWgH6O8OgBemfZYYB8EeUZ+xTlNz6kOBgA/r fr9zz+B9dW+uk7LOqxcAqnLn3OFz65OhWSPI2UnHjgv1wXaDu+nxzX2VlGOTAW5Oj+Sy PuTRuMn/Elkm54j4EkIoPV+0dmOb8IqOasSSenC17PmnCZ7pMksAnGFl+1v4iye9qVb9 Om+A== X-Forwarded-Encrypted: i=1; AJvYcCXMuNTE8vcB5uonsnQ5b3czg5VLczisFMpd3baQS078yx2DzcXHL3u6fC7KvWFuzyG8Qp/jF2WiPjcx6Ep7rXejMoYMkrZbhPEOd5p9 X-Gm-Message-State: AOJu0Yw1tS3HR3uvIwAz281RE5KO9EIcbHvytx3dWgwjJ7TzPmRONHmR 42mUlf0uvG1asGlKCWXR8TTEbqF9Ei/yi7oQP/PyQCUNd8YU45zu9vUYNwYNZeJo28WuMDgWVA0 GJhLmryBEwxrpNEqCKtpinB4aTdh3PCCmnIhC X-Received: by 2002:a05:622a:4015:b0:440:331b:59f7 with SMTP id d75a77b69052e-44158bbdd7cmr1206381cf.6.1718147123716; Tue, 11 Jun 2024 16:05:23 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20240611002145.2078921-1-jthoughton@google.com> <20240611002145.2078921-5-jthoughton@google.com> In-Reply-To: From: James Houghton Date: Tue, 11 Jun 2024 16:04:47 -0700 Message-ID: Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier To: Sean Christopherson Cc: Yu Zhao , Andrew Morton , Paolo Bonzini , Ankit Agrawal , Axel Rasmussen , Catalin Marinas , David Matlack , David Rientjes , James Morse , Jonathan Corbet , Marc Zyngier , Oliver Upton , Raghavendra Rao Ananta , Ryan Roberts , Shaoqin Huang , Suzuki K Poulose , Wei Xu , Will Deacon , Zenghui Yu , kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Jun 11, 2024 at 12:42=E2=80=AFPM Sean Christopherson wrote: > > On Tue, Jun 11, 2024, James Houghton wrote: > > On Mon, Jun 10, 2024 at 10:34=E2=80=AFPM Yu Zhao wr= ote: > > > > > > On Mon, Jun 10, 2024 at 6:22=E2=80=AFPM James Houghton wrote: > > > > > > > > This new notifier is for multi-gen LRU specifically > > > > > > Let me call it out before others do: we can't be this self-serving. > > > > > > > as it wants to be > > > > able to get and clear age information from secondary MMUs only if i= t can > > > > be done "fast". > > > > > > > > By having this notifier specifically created for MGLRU, what "fast" > > > > means comes down to what is "fast" enough to improve MGLRU's abilit= y to > > > > reclaim most of the time. > > > > > > > > Signed-off-by: James Houghton > > > > > > If we'd like this to pass other MM reviewers, especially the MMU > > > notifier maintainers, we'd need to design a generic API that can > > > benefit all the *existing* users: idle page tracking [1], DAMON [2] > > > and MGLRU. > > > > > > Also I personally prefer to extend the existing callbacks by adding > > > new parameters, and on top of that, I'd try to consolidate the > > > existing callbacks -- it'd be less of a hard sell if my changes resul= t > > > in less code, not more. > > > > > > (v2 did all these, btw.) > > > > I think consolidating the callbacks is cleanest, like you had it in > > v2. I really wasn't sure about this change honestly, but it was my > > attempt to incorporate feedback like this[3] from v4. I'll consolidate > > the callbacks like you had in v2. > > James, wait for others to chime in before committing yourself to a course= of > action, otherwise you're going to get ping-ponged to hell and back. Ah yeah. I really mean "I'll do it, provided the other feedback is in line with this". > > > Instead of the bitmap like you had, I imagine we'll have some kind of > > flags argument that has bits like MMU_NOTIFIER_YOUNG_CLEAR, > > MMU_NOTIFIER_YOUNG_FAST_ONLY, and other ones as they come up. Does > > that sound ok? > > Why do we need a bundle of flags? If we extend .clear_young() and .test_= young() > as Yu suggests, then we only need a single "bool fast_only". We don't need to. In my head it's a little easier to collapse them (slightly less code, and at the callsite you have a flag with a name instead of a true/false). Making it a bool SGTM. > As for adding a fast_only versus dedicated APIs, I don't have a strong pr= eference. > Extending will require a small amount of additional churn, e.g. to pass i= n false, > but that doesn't seem problematic on its own. On the plus side, there wo= uld be > less copy+paste in include/linux/mmu_notifier.h (though that could be sol= ved with > macros :-) ). I think having the extra bool is cleaner than the new fast_only notifier, definitely. > > E.g. > > -- > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > index 7b77ad6cf833..07872ae00fa6 100644 > --- a/mm/mmu_notifier.c > +++ b/mm/mmu_notifier.c > @@ -384,7 +384,8 @@ int __mmu_notifier_clear_flush_young(struct mm_struct= *mm, > > int __mmu_notifier_clear_young(struct mm_struct *mm, > unsigned long start, > - unsigned long end) > + unsigned long end, > + bool fast_only) > { > struct mmu_notifier *subscription; > int young =3D 0, id; > @@ -393,9 +394,12 @@ int __mmu_notifier_clear_young(struct mm_struct *mm, > hlist_for_each_entry_rcu(subscription, > &mm->notifier_subscriptions->list, hlist= , > srcu_read_lock_held(&srcu)) { > - if (subscription->ops->clear_young) > - young |=3D subscription->ops->clear_young(subscri= ption, > - mm, start= , end); > + if (!subscription->ops->clear_young || > + fast_only && !subscription->ops->has_fast_aging) > + continue; > + > + young |=3D subscription->ops->clear_young(subscription, > + mm, start, end); KVM changing has_fast_aging dynamically would be slow, wouldn't it? I feel like it's simpler to just pass in fast_only into `clear_young` itself (and this is how I interpreted what you wrote above anyway). > } > srcu_read_unlock(&srcu, id); > > @@ -403,7 +407,8 @@ int __mmu_notifier_clear_young(struct mm_struct *mm, > } > > int __mmu_notifier_test_young(struct mm_struct *mm, > - unsigned long address) > + unsigned long address, > + bool fast_only) > { > struct mmu_notifier *subscription; > int young =3D 0, id; > @@ -412,12 +417,15 @@ int __mmu_notifier_test_young(struct mm_struct *mm, > hlist_for_each_entry_rcu(subscription, > &mm->notifier_subscriptions->list, hlist= , > srcu_read_lock_held(&srcu)) { > - if (subscription->ops->test_young) { > - young =3D subscription->ops->test_young(subscript= ion, mm, > - address); > - if (young) > - break; > - } > + if (!subscription->ops->test_young) > + continue; > + > + if (fast_only && !subscription->ops->has_fast_aging) > + continue; > + > + young =3D subscription->ops->test_young(subscription, mm,= address); > + if (young) > + break; > } > srcu_read_unlock(&srcu, id); > -- > > It might also require multiplexing the return value to differentiate betw= een > "young" and "failed". Ugh, but the code already does that, just in a bes= poke way. Yeah, that is necessary. > Double ugh. Peeking ahead at the "failure" code, NAK to adding > kvm_arch_young_notifier_likely_fast for all the same reasons I objected t= o > kvm_arch_has_test_clear_young() in v1. Please stop trying to do anything= like > that, I will NAK each every attempt to have core mm/ code call directly i= nto KVM. Sorry to make you repeat yourself; I'll leave it out of v6. I don't like it either, but I wasn't sure how important it was to avoid calling into unnecessary notifiers if the TDP MMU were completely disabled. > Anyways, back to this code, before we spin another version, we need to ag= ree on > exactly what behavior we want out of secondary MMUs. Because to me, the = behavior > proposed in this version doesn't make any sense. > > Signalling failure because KVM _might_ have relevant aging information in= SPTEs > that require taking kvm->mmu_lock is a terrible tradeoff. And for the te= st_young > case, it's flat out wrong, e.g. if a page is marked Accessed in the TDP M= MU, then > KVM should return "young", not "failed". Sorry for this oversight. What about something like: 1. test (and maybe clear) A bits on TDP MMU 2. If accessed && !should_clear: return (fast) 3. if (fast_only): return (fast) 4. If !(must check shadow MMU): return (fast) 5. test (and maybe clear) A bits in shadow MMU 6. return (slow) Some of this reordering (and maybe a change from kvm_shadow_root_allocated() to checking indirect_shadow_pages or something else) can be done in its own patch. > If KVM is using the TDP MMU, i.e. has_fast_aging=3Dtrue, then there will = be rmaps > if and only if L1 ran a nested VM at some point. But as proposed, KVM do= esn't > actually check if there are any shadow TDP entries to process. That coul= d be > fixed by looking at kvm->arch.indirect_shadow_pages, but even then it's n= ot clear > that bailing if kvm->arch.indirect_shadow_pages > 0 makes sense. > > E.g. if L1 happens to be running an L2, but <10% of the VM's memory is ex= posed to > L2, then "failure" is pretty much guaranteed to a false positive. And ev= en for > the pages that are exposed to L2, "failure" will occur if and only if the= pages > are being accessed _only_ by L2. > > There most definitely are use cases where the majority of a VM's memory i= s accessed > only by L2. But if those use cases are performing poorly under MGLRU, th= en IMO > we should figure out a way to enhance KVM to do a fast harvest of nested = TDP > Accessed information, not make MGRLU+KVM suck for a VMs that run nested V= Ms. This makes sense. I don't have data today to say that we would get a huge win from speeding up harvesting Accessed information from the shadow MMU would be helpful. Getting this information for the TDP MMU is at least better than no information at all. > > Oh, and calling into mmu_notifiers to do the "slow" version if the fast v= ersion > fails is suboptimal. Agreed. I didn't like this when I wrote it. This can be easily fixed by making mmu_notifier_clear_young() return "fast" and "young or not", which I will do. > So rather than failing the fast aging, I think what we want is to know if= an > mmu_notifier found a young SPTE during a fast lookup. E.g. something lik= e this > in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_ha= ve_rmaps() > is an optional optimization to avoid taking mmu_lock for write in paths w= here a > (very rare) false negative is acceptable. > > static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm) > { > return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pa= ges); > } > > static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range, > bool fast_only) > { > int young =3D 0; > > if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) { > write_lock(&kvm->mmu_lock); > young =3D kvm_handle_gfn_range(kvm, range, kvm_age_rmap); > write_unlock(&kvm->mmu_lock); > } > > if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range)) > young =3D 1 | MMU_NOTIFY_WAS_FAST; I don't think this line is quite right. We might set MMU_NOTIFY_WAS_FAST even when we took the mmu_lock. I understand what you mean though, thanks. > > return (int)young; > } > > and then in lru_gen_look_around(): > > if (spin_is_contended(pvmw->ptl)) > return false; > > /* exclude special VMAs containing anon pages from COW */ > if (vma->vm_flags & VM_SPECIAL) > return false; > > young =3D ptep_clear_young_notify(vma, addr, pte); > if (!young) > return false; > > if (!(young & MMU_NOTIFY_WAS_FAST)) > return true; > > young =3D 1; > > with the lookaround done using ptep_clear_young_notify_fast(). > > The MMU_NOTIFY_WAS_FAST flag is gross, but AFAICT it would Just Work with= out > needing to update all users of ptep_clear_young_notify() and friends. Sounds good to me. Thanks for all the feedback!