Received: by 2002:ab2:7903:0:b0:1fb:b500:807b with SMTP id a3csp1261115lqj; Mon, 3 Jun 2024 15:56:57 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCUu2m5ayQjBfAiyUMUlNAUZdOFbr0+C/CiQsW/FKjmhEz+XB/M9t5s4goNJF9/lCE31zn8NA30GcFQFd0ApSE9p1m4xJNGDKonoxBvynA== X-Google-Smtp-Source: AGHT+IETj3dbuqSz9A21MOLfJdFmN7JJInapAx3f0dbvquapXlwGhwFWhv7vD6dEv5Z+xe1FMIRe X-Received: by 2002:a05:6359:6a8:b0:192:5306:228 with SMTP id e5c5f4694b2df-19b490d6ademr1011616355d.24.1717455416922; Mon, 03 Jun 2024 15:56:56 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1717455416; cv=pass; d=google.com; s=arc-20160816; b=znS2jfn5r+flr7an8lFWOKqw3CRkyWw3BfxZ+Lww4HTSeE9V9DF7wVAtmizaT2oBHP eOvRF+fRb1tSGvl/QQQ5d/MrXwneSFJr6DEGOicxRqtZxWjxh3HOnK0QuPgw6bA/yhWb NCGbNF9Nzh5fQ7+WMI/IE1g1OLZiOh83bnl9Kem4IfwSLk37bbMOQLcf0ErZOvwTNT0N szn1Nx5UHnTGgQz6+vK94ezz3z/QPrGzpFE1919DSNOSFn5QOIO0kHIxJrccGJyh/THB xulc60s3oIRkz0CROfiwtMReWMXxrPLPqLIasHpRcPR8yMqBOu46GOayZbd/l7cmmJlX GuDw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:dkim-signature; bh=zwApvMmCxi6kRvhFpeZNrHVjs/fSYi202Vq8O0y7Pvc=; fh=98qqvuH6I21HuieurZ+MInJTyrzQ8RZIVBjz7j6de2o=; b=VJ04sxijMyrm/NEpkwHUzqdSerU7mczG31ofRoTDytVTlOTcTLS7SegM0Y6Ldpfa0q XYgw53pX9pwxGVGnUbEbD+1avYeSiV94akem6D94DRvAsTmvcEYWSGXigUjJmB7kyT7C 8MUk23kFpBWtqUswHG88fwGsCkNpEGayJPXsoWzUGVjNAeYjnJ9TCSzf7G01JbWpZQr0 lLI9KPV/im14DL+Loh9UCvST8Bn5rMDEr2Ujxqm6X5glt6aT2gcIVcUiNlty0Rp1bRPX LJMUBNdcTALg33CAVHIL+J+YxQ7Gi6iW0jQUegSuYtSYZhjv2cWtA3NsJkiD1NbOOWQu ljug==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=2YEB+x82; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-199794-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-199794-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id 41be03b00d2f7-6c35c181cdfsi7019580a12.639.2024.06.03.15.56.56 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 03 Jun 2024 15:56:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-199794-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=2YEB+x82; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-199794-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-199794-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 9454D2869C6 for ; Mon, 3 Jun 2024 22:46:53 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 36C7E13CFA3; Mon, 3 Jun 2024 22:46:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="2YEB+x82" Received: from mail-qt1-f171.google.com (mail-qt1-f171.google.com [209.85.160.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 31BB31877 for ; Mon, 3 Jun 2024 22:46:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.171 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717454799; cv=none; b=fO9verYxuei6zBcaCbKtTFBWJ/YOkOvd8KItt/IKHssqb2zAxYeYBtmDTKo9GNQrHeTubkerhrJ3EFuW+vI/yY9avPffdh+1q9hKxmnd6hIR6pz4/qmrOmKwNnyEgGqf7lCCOAYGZkBJkb8ZUjCfjDxFv3xxWCYsgR4rX8O9fNU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717454799; c=relaxed/simple; bh=0SDC8BkbIu6webNUXhyFkkBsSVcdj+Sq6u/l89WhDUU=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=JrKeRcdDK/tX7jHWU2MS250TyZuaWDDGjXEsTe9xK7RgROJgEJIOtNR7qLgBlQr0Ft2XsSWS24K8pqOdWGTITPNFdQJeYSQrvU5AXLFUNnpz4Pjxdl3O967lu8k4X542axF288cH0o7WL9PfnlB8Tg6Dy3u9FEayJozTacovsjQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=2YEB+x82; arc=none smtp.client-ip=209.85.160.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Received: by mail-qt1-f171.google.com with SMTP id d75a77b69052e-43dfe020675so119701cf.0 for ; Mon, 03 Jun 2024 15:46:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1717454796; x=1718059596; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zwApvMmCxi6kRvhFpeZNrHVjs/fSYi202Vq8O0y7Pvc=; b=2YEB+x82iaPcRGgd758Wy0mcQv/zPOKnypfOsXwRqgtxaDBACdtrIJl9FVHN7nGTyB ev9S31pkK8Blef/fm/dVZ5QuTEmXM0RBQRF0tFNYF3D1fJ8EsNW5ww+rFKGaWTD6g5Fz az9/wdiJrTuQLUfBBMB/+9tMrh+WDScVdUwHEthj+IKVYorJQtKUYJLOXMmYDl5T5+p4 92iG8gHWXb794e/vtE7iyZZnFTFD+QDzoMPnzSFbIiqL3qRIyYOyDLok+SrYmnaYX66Y 1J0f383XKM8jo/SdztHum0F5latvbmw0xnnn6jZeBJN9V54Hw10OS40G9eqbjlzCcTxZ M9Fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717454796; x=1718059596; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zwApvMmCxi6kRvhFpeZNrHVjs/fSYi202Vq8O0y7Pvc=; b=NRIjhSwOmFWuC8ilrI2L2yF4Gs8XAfyVcFnzoDWV3tC1R/lAMgbsg1rv8Rpo6qFfS8 wkUHaDh0fLEzFSxw/NPZepR9/6tbFMWEHNVbaDCiDIfKqC89N9hcp6MHq+rmVhbEWvIA IOcEfc/RfgyfunGSVdK2T2L3nUhTrz+r4mfrL4+0iciHhbyrEQFlHr6cNEZlhKFR0mVp moM893yrPyTkD7ZmydhUnvzRvEV+QK1vVL7h+CBbEwoytxnHOuM5YgdFXwNdjIzOW0mA lUyE+OSZr+0xIH7Wb83pEJVdZa9oNOXfZH40uzNDt8+nD3+xIbFG19ZK74HT5YTlATjQ GDQg== X-Forwarded-Encrypted: i=1; AJvYcCXZlZIt5Ahknf+YMTtBFjdmuQJER7VqTsQsuOhQDUEqFSwywZZQKDmgdvdyagt6bvPUlCQLW6NmuygCAioi3J3UH/uhCsSSMNPav4E+ X-Gm-Message-State: AOJu0YxhCyRLK2SkN9jJ3IaK8TyDncly3D6K3Ayn3/nSKYU1dZEtfkSr nAK3vXqOUg1KOjDf6pGeZLWubth2IXjV6ghwESEa3v3FCk7oDiTgl8kIKSATTZ0KQqJKXXEMTLB YNGlZYPeBAAftiCK4lRjTsmkivWn71lN1RihE X-Received: by 2002:a05:622a:59ce:b0:43f:ff89:dfb9 with SMTP id d75a77b69052e-4401bd281f4mr1732131cf.6.1717454795966; Mon, 03 Jun 2024 15:46:35 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20240529180510.2295118-1-jthoughton@google.com> <20240529180510.2295118-3-jthoughton@google.com> In-Reply-To: From: James Houghton Date: Mon, 3 Jun 2024 15:45:59 -0700 Message-ID: Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging To: Yu Zhao Cc: Sean Christopherson , Andrew Morton , Paolo Bonzini , Albert Ou , Ankit Agrawal , Anup Patel , Atish Patra , Axel Rasmussen , Bibo Mao , Catalin Marinas , David Matlack , David Rientjes , Huacai Chen , James Morse , Jonathan Corbet , Marc Zyngier , Michael Ellerman , Nicholas Piggin , Oliver Upton , Palmer Dabbelt , Paul Walmsley , Raghavendra Rao Ananta , Ryan Roberts , Shaoqin Huang , Shuah Khan , Suzuki K Poulose , Tianrui Zhao , Will Deacon , Zenghui Yu , kvm-riscv@lists.infradead.org, kvm@vger.kernel.org, kvmarm@lists.linux.dev, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mips@vger.kernel.org, linux-mm@kvack.org, linux-riscv@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, loongarch@lists.linux.dev Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, May 30, 2024 at 11:06=E2=80=AFPM Yu Zhao wrote: > > On Wed, May 29, 2024 at 7:08=E2=80=AFPM James Houghton wrote: > > > > Hi Yu, Sean, > > > > Perhaps I "simplified" this bit of the series a little bit too much. > > Being able to opportunistically do aging with KVM (even without > > setting the Kconfig) is valuable. > > > > IIUC, we have the following possibilities: > > - v4: aging with KVM is done if the new Kconfig is set. > > - v3: aging with KVM is always done. > > This is not true -- in v3, MGLRU only scans secondary MMUs if it can > be done locklessly on x86. It uses a bitmap to imply this requirement. > > > - v2: aging with KVM is done when the architecture reports that it can > > probably be done locklessly, set at KVM MMU init time. > > Not really -- it's only done if it can be done locklessly on both x86 and= arm64. > > > - Another possibility?: aging with KVM is only done exactly when it > > can be done locklessly (i.e., mmu_notifier_test/clear_young() called > > such that it will not grab any locks). > > This is exactly the case for v2. Thanks for clarifying; sorry for getting this wrong. > > > I like the v4 approach because: > > 1. We can choose whether or not to do aging with KVM no matter what > > architecture we're using (without requiring userspace be aware to > > disable the feature at runtime with sysfs to avoid regressing > > performance if they don't care about proactive reclaim). > > 2. If we check the new feature bit (0x8) in sysfs, we can know for > > sure if aging is meant to be working or not. The selftest changes I > > made won't work properly unless there is a way to be sure that aging > > is working with KVM. > > I'm not convinced, but it doesn't mean your point of view is invalid. > If you fully understand the implications of your design choice and > document them, I will not object. > > All optimizations in v2 were measured step by step. Even that bitmap, > which might be considered overengineered, brought a readily > measuarable 4% improvement in memcached throughput on Altra Max > swapping to Optane: > > Using the bitmap (64 KVM PTEs for each call) > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 > Latency p99 Latency p99.9 Latency KB/sec > -------------------------------------------------------------------------= --------------------------------------------------- > Sets 0.00 --- --- --- > --- --- --- 0.00 > Gets 1012801.92 431436.92 14965.11 0.06246 > 0.04700 0.16700 4.31900 39635.83 > Waits 0.00 --- --- --- > --- --- --- --- > Totals 1012801.92 431436.92 14965.11 0.06246 > 0.04700 0.16700 4.31900 39635.83 > > > Not using the bitmap (1 KVM PTEs for each call) > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 > Latency p99 Latency p99.9 Latency KB/sec > -------------------------------------------------------------------------= --------------------------------------------------- > Sets 0.00 --- --- --- > --- --- --- 0.00 > Gets 968210.02 412443.85 14303.89 0.06517 > 0.04700 0.15900 7.42300 37890.74 > Waits 0.00 --- --- --- > --- --- --- --- > Totals 968210.02 412443.85 14303.89 0.06517 > 0.04700 0.15900 7.42300 37890.74 > > > FlameGraphs with bitmap (1.svg) and without bitmap (2.svg) attached. > > What I don't think is acceptable is simplifying those optimizations > out without documenting your justifications (I would even call it a > design change, rather than simplification, from v3 to v4). I'll put back something similar to what you had before (like a test_clear_young() with a "fast" parameter instead of "bitmap"). I like the idea of having a new mmu notifier, like fast_test_clear_young(), while leaving test_young() and clear_young() unchanged (where "fast" means "prioritize speed over accuracy"). It seems a little more straightforward that way. > > > For look-around at eviction time: > > - v4: done if the main mm PTE was young and no MMU notifiers are subscr= ibed. > > - v2/v3: done if the main mm PTE was young or (the SPTE was young and > > the MMU notifier was lockless/fast). > > The host and secondary MMUs are two *independent* cases, IMO: > 1. lookaround the host MMU if the PTE mapping the folio under reclaim is = young. > 2. lookaround the secondary MMU if it can be done locklessly. > > So the v2/v3 behavior sounds a lot more reasonable to me. I'll restore the v2/v3 behavior. I initially removed it because, without batching, we (mostly) lose the spatial locality that, IIUC, look-around is designed to exploit. > > Also a nit -- don't use 'else' in the following case (should_look_around(= )): > > if (foo) > return bar; > else > do_something(); Oh, yes, sorry. I wrote and rewrote should_look_around() quite a few times while trying to figure out what made sense in a no-batching series. I'll fix this. > > > I made this logic change as part of removing batching. > > > > I'd really appreciate guidance on what the correct thing to do is. > > > > In my mind, what would work great is: by default, do aging exactly > > when KVM can do it locklessly, and then have a Kconfig to always have > > MGLRU to do aging with KVM if a user really cares about proactive > > reclaim (when the feature bit is set). The selftest can check the > > Kconfig + feature bit to know for sure if aging will be done. > > I still don't see how that Kconfig helps. Or why the new static branch > isn't enough? Without a special Kconfig, the feature bit just tells us that aging with KVM is possible, not that it will necessarily be done. For the self-test, it'd be good to know exactly when aging is being done or not, so having a Kconfig like LRU_GEN_ALWAYS_WALK_SECONDARY_MMU would help make the self-test set the right expectations for aging. The Kconfig would also allow a user to know that, no matter what, we're going to get correct age data for VMs, even if, say, we're using the shadow MMU. This is somewhat important for me/Google Cloud. Is that reasonable? Maybe there's a better solution.