Received: by 2002:ab2:b82:0:b0:1f3:401:3cfb with SMTP id 2csp50759lqh; Wed, 27 Mar 2024 14:32:35 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCWzBJ5Ou3Z4j8RXscaRsSBTeC1o4kKirz3mLCVxTIbvgTu2WcKivn8+tM+GAU14NwBrdhGyf8C5AVkbjgtRltfD09TgwJboSgYCeB4bLQ== X-Google-Smtp-Source: AGHT+IHxa8EiZPwtQlhK7JLqO/luTevv9lN3w9aWvbjwQkbCZBfqzuGEbETh2BOM7On7NZAIjNBI X-Received: by 2002:a05:6871:48c:b0:229:8236:ae9 with SMTP id f12-20020a056871048c00b0022982360ae9mr984078oaj.59.1711575155114; Wed, 27 Mar 2024 14:32:35 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1711575155; cv=pass; d=google.com; s=arc-20160816; b=vg0qCwFgiY4AZEWPT8QXA1f9aCp8wFfylrHX3SzY9Kw7s9vbgai0qAnX9CclJtP+WO 22CbT//7c3x5ZLPp5M+rGhHLTQRIsMozAt3yuwffISjNl9iGg7tcieMFKpD3ZIe9m13w bwZzjh9/X5N/UQZkM0ryYlrn5BMTddXmv2oeMSLtD58ynjKJm3cQX8UNl7HUCrMChgeE pMDEpqVTCcmt9MZhsVPlc+x3LT2X6kaEaQEfQoQRSy14NZINbsLi9CR4pvDleYRz2IWY pzTqJcvzndEeWJFi3DLCN2r9nvaQBobeJYY4MQ7SDWCms+tM9RvQ0QlUyPquGEPm2Wxk IzGw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:from:subject:message-id:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:dkim-signature; bh=j1WkldVevb55Hs0T4InQAALOIu6zVUWaR4B4A3EAoAg=; fh=9MAmUqamgKxYzpNvxEEDB7d1pQMB7nCGzIg8vr+03zc=; b=oHOL8SM3KIP6YrZobxwq7W+tdZKa1Cxv2LnhwniGfOLkaPxZYADgp13YutRZpgovYw HAWdkjBJC7cs0ACoO/yDhVPRZD0Vf9BTZqffp+mPZ7ZUPQqw+TrqyBp22rqWbvbW75KT jL8mn6W8zSudTLjAlAzldIG4L/LzzduHvoCahTkhLHSGw8CgAiVpt+SqwpFkYCAaT91Q aRjKFU8i2QuG1G3/0uzKMmN3RyQoCdIcB+KBFyce9YcVdKrF6t7hgRKvEoEUpen1ePek ti7ahcQz+p3JTA/apgfLs9vYdJ9TdnCeLRPO8rMUCjMtldkUUCypaXdva21Gt1zji7eX In4g==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=VJ9tFCuE; arc=pass (i=1 spf=pass spfdomain=flex--yuanchu.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-122025-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-122025-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id n2-20020a6543c2000000b005e49bf549b2si12414539pgp.523.2024.03.27.14.32.34 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Mar 2024 14:32:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-122025-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=VJ9tFCuE; arc=pass (i=1 spf=pass spfdomain=flex--yuanchu.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-122025-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-122025-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id B1BCEB22756 for ; Wed, 27 Mar 2024 21:31:38 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id A79C414EC62; Wed, 27 Mar 2024 21:31:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="VJ9tFCuE" Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D95B514C5BE for ; Wed, 27 Mar 2024 21:31:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575090; cv=none; b=eAq6rXRyr7SHU0swPbx41vcaAt8aG2oydiITNYqVsARvYMV+8NDhnE03/G6QbGqOHc+v1HH4yoH6YiWA0AsN6S4d9rgmPuv3SYBQlFNWVgB55+eDrn3m09Pm7CPiAXoJ+q68sfrmOgcMInRJVsm86aJPwq7ahQgDrWdVWaCYLYU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575090; c=relaxed/simple; bh=G1xjtCwsE5TyJpVtDYihKDzXG3ZXcnzrS1aBTofQBcw=; h=Date:Mime-Version:Message-ID:Subject:From:To:Cc:Content-Type; b=oYMbiW8vA+pjsMd5iTfa1J6ikZ8/mHRYnCDyL9dX33+F+m4ZpWQE6aiHkQJY72XW56tKLrxXdo6CqXy9phLy11zWUwGUe3fincg5uwa6xYYVviqvbXy5ITndarhm5Ssen/5/gGDLQs7h/tENOeRxAiOfW8kwLRW5q11L55fu8hk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=VJ9tFCuE; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-dd944567b6cso350958276.2 for ; Wed, 27 Mar 2024 14:31:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1711575088; x=1712179888; darn=vger.kernel.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=j1WkldVevb55Hs0T4InQAALOIu6zVUWaR4B4A3EAoAg=; b=VJ9tFCuE6SHosieGf6Y797K3Msd0s+1vEvqUMM24r+awggGQWVfpoEP/Kg8EFOy6a1 qM4kampCUfCBs7o4XHeVQM9+vUHselrR6QsQo44ReI13NEacHu2cetVN7pLiXr0ru+fq W6na2bw7UmrorRri4Y49lEQf99EsujLOEEcXV1pi7Ju5GdmaRcMIYNpzkOlu1VSxGXWl 3EFEA1ie57jR3gICJBVLZK/x1t2xFwrSy3WBYBaRYn2y2/gsTHMS4BJpESsGgiVeJpr0 7x2ubR+GpKXSEBSPujU/3R5xP8ClAmcgH++sH+jLy/sp4ZKrdzxFc/adJVxGPtDomZDh WUnA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711575088; x=1712179888; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=j1WkldVevb55Hs0T4InQAALOIu6zVUWaR4B4A3EAoAg=; b=OYc03hxuuf82JVqVUoBQ3voIjpd6WjJUczxjxVELNbb/KOnDHhd0K2FGHS7rCKJ7wZ VSKkCWsqgOL7oyFD05VTL4Ln2dqTFgUigHENyuqV3sSMmpXMSe/KL1/v9guqzgz+FarS HJ2RKNziRZFgkb+hlrpIso6R8VwkSdIbDsDKEPtCA8A7vY3O66GGNu8ayi0tt3rixfjQ bc9qtFhRcJD1fc+hjltdpcARTrllGt69ajLkfSQt5msjdyLc68/ZsOejxvNkrqvchBSv eGh3S2UbHEfW4yYLsV8iS+vj8J/hWw6BUiKFMCN/HkPlkci2Gwuw/wx0rGnmREScG+V4 r2Ew== X-Forwarded-Encrypted: i=1; AJvYcCV/aMlfDwqGvSiCAe3ytHOqcgUyhyS3aHI6ryxyGLhRlHCNLOeo4uem4DLvXUItb/UAHc0GmmpyZpU6FcceplTy64ohuG0DnuPHGf9S X-Gm-Message-State: AOJu0YwYuGX9UmHgecbUKn+BJTQiNqO7YOOaUKIjqtxtjvnhg9130Kw1 DpfOqW3cvXSyuEemt2LxkCBSCfNUeW5wg4NDeHsA5ObvRBrZsBGXYrG16/l/ceXCBFwAAS8GeiT fqijqow== X-Received: from yuanchu-desktop.svl.corp.google.com ([2620:15c:2a3:200:6df3:ef42:a58e:a6b1]) (user=yuanchu job=sendgmr) by 2002:a05:6902:1145:b0:dce:5218:c89b with SMTP id p5-20020a056902114500b00dce5218c89bmr119024ybu.5.1711575087927; Wed, 27 Mar 2024 14:31:27 -0700 (PDT) Date: Wed, 27 Mar 2024 14:30:59 -0700 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 X-Mailer: git-send-email 2.44.0.396.g6e790dbe36-goog Message-ID: <20240327213108.2384666-1-yuanchu@google.com> Subject: [RFC PATCH v3 0/8] mm: workingset reporting From: Yuanchu Xie To: David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying Cc: Wei Xu , David Rientjes , Greg Kroah-Hartman , "Rafael J. Wysocki" , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Shuah Khan , Yosry Ahmed , Matthew Wilcox , Sudarshan Rajagopalan , Kairui Song , "Michael S. Tsirkin" , Vasily Averin , Nhat Pham , Miaohe Lin , Qi Zheng , Abel Wu , "Vishal Moola (Oracle)" , Kefeng Wang , Yuanchu Xie , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="UTF-8" This patch series provides workingset reporting of user pages in lruvecs, of which coldness can be tracked by accessed bits and fd references. However, the concept of workingset applies generically to all types of memory, which could be kernel slab caches, discardable userspace caches (databases), or CXL.mem. Therefore, data sources might come from slab shrinkers, device drivers, or the userspace. IMO, the kernel should provide a set of workingset interfaces that should be generic enough to accommodate the various use cases, and be extensible to potential future use cases. The current proposed interfaces are not sufficient in that regard, but I would like to start somewhere, solicit feedback, and iterate. Use cases ========== Job scheduling For data center machines, workingset information allows the job scheduler to right-size each job and land more jobs on the same host or NUMA node, and in the case of a job with increasing workingset, policy decisions can be made to migrate other jobs off the host/NUMA node, or oom-kill the misbehaving job. If the job shape is very different from the machine shape, knowing the workingset per-node can also help inform page allocation policies. Proactive reclaim Workingset information allows the a container manager to proactively reclaim memory while not impacting a job's performance. While PSI may provide a reactive measure of when a proactive reclaim has reclaimed too much, workingset reporting enables the policy to be more accurate and flexible. Ballooning (similar to proactive reclaim) While this patch series does not extend the virtio-balloon device, balloon policies benefit from workingset to more precisely determine the size of the memory balloon. On desktops/laptops/mobile devices where memory is scarce and overcommitted, the balloon sizing in multiple VMs running on the same device can be orchestrated with workingset reports from each one. Promotion/Demotion Similar to proactive reclaim, a workingset report enables demotion to a slower tier of memory. For promotion, the workingset report interfaces need to be extended to report hotness and gather hotness information from the devices[1]. [1] https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1 Sysfs and Cgroup Interfaces ========== The interfaces are detailed in the patches that introduce them. The main idea here is we break down the workingset per-node per-memcg into time intervals (ms), e.g. 1000 anon=137368 file=24530 20000 anon=34342 file=0 30000 anon=353232 file=333608 40000 anon=407198 file=206052 9223372036854775807 anon=4925624 file=892892 I realize this does not generalize well to hotness information, but I lack the intuition for an abstraction that presents hotness in a useful way. Based on a recent proposal for move_phys_pages[2], it seems like userspace tiering software would like to move specific physical pages, instead of informing the kernel "move x number of hot pages to y device". Please advise. [2] https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/ Implementation ========== Currently, the reporting of user pages is based off of MGLRU, and therefore requires CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more fine-grained workingset report. I will make the generation count configurable in the next version. The workingset reporting mechanism is gated behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind CONFIG_WORKINGSET_REPORT_AGING. -- Changes from RFC v2 -> RFC v3: - Update to v6.8 - Added an aging kernel thread (gated behind config) - Added basic selftests for sysfs interface files - Track swapped out pages for reaccesses - Refactoring and cleanup - Dropped the virtio-balloon extension to make things manageable Changes from RFC v1 -> RFC v2: - Refactored the patchs into smaller pieces - Renamed interfaces and functions from wss to wsr (Working Set Reporting) - Fixed build errors when CONFIG_WSR is not set - Changed working_set_num_bins to u8 for virtio-balloon - Added support for per-NUMA node reporting for virtio-balloon [rfc v1] https://lore.kernel.org/linux-mm/20230509185419.1088297-1-yuanchu@google.com/ [rfc v2] https://lore.kernel.org/linux-mm/20230621180454.973862-1-yuanchu@google.com/ Yuanchu Xie (8): mm: multi-gen LRU: ignore non-leaf pmd_young for force_scan=true mm: aggregate working set information into histograms mm: use refresh interval to rate-limit workingset report aggregation mm: report workingset during memory pressure driven scanning mm: extend working set reporting to memcgs mm: add per-memcg reaccess histogram mm: add kernel aging thread for workingset reporting mm: test system-wide workingset reporting drivers/base/node.c | 3 + include/linux/memcontrol.h | 5 + include/linux/mmzone.h | 4 + include/linux/workingset_report.h | 107 +++ mm/Kconfig | 15 + mm/Makefile | 2 + mm/internal.h | 45 ++ mm/memcontrol.c | 386 ++++++++- mm/mmzone.c | 2 + mm/vmscan.c | 95 ++- mm/workingset.c | 9 +- mm/workingset_report.c | 757 ++++++++++++++++++ mm/workingset_report_aging.c | 127 +++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 3 + .../testing/selftests/mm/workingset_report.c | 315 ++++++++ .../testing/selftests/mm/workingset_report.h | 37 + .../selftests/mm/workingset_report_test.c | 328 ++++++++ 18 files changed, 2231 insertions(+), 10 deletions(-) create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c create mode 100644 mm/workingset_report_aging.c create mode 100644 tools/testing/selftests/mm/workingset_report.c create mode 100644 tools/testing/selftests/mm/workingset_report.h create mode 100644 tools/testing/selftests/mm/workingset_report_test.c -- 2.44.0.396.g6e790dbe36-goog