Received: by 2002:a5d:925a:0:0:0:0:0 with SMTP id e26csp400095iol; Thu, 9 Jun 2022 06:11:57 -0700 (PDT) X-Google-Smtp-Source: ABdhPJye8onFvM1mLLJujD3REiXI3FIiVtupVdHZqxed3AiS7Amlgmq/r09f7Sa+YpkGfhsEprn5 X-Received: by 2002:a17:907:9810:b0:6fe:90ca:e730 with SMTP id ji16-20020a170907981000b006fe90cae730mr34603426ejc.549.1654780317429; Thu, 09 Jun 2022 06:11:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1654780317; cv=none; d=google.com; s=arc-20160816; b=jFeabi1S2E6w/OLyTy4xKtsC0gT6vmO9vDqhA0Vy+2Yd+o9kUuY5PqriGmWzXoxt7m 4l4sb2GzPxhzATEOZNNHGJ3FLmvJLOvmIgzqy/0WTDOARz1dAS0HBojjgXdGZKE11JbB UoDjttHa1QVpiSqoP1xJRpJ41s9vNFz/LrWLREg8cow3yvPq4N023c9EGsQWcOBngRPl AdbhcF8lmjmSBlQEM+MPnyrl4hW04PjVEbU8bRAvKSS40dvM36TEMpizYI00T4oF3n6K 9OebpeMoRdrsgFiaTToIvILFey0aTvKlNZ1impYbxHBup5CeC/88G+q1vzzREWvdxzjP VnlQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=5ISxTnc28XG4FsXm24xKtMS7QWLUFBmxsa2VqGSfPqQ=; b=e7D8j0guQLrQmn1GX5f7T1L4SQnP1qKNtPRT8IHczH/W6L98PXCyYnmvlDq08FkN8h H9PLHJBq0ycNyK7xUKgaOtaxt7EJApjiE5YtfF8xwSzdRql/cUa2EUY8Sku66v6R5HY+ NDu0S5SOKAkTUQRiWV8NawtWDJBHb6+6QhfEDsKj/mZQ5T9ZX3K3o8B47rqAr4Ysd7J/ R0UtY1zlYJyjhlDcB9Tj6vC4P4IKlnh2XeZjEYy5oVQQuyIFwWuWNhLT1Wa2BLhdXKDg tJ4aUlghCumncI1vSwSVagrZK3zbPXhWIMTZ+nPXxdRcNQWoQ5pppaQR4ppWFhFz7xQq NCIg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id qa8-20020a170907868800b006fe98e19840si5986952ejc.733.2022.06.09.06.11.22; Thu, 09 Jun 2022 06:11:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245410AbiFIMfI (ORCPT + 99 others); Thu, 9 Jun 2022 08:35:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59990 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229637AbiFIMfF (ORCPT ); Thu, 9 Jun 2022 08:35:05 -0400 Received: from out30-42.freemail.mail.aliyun.com (out30-42.freemail.mail.aliyun.com [115.124.30.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 49AB022B30; Thu, 9 Jun 2022 05:34:56 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R171e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046051;MF=zhongjiang-ali@linux.alibaba.com;NM=1;PH=DS;RN=38;SR=0;TI=SMTPD_---0VFtsKPr_1654778085; Received: from 30.225.24.138(mailfrom:zhongjiang-ali@linux.alibaba.com fp:SMTPD_---0VFtsKPr_1654778085) by smtp.aliyun-inc.com; Thu, 09 Jun 2022 20:34:48 +0800 Message-ID: Date: Thu, 9 Jun 2022 20:34:45 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:101.0) Gecko/20100101 Thunderbird/101.0 Subject: Re: [PATCH v11 06/14] mm: multi-gen LRU: minimal implementation Content-Language: en-US To: Yu Zhao , Andrew Morton , linux-mm@kvack.org Cc: Andi Kleen , Aneesh Kumar , Catalin Marinas , Dave Hansen , Hillf Danton , Jens Axboe , Johannes Weiner , Jonathan Corbet , Linus Torvalds , Matthew Wilcox , Mel Gorman , Michael Larabel , Michal Hocko , Mike Rapoport , Peter Zijlstra , Tejun Heo , Vlastimil Babka , Will Deacon , linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org, page-reclaim@google.com, Brian Geffon , Jan Alexander Steffens , Oleksandr Natalenko , Steven Barrett , Suleiman Souhlal , Daniel Byrne , Donald Carr , =?UTF-8?Q?Holger_Hoffst=c3=a4tte?= , Konstantin Kharlamov , Shuang Zhai , Sofia Trinh , Vaibhav Jain References: <20220518014632.922072-1-yuzhao@google.com> <20220518014632.922072-7-yuzhao@google.com> From: zhong jiang In-Reply-To: <20220518014632.922072-7-yuzhao@google.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-11.1 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY,URIBL_BLOCKED, USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2022/5/18 9:46 上午, Yu Zhao wrote: > To avoid confusion, the terms "promotion" and "demotion" will be > applied to the multi-gen LRU, as a new convention; the terms > "activation" and "deactivation" will be applied to the active/inactive > LRU, as usual. > > The aging produces young generations. Given an lruvec, it increments > max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging > promotes hot pages to the youngest generation when it finds them > accessed through page tables; the demotion of cold pages happens > consequently when it increments max_seq. The aging has the complexity > O(nr_hot_pages), since it is only interested in hot pages. Promotion > in the aging path does not involve any LRU list operations, only the > updates of the gen counter and lrugen->nr_pages[]; demotion, unless as > the result of the increment of max_seq, requires LRU list operations, > e.g., lru_deactivate_fn(). > > The eviction consumes old generations. Given an lruvec, it increments > min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A > feedback loop modeled after the PID controller monitors refaults over > anon and file types and decides which type to evict when both types > are available from the same generation. > > Each generation is divided into multiple tiers. Tiers represent > different ranges of numbers of accesses through file descriptors. A > page accessed N times through file descriptors is in tier > order_base_2(N). Tiers do not have dedicated lrugen->lists[], only > bits in folio->flags. In contrast to moving across generations, which > requires the LRU lock, moving across tiers only involves operations on > folio->flags. The feedback loop also monitors refaults over all tiers > and decides when to protect pages in which tiers (N>1), using the > first tier (N=0,1) as a baseline. The first tier contains single-use > unmapped clean pages, which are most likely the best choices. The > eviction moves a page to the next generation, i.e., min_seq+1, if the > feedback loop decides so. This approach has the following advantages: > 1. It removes the cost of activation in the buffered access path by > inferring whether pages accessed multiple times through file > descriptors are statistically hot and thus worth protecting in the > eviction path. > 2. It takes pages accessed through page tables into account and avoids > overprotecting pages accessed multiple times through file > descriptors. (Pages accessed through page tables are in the first > tier, since N=0.) > 3. More tiers provide better protection for pages accessed more than > twice through file descriptors, when under heavy buffered I/O > workloads. > > Server benchmark results: > Single workload: > fio (buffered I/O): +[40, 42]% > IOPS BW > 5.18-rc1: 2463k 9621MiB/s > patch1-6: 3484k 13.3GiB/s > > Single workload: > memcached (anon): +[44, 46]% > Ops/sec KB/sec > 5.18-rc1: 771403.27 30004.17 > patch1-6: 1120643.70 43588.06 > > Configurations: > CPU: two Xeon 6154 > Mem: total 256G > > Node 1 was only used as a ram disk to reduce the variance in the > results. > > patch drivers/block/brd.c < 99,100c99,100 > < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; > < page = alloc_page(gfp_flags); > --- > > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > > page = alloc_pages_node(1, gfp_flags, 0); > EOF > > cat >>/etc/systemd/system.conf < CPUAffinity=numa > NUMAPolicy=bind > NUMAMask=0 > EOF > > cat >>/etc/memcached.conf < -m 184320 > -s /var/run/memcached/memcached.sock > -a 0766 > -t 36 > -B binary > EOF > > cat fio.sh > modprobe brd rd_nr=1 rd_size=113246208 > swapoff -a > mkfs.ext4 /dev/ram0 > mount -t ext4 /dev/ram0 /mnt > > mkdir /sys/fs/cgroup/user.slice/test > echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max > echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs > fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ > --buffered=1 --ioengine=io_uring --iodepth=128 \ > --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ > --rw=randread --random_distribution=random --norandommap \ > --time_based --ramp_time=10m --runtime=5m --group_reporting > > cat memcached.sh > modprobe brd rd_nr=1 rd_size=113246208 > swapoff -a > mkswap /dev/ram0 > swapon /dev/ram0 > > memtier_benchmark -S /var/run/memcached/memcached.sock \ > -P memcache_binary -n allkeys --key-minimum=1 \ > --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ > --ratio 1:0 --pipeline 8 -d 2000 > > memtier_benchmark -S /var/run/memcached/memcached.sock \ > -P memcache_binary -n allkeys --key-minimum=1 \ > --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ > --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed > > Client benchmark results: > kswapd profiles: > 5.18-rc1 > 40.53% page_vma_mapped_walk > 20.37% lzo1x_1_do_compress (real work) > 6.99% do_raw_spin_lock > 3.93% _raw_spin_unlock_irq > 2.08% vma_interval_tree_subtree_search > 2.06% vma_interval_tree_iter_next > 1.95% folio_referenced_one > 1.93% anon_vma_interval_tree_iter_first > 1.51% ptep_clear_flush > 1.35% __anon_vma_interval_tree_subtree_search > > patch1-6 > 35.99% lzo1x_1_do_compress (real work) > 19.40% page_vma_mapped_walk > 6.31% _raw_spin_unlock_irq > 3.95% do_raw_spin_lock > 2.39% anon_vma_interval_tree_iter_first > 2.25% ptep_clear_flush > 1.92% __anon_vma_interval_tree_subtree_search > 1.70% folio_referenced_one > 1.68% __zram_bvec_write > 1.43% anon_vma_interval_tree_iter_next > > Configurations: > CPU: single Snapdragon 7c > Mem: total 4G > > Chrome OS MemoryPressure [1] > > [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ > > Signed-off-by: Yu Zhao > Acked-by: Brian Geffon > Acked-by: Jan Alexander Steffens (heftig) > Acked-by: Oleksandr Natalenko > Acked-by: Steven Barrett > Acked-by: Suleiman Souhlal > Tested-by: Daniel Byrne > Tested-by: Donald Carr > Tested-by: Holger Hoffstätte > Tested-by: Konstantin Kharlamov > Tested-by: Shuang Zhai > Tested-by: Sofia Trinh > Tested-by: Vaibhav Jain > --- > include/linux/mm_inline.h | 36 ++ > include/linux/mmzone.h | 42 ++ > include/linux/page-flags-layout.h | 5 +- > kernel/bounds.c | 2 + > mm/Kconfig | 11 + > mm/swap.c | 39 ++ > mm/vmscan.c | 799 +++++++++++++++++++++++++++++- > mm/workingset.c | 110 +++- > 8 files changed, 1034 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h > index 98ae22bfaf12..85fe78832436 100644 > --- a/include/linux/mm_inline.h > +++ b/include/linux/mm_inline.h > @@ -119,6 +119,33 @@ static inline int lru_gen_from_seq(unsigned long seq) > return seq % MAX_NR_GENS; > } > > +static inline int lru_hist_from_seq(unsigned long seq) > +{ > + return seq % NR_HIST_GENS; > +} > + > +static inline int lru_tier_from_refs(int refs) > +{ > + VM_WARN_ON_ONCE(refs > BIT(LRU_REFS_WIDTH)); > + > + /* see the comment in folio_lru_refs() */ > + return order_base_2(refs + 1); > +} > + > +static inline int folio_lru_refs(struct folio *folio) > +{ > + unsigned long flags = READ_ONCE(folio->flags); > + bool workingset = flags & BIT(PG_workingset); > + > + /* > + * Return the number of accesses beyond PG_referenced, i.e., N-1 if the > + * total number of accesses is N>1, since N=0,1 both map to the first > + * tier. lru_tier_from_refs() will account for this off-by-one. Also see > + * the comment on MAX_NR_TIERS. > + */ > + return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + workingset; > +} > + > static inline int folio_lru_gen(struct folio *folio) > { > unsigned long flags = READ_ONCE(folio->flags); > @@ -171,6 +198,15 @@ static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *foli > __update_lru_size(lruvec, lru, zone, -delta); > return; > } > + > + /* promotion */ > + if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) { > + __update_lru_size(lruvec, lru, zone, -delta); > + __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta); > + } > + > + /* demotion requires isolation, e.g., lru_deactivate_fn() */ > + VM_WARN_ON_ONCE(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen)); > } > > static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming) > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 6994acef63cb..2d023d243e73 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -348,6 +348,29 @@ enum lruvec_flags { > #define MIN_NR_GENS 2U > #define MAX_NR_GENS 4U > > +/* > + * Each generation is divided into multiple tiers. Tiers represent different > + * ranges of numbers of accesses through file descriptors. A page accessed N > + * times through file descriptors is in tier order_base_2(N). A page in the > + * first tier (N=0,1) is marked by PG_referenced unless it was faulted in > + * though page tables or read ahead. A page in any other tier (N>1) is marked > + * by PG_referenced and PG_workingset. This implies a minimum of two tiers is > + * supported without using additional bits in folio->flags. > + * > + * In contrast to moving across generations which requires the LRU lock, moving > + * across tiers only involves atomic operations on folio->flags and therefore > + * has a negligible cost in the buffered access path. In the eviction path, > + * comparisons of refaulted/(evicted+protected) from the first tier and the > + * rest infer whether pages accessed multiple times through file descriptors > + * are statistically hot and thus worth protecting. > + * > + * MAX_NR_TIERS is set to 4 so that the multi-gen LRU can support twice the > + * number of categories of the active/inactive LRU when keeping track of > + * accesses through file descriptors. It uses MAX_NR_TIERS-2 spare bits in > + * folio->flags (LRU_REFS_MASK). > + */ > +#define MAX_NR_TIERS 4U > + > #ifndef __GENERATING_BOUNDS_H > > struct lruvec; > @@ -362,6 +385,16 @@ enum { > LRU_GEN_FILE, > }; > > +#define MIN_LRU_BATCH BITS_PER_LONG > +#define MAX_LRU_BATCH (MIN_LRU_BATCH * 128) > + > +/* whether to keep historical stats from evicted generations */ > +#ifdef CONFIG_LRU_GEN_STATS > +#define NR_HIST_GENS MAX_NR_GENS > +#else > +#define NR_HIST_GENS 1U > +#endif > + > /* > * The youngest generation number is stored in max_seq for both anon and file > * types as they are aged on an equal footing. The oldest generation numbers are > @@ -384,6 +417,15 @@ struct lru_gen_struct { > struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; > /* the sizes of the above lists */ > long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; > + /* the exponential moving average of refaulted */ > + unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS]; > + /* the exponential moving average of evicted+protected */ > + unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS]; > + /* the first tier doesn't need protection, hence the minus one */ > + unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1]; > + /* can be modified without holding the LRU lock */ > + atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS]; > + atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS]; > }; > > void lru_gen_init_lruvec(struct lruvec *lruvec); > diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h > index 240905407a18..7d79818dc065 100644 > --- a/include/linux/page-flags-layout.h > +++ b/include/linux/page-flags-layout.h > @@ -106,7 +106,10 @@ > #error "Not enough bits in page flags" > #endif > > -#define LRU_REFS_WIDTH 0 > +/* see the comment on MAX_NR_TIERS */ > +#define LRU_REFS_WIDTH min(__LRU_REFS_WIDTH, BITS_PER_LONG - NR_PAGEFLAGS - \ > + ZONES_WIDTH - LRU_GEN_WIDTH - SECTIONS_WIDTH - \ > + NODES_WIDTH - KASAN_TAG_WIDTH - LAST_CPUPID_WIDTH) > > #endif > #endif /* _LINUX_PAGE_FLAGS_LAYOUT */ > diff --git a/kernel/bounds.c b/kernel/bounds.c > index 5ee60777d8e4..b529182e8b04 100644 > --- a/kernel/bounds.c > +++ b/kernel/bounds.c > @@ -24,8 +24,10 @@ int main(void) > DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t)); > #ifdef CONFIG_LRU_GEN > DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1)); > + DEFINE(__LRU_REFS_WIDTH, MAX_NR_TIERS - 2); > #else > DEFINE(LRU_GEN_WIDTH, 0); > + DEFINE(__LRU_REFS_WIDTH, 0); > #endif > /* End of constants */ > > diff --git a/mm/Kconfig b/mm/Kconfig > index e62bd501082b..0aeacbd3361c 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -909,6 +909,7 @@ config ANON_VMA_NAME > area from being merged with adjacent virtual memory areas due to the > difference in their name. > > +# multi-gen LRU { > config LRU_GEN > bool "Multi-Gen LRU" > depends on MMU > @@ -917,6 +918,16 @@ config LRU_GEN > help > A high performance LRU implementation to overcommit memory. > > +config LRU_GEN_STATS > + bool "Full stats for debugging" > + depends on LRU_GEN > + help > + Do not enable this option unless you plan to look at historical stats > + from evicted generations for debugging purpose. > + > + This option has a per-memcg and per-node memory overhead. > +# } > + > source "mm/damon/Kconfig" > > endmenu > diff --git a/mm/swap.c b/mm/swap.c > index a6870ba0bd83..a99d22308f28 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -405,6 +405,40 @@ static void __lru_cache_activate_folio(struct folio *folio) > local_unlock(&lru_pvecs.lock); > } > > +#ifdef CONFIG_LRU_GEN > +static void folio_inc_refs(struct folio *folio) > +{ > + unsigned long new_flags, old_flags = READ_ONCE(folio->flags); > + > + if (folio_test_unevictable(folio)) > + return; > + > + if (!folio_test_referenced(folio)) { > + folio_set_referenced(folio); > + return; > + } > + > + if (!folio_test_workingset(folio)) { > + folio_set_workingset(folio); > + return; > + } > + > + /* see the comment on MAX_NR_TIERS */ > + do { > + new_flags = old_flags & LRU_REFS_MASK; > + if (new_flags == LRU_REFS_MASK) > + break; > + > + new_flags += BIT(LRU_REFS_PGOFF); > + new_flags |= old_flags & ~LRU_REFS_MASK; > + } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); > +} > +#else > +static void folio_inc_refs(struct folio *folio) > +{ > +} > +#endif /* CONFIG_LRU_GEN */ > + > /* > * Mark a page as having seen activity. > * > @@ -417,6 +451,11 @@ static void __lru_cache_activate_folio(struct folio *folio) > */ > void folio_mark_accessed(struct folio *folio) > { > + if (lru_gen_enabled()) { > + folio_inc_refs(folio); > + return; > + } > + > if (!folio_test_referenced(folio)) { > folio_set_referenced(folio); > } else if (folio_test_unevictable(folio)) { > diff --git a/mm/vmscan.c b/mm/vmscan.c > index b41ff9765cc7..891f0ab69b3a 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1275,9 +1275,11 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio, > > if (folio_test_swapcache(folio)) { > swp_entry_t swap = folio_swap_entry(folio); > - mem_cgroup_swapout(folio, swap); > + > + /* get a shadow entry before mem_cgroup_swapout() clears folio_memcg() */ > if (reclaimed && !mapping_exiting(mapping)) > shadow = workingset_eviction(folio, target_memcg); > + mem_cgroup_swapout(folio, swap); > __delete_from_swap_cache(&folio->page, swap, shadow); > xa_unlock_irq(&mapping->i_pages); > put_swap_page(&folio->page, swap); > @@ -2649,6 +2651,9 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc) > unsigned long file; > struct lruvec *target_lruvec; > > + if (lru_gen_enabled()) > + return; > + > target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); > > /* > @@ -2974,6 +2979,17 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, > * shorthand helpers > ******************************************************************************/ > > +#define LRU_REFS_FLAGS (BIT(PG_referenced) | BIT(PG_workingset)) > + > +#define DEFINE_MAX_SEQ(lruvec) \ > + unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq) > + > +#define DEFINE_MIN_SEQ(lruvec) \ > + unsigned long min_seq[ANON_AND_FILE] = { \ > + READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_ANON]), \ > + READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]), \ > + } > + > #define for_each_gen_type_zone(gen, type, zone) \ > for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \ > for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \ > @@ -2999,6 +3015,753 @@ static struct lruvec __maybe_unused *get_lruvec(struct mem_cgroup *memcg, int ni > return pgdat ? &pgdat->__lruvec : NULL; > } > > +static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) > +{ > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > + struct pglist_data *pgdat = lruvec_pgdat(lruvec); > + > + if (!can_demote(pgdat->node_id, sc) && > + mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) > + return 0; > + > + return mem_cgroup_swappiness(memcg); > +} > + > +static int get_nr_gens(struct lruvec *lruvec, int type) > +{ > + return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1; > +} > + > +static bool __maybe_unused seq_is_valid(struct lruvec *lruvec) > +{ > + /* see the comment on lru_gen_struct */ > + return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS && > + get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) && > + get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS; > +} > + > +/****************************************************************************** > + * refault feedback loop > + ******************************************************************************/ > + > +/* > + * A feedback loop based on Proportional-Integral-Derivative (PID) controller. > + * > + * The P term is refaulted/(evicted+protected) from a tier in the generation > + * currently being evicted; the I term is the exponential moving average of the > + * P term over the generations previously evicted, using the smoothing factor > + * 1/2; the D term isn't supported. > + * > + * The setpoint (SP) is always the first tier of one type; the process variable > + * (PV) is either any tier of the other type or any other tier of the same > + * type. > + * > + * The error is the difference between the SP and the PV; the correction is > + * turn off protection when SP>PV or turn on protection when SP + * > + * For future optimizations: > + * 1. The D term may discount the other two terms over time so that long-lived > + * generations can resist stale information. > + */ > +struct ctrl_pos { > + unsigned long refaulted; > + unsigned long total; > + int gain; > +}; > + > +static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain, > + struct ctrl_pos *pos) > +{ > + struct lru_gen_struct *lrugen = &lruvec->lrugen; > + int hist = lru_hist_from_seq(lrugen->min_seq[type]); > + > + pos->refaulted = lrugen->avg_refaulted[type][tier] + > + atomic_long_read(&lrugen->refaulted[hist][type][tier]); > + pos->total = lrugen->avg_total[type][tier] + > + atomic_long_read(&lrugen->evicted[hist][type][tier]); > + if (tier) > + pos->total += lrugen->protected[hist][type][tier - 1]; > + pos->gain = gain; > +} > + > +static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover) > +{ > + int hist, tier; > + struct lru_gen_struct *lrugen = &lruvec->lrugen; > + bool clear = carryover ? NR_HIST_GENS == 1 : NR_HIST_GENS > 1; > + unsigned long seq = carryover ? lrugen->min_seq[type] : lrugen->max_seq + 1; > + > + lockdep_assert_held(&lruvec->lru_lock); > + > + if (!carryover && !clear) > + return; > + > + hist = lru_hist_from_seq(seq); > + > + for (tier = 0; tier < MAX_NR_TIERS; tier++) { > + if (carryover) { > + unsigned long sum; > + > + sum = lrugen->avg_refaulted[type][tier] + > + atomic_long_read(&lrugen->refaulted[hist][type][tier]); > + WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2); > + > + sum = lrugen->avg_total[type][tier] + > + atomic_long_read(&lrugen->evicted[hist][type][tier]); > + if (tier) > + sum += lrugen->protected[hist][type][tier - 1]; > + WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2); > + } > + > + if (clear) { > + atomic_long_set(&lrugen->refaulted[hist][type][tier], 0); > + atomic_long_set(&lrugen->evicted[hist][type][tier], 0); > + if (tier) > + WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0); > + } > + } > +} > + > +static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv) > +{ > + /* > + * Return true if the PV has a limited number of refaults or a lower > + * refaulted/total than the SP. > + */ > + return pv->refaulted < MIN_LRU_BATCH || > + pv->refaulted * (sp->total + MIN_LRU_BATCH) * sp->gain <= > + (sp->refaulted + 1) * pv->total * pv->gain; > +} > + > +/****************************************************************************** > + * the aging > + ******************************************************************************/ > + > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming) > +{ > + int type = folio_is_file_lru(folio); > + struct lru_gen_struct *lrugen = &lruvec->lrugen; > + int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); > + unsigned long new_flags, old_flags = READ_ONCE(folio->flags); > + > + VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio); > + > + do { > + new_gen = (old_gen + 1) % MAX_NR_GENS; > + > + new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS); > + new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF; > + /* for folio_end_writeback() */ > + if (reclaiming) > + new_flags |= BIT(PG_reclaim); > + } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); > + > + lru_gen_update_size(lruvec, folio, old_gen, new_gen); > + > + return new_gen; > +} > + > +static void inc_min_seq(struct lruvec *lruvec, int type) > +{ > + struct lru_gen_struct *lrugen = &lruvec->lrugen; > + > + reset_ctrl_pos(lruvec, type, true); > + WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1); > +} > + > +static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap) > +{ > + int gen, type, zone; > + bool success = false; > + struct lru_gen_struct *lrugen = &lruvec->lrugen; > + DEFINE_MIN_SEQ(lruvec); > + > + VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); > + > + /* find the oldest populated generation */ > + for (type = !can_swap; type < ANON_AND_FILE; type++) { > + while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) { > + gen = lru_gen_from_seq(min_seq[type]); > + > + for (zone = 0; zone < MAX_NR_ZONES; zone++) { > + if (!list_empty(&lrugen->lists[gen][type][zone])) > + goto next; > + } > + > + min_seq[type]++; > + } > +next: > + ; > + } > + > + /* see the comment on lru_gen_struct */ > + if (can_swap) { > + min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]); > + min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]); > + } > + > + for (type = !can_swap; type < ANON_AND_FILE; type++) { > + if (min_seq[type] == lrugen->min_seq[type]) > + continue; > + > + reset_ctrl_pos(lruvec, type, true); > + WRITE_ONCE(lrugen->min_seq[type], min_seq[type]); > + success = true; > + } > + > + return success; > +} > + > +static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_swap) > +{ > + int prev, next; > + int type, zone; > + struct lru_gen_struct *lrugen = &lruvec->lrugen; > + > + spin_lock_irq(&lruvec->lru_lock); > + > + VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); > + > + if (max_seq != lrugen->max_seq) > + goto unlock; > + > + for (type = 0; type < ANON_AND_FILE; type++) { > + if (get_nr_gens(lruvec, type) != MAX_NR_GENS) > + continue; > + > + VM_WARN_ON_ONCE(type == LRU_GEN_FILE || can_swap); > + > + inc_min_seq(lruvec, type); > + } > + > + /* > + * Update the active/inactive LRU sizes for compatibility. Both sides of > + * the current max_seq need to be covered, since max_seq+1 can overlap > + * with min_seq[LRU_GEN_ANON] if swapping is constrained. And if they do > + * overlap, cold/hot inversion happens. > + */ > + prev = lru_gen_from_seq(lrugen->max_seq - 1); > + next = lru_gen_from_seq(lrugen->max_seq + 1); > + > + for (type = 0; type < ANON_AND_FILE; type++) { > + for (zone = 0; zone < MAX_NR_ZONES; zone++) { > + enum lru_list lru = type * LRU_INACTIVE_FILE; > + long delta = lrugen->nr_pages[prev][type][zone] - > + lrugen->nr_pages[next][type][zone]; > + > + if (!delta) > + continue; > + > + __update_lru_size(lruvec, lru, zone, delta); > + __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta); > + } > + } > + > + for (type = 0; type < ANON_AND_FILE; type++) > + reset_ctrl_pos(lruvec, type, false); > + > + /* make sure preceding modifications appear */ > + smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1); > +unlock: > + spin_unlock_irq(&lruvec->lru_lock); > +} > + > +static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq, > + unsigned long *min_seq, bool can_swap, bool *need_aging) > +{ > + int gen, type, zone; > + long old = 0; > + long young = 0; > + long total = 0; > + struct lru_gen_struct *lrugen = &lruvec->lrugen; > + > + for (type = !can_swap; type < ANON_AND_FILE; type++) { > + unsigned long seq; > + > + for (seq = min_seq[type]; seq <= max_seq; seq++) { > + long size = 0; > + > + gen = lru_gen_from_seq(seq); > + > + for (zone = 0; zone < MAX_NR_ZONES; zone++) > + size += READ_ONCE(lrugen->nr_pages[gen][type][zone]); > + > + total += size; > + if (seq == max_seq) > + young += size; > + if (seq + MIN_NR_GENS == max_seq) > + old += size; > + } > + } > + > + /* > + * The aging tries to be lazy to reduce the overhead. On the other hand, > + * the eviction stalls when the number of generations reaches > + * MIN_NR_GENS. So ideally, there should be MIN_NR_GENS+1 generations, > + * hence the first two if's. > + * > + * Also it's ideal to spread pages out evenly, meaning 1/(MIN_NR_GENS+1) > + * of the total number of pages for each generation. A reasonable range > + * for this average portion is [1/MIN_NR_GENS, 1/(MIN_NR_GENS+2)]. The > + * eviction cares about the lower bound of cold pages, whereas the aging > + * cares about the upper bound of hot pages. > + */ > + if (min_seq[!can_swap] + MIN_NR_GENS > max_seq) > + *need_aging = true; > + else if (min_seq[!can_swap] + MIN_NR_GENS < max_seq) > + *need_aging = false; > + else if (young * MIN_NR_GENS > total) > + *need_aging = true; > + else if (old * (MIN_NR_GENS + 2) < total) > + *need_aging = true; > + else > + *need_aging = false; > + > + return total > 0 ? total : 0; > +} > + > +static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc) > +{ > + bool need_aging; > + long nr_to_scan; > + int swappiness = get_swappiness(lruvec, sc); > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > + DEFINE_MAX_SEQ(lruvec); > + DEFINE_MIN_SEQ(lruvec); > + > + VM_WARN_ON_ONCE(sc->memcg_low_reclaim); > + > + mem_cgroup_calculate_protection(NULL, memcg); > + > + if (mem_cgroup_below_min(memcg)) > + return; > + > + nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, swappiness, &need_aging); > + if (!nr_to_scan) > + return; > + > + nr_to_scan >>= sc->priority; > + > + if (!mem_cgroup_online(memcg)) > + nr_to_scan++; > + > + if (nr_to_scan && need_aging) > + inc_max_seq(lruvec, max_seq, swappiness); > +} > + > +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) > +{ > + struct mem_cgroup *memcg; > + > + VM_WARN_ON_ONCE(!current_is_kswapd()); > + > + memcg = mem_cgroup_iter(NULL, NULL, NULL); > + do { > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); > + > + age_lruvec(lruvec, sc); > + > + cond_resched(); > + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); > +} > + > +/****************************************************************************** > + * the eviction > + ******************************************************************************/ > + > +static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx) > +{ > + bool success; > + int gen = folio_lru_gen(folio); > + int type = folio_is_file_lru(folio); > + int zone = folio_zonenum(folio); > + int delta = folio_nr_pages(folio); > + int refs = folio_lru_refs(folio); > + int tier = lru_tier_from_refs(refs); > + struct lru_gen_struct *lrugen = &lruvec->lrugen; > + > + VM_WARN_ON_ONCE_FOLIO(gen >= MAX_NR_GENS, folio); > + > + /* unevictable */ > + if (!folio_evictable(folio)) { > + success = lru_gen_del_folio(lruvec, folio, true); > + VM_WARN_ON_ONCE_FOLIO(!success, folio); > + folio_set_unevictable(folio); > + lruvec_add_folio(lruvec, folio); > + __count_vm_events(UNEVICTABLE_PGCULLED, delta); > + return true; > + } > + > + /* dirtied lazyfree */ > + if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) { > + success = lru_gen_del_folio(lruvec, folio, true); > + VM_WARN_ON_ONCE_FOLIO(!success, folio); > + folio_set_swapbacked(folio); > + lruvec_add_folio_tail(lruvec, folio); > + return true; > + } > + > + /* protected */ > + if (tier > tier_idx) { > + int hist = lru_hist_from_seq(lrugen->min_seq[type]); > + > + gen = folio_inc_gen(lruvec, folio, false); > + list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]); > + > + WRITE_ONCE(lrugen->protected[hist][type][tier - 1], > + lrugen->protected[hist][type][tier - 1] + delta); > + __mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta); > + return true; > + } > + > + /* waiting for writeback */ > + if (folio_test_locked(folio) || folio_test_writeback(folio) || > + (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > + gen = folio_inc_gen(lruvec, folio, true); > + list_move(&folio->lru, &lrugen->lists[gen][type][zone]); > + return true; > + } > + > + return false; > +} > + > +static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc) > +{ > + bool success; > + > + if (!sc->may_unmap && folio_mapped(folio)) > + return false; > + > + if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) && > + (folio_test_dirty(folio) || > + (folio_test_anon(folio) && !folio_test_swapcache(folio)))) > + return false; > + > + if (!folio_try_get(folio)) > + return false; > + > + if (!folio_test_clear_lru(folio)) { > + folio_put(folio); > + return false; > + } > + > + success = lru_gen_del_folio(lruvec, folio, true); > + VM_WARN_ON_ONCE_FOLIO(!success, folio); > + > + return true; > +} > + > +static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > + int type, int tier, struct list_head *list) > +{ > + int gen, zone; > + enum vm_event_item item; > + int sorted = 0; > + int scanned = 0; > + int isolated = 0; > + int remaining = MAX_LRU_BATCH; > + struct lru_gen_struct *lrugen = &lruvec->lrugen; > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > + > + VM_WARN_ON_ONCE(!list_empty(list)); > + > + if (get_nr_gens(lruvec, type) == MIN_NR_GENS) > + return 0; > + > + gen = lru_gen_from_seq(lrugen->min_seq[type]); > + > + for (zone = sc->reclaim_idx; zone >= 0; zone--) { > + LIST_HEAD(moved); > + int skipped = 0; > + struct list_head *head = &lrugen->lists[gen][type][zone]; > + > + while (!list_empty(head)) { > + struct folio *folio = lru_to_folio(head); > + int delta = folio_nr_pages(folio); > + > + VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio); > + VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio); > + VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio); > + VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio); > + > + scanned += delta; > + > + if (sort_folio(lruvec, folio, tier)) > + sorted += delta; > + else if (isolate_folio(lruvec, folio, sc)) { > + list_add(&folio->lru, list); > + isolated += delta; > + } else { > + list_move(&folio->lru, &moved); > + skipped += delta; > + } > + > + if (!--remaining || max(isolated, skipped) >= MIN_LRU_BATCH) > + break; > + } > + > + if (skipped) { > + list_splice(&moved, head); > + __count_zid_vm_events(PGSCAN_SKIP, zone, skipped); > + } > + > + if (!remaining || isolated >= MIN_LRU_BATCH) > + break; > + } > + > + item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; > + if (!cgroup_reclaim(sc)) { > + __count_vm_events(item, isolated); > + __count_vm_events(PGREFILL, sorted); > + } > + __count_memcg_events(memcg, item, isolated); > + __count_memcg_events(memcg, PGREFILL, sorted); > + __count_vm_events(PGSCAN_ANON + type, isolated); > + > + /* > + * There might not be eligible pages due to reclaim_idx, may_unmap and > + * may_writepage. Check the remaining to prevent livelock if there is no > + * progress. > + */ > + return isolated || !remaining ? scanned : 0; > +} > + > +static int get_tier_idx(struct lruvec *lruvec, int type) > +{ > + int tier; > + struct ctrl_pos sp, pv; > + > + /* > + * To leave a margin for fluctuations, use a larger gain factor (1:2). > + * This value is chosen because any other tier would have at least twice > + * as many refaults as the first tier. > + */ > + read_ctrl_pos(lruvec, type, 0, 1, &sp); > + for (tier = 1; tier < MAX_NR_TIERS; tier++) { > + read_ctrl_pos(lruvec, type, tier, 2, &pv); > + if (!positive_ctrl_err(&sp, &pv)) > + break; > + } > + > + return tier - 1; > +} > + > +static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx) > +{ > + int type, tier; > + struct ctrl_pos sp, pv; > + int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness }; > + > + /* > + * Compare the first tier of anon with that of file to determine which > + * type to scan. Also need to compare other tiers of the selected type > + * with the first tier of the other type to determine the last tier (of > + * the selected type) to evict. > + */ > + read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp); > + read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv); > + type = positive_ctrl_err(&sp, &pv); > + > + read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp); > + for (tier = 1; tier < MAX_NR_TIERS; tier++) { > + read_ctrl_pos(lruvec, type, tier, gain[type], &pv); > + if (!positive_ctrl_err(&sp, &pv)) > + break; > + } > + > + *tier_idx = tier - 1; > + > + return type; > +} > + > +static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness, > + int *type_scanned, struct list_head *list) > +{ > + int i; > + int type; > + int scanned; > + int tier = -1; > + DEFINE_MIN_SEQ(lruvec); > + > + /* > + * Try to make the obvious choice first. When anon and file are both > + * available from the same generation, interpret swappiness 1 as file > + * first and 200 as anon first. > + */ > + if (!swappiness) > + type = LRU_GEN_FILE; > + else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE]) > + type = LRU_GEN_ANON; > + else if (swappiness == 1) > + type = LRU_GEN_FILE; > + else if (swappiness == 200) > + type = LRU_GEN_ANON; > + else > + type = get_type_to_scan(lruvec, swappiness, &tier); > + > + for (i = !swappiness; i < ANON_AND_FILE; i++) { > + if (tier < 0) > + tier = get_tier_idx(lruvec, type); > + > + scanned = scan_folios(lruvec, sc, type, tier, list); > + if (scanned) > + break; > + > + type = !type; > + tier = -1; > + } > + > + *type_scanned = type; > + > + return scanned; > +} > + > +static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness) > +{ > + int type; > + int scanned; > + int reclaimed; > + LIST_HEAD(list); > + struct folio *folio; > + enum vm_event_item item; > + struct reclaim_stat stat; > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > + struct pglist_data *pgdat = lruvec_pgdat(lruvec); > + > + spin_lock_irq(&lruvec->lru_lock); > + > + scanned = isolate_folios(lruvec, sc, swappiness, &type, &list); > + > + if (try_to_inc_min_seq(lruvec, swappiness)) > + scanned++; > + > + if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS) > + scanned = 0; > + > + spin_unlock_irq(&lruvec->lru_lock); > + > + if (list_empty(&list)) > + return scanned; > + > + reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false); > + > + /* > + * To avoid livelock, don't add rejected pages back to the same lists > + * they were isolated from. See lru_gen_add_folio(). > + */ > + list_for_each_entry(folio, &list, lru) { > + folio_clear_referenced(folio); > + folio_clear_workingset(folio); > + > + if (folio_test_reclaim(folio) && > + (folio_test_dirty(folio) || folio_test_writeback(folio))) > + folio_clear_active(folio); > + else > + folio_set_active(folio); > + } > + > + spin_lock_irq(&lruvec->lru_lock); > + > + move_pages_to_lru(lruvec, &list); > + > + item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; > + if (!cgroup_reclaim(sc)) > + __count_vm_events(item, reclaimed); > + __count_memcg_events(memcg, item, reclaimed); > + __count_vm_events(PGSTEAL_ANON + type, reclaimed); > + > + spin_unlock_irq(&lruvec->lru_lock); > + > + mem_cgroup_uncharge_list(&list); > + free_unref_page_list(&list); > + > + sc->nr_reclaimed += reclaimed; > + > + return scanned; > +} > + > +static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool can_swap) > +{ > + bool need_aging; > + long nr_to_scan; > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > + DEFINE_MAX_SEQ(lruvec); > + DEFINE_MIN_SEQ(lruvec); > + > + if (mem_cgroup_below_min(memcg) || > + (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim)) > + return 0; > + > + nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, can_swap, &need_aging); > + if (!nr_to_scan) > + return 0; > + > + /* reset the priority if the target has been met */ > + nr_to_scan >>= sc->nr_reclaimed < sc->nr_to_reclaim ? sc->priority : DEF_PRIORITY; > + > + if (!mem_cgroup_online(memcg)) > + nr_to_scan++; > + > + if (!nr_to_scan) > + return 0; > + > + if (!need_aging) > + return nr_to_scan; > + > + /* leave the work to lru_gen_age_node() */ > + if (current_is_kswapd()) > + return 0; > + > + /* try other memcgs before going to the aging path */ > + if (!cgroup_reclaim(sc) && !sc->force_deactivate) { > + sc->skipped_deactivate = true; > + return 0; > + } > + > + inc_max_seq(lruvec, max_seq, can_swap); > + > + return nr_to_scan; > +} > + > +static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) > +{ > + struct blk_plug plug; > + long scanned = 0; > + > + lru_add_drain(); > + > + blk_start_plug(&plug); > + > + while (true) { > + int delta; > + int swappiness; > + long nr_to_scan; > + > + if (sc->may_swap) > + swappiness = get_swappiness(lruvec, sc); > + else if (!cgroup_reclaim(sc) && get_swappiness(lruvec, sc)) > + swappiness = 1; > + else > + swappiness = 0; > + > + nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness); > + if (!nr_to_scan) > + break; > + > + delta = evict_folios(lruvec, sc, swappiness); > + if (!delta) > + break; > + > + scanned += delta; > + if (scanned >= nr_to_scan) > + break; > + > + cond_resched(); > + } > + > + blk_finish_plug(&plug); > +} > + > /****************************************************************************** > * initialization > ******************************************************************************/ > @@ -3041,6 +3804,16 @@ static int __init init_lru_gen(void) > }; > late_initcall(init_lru_gen); > > +#else > + > +static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) > +{ > +} > + > +static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) > +{ > +} > + > #endif /* CONFIG_LRU_GEN */ > > static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) > @@ -3054,6 +3827,11 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) > struct blk_plug plug; > bool scan_adjusted; > > + if (lru_gen_enabled()) { > + lru_gen_shrink_lruvec(lruvec, sc); > + return; > + } > + > get_scan_count(lruvec, sc, nr); > > /* Record the original scan target for proportional adjustments later */ > @@ -3558,6 +4336,9 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) > struct lruvec *target_lruvec; > unsigned long refaults; > > + if (lru_gen_enabled()) > + return; > + > target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); > refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON); > target_lruvec->refaults[0] = refaults; > @@ -3922,12 +4703,17 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > } > #endif > > -static void age_active_anon(struct pglist_data *pgdat, > +static void kswapd_age_node(struct pglist_data *pgdat, > struct scan_control *sc) > { > struct mem_cgroup *memcg; > struct lruvec *lruvec; > > + if (lru_gen_enabled()) { > + lru_gen_age_node(pgdat, sc); > + return; > + } > + > if (!can_age_anon_pages(pgdat, sc)) > return; > > @@ -4247,12 +5033,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) > sc.may_swap = !nr_boost_reclaim; > > /* > - * Do some background aging of the anon list, to give > - * pages a chance to be referenced before reclaiming. All > - * pages are rotated regardless of classzone as this is > - * about consistent aging. > + * Do some background aging, to give pages a chance to be > + * referenced before reclaiming. All pages are rotated > + * regardless of classzone as this is about consistent aging. > */ > - age_active_anon(pgdat, &sc); > + kswapd_age_node(pgdat, &sc); > > /* > * If we're getting trouble reclaiming, start doing writepage > diff --git a/mm/workingset.c b/mm/workingset.c > index 592569a8974c..db6f0c8a98c2 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -187,7 +187,6 @@ static unsigned int bucket_order __read_mostly; > static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction, > bool workingset) > { > - eviction >>= bucket_order; > eviction &= EVICTION_MASK; > eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; > eviction = (eviction << NODES_SHIFT) | pgdat->node_id; > @@ -212,10 +211,107 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, > > *memcgidp = memcgid; > *pgdat = NODE_DATA(nid); > - *evictionp = entry << bucket_order; > + *evictionp = entry; > *workingsetp = workingset; > } > > +#ifdef CONFIG_LRU_GEN > + > +static void *lru_gen_eviction(struct folio *folio) > +{ > + int hist; > + unsigned long token; > + unsigned long min_seq; > + struct lruvec *lruvec; > + struct lru_gen_struct *lrugen; > + int type = folio_is_file_lru(folio); > + int delta = folio_nr_pages(folio); > + int refs = folio_lru_refs(folio); > + int tier = lru_tier_from_refs(refs); > + struct mem_cgroup *memcg = folio_memcg(folio); > + struct pglist_data *pgdat = folio_pgdat(folio); > + > + BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT); > + > + lruvec = mem_cgroup_lruvec(memcg, pgdat); > + lrugen = &lruvec->lrugen; > + min_seq = READ_ONCE(lrugen->min_seq[type]); > + token = (min_seq << LRU_REFS_WIDTH) | max(refs - 1, 0); > + > + hist = lru_hist_from_seq(min_seq); > + atomic_long_add(delta, &lrugen->evicted[hist][type][tier]); > + > + return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs); pack_shadow pass refs rather than PageWorkingset(page), is it done on purpose? In my opinion,  PageWorkingset can be removed due to failed to reclaim > +} > + > +static void lru_gen_refault(struct folio *folio, void *shadow) > +{ > + int hist, tier, refs; > + int memcg_id; > + bool workingset; > + unsigned long token; > + unsigned long min_seq; > + struct lruvec *lruvec; > + struct lru_gen_struct *lrugen; > + struct mem_cgroup *memcg; > + struct pglist_data *pgdat; > + int type = folio_is_file_lru(folio); > + int delta = folio_nr_pages(folio); > + > + unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset); > + > + if (folio_pgdat(folio) != pgdat) > + return; > + > + /* see the comment in folio_lru_refs() */ > + refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset; > + tier = lru_tier_from_refs(refs); > + > + rcu_read_lock(); > + memcg = folio_memcg_rcu(folio); > + if (mem_cgroup_id(memcg) != memcg_id) > + goto unlock; > + > + lruvec = mem_cgroup_lruvec(memcg, pgdat); > + lrugen = &lruvec->lrugen; > + min_seq = READ_ONCE(lrugen->min_seq[type]); > + > + token >>= LRU_REFS_WIDTH; > + if (token != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH))) > + goto unlock; > + > + hist = lru_hist_from_seq(min_seq); > + atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]); > + mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta); > + > + /* > + * Count the following two cases as stalls: > + * 1. For pages accessed through page tables, hotter pages pushed out > + * hot pages which refaulted immediately. > + * 2. For pages accessed through file descriptors, numbers of accesses > + * might have been beyond the limit. > + */ > + if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) { > + folio_set_workingset(folio); > + mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta); > + } > +unlock: > + rcu_read_unlock(); > +} > + > +#else > + > +static void *lru_gen_eviction(struct folio *folio) > +{ > + return NULL; > +} > + > +static void lru_gen_refault(struct folio *folio, void *shadow) > +{ > +} > + > +#endif /* CONFIG_LRU_GEN */ > + > /** > * workingset_age_nonresident - age non-resident entries as LRU ages > * @lruvec: the lruvec that was aged > @@ -264,10 +360,14 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg) > VM_BUG_ON_FOLIO(folio_ref_count(folio), folio); > VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); > > + if (lru_gen_enabled()) > + return lru_gen_eviction(folio); > + > lruvec = mem_cgroup_lruvec(target_memcg, pgdat); > /* XXX: target_memcg can be NULL, go through lruvec */ > memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); > eviction = atomic_long_read(&lruvec->nonresident_age); > + eviction >>= bucket_order; > workingset_age_nonresident(lruvec, folio_nr_pages(folio)); > return pack_shadow(memcgid, pgdat, eviction, > folio_test_workingset(folio)); > @@ -298,7 +398,13 @@ void workingset_refault(struct folio *folio, void *shadow) > int memcgid; > long nr; > > + if (lru_gen_enabled()) { > + lru_gen_refault(folio, shadow); > + return; > + } > + > unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset); > + eviction <<= bucket_order; > > rcu_read_lock(); > /*