Received: by 2002:a05:6a10:1a4d:0:0:0:0 with SMTP id nk13csp3419249pxb; Thu, 10 Feb 2022 22:03:49 -0800 (PST) X-Google-Smtp-Source: ABdhPJz9w3HW+h4efDspUS0qfwiLmnzl7KTgOwiwRIx7mbsfSDXM2x8KvBNMmlw4eyoWA91bprxe X-Received: by 2002:a17:90b:4f43:: with SMTP id pj3mr191951pjb.227.1644559428983; Thu, 10 Feb 2022 22:03:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1644559428; cv=none; d=google.com; s=arc-20160816; b=Fh/ddUUwAz7qDsPrjUJJdlyQDcBRtzF/MRXQnZfjwlGWfiXgJnEmbZ+BJ8oi5gPYT4 fOAjr/1nihCO2bdw5DF0puNVnpNAxFSY/QEONIMluUitLySeqXOojlvC25BFaf6jLWnD tIbNRhNHYUn8X/mM95lV1sSWOQ+xjsYTGchhoGjsW/0xFDDvd79OecTtXB9O8ODr4U8K 5FF/C21BGpXlzBb+rPn5CGiXM61knX/QPf/s86ffskgPWzBh6JAwa+MLmX01OpBOwEMB FNuv3oQTp8gOoaREJZDs/znCgWku6yj9BvgfbVhHf9Vm+Fm0bVO7AbeIfxyaA+MuNY3r KiiA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=zMCyMjn5wNyD/PEScuAZavGsBtBHaIhf5nWutf1hESw=; b=KTsIzEx1olPeWadOeLAgmwwMNgtfD6zqKdG1wyNkcIwapUMNry6SZUnebSGMZBQDW5 mCnMkhV6CXaRj/h4U3h7XMKgM53RBc7p5IOM2GN07fXXDTdTSvnmZNUuvYl2dbRsxFk9 LS1onIw4pb2FlegMnAuBykKd5LChcJh2YwTazhyGDwqz++rV6Uj+w3mvERLTEFXoDtN5 LuuyUXk+WhRv8T/oKvAU8x8ucp/jj4GUodGr/yLPUVGz6QVYVONZMnC/nIoq5znn2aBC YmqkzprH2JAfj54JtVEpZWS0wjX99OyZxmysJu+sLck7PvAevM+MUK55dl2A+2FxLpWy lpxA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=Ll6uyIhI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s19si3105429pjq.101.2022.02.10.22.03.35; Thu, 10 Feb 2022 22:03:48 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=Ll6uyIhI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344284AbiBJUmC (ORCPT + 99 others); Thu, 10 Feb 2022 15:42:02 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:55356 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344279AbiBJUmA (ORCPT ); Thu, 10 Feb 2022 15:42:00 -0500 Received: from mail-qv1-xf31.google.com (mail-qv1-xf31.google.com [IPv6:2607:f8b0:4864:20::f31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9DCBC103F for ; Thu, 10 Feb 2022 12:41:59 -0800 (PST) Received: by mail-qv1-xf31.google.com with SMTP id k9so6277197qvv.9 for ; Thu, 10 Feb 2022 12:41:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=zMCyMjn5wNyD/PEScuAZavGsBtBHaIhf5nWutf1hESw=; b=Ll6uyIhI4CeUzl37pUmXKy+1nkOSrXklXX9a4yWROdEynaTH2IDaeNzetsnh5lU4fk N8ZPS/908XJPfNHUPxLV1Dcbm8gjYO/ruU717maLtl/4daHElmqhBH88bWm9wCnCi/aQ roUen5hfef04i2LoKzRJzxcJqfBEwOId+qygzwnqfyR46VkULQWNcPayIi0Yhi2avODM WbkWliFn67Ek5VcuMg37jwfPVNKLPmEV71XFmumgnbbNXbVyTiS3mngOU7wNDGV+RLOa NzgChm+p+e+KhmtSFzM/lKe6r7K4g9aaZ3sf1fnKAZpSlF7bhxLNJL/EuHSy885M8UNq xudw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=zMCyMjn5wNyD/PEScuAZavGsBtBHaIhf5nWutf1hESw=; b=mFQgm88+oynWMQNgkudqvVQmikToIpz4nMMksjn+zQDJtzyXfRYtCyT9dKlZcXnI/x 3AXTASLH5PWXC7XWglfewQT1Gmj/RwlVOC/3zzcb26ggoNqhAVqEaroCX9lvdf8kD5xy tWPYVdjNxFapqaUTSfWEJP+Qtgepsxz2Mf78g5TQf0NptNImyFOLTbwn+NDsuL/JoP6w DK4J/Vr4nLOh7OS6KA6Qn+1bcHBxTEpwNXVm29vjBgPoS8jT7U1Top0jsw6ZtzH9K9kS 3KxJnd2syiajSWalHqFfId4iJsqaQGhdI4ma5m1dpRcCI61b3lCKz34L/++9UZMDk0uH KTxA== X-Gm-Message-State: AOAM531fmz3dMjBfbRwc5fhWBevbszNgsJU3KZoaOrUO6nnfdVXpz/GR ZkPVMWMlpdG9js89k1Npyp0Z6Q== X-Received: by 2002:a05:6214:e67:: with SMTP id jz7mr6162574qvb.41.1644525718634; Thu, 10 Feb 2022 12:41:58 -0800 (PST) Received: from localhost (cpe-98-15-154-102.hvc.res.rr.com. [98.15.154.102]) by smtp.gmail.com with ESMTPSA id i4sm11201612qkn.13.2022.02.10.12.41.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 10 Feb 2022 12:41:58 -0800 (PST) Date: Thu, 10 Feb 2022 15:41:57 -0500 From: Johannes Weiner To: Yu Zhao Cc: Andrew Morton , Mel Gorman , Michal Hocko , Andi Kleen , Aneesh Kumar , Barry Song <21cnbao@gmail.com>, Catalin Marinas , Dave Hansen , Hillf Danton , Jens Axboe , Jesse Barnes , Jonathan Corbet , Linus Torvalds , Matthew Wilcox , Michael Larabel , Mike Rapoport , Rik van Riel , Vlastimil Babka , Will Deacon , Ying Huang , linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, page-reclaim@google.com, x86@kernel.org, Brian Geffon , Jan Alexander Steffens , Oleksandr Natalenko , Steven Barrett , Suleiman Souhlal , Daniel Byrne , Donald Carr , Holger =?iso-8859-1?Q?Hoffst=E4tte?= , Konstantin Kharlamov , Shuang Zhai , Sofia Trinh Subject: Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork Message-ID: References: <20220208081902.3550911-1-yuzhao@google.com> <20220208081902.3550911-5-yuzhao@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220208081902.3550911-5-yuzhao@google.com> X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Yu, On Tue, Feb 08, 2022 at 01:18:54AM -0700, Yu Zhao wrote: > @@ -92,11 +92,196 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio) > return lru; > } > > +#ifdef CONFIG_LRU_GEN > + > +static inline bool lru_gen_enabled(void) > +{ > + return true; > +} > + > +static inline bool lru_gen_in_fault(void) > +{ > + return current->in_lru_fault; > +} > + > +static inline int lru_gen_from_seq(unsigned long seq) > +{ > + return seq % MAX_NR_GENS; > +} > + > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen) > +{ > + unsigned long max_seq = lruvec->lrugen.max_seq; > + > + VM_BUG_ON(gen >= MAX_NR_GENS); > + > + /* see the comment on MIN_NR_GENS */ > + return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1); > +} I'm still reading the series, so correct me if I'm wrong: the "active" set is split into two generations for the sole purpose of the second-chance policy for fresh faults, right? If so, it'd be better to have the comment here instead of down by MIN_NR_GENS. This is the place that defines what "active" is, so this is where the reader asks what it means and what it implies. The definition of MIN_NR_GENS can be briefer: "need at least two for second chance, see lru_gen_is_active() for details". > +static inline void lru_gen_update_size(struct lruvec *lruvec, enum lru_list lru, > + int zone, long delta) > +{ > + struct pglist_data *pgdat = lruvec_pgdat(lruvec); > + > + lockdep_assert_held(&lruvec->lru_lock); > + WARN_ON_ONCE(delta != (int)delta); > + > + __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, delta); > + __mod_zone_page_state(&pgdat->node_zones[zone], NR_ZONE_LRU_BASE + lru, delta); > +} This is a duplicate of update_lru_size(), please use that instead. Yeah technically you don't need the mem_cgroup_update_lru_size() but that's not worth sweating over, better to keep it simple. > +static inline void lru_gen_balance_size(struct lruvec *lruvec, struct folio *folio, > + int old_gen, int new_gen) lru_gen_update_lru_sizes() for this one would be more descriptive imo and in line with update_lru_size() that it's built on. > +{ > + int type = folio_is_file_lru(folio); > + int zone = folio_zonenum(folio); > + int delta = folio_nr_pages(folio); > + enum lru_list lru = type * LRU_INACTIVE_FILE; > + struct lru_gen_struct *lrugen = &lruvec->lrugen; > + > + VM_BUG_ON(old_gen != -1 && old_gen >= MAX_NR_GENS); > + VM_BUG_ON(new_gen != -1 && new_gen >= MAX_NR_GENS); > + VM_BUG_ON(old_gen == -1 && new_gen == -1); Could be a bit easier to read quickly with high-level descriptions: > + if (old_gen >= 0) > + WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone], > + lrugen->nr_pages[old_gen][type][zone] - delta); > + if (new_gen >= 0) > + WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone], > + lrugen->nr_pages[new_gen][type][zone] + delta); > + /* Addition */ > + if (old_gen < 0) { > + if (lru_gen_is_active(lruvec, new_gen)) > + lru += LRU_ACTIVE; > + lru_gen_update_size(lruvec, lru, zone, delta); > + return; > + } > + /* Removal */ > + if (new_gen < 0) { > + if (lru_gen_is_active(lruvec, old_gen)) > + lru += LRU_ACTIVE; > + lru_gen_update_size(lruvec, lru, zone, -delta); > + return; > + } > + /* Promotion */ > + if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) { > + lru_gen_update_size(lruvec, lru, zone, -delta); > + lru_gen_update_size(lruvec, lru + LRU_ACTIVE, zone, delta); > + } > + > + /* Promotion is legit while a page is on an LRU list, but demotion isn't. */ /* Demotion happens during aging when pages are isolated, never on-LRU */ > + VM_BUG_ON(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen)); > +} On that note, please move introduction of the promotion and demotion bits to the next patch. They aren't used here yet, and I spent some time jumping around patches to verify the promotion callers and confirm the validy of the BUG_ON. > +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming) > +{ > + int gen; > + unsigned long old_flags, new_flags; > + int type = folio_is_file_lru(folio); > + int zone = folio_zonenum(folio); > + struct lru_gen_struct *lrugen = &lruvec->lrugen; > + > + if (folio_test_unevictable(folio) || !lrugen->enabled) > + return false; These two checks should be in the callsite and the function should return void. Otherwise you can't understand the callsite without drilling down into lrugen code, even if lrugen is disabled. folio_add_lru() gets it right. > + /* > + * There are three common cases for this page: > + * 1) If it shouldn't be evicted, e.g., it was just faulted in, add it > + * to the youngest generation. "shouldn't be evicted" makes it sound like mlock. But they should just be evicted last, right? Maybe: /* * Pages start in different generations depending on * advance knowledge we have about their hotness and * evictability: * * 1. Already active pages start out youngest. This can be * fresh faults, or refaults of previously hot pages. * 2. Cold pages that require writeback before becoming * evictable start on the second oldest generation. * 3. Everything else (clean, cold) starts old. */ On that note, I think #1 is reintroducing a problem we have fixed before, which is trashing the workingset with a flood of use-once mmapped pages. It's the classic scenario where LFU beats LRU. Mapped streaming IO isn't very common, but it does happen. See these commits: dfc8d636cdb95f7b792d5ba8c9f3b295809c125d 31c0569c3b0b6cc8a867ac6665ca081553f7984c 645747462435d84c6c6a64269ed49cc3015f753d From the changelog: The used-once mapped file page detection patchset. It is meant to help workloads with large amounts of shortly used file mappings, like rtorrent hashing a file or git when dealing with loose objects (git gc on a bigger site?). Right now, the VM activates referenced mapped file pages on first encounter on the inactive list and it takes a full memory cycle to reclaim them again. When those pages dominate memory, the system no longer has a meaningful notion of 'working set' and is required to give up the active list to make reclaim progress. Obviously, this results in rather bad scanning latencies and the wrong pages being reclaimed. This patch makes the VM be more careful about activating mapped file pages in the first place. The minimum granted lifetime without another memory access becomes an inactive list cycle instead of the full memory cycle, which is more natural given the mentioned loads. Translating this to multigen, it seems fresh faults should really start on the second oldest rather than on the youngest generation, to get a second chance but without jeopardizing the workingset if they don't take it. > + * 2) If it can't be evicted immediately, i.e., it's an anon page and > + * not in swapcache, or a dirty page pending writeback, add it to the > + * second oldest generation. > + * 3) If it may be evicted immediately, e.g., it's a clean page, add it > + * to the oldest generation. > + */ > + if (folio_test_active(folio)) > + gen = lru_gen_from_seq(lrugen->max_seq); > + else if ((!type && !folio_test_swapcache(folio)) || > + (folio_test_reclaim(folio) && > + (folio_test_dirty(folio) || folio_test_writeback(folio)))) > + gen = lru_gen_from_seq(lrugen->min_seq[type] + 1); > + else > + gen = lru_gen_from_seq(lrugen->min_seq[type]); Condition #2 is not quite clear to me, and the comment is incomplete: The code does put dirty/writeback pages on the oldest gen as long as they haven't been marked for immediate reclaim by the scanner yet. HOWEVER, once the scanner does see those pages and sets PG_reclaim, it will also activate them to move them out of the way until writeback finishes (see shrink_page_list()) - at which point we'll trigger #1. So that second part of #2 appears unreachable. It could be a good exercise to describe how cache pages move through the generations, similar to the comment on lru_deactivate_file_fn(). It's a good example of intent vs implementation. On another note, "!type" meaning "anon" is a bit rough. Please follow the "bool file" convention used elsewhere. > @@ -113,6 +298,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio) > { > enum lru_list lru = folio_lru_list(folio); > > + if (lru_gen_add_folio(lruvec, folio, true)) > + return; > + bool parameters are notoriously hard to follow in the callsite. Can you please add lru_gen_add_folio_tail() instead and have them use a common helper? > @@ -127,6 +315,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page, > static __always_inline > void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio) > { > + if (lru_gen_del_folio(lruvec, folio, false)) > + return; > + > list_del(&folio->lru); > update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio), > -folio_nr_pages(folio)); > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index aed44e9b5d89..0f5e8a995781 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -303,6 +303,78 @@ enum lruvec_flags { > */ > }; > > +struct lruvec; > + > +#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF) > +#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF) > + > +#ifdef CONFIG_LRU_GEN > + > +#define MIN_LRU_BATCH BITS_PER_LONG > +#define MAX_LRU_BATCH (MIN_LRU_BATCH * 128) Those two aren't used in this patch, so it's hard to say whether they are chosen correctly. > + * Evictable pages are divided into multiple generations. The youngest and the > + * oldest generation numbers, max_seq and min_seq, are monotonically increasing. > + * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An > + * offset within MAX_NR_GENS, gen, indexes the LRU list of the corresponding > + * generation. The gen counter in folio->flags stores gen+1 while a page is on > + * one of lrugen->lists[]. Otherwise it stores 0. > + * > + * A page is added to the youngest generation on faulting. The aging needs to > + * check the accessed bit at least twice before handing this page over to the > + * eviction. The first check takes care of the accessed bit set on the initial > + * fault; the second check makes sure this page hasn't been used since then. > + * This process, AKA second chance, requires a minimum of two generations, > + * hence MIN_NR_GENS. And to be compatible with the active/inactive LRU, these > + * two generations are mapped to the active; the rest of generations, if they > + * exist, are mapped to the inactive. PG_active is always cleared while a page > + * is on one of lrugen->lists[] so that demotion, which happens consequently > + * when the aging produces a new generation, needs not to worry about it. > + */ > +#define MIN_NR_GENS 2U > +#define MAX_NR_GENS ((unsigned int)CONFIG_NR_LRU_GENS) > + > +struct lru_gen_struct { struct lrugen? In fact, "lrugen" for the general function and variable namespace might be better, the _ doesn't seem to pull its weight. CONFIG_LRUGEN struct lrugen lrugen_foo() etc. > + /* the aging increments the youngest generation number */ > + unsigned long max_seq; > + /* the eviction increments the oldest generation numbers */ > + unsigned long min_seq[ANON_AND_FILE]; The singular max_seq vs the split min_seq raises questions. Please add a comment that explains or points to an explanation. > + /* the birth time of each generation in jiffies */ > + unsigned long timestamps[MAX_NR_GENS]; This isn't in use until the thrashing-based OOM killing patch. > + /* the multigenerational LRU lists */ > + struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; > + /* the sizes of the above lists */ > + unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; > + /* whether the multigenerational LRU is enabled */ > + bool enabled; Not (really) in use until the runtime switch. Best to keep everybody checking the global flag for now, and have the runtime switch patch introduce this flag and switch necessary callsites over. > +void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec); "state" is what we usually init :) How about lrugen_init_lruvec()? You can drop the memcg parameter and use lruvec_memcg(). > +#ifdef CONFIG_MEMCG > +void lru_gen_init_memcg(struct mem_cgroup *memcg); > +void lru_gen_free_memcg(struct mem_cgroup *memcg); This should be either init+exit, or alloc+free. Thanks