On original global lru_gen node in debugfs, it can all show each memcg's
lru gen info in "lru_gen" or "lru_gen_full", and can type cmd into lru_gen.
But which show info contains all memcg's info, and cmd need to
know memcg's id.
This patchset add lru_gen node in per memcg, with this node, we can
get lru_gen info in each memcg.
Also, we can type cmd to control each memcg's lru_gen seq, but, this node
don't support multi cmd, single memcg just process one cmd once time.
HuanYang (3):
mm: multi-gen LRU: fold lru_gen run cmd
mm: memcg: add per memcg "lru_gen" node
mm: multi-gen LRU: add per memcg "lru_gen" document
Documentation/admin-guide/mm/multigen_lru.rst | 10 ++
include/linux/mm_inline.h | 9 +
include/linux/mmzone.h | 4 +-
mm/memcontrol.c | 163 ++++++++++++++++++
mm/vmscan.c | 82 ++++++---
5 files changed, 246 insertions(+), 22 deletions(-)
--
2.34.1
From: HuanYang <[email protected]>
This patch move LRU_GEN define helper into mm_inline which will
used in per memcg lru gen node.
let run aging/eviction into func, per memcg also will run this.
Signed-off-by: HuanYang <[email protected]>
---
include/linux/mm_inline.h | 9 ++++++++
mm/vmscan.c | 45 +++++++++++++++++++++------------------
2 files changed, 33 insertions(+), 21 deletions(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 8148b30a9df1..b953b305c8a2 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -101,6 +101,15 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
#ifdef CONFIG_LRU_GEN
+#define DEFINE_MAX_SEQ(lruvec) \
+ unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq)
+
+#define DEFINE_MIN_SEQ(lruvec) \
+ unsigned long min_seq[ANON_AND_FILE] = { \
+ READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_ANON]), \
+ READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]), \
+ }
+
#ifdef CONFIG_LRU_GEN_ENABLED
static inline bool lru_gen_enabled(void)
{
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ea57a43ebd6b..f59977964e81 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3285,15 +3285,6 @@ static bool should_clear_pmd_young(void)
#define LRU_REFS_FLAGS (BIT(PG_referenced) | BIT(PG_workingset))
-#define DEFINE_MAX_SEQ(lruvec) \
- unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq)
-
-#define DEFINE_MIN_SEQ(lruvec) \
- unsigned long min_seq[ANON_AND_FILE] = { \
- READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_ANON]), \
- READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]), \
- }
-
#define for_each_gen_type_zone(gen, type, zone) \
for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \
for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \
@@ -6058,6 +6049,29 @@ static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_co
return -EINTR;
}
+static int __process_one_cmd(char cmd, struct lruvec *lruvec, unsigned long seq,
+ struct scan_control *sc, int swappiness,
+ unsigned long opt)
+{
+ int err;
+
+ if (swappiness < 0)
+ swappiness = get_swappiness(lruvec, sc);
+ else if (swappiness > 200)
+ return -EINVAL;
+
+ switch (cmd) {
+ case '+':
+ err = run_aging(lruvec, seq, sc, swappiness, opt);
+ break;
+ case '-':
+ err = run_eviction(lruvec, seq, sc, swappiness, opt);
+ break;
+ }
+
+ return err;
+}
+
static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
struct scan_control *sc, int swappiness, unsigned long opt)
{
@@ -6086,19 +6100,8 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
lruvec = get_lruvec(memcg, nid);
- if (swappiness < 0)
- swappiness = get_swappiness(lruvec, sc);
- else if (swappiness > 200)
- goto done;
+ err = __process_one_cmd(cmd, lruvec, seq, sc, swappiness, opt);
- switch (cmd) {
- case '+':
- err = run_aging(lruvec, seq, sc, swappiness, opt);
- break;
- case '-':
- err = run_eviction(lruvec, seq, sc, swappiness, opt);
- break;
- }
done:
mem_cgroup_put(memcg);
--
2.34.1
From: HuanYang <[email protected]>
change multi-gen LRU document.
Signed-off-by: Huan Yang <[email protected]>
---
Documentation/admin-guide/mm/multigen_lru.rst | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
index 33e068830497..078056b8cc7c 100644
--- a/Documentation/admin-guide/mm/multigen_lru.rst
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -160,3 +160,13 @@ cold pages because of the overestimation, it retries on the next
server according to the ranking result obtained from the working set
estimation step. This less forceful approach limits the impacts on the
existing jobs.
+
+Per memcg lru gen node
+-----------------
+In each memcg's dir, it contains ``lru_gen`` node. you can type upon cmd
+without input memcg_id to do the same thing.
+
+ ``+ node_id max_gen_nr [can_swap [force_scan]]``
+ ``- node_id min_gen_nr [swappiness [nr_to_reclaim]]``
+
+For show info, memcg's node always show the full info.
--
2.34.1
From: HuanYang <[email protected]>
This patch add "lru_gen" node in mem cgroup both in v1 and v2.
mem_cgroup_lru_gen_show just like global node, show lru_gen info,
but always show the full info due to per memcg no need to simple output.
Just like global node, per memcg "lru_gen" node can input cmd, but we
no need to type memcgid anymore to select one memcg.
Signed-off-by: HuanYang <[email protected]>
---
include/linux/mmzone.h | 4 +-
mm/memcontrol.c | 163 +++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 37 ++++++++++
3 files changed, 203 insertions(+), 1 deletion(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4106fbc5b4b3..3d399ef177a4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -553,7 +553,9 @@ void lru_gen_online_memcg(struct mem_cgroup *memcg);
void lru_gen_offline_memcg(struct mem_cgroup *memcg);
void lru_gen_release_memcg(struct mem_cgroup *memcg);
void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid);
-
+int mem_cgroup_lru_gen_cmd(char cmd, struct mem_cgroup *memcg, int nid,
+ unsigned long seq, int swappiness,
+ unsigned long opt);
#else /* !CONFIG_MEMCG */
#define MEMCG_NR_GENS 1
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ecc07b47e813..56385142c5b8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5052,6 +5052,155 @@ static int mem_cgroup_slab_show(struct seq_file *m, void *p)
static int memory_stat_show(struct seq_file *m, void *v);
+#ifdef CONFIG_LRU_GEN
+static ssize_t mem_cgroup_lru_gen_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ int n;
+ int end;
+ char cmd;
+ unsigned int nid;
+ unsigned long seq;
+ unsigned int swappiness = -1;
+ unsigned long opt = -1;
+ int ret;
+
+ buf = strstrip(buf);
+ n = sscanf(buf, "%c %u %lu %n %u %n %lu %n", &cmd, &nid, &seq, &end,
+ &swappiness, &end, &opt, &end);
+ if (n < 3 || buf[end])
+ return -EINVAL;
+
+ if (nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY))
+ return -EINVAL;
+
+ ret = mem_cgroup_lru_gen_cmd(cmd, memcg, nid, seq, swappiness, opt);
+ if (ret)
+ return ret;
+
+ return nbytes;
+}
+
+static void __lru_gen_show_info_full(struct seq_file *m, struct lruvec *lruvec,
+ unsigned long max_seq, unsigned long *min_seq,
+ unsigned long seq)
+{
+ int i;
+ int type, tier;
+ int hist = lru_hist_from_seq(seq);
+ struct lru_gen_folio *lrugen = &lruvec->lrugen;
+
+ for (tier = 0; tier < MAX_NR_TIERS; tier++) {
+ seq_printf(m, " %10d", tier);
+ for (type = 0; type < ANON_AND_FILE; type++) {
+ const char *s = " ";
+ unsigned long n[3] = {};
+
+ if (seq == max_seq) {
+ s = "RT ";
+ n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]);
+ n[1] = READ_ONCE(lrugen->avg_total[type][tier]);
+ } else if (seq == min_seq[type] || NR_HIST_GENS > 1) {
+ s = "rep";
+ n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+ n[1] = atomic_long_read(&lrugen->evicted[hist][type][tier]);
+ if (tier)
+ n[2] = READ_ONCE(lrugen->protected[hist][type][tier - 1]);
+ }
+
+ for (i = 0; i < 3; i++)
+ seq_printf(m, " %10lu%c", n[i], s[i]);
+ }
+ seq_putc(m, '\n');
+ }
+
+ seq_puts(m, " ");
+ for (i = 0; i < NR_MM_STATS; i++) {
+ const char *s = " ";
+ unsigned long n = 0;
+
+ if (seq == max_seq && NR_HIST_GENS == 1) {
+ s = "LOYNFA";
+ n = READ_ONCE(lruvec->mm_state.stats[hist][i]);
+ } else if (seq != max_seq && NR_HIST_GENS > 1) {
+ s = "loynfa";
+ n = READ_ONCE(lruvec->mm_state.stats[hist][i]);
+ }
+
+ seq_printf(m, " %10lu%c", n, s[i]);
+ }
+ seq_putc(m, '\n');
+}
+
+
+static int __lru_gen_show_info(struct seq_file *m, struct mem_cgroup *memcg, int nid)
+{
+ unsigned long seq;
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+ struct lru_gen_folio *lrugen = &lruvec->lrugen;
+ DEFINE_MAX_SEQ(lruvec);
+ DEFINE_MIN_SEQ(lruvec);
+ char *path = kvmalloc(PATH_MAX, GFP_KERNEL);
+
+ if (unlikely(!path))
+ return -ENOMEM;
+
+ if (nid == first_memory_node) {
+ cgroup_path(memcg->css.cgroup, path, PATH_MAX);
+ seq_printf(m, "memcg %5u %s\n", mem_cgroup_id(memcg), path);
+ }
+
+ seq_printf(m, " node %5d\n", nid);
+
+ if (max_seq >= MAX_NR_GENS)
+ seq = max_seq - MAX_NR_GENS + 1;
+ else
+ seq = 0;
+
+ for (; seq <= max_seq; seq++) {
+ int type, zone;
+ int gen = lru_gen_from_seq(seq);
+ unsigned long birth = READ_ONCE(lrugen->timestamps[gen]);
+
+ seq_printf(m, " %10lu %10u", seq, jiffies_to_msecs(jiffies - birth));
+
+ for (type = 0; type < ANON_AND_FILE; type++) {
+ unsigned long size = 0;
+ char mark = seq < min_seq[type] ? 'x' : ' ';
+
+ for (zone = 0; zone < MAX_NR_ZONES; zone++)
+ size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
+
+ seq_printf(m, " %10lu%c", size, mark);
+ }
+
+ seq_putc(m, '\n');
+
+
+ __lru_gen_show_info_full(m, lruvec, max_seq, min_seq, seq);
+ }
+
+ kvfree(path);
+
+ return 0;
+}
+
+static int mem_cgroup_lru_gen_show(struct seq_file *m, void *v)
+{
+ int nid, ret;
+ struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+ for_each_node_state(nid, N_MEMORY) {
+ ret = __lru_gen_show_info(m, memcg, nid);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+#endif
+
static struct cftype mem_cgroup_legacy_files[] = {
{
.name = "usage_in_bytes",
@@ -5172,6 +5321,13 @@ static struct cftype mem_cgroup_legacy_files[] = {
.write = mem_cgroup_reset,
.read_u64 = mem_cgroup_read_u64,
},
+#ifdef CONFIG_LRU_GEN
+ {
+ .name = "lru_gen",
+ .write = mem_cgroup_lru_gen_write,
+ .seq_show = mem_cgroup_lru_gen_show,
+ },
+#endif
{ }, /* terminate */
};
@@ -6831,6 +6987,13 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NS_DELEGATABLE,
.write = memory_reclaim,
},
+#ifdef CONFIG_LRU_GEN
+ {
+ .name = "lru_gen",
+ .write = mem_cgroup_lru_gen_write,
+ .seq_show = mem_cgroup_lru_gen_show,
+ },
+#endif
{ } /* terminate */
};
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f59977964e81..4da200cda0b9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6072,6 +6072,43 @@ static int __process_one_cmd(char cmd, struct lruvec *lruvec, unsigned long seq,
return err;
}
+#ifdef CONFIG_MEMCG
+int mem_cgroup_lru_gen_cmd(char cmd, struct mem_cgroup *memcg, int nid,
+ unsigned long seq, int swappiness, unsigned long opt)
+{
+ int err;
+ struct lruvec *lruvec;
+ unsigned int flags;
+ struct blk_plug plug;
+ struct scan_control sc = {
+ .may_writepage = true,
+ .may_unmap = true,
+ .may_swap = true,
+ .reclaim_idx = MAX_NR_ZONES - 1,
+ .gfp_mask = GFP_KERNEL,
+ };
+
+ set_task_reclaim_state(current, &sc.reclaim_state);
+ flags = memalloc_noreclaim_save();
+ blk_start_plug(&plug);
+ if (!set_mm_walk(NULL, true)) {
+ err = -ENOMEM;
+ goto done;
+ }
+
+ lruvec = get_lruvec(memcg, nid);
+ err = __process_one_cmd(cmd, lruvec, seq, &sc, swappiness, opt);
+
+done:
+ clear_mm_walk();
+ blk_finish_plug(&plug);
+ memalloc_noreclaim_restore(flags);
+ set_task_reclaim_state(current, NULL);
+
+ return err;
+}
+#endif
+
static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
struct scan_control *sc, int swappiness, unsigned long opt)
{
--
2.34.1
On Sun, Oct 8, 2023 at 8:57 PM Huan Yang <[email protected]> wrote:
>
> On original global lru_gen node in debugfs, it can all show each memcg's
> lru gen info in "lru_gen" or "lru_gen_full", and can type cmd into lru_gen.
> But which show info contains all memcg's info, and cmd need to
> know memcg's id.
>
> This patchset add lru_gen node in per memcg, with this node, we can
> get lru_gen info in each memcg.
> Also, we can type cmd to control each memcg's lru_gen seq, but, this node
> don't support multi cmd, single memcg just process one cmd once time.
Adding TJ from the Android team. (The other TJ you CC'ed is from the
ChromeOS team.)
This series introduced a new ABI, which has to be maintained forever.
How exactly would it be used in *production*?
Android doesn't officially support memcgs. So I want to understand the
real-world use cases first.
On Wed, Oct 18, 2023 at 9:34 AM Yu Zhao <[email protected]> wrote:
>
> On Sun, Oct 8, 2023 at 8:57 PM Huan Yang <[email protected]> wrote:
> >
> > On original global lru_gen node in debugfs, it can all show each memcg's
> > lru gen info in "lru_gen" or "lru_gen_full", and can type cmd into lru_gen.
> > But which show info contains all memcg's info, and cmd need to
> > know memcg's id.
> >
> > This patchset add lru_gen node in per memcg, with this node, we can
> > get lru_gen info in each memcg.
> > Also, we can type cmd to control each memcg's lru_gen seq, but, this node
> > don't support multi cmd, single memcg just process one cmd once time.
>
> Adding TJ from the Android team. (The other TJ you CC'ed is from the
> ChromeOS team.)
>
> This series introduced a new ABI, which has to be maintained forever.
> How exactly would it be used in *production*?
>
> Android doesn't officially support memcgs. So I want to understand the
> real-world use cases first.
Not sure how Android came up but I'm happy to chat. We want to turn on
memcg v2 for Android but I'm currently working through perf impacts
before that happens. Android can't use debugfs in production, but I
think we'd prefer to use memory.reclaim for eviction anyway because it
respects memcg limits and reclaims from slab.
So maybe it's possible to add just aging functionality specific to
MGLRU? It'd be nice to know how you're going to use the aging, or why
you want this version of eviction instead of what memory.reclaim does.
在 2023/10/19 3:59, T.J. Mercier 写道:
> [你通常不会收到来自 [email protected] 的电子邮件。请访问 https://aka.ms/LearnAboutSenderIdentification,以了解这一点为什么很重要]
>
> On Wed, Oct 18, 2023 at 9:34 AM Yu Zhao <[email protected]> wrote:
>> On Sun, Oct 8, 2023 at 8:57 PM Huan Yang <[email protected]> wrote:
>>> On original global lru_gen node in debugfs, it can all show each memcg's
>>> lru gen info in "lru_gen" or "lru_gen_full", and can type cmd into lru_gen.
>>> But which show info contains all memcg's info, and cmd need to
>>> know memcg's id.
>>>
>>> This patchset add lru_gen node in per memcg, with this node, we can
>>> get lru_gen info in each memcg.
>>> Also, we can type cmd to control each memcg's lru_gen seq, but, this node
>>> don't support multi cmd, single memcg just process one cmd once time.
>> Adding TJ from the Android team. (The other TJ you CC'ed is from the
>> ChromeOS team.)
>>
>> This series introduced a new ABI, which has to be maintained forever.
>> How exactly would it be used in *production*?
>>
>> Android doesn't officially support memcgs. So I want to understand the
>> real-world use cases first.
> Not sure how Android came up but I'm happy to chat. We want to turn on
> memcg v2 for Android but I'm currently working through perf impacts
> before that happens. Android can't use debugfs in production, but I
> think we'd prefer to use memory.reclaim for eviction anyway because it
> respects memcg limits and reclaims from slab.
Yes, shrink control this actually can use proactive reclaim.
>
> So maybe it's possible to add just aging functionality specific to
> MGLRU? It'd be nice to know how you're going to use the aging, or why
Due to debugfs not always mount, if we want to now lrugen's info, maybe
nice to offer a memcg's node to show per memcg's lrugen info.
> you want this version of eviction instead of what memory.reclaim does.
So, this node not want to instead of memory.reclaim, it's good enough.
age or other control just flow debugfs global node's behavior. If no
need, delete write is OK.
Thanks
On Wed, Oct 18, 2023 at 7:32 PM Huan Yang <[email protected]> wrote:
> > Android can't use debugfs in production, but I
> > think we'd prefer to use memory.reclaim for eviction anyway because it
> > respects memcg limits and reclaims from slab.
> Yes, shrink control this actually can use proactive reclaim.
> >
> > So maybe it's possible to add just aging functionality specific to
> > MGLRU? It'd be nice to know how you're going to use the aging, or why
> Due to debugfs not always mount, if we want to now lrugen's info, maybe
> nice to offer a memcg's node to show per memcg's lrugen info.
> > you want this version of eviction instead of what memory.reclaim does.
>
I think Yu's inquiry was about whether you will look at the lrugen
info from the memcg file for a production use case, or just for
debugging where you don't have debugfs for some reason. Because it's a
long term commitment to add the file.
在 2023/10/20 0:01, T.J. Mercier 写道:
> [你通常不会收到来自 [email protected] 的电子邮件。请访问 https://aka.ms/LearnAboutSenderIdentification,以了解这一点为什么很重要]
>
> On Wed, Oct 18, 2023 at 7:32 PM Huan Yang <[email protected]> wrote:
>>> Android can't use debugfs in production, but I
>>> think we'd prefer to use memory.reclaim for eviction anyway because it
>>> respects memcg limits and reclaims from slab.
>> Yes, shrink control this actually can use proactive reclaim.
>>> So maybe it's possible to add just aging functionality specific to
>>> MGLRU? It'd be nice to know how you're going to use the aging, or why
>> Due to debugfs not always mount, if we want to now lrugen's info, maybe
>> nice to offer a memcg's node to show per memcg's lrugen info.
>>> you want this version of eviction instead of what memory.reclaim does.
> I think Yu's inquiry was about whether you will look at the lrugen
Thanks to point that.
> info from the memcg file for a production use case, or just for
> debugging where you don't have debugfs for some reason. Because it's a
Yes, for now use, just collect log from it, not have control logic.
For future use, maybe we need to control a memcg's lru_gen anon seq reclaim.
(just assume)
> long term commitment to add the file.
OK, if I can offer a actually use case, I will send again
Thanks too much.