2023-11-23 19:40:11

by Dmitry Rokosov

[permalink] [raw]
Subject: [PATCH v3 0/2] mm: memcg: improve vmscan tracepoints

The motivation behind this commit is to enhance the traceability and
understanding of memcg events. By integrating the function cgroup_ino()
into the existing memcg tracepoints, this patch series introduces a new
tracepoint template for the begin() and end() events. It utilizes a new
entry field ino to store the cgroup ino, enabling developers to easily
identify the cgroup associated with a specific memcg tracepoint event.

Additionally, this patch series introduces new shrink_memcg tracepoints
to facilitate non-direct memcg reclaim tracing and debugging.

Changes v3 since v2 at [2]:
- use cgroup_ino() instead of cgroup_name() for memcg tracepoints
because cgroup_name() acquires a global rw_lock, which can
potentially slow down the system
- introduce a stub macro for each shrink_memcg tracepoint to avoid
using ifdefs within the common vmscan code."

Changes v2 since v1 at [1]:
- change the position of the "memcg" parameter to ensure backward
compatibility with userspace tools that use memcg tracepoints
- add additional CONFIG_MEMCG ifdefs to prevent the use of memcg
tracepoints when memcg is disabled

Links:
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/

Dmitry Rokosov (2):
mm: memcg: print out cgroup ino in the memcg tracepoints
mm: memcg: introduce new event to trace shrink_memcg

include/trace/events/vmscan.h | 95 ++++++++++++++++++++++++++++++-----
mm/vmscan.c | 17 +++++--
2 files changed, 95 insertions(+), 17 deletions(-)

--
2.36.0


2023-11-23 19:41:46

by Dmitry Rokosov

[permalink] [raw]
Subject: [PATCH v3 1/2] mm: memcg: print out cgroup ino in the memcg tracepoints

Sometimes, it becomes necessary to determine the memcg tracepoint event
that has occurred. This is particularly relevant in scenarios involving
a large cgroup hierarchy, where users may wish to trace the process of
reclamation within a specific cgroup(s) by applying a filter.

The function cgroup_ino() is a useful tool for this purpose.
To integrate cgroup_ino() into the existing memcg tracepoints, this
patch introduces a new tracepoint template for the begin() and end()
events.

Signed-off-by: Dmitry Rokosov <[email protected]>
---
include/trace/events/vmscan.h | 73 +++++++++++++++++++++++++++++------
mm/vmscan.c | 10 ++---
2 files changed, 66 insertions(+), 17 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index d2123dd960d5..e9093fa1c924 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -141,19 +141,45 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_direct_reclaim_b
);

#ifdef CONFIG_MEMCG
-DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_reclaim_begin,

- TP_PROTO(int order, gfp_t gfp_flags),
+DECLARE_EVENT_CLASS(mm_vmscan_memcg_reclaim_begin_template,

- TP_ARGS(order, gfp_flags)
+ TP_PROTO(int order, gfp_t gfp_flags, const struct mem_cgroup *memcg),
+
+ TP_ARGS(order, gfp_flags, memcg),
+
+ TP_STRUCT__entry(
+ __field(int, order)
+ __field(unsigned long, gfp_flags)
+ __field(ino_t, ino)
+ ),
+
+ TP_fast_assign(
+ __entry->order = order;
+ __entry->gfp_flags = (__force unsigned long)gfp_flags;
+ __entry->ino = cgroup_ino(memcg->css.cgroup);
+ ),
+
+ TP_printk("order=%d gfp_flags=%s memcg=%ld",
+ __entry->order,
+ show_gfp_flags(__entry->gfp_flags),
+ __entry->ino)
);

-DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_softlimit_reclaim_begin,
+DEFINE_EVENT(mm_vmscan_memcg_reclaim_begin_template, mm_vmscan_memcg_reclaim_begin,

- TP_PROTO(int order, gfp_t gfp_flags),
+ TP_PROTO(int order, gfp_t gfp_flags, const struct mem_cgroup *memcg),

- TP_ARGS(order, gfp_flags)
+ TP_ARGS(order, gfp_flags, memcg)
+);
+
+DEFINE_EVENT(mm_vmscan_memcg_reclaim_begin_template, mm_vmscan_memcg_softlimit_reclaim_begin,
+
+ TP_PROTO(int order, gfp_t gfp_flags, const struct mem_cgroup *memcg),
+
+ TP_ARGS(order, gfp_flags, memcg)
);
+
#endif /* CONFIG_MEMCG */

DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_end_template,
@@ -181,19 +207,42 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_direct_reclaim_end
);

#ifdef CONFIG_MEMCG
-DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_reclaim_end,

- TP_PROTO(unsigned long nr_reclaimed),
+DECLARE_EVENT_CLASS(mm_vmscan_memcg_reclaim_end_template,

- TP_ARGS(nr_reclaimed)
+ TP_PROTO(unsigned long nr_reclaimed, const struct mem_cgroup *memcg),
+
+ TP_ARGS(nr_reclaimed, memcg),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, nr_reclaimed)
+ __field(ino_t, ino)
+ ),
+
+ TP_fast_assign(
+ __entry->nr_reclaimed = nr_reclaimed;
+ __entry->ino = cgroup_ino(memcg->css.cgroup);
+ ),
+
+ TP_printk("nr_reclaimed=%lu memcg=%ld",
+ __entry->nr_reclaimed,
+ __entry->ino)
);

-DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_reclaim_end,
+DEFINE_EVENT(mm_vmscan_memcg_reclaim_end_template, mm_vmscan_memcg_reclaim_end,

- TP_PROTO(unsigned long nr_reclaimed),
+ TP_PROTO(unsigned long nr_reclaimed, const struct mem_cgroup *memcg),

- TP_ARGS(nr_reclaimed)
+ TP_ARGS(nr_reclaimed, memcg)
);
+
+DEFINE_EVENT(mm_vmscan_memcg_reclaim_end_template, mm_vmscan_memcg_softlimit_reclaim_end,
+
+ TP_PROTO(unsigned long nr_reclaimed, const struct mem_cgroup *memcg),
+
+ TP_ARGS(nr_reclaimed, memcg)
+);
+
#endif /* CONFIG_MEMCG */

TRACE_EVENT(mm_shrink_slab_start,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1080209a568b..45780952f4b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7088,8 +7088,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);

- trace_mm_vmscan_memcg_softlimit_reclaim_begin(sc.order,
- sc.gfp_mask);
+ trace_mm_vmscan_memcg_softlimit_reclaim_begin(sc.order, sc.gfp_mask,
+ memcg);

/*
* NOTE: Although we can get the priority field, using it
@@ -7100,7 +7100,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
*/
shrink_lruvec(lruvec, &sc);

- trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
+ trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed, memcg);

*nr_scanned = sc.nr_scanned;

@@ -7134,13 +7134,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);

set_task_reclaim_state(current, &sc.reclaim_state);
- trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask);
+ trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask, memcg);
noreclaim_flag = memalloc_noreclaim_save();

nr_reclaimed = do_try_to_free_pages(zonelist, &sc);

memalloc_noreclaim_restore(noreclaim_flag);
- trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
+ trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed, memcg);
set_task_reclaim_state(current, NULL);

return nr_reclaimed;
--
2.36.0

2023-11-23 19:41:52

by Dmitry Rokosov

[permalink] [raw]
Subject: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

The shrink_memcg flow plays a crucial role in memcg reclamation.
Currently, it is not possible to trace this point from non-direct
reclaim paths. However, direct reclaim has its own tracepoint, so there
is no issue there. In certain cases, when debugging memcg pressure,
developers may need to identify all potential requests for memcg
reclamation including kswapd(). The patchset introduces the tracepoints
mm_vmscan_memcg_shrink_{begin|end}() to address this problem.

Example of output in the kswapd context (non-direct reclaim):
kswapd0-39 [001] ..... 240.356378: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
kswapd0-39 [001] ..... 240.356396: mm_vmscan_memcg_shrink_end: nr_reclaimed=0 memcg=16
kswapd0-39 [001] ..... 240.356420: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
kswapd0-39 [001] ..... 240.356454: mm_vmscan_memcg_shrink_end: nr_reclaimed=1 memcg=16
kswapd0-39 [001] ..... 240.356479: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
kswapd0-39 [001] ..... 240.356506: mm_vmscan_memcg_shrink_end: nr_reclaimed=4 memcg=16
kswapd0-39 [001] ..... 240.356525: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
kswapd0-39 [001] ..... 240.356593: mm_vmscan_memcg_shrink_end: nr_reclaimed=11 memcg=16
kswapd0-39 [001] ..... 240.356614: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
kswapd0-39 [001] ..... 240.356738: mm_vmscan_memcg_shrink_end: nr_reclaimed=25 memcg=16
kswapd0-39 [001] ..... 240.356790: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
kswapd0-39 [001] ..... 240.357125: mm_vmscan_memcg_shrink_end: nr_reclaimed=53 memcg=16

Signed-off-by: Dmitry Rokosov <[email protected]>
---
include/trace/events/vmscan.h | 22 ++++++++++++++++++++++
mm/vmscan.c | 7 +++++++
2 files changed, 29 insertions(+)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index e9093fa1c924..a4686afe571d 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -180,6 +180,17 @@ DEFINE_EVENT(mm_vmscan_memcg_reclaim_begin_template, mm_vmscan_memcg_softlimit_r
TP_ARGS(order, gfp_flags, memcg)
);

+DEFINE_EVENT(mm_vmscan_memcg_reclaim_begin_template, mm_vmscan_memcg_shrink_begin,
+
+ TP_PROTO(int order, gfp_t gfp_flags, const struct mem_cgroup *memcg),
+
+ TP_ARGS(order, gfp_flags, memcg)
+);
+
+#else
+
+#define trace_mm_vmscan_memcg_shrink_begin(...)
+
#endif /* CONFIG_MEMCG */

DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_end_template,
@@ -243,6 +254,17 @@ DEFINE_EVENT(mm_vmscan_memcg_reclaim_end_template, mm_vmscan_memcg_softlimit_rec
TP_ARGS(nr_reclaimed, memcg)
);

+DEFINE_EVENT(mm_vmscan_memcg_reclaim_end_template, mm_vmscan_memcg_shrink_end,
+
+ TP_PROTO(unsigned long nr_reclaimed, const struct mem_cgroup *memcg),
+
+ TP_ARGS(nr_reclaimed, memcg)
+);
+
+#else
+
+#define trace_mm_vmscan_memcg_shrink_end(...)
+
#endif /* CONFIG_MEMCG */

TRACE_EVENT(mm_shrink_slab_start,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 45780952f4b5..f7e3ddc5a7ad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6461,6 +6461,10 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
*/
cond_resched();

+ trace_mm_vmscan_memcg_shrink_begin(sc->order,
+ sc->gfp_mask,
+ memcg);
+
mem_cgroup_calculate_protection(target_memcg, memcg);

if (mem_cgroup_below_min(target_memcg, memcg)) {
@@ -6491,6 +6495,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
sc->priority);

+ trace_mm_vmscan_memcg_shrink_end(sc->nr_reclaimed - reclaimed,
+ memcg);
+
/* Record the group's reclaim efficiency */
if (!sc->proactive)
vmpressure(sc->gfp_mask, memcg, false,
--
2.36.0

2023-11-25 04:11:37

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm: memcg: print out cgroup ino in the memcg tracepoints

On Thu, Nov 23, 2023 at 10:39:36PM +0300, Dmitry Rokosov wrote:
> Sometimes, it becomes necessary to determine the memcg tracepoint event
> that has occurred. This is particularly relevant in scenarios involving
> a large cgroup hierarchy, where users may wish to trace the process of
> reclamation within a specific cgroup(s) by applying a filter.
>
> The function cgroup_ino() is a useful tool for this purpose.
> To integrate cgroup_ino() into the existing memcg tracepoints, this
> patch introduces a new tracepoint template for the begin() and end()
> events.
>
> Signed-off-by: Dmitry Rokosov <[email protected]>

Acked-by: Shakeel Butt <[email protected]>

2023-11-25 06:39:17

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

On Thu, Nov 23, 2023 at 10:39:37PM +0300, Dmitry Rokosov wrote:
> The shrink_memcg flow plays a crucial role in memcg reclamation.
> Currently, it is not possible to trace this point from non-direct
> reclaim paths. However, direct reclaim has its own tracepoint, so there
> is no issue there. In certain cases, when debugging memcg pressure,
> developers may need to identify all potential requests for memcg
> reclamation including kswapd(). The patchset introduces the tracepoints
> mm_vmscan_memcg_shrink_{begin|end}() to address this problem.
>
> Example of output in the kswapd context (non-direct reclaim):
> kswapd0-39 [001] ..... 240.356378: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.356396: mm_vmscan_memcg_shrink_end: nr_reclaimed=0 memcg=16
> kswapd0-39 [001] ..... 240.356420: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.356454: mm_vmscan_memcg_shrink_end: nr_reclaimed=1 memcg=16
> kswapd0-39 [001] ..... 240.356479: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.356506: mm_vmscan_memcg_shrink_end: nr_reclaimed=4 memcg=16
> kswapd0-39 [001] ..... 240.356525: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.356593: mm_vmscan_memcg_shrink_end: nr_reclaimed=11 memcg=16
> kswapd0-39 [001] ..... 240.356614: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.356738: mm_vmscan_memcg_shrink_end: nr_reclaimed=25 memcg=16
> kswapd0-39 [001] ..... 240.356790: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.357125: mm_vmscan_memcg_shrink_end: nr_reclaimed=53 memcg=16
>
> Signed-off-by: Dmitry Rokosov <[email protected]>
> ---
> include/trace/events/vmscan.h | 22 ++++++++++++++++++++++
> mm/vmscan.c | 7 +++++++
> 2 files changed, 29 insertions(+)
>
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index e9093fa1c924..a4686afe571d 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -180,6 +180,17 @@ DEFINE_EVENT(mm_vmscan_memcg_reclaim_begin_template, mm_vmscan_memcg_softlimit_r
> TP_ARGS(order, gfp_flags, memcg)
> );
>
> +DEFINE_EVENT(mm_vmscan_memcg_reclaim_begin_template, mm_vmscan_memcg_shrink_begin,
> +
> + TP_PROTO(int order, gfp_t gfp_flags, const struct mem_cgroup *memcg),
> +
> + TP_ARGS(order, gfp_flags, memcg)
> +);
> +
> +#else
> +
> +#define trace_mm_vmscan_memcg_shrink_begin(...)
> +
> #endif /* CONFIG_MEMCG */
>
> DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_end_template,
> @@ -243,6 +254,17 @@ DEFINE_EVENT(mm_vmscan_memcg_reclaim_end_template, mm_vmscan_memcg_softlimit_rec
> TP_ARGS(nr_reclaimed, memcg)
> );
>
> +DEFINE_EVENT(mm_vmscan_memcg_reclaim_end_template, mm_vmscan_memcg_shrink_end,
> +
> + TP_PROTO(unsigned long nr_reclaimed, const struct mem_cgroup *memcg),
> +
> + TP_ARGS(nr_reclaimed, memcg)
> +);
> +
> +#else
> +
> +#define trace_mm_vmscan_memcg_shrink_end(...)
> +
> #endif /* CONFIG_MEMCG */
>
> TRACE_EVENT(mm_shrink_slab_start,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 45780952f4b5..f7e3ddc5a7ad 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6461,6 +6461,10 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> */
> cond_resched();
>
> + trace_mm_vmscan_memcg_shrink_begin(sc->order,
> + sc->gfp_mask,
> + memcg);
> +

If you place the start of the trace here, you may have only the begin
trace for memcgs whose usage are below their min or low limits. Is that
fine? Otherwise you can put it just before shrink_lruvec() call.

> mem_cgroup_calculate_protection(target_memcg, memcg);
>
> if (mem_cgroup_below_min(target_memcg, memcg)) {
> @@ -6491,6 +6495,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> sc->priority);
>
> + trace_mm_vmscan_memcg_shrink_end(sc->nr_reclaimed - reclaimed,
> + memcg);
> +
> /* Record the group's reclaim efficiency */
> if (!sc->proactive)
> vmpressure(sc->gfp_mask, memcg, false,
> --
> 2.36.0
>

2023-11-25 08:02:04

by Dmitry Rokosov

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

On Sat, Nov 25, 2023 at 06:36:16AM +0000, Shakeel Butt wrote:
> On Thu, Nov 23, 2023 at 10:39:37PM +0300, Dmitry Rokosov wrote:
> > The shrink_memcg flow plays a crucial role in memcg reclamation.
> > Currently, it is not possible to trace this point from non-direct
> > reclaim paths. However, direct reclaim has its own tracepoint, so there
> > is no issue there. In certain cases, when debugging memcg pressure,
> > developers may need to identify all potential requests for memcg
> > reclamation including kswapd(). The patchset introduces the tracepoints
> > mm_vmscan_memcg_shrink_{begin|end}() to address this problem.
> >
> > Example of output in the kswapd context (non-direct reclaim):
> > kswapd0-39 [001] ..... 240.356378: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > kswapd0-39 [001] ..... 240.356396: mm_vmscan_memcg_shrink_end: nr_reclaimed=0 memcg=16
> > kswapd0-39 [001] ..... 240.356420: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > kswapd0-39 [001] ..... 240.356454: mm_vmscan_memcg_shrink_end: nr_reclaimed=1 memcg=16
> > kswapd0-39 [001] ..... 240.356479: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > kswapd0-39 [001] ..... 240.356506: mm_vmscan_memcg_shrink_end: nr_reclaimed=4 memcg=16
> > kswapd0-39 [001] ..... 240.356525: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > kswapd0-39 [001] ..... 240.356593: mm_vmscan_memcg_shrink_end: nr_reclaimed=11 memcg=16
> > kswapd0-39 [001] ..... 240.356614: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > kswapd0-39 [001] ..... 240.356738: mm_vmscan_memcg_shrink_end: nr_reclaimed=25 memcg=16
> > kswapd0-39 [001] ..... 240.356790: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > kswapd0-39 [001] ..... 240.357125: mm_vmscan_memcg_shrink_end: nr_reclaimed=53 memcg=16
> >
> > Signed-off-by: Dmitry Rokosov <[email protected]>
> > ---
> > include/trace/events/vmscan.h | 22 ++++++++++++++++++++++
> > mm/vmscan.c | 7 +++++++
> > 2 files changed, 29 insertions(+)
> >
> > diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> > index e9093fa1c924..a4686afe571d 100644
> > --- a/include/trace/events/vmscan.h
> > +++ b/include/trace/events/vmscan.h
> > @@ -180,6 +180,17 @@ DEFINE_EVENT(mm_vmscan_memcg_reclaim_begin_template, mm_vmscan_memcg_softlimit_r
> > TP_ARGS(order, gfp_flags, memcg)
> > );
> >
> > +DEFINE_EVENT(mm_vmscan_memcg_reclaim_begin_template, mm_vmscan_memcg_shrink_begin,
> > +
> > + TP_PROTO(int order, gfp_t gfp_flags, const struct mem_cgroup *memcg),
> > +
> > + TP_ARGS(order, gfp_flags, memcg)
> > +);
> > +
> > +#else
> > +
> > +#define trace_mm_vmscan_memcg_shrink_begin(...)
> > +
> > #endif /* CONFIG_MEMCG */
> >
> > DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_end_template,
> > @@ -243,6 +254,17 @@ DEFINE_EVENT(mm_vmscan_memcg_reclaim_end_template, mm_vmscan_memcg_softlimit_rec
> > TP_ARGS(nr_reclaimed, memcg)
> > );
> >
> > +DEFINE_EVENT(mm_vmscan_memcg_reclaim_end_template, mm_vmscan_memcg_shrink_end,
> > +
> > + TP_PROTO(unsigned long nr_reclaimed, const struct mem_cgroup *memcg),
> > +
> > + TP_ARGS(nr_reclaimed, memcg)
> > +);
> > +
> > +#else
> > +
> > +#define trace_mm_vmscan_memcg_shrink_end(...)
> > +
> > #endif /* CONFIG_MEMCG */
> >
> > TRACE_EVENT(mm_shrink_slab_start,
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 45780952f4b5..f7e3ddc5a7ad 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6461,6 +6461,10 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > */
> > cond_resched();
> >
> > + trace_mm_vmscan_memcg_shrink_begin(sc->order,
> > + sc->gfp_mask,
> > + memcg);
> > +
>
> If you place the start of the trace here, you may have only the begin
> trace for memcgs whose usage are below their min or low limits. Is that
> fine? Otherwise you can put it just before shrink_lruvec() call.
>

From my point of view, it's fine. For situations like the one you
described, when we only see the begin() tracepoint raised without the
end(), we understand that reclaim requests are being made but cannot be
satisfied due to certain conditions within memcg (such as limits).

There may be some spam tracepoints in the trace pipe, which is a disadvantage
of this approach.

How important do you think it is to understand such situations? Or do
you suggest moving the begin() tracepoint after the memcg limits checks
and don't care about it?

> > mem_cgroup_calculate_protection(target_memcg, memcg);
> >
> > if (mem_cgroup_below_min(target_memcg, memcg)) {
> > @@ -6491,6 +6495,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> > sc->priority);
> >
> > + trace_mm_vmscan_memcg_shrink_end(sc->nr_reclaimed - reclaimed,
> > + memcg);
> > +
> > /* Record the group's reclaim efficiency */
> > if (!sc->proactive)
> > vmpressure(sc->gfp_mask, memcg, false,
> > --
> > 2.36.0
> >

--
Thank you,
Dmitry

2023-11-25 17:38:17

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

On Sat, Nov 25, 2023 at 11:01:37AM +0300, Dmitry Rokosov wrote:
[...]
> > > + trace_mm_vmscan_memcg_shrink_begin(sc->order,
> > > + sc->gfp_mask,
> > > + memcg);
> > > +
> >
> > If you place the start of the trace here, you may have only the begin
> > trace for memcgs whose usage are below their min or low limits. Is that
> > fine? Otherwise you can put it just before shrink_lruvec() call.
> >
>
> From my point of view, it's fine. For situations like the one you
> described, when we only see the begin() tracepoint raised without the
> end(), we understand that reclaim requests are being made but cannot be
> satisfied due to certain conditions within memcg (such as limits).
>
> There may be some spam tracepoints in the trace pipe, which is a disadvantage
> of this approach.
>
> How important do you think it is to understand such situations? Or do
> you suggest moving the begin() tracepoint after the memcg limits checks
> and don't care about it?
>

I was mainly wondering if that is intentional. It seems like you as
first user of this trace has a need to know that a reclaim for a given
memcg was triggered but due to min/low limits no reclaim was done. This
is a totally reasonable use-case.

2023-11-25 17:47:56

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

On Thu, Nov 23, 2023 at 10:39:37PM +0300, Dmitry Rokosov wrote:
> The shrink_memcg flow plays a crucial role in memcg reclamation.
> Currently, it is not possible to trace this point from non-direct
> reclaim paths. However, direct reclaim has its own tracepoint, so there
> is no issue there. In certain cases, when debugging memcg pressure,
> developers may need to identify all potential requests for memcg
> reclamation including kswapd(). The patchset introduces the tracepoints
> mm_vmscan_memcg_shrink_{begin|end}() to address this problem.
>
> Example of output in the kswapd context (non-direct reclaim):
> kswapd0-39 [001] ..... 240.356378: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.356396: mm_vmscan_memcg_shrink_end: nr_reclaimed=0 memcg=16
> kswapd0-39 [001] ..... 240.356420: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.356454: mm_vmscan_memcg_shrink_end: nr_reclaimed=1 memcg=16
> kswapd0-39 [001] ..... 240.356479: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.356506: mm_vmscan_memcg_shrink_end: nr_reclaimed=4 memcg=16
> kswapd0-39 [001] ..... 240.356525: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.356593: mm_vmscan_memcg_shrink_end: nr_reclaimed=11 memcg=16
> kswapd0-39 [001] ..... 240.356614: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.356738: mm_vmscan_memcg_shrink_end: nr_reclaimed=25 memcg=16
> kswapd0-39 [001] ..... 240.356790: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.357125: mm_vmscan_memcg_shrink_end: nr_reclaimed=53 memcg=16
>
> Signed-off-by: Dmitry Rokosov <[email protected]>

Acked-by: Shakeel Butt <[email protected]>

2023-11-27 09:34:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

On Thu 23-11-23 22:39:37, Dmitry Rokosov wrote:
> The shrink_memcg flow plays a crucial role in memcg reclamation.
> Currently, it is not possible to trace this point from non-direct
> reclaim paths. However, direct reclaim has its own tracepoint, so there
> is no issue there. In certain cases, when debugging memcg pressure,
> developers may need to identify all potential requests for memcg
> reclamation including kswapd(). The patchset introduces the tracepoints
> mm_vmscan_memcg_shrink_{begin|end}() to address this problem.
>
> Example of output in the kswapd context (non-direct reclaim):
> kswapd0-39 [001] ..... 240.356378: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.356396: mm_vmscan_memcg_shrink_end: nr_reclaimed=0 memcg=16
> kswapd0-39 [001] ..... 240.356420: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.356454: mm_vmscan_memcg_shrink_end: nr_reclaimed=1 memcg=16
> kswapd0-39 [001] ..... 240.356479: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.356506: mm_vmscan_memcg_shrink_end: nr_reclaimed=4 memcg=16
> kswapd0-39 [001] ..... 240.356525: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.356593: mm_vmscan_memcg_shrink_end: nr_reclaimed=11 memcg=16
> kswapd0-39 [001] ..... 240.356614: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.356738: mm_vmscan_memcg_shrink_end: nr_reclaimed=25 memcg=16
> kswapd0-39 [001] ..... 240.356790: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> kswapd0-39 [001] ..... 240.357125: mm_vmscan_memcg_shrink_end: nr_reclaimed=53 memcg=16

In the previous version I have asked why do we need this specific
tracepoint when we already do have trace_mm_vmscan_lru_shrink_{in}active
which already give you a very good insight. That includes the number of
reclaimed pages but also more. I do see that we do not include memcg id
of the reclaimed LRU, but that shouldn't be a big problem to add, no?

--
Michal Hocko
SUSE Labs

2023-11-27 11:37:01

by Dmitry Rokosov

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

On Mon, Nov 27, 2023 at 10:33:49AM +0100, Michal Hocko wrote:
> On Thu 23-11-23 22:39:37, Dmitry Rokosov wrote:
> > The shrink_memcg flow plays a crucial role in memcg reclamation.
> > Currently, it is not possible to trace this point from non-direct
> > reclaim paths. However, direct reclaim has its own tracepoint, so there
> > is no issue there. In certain cases, when debugging memcg pressure,
> > developers may need to identify all potential requests for memcg
> > reclamation including kswapd(). The patchset introduces the tracepoints
> > mm_vmscan_memcg_shrink_{begin|end}() to address this problem.
> >
> > Example of output in the kswapd context (non-direct reclaim):
> > kswapd0-39 [001] ..... 240.356378: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > kswapd0-39 [001] ..... 240.356396: mm_vmscan_memcg_shrink_end: nr_reclaimed=0 memcg=16
> > kswapd0-39 [001] ..... 240.356420: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > kswapd0-39 [001] ..... 240.356454: mm_vmscan_memcg_shrink_end: nr_reclaimed=1 memcg=16
> > kswapd0-39 [001] ..... 240.356479: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > kswapd0-39 [001] ..... 240.356506: mm_vmscan_memcg_shrink_end: nr_reclaimed=4 memcg=16
> > kswapd0-39 [001] ..... 240.356525: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > kswapd0-39 [001] ..... 240.356593: mm_vmscan_memcg_shrink_end: nr_reclaimed=11 memcg=16
> > kswapd0-39 [001] ..... 240.356614: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > kswapd0-39 [001] ..... 240.356738: mm_vmscan_memcg_shrink_end: nr_reclaimed=25 memcg=16
> > kswapd0-39 [001] ..... 240.356790: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > kswapd0-39 [001] ..... 240.357125: mm_vmscan_memcg_shrink_end: nr_reclaimed=53 memcg=16
>
> In the previous version I have asked why do we need this specific
> tracepoint when we already do have trace_mm_vmscan_lru_shrink_{in}active
> which already give you a very good insight. That includes the number of
> reclaimed pages but also more. I do see that we do not include memcg id
> of the reclaimed LRU, but that shouldn't be a big problem to add, no?

From my point of view, memcg reclaim includes two points: LRU shrink and
slab shrink, as mentioned in the vmscan.c file.


static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
...
reclaimed = sc->nr_reclaimed;
scanned = sc->nr_scanned;

shrink_lruvec(lruvec, sc);

shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
sc->priority);
...

So, both of these operations are important for understanding whether
memcg reclaiming was successful or not, as well as its effectiveness. I
believe it would be beneficial to summarize them, which is why I have
created new tracepoints.

--
Thank you,
Dmitry

2023-11-27 16:17:20

by Dmitry Rokosov

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

On Mon, Nov 27, 2023 at 01:50:22PM +0100, Michal Hocko wrote:
> On Mon 27-11-23 14:36:44, Dmitry Rokosov wrote:
> > On Mon, Nov 27, 2023 at 10:33:49AM +0100, Michal Hocko wrote:
> > > On Thu 23-11-23 22:39:37, Dmitry Rokosov wrote:
> > > > The shrink_memcg flow plays a crucial role in memcg reclamation.
> > > > Currently, it is not possible to trace this point from non-direct
> > > > reclaim paths. However, direct reclaim has its own tracepoint, so there
> > > > is no issue there. In certain cases, when debugging memcg pressure,
> > > > developers may need to identify all potential requests for memcg
> > > > reclamation including kswapd(). The patchset introduces the tracepoints
> > > > mm_vmscan_memcg_shrink_{begin|end}() to address this problem.
> > > >
> > > > Example of output in the kswapd context (non-direct reclaim):
> > > > kswapd0-39 [001] ..... 240.356378: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > kswapd0-39 [001] ..... 240.356396: mm_vmscan_memcg_shrink_end: nr_reclaimed=0 memcg=16
> > > > kswapd0-39 [001] ..... 240.356420: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > kswapd0-39 [001] ..... 240.356454: mm_vmscan_memcg_shrink_end: nr_reclaimed=1 memcg=16
> > > > kswapd0-39 [001] ..... 240.356479: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > kswapd0-39 [001] ..... 240.356506: mm_vmscan_memcg_shrink_end: nr_reclaimed=4 memcg=16
> > > > kswapd0-39 [001] ..... 240.356525: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > kswapd0-39 [001] ..... 240.356593: mm_vmscan_memcg_shrink_end: nr_reclaimed=11 memcg=16
> > > > kswapd0-39 [001] ..... 240.356614: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > kswapd0-39 [001] ..... 240.356738: mm_vmscan_memcg_shrink_end: nr_reclaimed=25 memcg=16
> > > > kswapd0-39 [001] ..... 240.356790: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > kswapd0-39 [001] ..... 240.357125: mm_vmscan_memcg_shrink_end: nr_reclaimed=53 memcg=16
> > >
> > > In the previous version I have asked why do we need this specific
> > > tracepoint when we already do have trace_mm_vmscan_lru_shrink_{in}active
> > > which already give you a very good insight. That includes the number of
> > > reclaimed pages but also more. I do see that we do not include memcg id
> > > of the reclaimed LRU, but that shouldn't be a big problem to add, no?
> >
> > >From my point of view, memcg reclaim includes two points: LRU shrink and
> > slab shrink, as mentioned in the vmscan.c file.
> >
> >
> > static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > ...
> > reclaimed = sc->nr_reclaimed;
> > scanned = sc->nr_scanned;
> >
> > shrink_lruvec(lruvec, sc);
> >
> > shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> > sc->priority);
> > ...
> >
> > So, both of these operations are important for understanding whether
> > memcg reclaiming was successful or not, as well as its effectiveness. I
> > believe it would be beneficial to summarize them, which is why I have
> > created new tracepoints.
>
> This sounds like nice to have rather than must. Put it differently. If
> you make existing reclaim trace points memcg aware (print memcg id) then
> what prevents you from making analysis you need?

You are right, nothing prevents me from making this analysis... but...

This approach does have some disadvantages:
1) It requires more changes to vmscan. At the very least, the memcg
object should be forwarded to all subfunctions for LRU and SLAB
shrinkers.

2) With this approach, we will not have the ability to trace a situation
where the kernel is requesting reclaim for a specific memcg, but due to
limits issues, we are unable to run it.

3) LRU and SLAB shrinkers are too common places to handle memcg-related
tasks. Additionally, memcg can be disabled in the kernel configuration.

--
Thank you,
Dmitry

2023-11-29 15:21:18

by Dmitry Rokosov

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

On Tue, Nov 28, 2023 at 10:32:50AM +0100, Michal Hocko wrote:
> On Mon 27-11-23 19:16:37, Dmitry Rokosov wrote:
> > On Mon, Nov 27, 2023 at 01:50:22PM +0100, Michal Hocko wrote:
> > > On Mon 27-11-23 14:36:44, Dmitry Rokosov wrote:
> > > > On Mon, Nov 27, 2023 at 10:33:49AM +0100, Michal Hocko wrote:
> > > > > On Thu 23-11-23 22:39:37, Dmitry Rokosov wrote:
> > > > > > The shrink_memcg flow plays a crucial role in memcg reclamation.
> > > > > > Currently, it is not possible to trace this point from non-direct
> > > > > > reclaim paths. However, direct reclaim has its own tracepoint, so there
> > > > > > is no issue there. In certain cases, when debugging memcg pressure,
> > > > > > developers may need to identify all potential requests for memcg
> > > > > > reclamation including kswapd(). The patchset introduces the tracepoints
> > > > > > mm_vmscan_memcg_shrink_{begin|end}() to address this problem.
> > > > > >
> > > > > > Example of output in the kswapd context (non-direct reclaim):
> > > > > > kswapd0-39 [001] ..... 240.356378: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356396: mm_vmscan_memcg_shrink_end: nr_reclaimed=0 memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356420: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356454: mm_vmscan_memcg_shrink_end: nr_reclaimed=1 memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356479: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356506: mm_vmscan_memcg_shrink_end: nr_reclaimed=4 memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356525: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356593: mm_vmscan_memcg_shrink_end: nr_reclaimed=11 memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356614: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356738: mm_vmscan_memcg_shrink_end: nr_reclaimed=25 memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356790: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > kswapd0-39 [001] ..... 240.357125: mm_vmscan_memcg_shrink_end: nr_reclaimed=53 memcg=16
> > > > >
> > > > > In the previous version I have asked why do we need this specific
> > > > > tracepoint when we already do have trace_mm_vmscan_lru_shrink_{in}active
> > > > > which already give you a very good insight. That includes the number of
> > > > > reclaimed pages but also more. I do see that we do not include memcg id
> > > > > of the reclaimed LRU, but that shouldn't be a big problem to add, no?
> > > >
> > > > >From my point of view, memcg reclaim includes two points: LRU shrink and
> > > > slab shrink, as mentioned in the vmscan.c file.
> > > >
> > > >
> > > > static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > > > ...
> > > > reclaimed = sc->nr_reclaimed;
> > > > scanned = sc->nr_scanned;
> > > >
> > > > shrink_lruvec(lruvec, sc);
> > > >
> > > > shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> > > > sc->priority);
> > > > ...
> > > >
> > > > So, both of these operations are important for understanding whether
> > > > memcg reclaiming was successful or not, as well as its effectiveness. I
> > > > believe it would be beneficial to summarize them, which is why I have
> > > > created new tracepoints.
> > >
> > > This sounds like nice to have rather than must. Put it differently. If
> > > you make existing reclaim trace points memcg aware (print memcg id) then
> > > what prevents you from making analysis you need?
> >
> > You are right, nothing prevents me from making this analysis... but...
> >
> > This approach does have some disadvantages:
> > 1) It requires more changes to vmscan. At the very least, the memcg
> > object should be forwarded to all subfunctions for LRU and SLAB
> > shrinkers.
>
> We should have lruvec or memcg available. lruvec_memcg() could be used
> to get memcg from the lruvec. It might be more places to add the id but
> arguably this would improve them to identify where the memory has been
> scanned/reclaimed from.
>

Oh, thank you, didn't see this conversion function before...

> > 2) With this approach, we will not have the ability to trace a situation
> > where the kernel is requesting reclaim for a specific memcg, but due to
> > limits issues, we are unable to run it.
>
> I do not follow. Could you be more specific please?
>

I'm referring to a situation where kswapd() or another kernel mm code
requests some reclaim pages from memcg, but memcg rejects it due to
limits checkers. This occurs in the shrink_node_memcgs() function.

===
mem_cgroup_calculate_protection(target_memcg, memcg);

if (mem_cgroup_below_min(target_memcg, memcg)) {
/*
* Hard protection.
* If there is no reclaimable memory, OOM.
*/
continue;
} else if (mem_cgroup_below_low(target_memcg, memcg)) {
/*
* Soft protection.
* Respect the protection only as long as
* there is an unprotected supply
* of reclaimable memory from other cgroups.
*/
if (!sc->memcg_low_reclaim) {
sc->memcg_low_skipped = 1;
continue;
}
memcg_memory_event(memcg, MEMCG_LOW);
}
===

With separate shrink begin()/end() tracepoints we can detect such
problem.


> > 3) LRU and SLAB shrinkers are too common places to handle memcg-related
> > tasks. Additionally, memcg can be disabled in the kernel configuration.
>
> Right. This could be all hidden in the tracing code. You simply do not
> print memcg id when the controller is disabled. Or just simply print 0.
> I do not really see any major problems with that.
>
> I would really prefer to focus on that direction rather than adding
> another begin/end tracepoint which overalaps with existing begin/end
> traces and provides much more limited information because I would bet we
> will have somebody complaining that mere nr_reclaimed is not sufficient.

Okay, I will try to prepare a new patch version with memcg printing from
lruvec and slab tracepoints.

Then Andrew should drop the previous patchsets, I suppose. Please advise
on the correct workflow steps here.

--
Thank you,
Dmitry

2023-11-29 15:27:18

by Dmitry Rokosov

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

On Wed, Nov 29, 2023 at 06:20:57PM +0300, Dmitry Rokosov wrote:
> On Tue, Nov 28, 2023 at 10:32:50AM +0100, Michal Hocko wrote:
> > On Mon 27-11-23 19:16:37, Dmitry Rokosov wrote:
> > > On Mon, Nov 27, 2023 at 01:50:22PM +0100, Michal Hocko wrote:
> > > > On Mon 27-11-23 14:36:44, Dmitry Rokosov wrote:
> > > > > On Mon, Nov 27, 2023 at 10:33:49AM +0100, Michal Hocko wrote:
> > > > > > On Thu 23-11-23 22:39:37, Dmitry Rokosov wrote:
> > > > > > > The shrink_memcg flow plays a crucial role in memcg reclamation.
> > > > > > > Currently, it is not possible to trace this point from non-direct
> > > > > > > reclaim paths. However, direct reclaim has its own tracepoint, so there
> > > > > > > is no issue there. In certain cases, when debugging memcg pressure,
> > > > > > > developers may need to identify all potential requests for memcg
> > > > > > > reclamation including kswapd(). The patchset introduces the tracepoints
> > > > > > > mm_vmscan_memcg_shrink_{begin|end}() to address this problem.
> > > > > > >
> > > > > > > Example of output in the kswapd context (non-direct reclaim):
> > > > > > > kswapd0-39 [001] ..... 240.356378: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > > kswapd0-39 [001] ..... 240.356396: mm_vmscan_memcg_shrink_end: nr_reclaimed=0 memcg=16
> > > > > > > kswapd0-39 [001] ..... 240.356420: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > > kswapd0-39 [001] ..... 240.356454: mm_vmscan_memcg_shrink_end: nr_reclaimed=1 memcg=16
> > > > > > > kswapd0-39 [001] ..... 240.356479: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > > kswapd0-39 [001] ..... 240.356506: mm_vmscan_memcg_shrink_end: nr_reclaimed=4 memcg=16
> > > > > > > kswapd0-39 [001] ..... 240.356525: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > > kswapd0-39 [001] ..... 240.356593: mm_vmscan_memcg_shrink_end: nr_reclaimed=11 memcg=16
> > > > > > > kswapd0-39 [001] ..... 240.356614: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > > kswapd0-39 [001] ..... 240.356738: mm_vmscan_memcg_shrink_end: nr_reclaimed=25 memcg=16
> > > > > > > kswapd0-39 [001] ..... 240.356790: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > > kswapd0-39 [001] ..... 240.357125: mm_vmscan_memcg_shrink_end: nr_reclaimed=53 memcg=16
> > > > > >
> > > > > > In the previous version I have asked why do we need this specific
> > > > > > tracepoint when we already do have trace_mm_vmscan_lru_shrink_{in}active
> > > > > > which already give you a very good insight. That includes the number of
> > > > > > reclaimed pages but also more. I do see that we do not include memcg id
> > > > > > of the reclaimed LRU, but that shouldn't be a big problem to add, no?
> > > > >
> > > > > >From my point of view, memcg reclaim includes two points: LRU shrink and
> > > > > slab shrink, as mentioned in the vmscan.c file.
> > > > >
> > > > >
> > > > > static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > > > > ...
> > > > > reclaimed = sc->nr_reclaimed;
> > > > > scanned = sc->nr_scanned;
> > > > >
> > > > > shrink_lruvec(lruvec, sc);
> > > > >
> > > > > shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> > > > > sc->priority);
> > > > > ...
> > > > >
> > > > > So, both of these operations are important for understanding whether
> > > > > memcg reclaiming was successful or not, as well as its effectiveness. I
> > > > > believe it would be beneficial to summarize them, which is why I have
> > > > > created new tracepoints.
> > > >
> > > > This sounds like nice to have rather than must. Put it differently. If
> > > > you make existing reclaim trace points memcg aware (print memcg id) then
> > > > what prevents you from making analysis you need?
> > >
> > > You are right, nothing prevents me from making this analysis... but...
> > >
> > > This approach does have some disadvantages:
> > > 1) It requires more changes to vmscan. At the very least, the memcg
> > > object should be forwarded to all subfunctions for LRU and SLAB
> > > shrinkers.
> >
> > We should have lruvec or memcg available. lruvec_memcg() could be used
> > to get memcg from the lruvec. It might be more places to add the id but
> > arguably this would improve them to identify where the memory has been
> > scanned/reclaimed from.
> >
>
> Oh, thank you, didn't see this conversion function before...
>
> > > 2) With this approach, we will not have the ability to trace a situation
> > > where the kernel is requesting reclaim for a specific memcg, but due to
> > > limits issues, we are unable to run it.
> >
> > I do not follow. Could you be more specific please?
> >
>
> I'm referring to a situation where kswapd() or another kernel mm code
> requests some reclaim pages from memcg, but memcg rejects it due to
> limits checkers. This occurs in the shrink_node_memcgs() function.
>
> ===
> mem_cgroup_calculate_protection(target_memcg, memcg);
>
> if (mem_cgroup_below_min(target_memcg, memcg)) {
> /*
> * Hard protection.
> * If there is no reclaimable memory, OOM.
> */
> continue;
> } else if (mem_cgroup_below_low(target_memcg, memcg)) {
> /*
> * Soft protection.
> * Respect the protection only as long as
> * there is an unprotected supply
> * of reclaimable memory from other cgroups.
> */
> if (!sc->memcg_low_reclaim) {
> sc->memcg_low_skipped = 1;
> continue;
> }
> memcg_memory_event(memcg, MEMCG_LOW);
> }
> ===
>
> With separate shrink begin()/end() tracepoints we can detect such
> problem.
>
>
> > > 3) LRU and SLAB shrinkers are too common places to handle memcg-related
> > > tasks. Additionally, memcg can be disabled in the kernel configuration.
> >
> > Right. This could be all hidden in the tracing code. You simply do not
> > print memcg id when the controller is disabled. Or just simply print 0.
> > I do not really see any major problems with that.
> >
> > I would really prefer to focus on that direction rather than adding
> > another begin/end tracepoint which overalaps with existing begin/end
> > traces and provides much more limited information because I would bet we
> > will have somebody complaining that mere nr_reclaimed is not sufficient.
>
> Okay, I will try to prepare a new patch version with memcg printing from
> lruvec and slab tracepoints.
>
> Then Andrew should drop the previous patchsets, I suppose. Please advise
> on the correct workflow steps here.

Actually, it has already been merged into linux-next... I just checked.
Maybe it would be better to prepare lruvec and slab memcg printing as a
separate patch series?

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=0e7f0c52a76cb22c8633f21bff6e48fabff6016e

--
Thank you,
Dmitry

2023-11-29 16:58:49

by Dmitry Rokosov

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

On Wed, Nov 29, 2023 at 05:06:37PM +0100, Michal Hocko wrote:
> On Wed 29-11-23 18:20:57, Dmitry Rokosov wrote:
> > On Tue, Nov 28, 2023 at 10:32:50AM +0100, Michal Hocko wrote:
> > > On Mon 27-11-23 19:16:37, Dmitry Rokosov wrote:
> [...]
> > > > 2) With this approach, we will not have the ability to trace a situation
> > > > where the kernel is requesting reclaim for a specific memcg, but due to
> > > > limits issues, we are unable to run it.
> > >
> > > I do not follow. Could you be more specific please?
> > >
> >
> > I'm referring to a situation where kswapd() or another kernel mm code
> > requests some reclaim pages from memcg, but memcg rejects it due to
> > limits checkers. This occurs in the shrink_node_memcgs() function.
>
> Ohh, you mean reclaim protection
>
> > ===
> > mem_cgroup_calculate_protection(target_memcg, memcg);
> >
> > if (mem_cgroup_below_min(target_memcg, memcg)) {
> > /*
> > * Hard protection.
> > * If there is no reclaimable memory, OOM.
> > */
> > continue;
> > } else if (mem_cgroup_below_low(target_memcg, memcg)) {
> > /*
> > * Soft protection.
> > * Respect the protection only as long as
> > * there is an unprotected supply
> > * of reclaimable memory from other cgroups.
> > */
> > if (!sc->memcg_low_reclaim) {
> > sc->memcg_low_skipped = 1;
> > continue;
> > }
> > memcg_memory_event(memcg, MEMCG_LOW);
> > }
> > ===
> >
> > With separate shrink begin()/end() tracepoints we can detect such
> > problem.
>
> How? You are only reporting the number of reclaimed pages and no
> reclaimed pages could be not just because of low/min limits but
> generally because of other reasons. You would need to report also the
> number of scanned/isolated pages.
>

From my perspective, if memory control group (memcg) protection
restrictions occur, we can identify them by the absence of the end()
pair of begin(). Other reasons will have both tracepoints raised.

> > > > 3) LRU and SLAB shrinkers are too common places to handle memcg-related
> > > > tasks. Additionally, memcg can be disabled in the kernel configuration.
> > >
> > > Right. This could be all hidden in the tracing code. You simply do not
> > > print memcg id when the controller is disabled. Or just simply print 0.
> > > I do not really see any major problems with that.
> > >
> > > I would really prefer to focus on that direction rather than adding
> > > another begin/end tracepoint which overalaps with existing begin/end
> > > traces and provides much more limited information because I would bet we
> > > will have somebody complaining that mere nr_reclaimed is not sufficient.
> >
> > Okay, I will try to prepare a new patch version with memcg printing from
> > lruvec and slab tracepoints.
> >
> > Then Andrew should drop the previous patchsets, I suppose. Please advise
> > on the correct workflow steps here.
>
> Andrew usually just drops the patch from his tree and it will disappaer
> from the linux-next as well.

Okay, I understand, thank you!

Andrew, could you please take a look? I am planning to prepare a new
patch version based on Michal's suggestion, so previous one should be
dropped.

--
Thank you,
Dmitry

2023-11-29 17:10:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

On Wed 29-11-23 19:57:52, Dmitry Rokosov wrote:
> On Wed, Nov 29, 2023 at 05:06:37PM +0100, Michal Hocko wrote:
> > On Wed 29-11-23 18:20:57, Dmitry Rokosov wrote:
> > > On Tue, Nov 28, 2023 at 10:32:50AM +0100, Michal Hocko wrote:
> > > > On Mon 27-11-23 19:16:37, Dmitry Rokosov wrote:
> > [...]
> > > > > 2) With this approach, we will not have the ability to trace a situation
> > > > > where the kernel is requesting reclaim for a specific memcg, but due to
> > > > > limits issues, we are unable to run it.
> > > >
> > > > I do not follow. Could you be more specific please?
> > > >
> > >
> > > I'm referring to a situation where kswapd() or another kernel mm code
> > > requests some reclaim pages from memcg, but memcg rejects it due to
> > > limits checkers. This occurs in the shrink_node_memcgs() function.
> >
> > Ohh, you mean reclaim protection
> >
> > > ===
> > > mem_cgroup_calculate_protection(target_memcg, memcg);
> > >
> > > if (mem_cgroup_below_min(target_memcg, memcg)) {
> > > /*
> > > * Hard protection.
> > > * If there is no reclaimable memory, OOM.
> > > */
> > > continue;
> > > } else if (mem_cgroup_below_low(target_memcg, memcg)) {
> > > /*
> > > * Soft protection.
> > > * Respect the protection only as long as
> > > * there is an unprotected supply
> > > * of reclaimable memory from other cgroups.
> > > */
> > > if (!sc->memcg_low_reclaim) {
> > > sc->memcg_low_skipped = 1;
> > > continue;
> > > }
> > > memcg_memory_event(memcg, MEMCG_LOW);
> > > }
> > > ===
> > >
> > > With separate shrink begin()/end() tracepoints we can detect such
> > > problem.
> >
> > How? You are only reporting the number of reclaimed pages and no
> > reclaimed pages could be not just because of low/min limits but
> > generally because of other reasons. You would need to report also the
> > number of scanned/isolated pages.
> >
>
> From my perspective, if memory control group (memcg) protection
> restrictions occur, we can identify them by the absence of the end()
> pair of begin(). Other reasons will have both tracepoints raised.

That is not really great way to detect that TBH. Trace events could be
lost and then you simply do not know what has happened.

--
Michal Hocko
SUSE Labs

2023-11-29 17:33:58

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

On Wed, 29 Nov 2023 18:20:57 +0300 Dmitry Rokosov <[email protected]> wrote:

> Okay, I will try to prepare a new patch version with memcg printing from
> lruvec and slab tracepoints.
>
> Then Andrew should drop the previous patchsets, I suppose. Please advise
> on the correct workflow steps here.

This series is present in mm.git's mm-unstable branch. Note
"unstable". So dropping the v3 series and merging v4 is totally not a
problem. It's why this branch exists - it's daily rebasing, in high
flux.

When a patchset is considered stabilized and ready, I'll move it into
the mm-stable branch, which is (supposed to be) the non-rebasing tree
for next merge window.

If you have small fixes then I prefer little fixup patches against what
is presently in mm-unstable.

If you send replacement patches then no problem, I'll check to see
whether I should turn them into little fixup deltas.

I prefer little fixups so that people can see what has changed, so I
can see which review/test issues were addressed and so that people
don't feel a need to re-review the whole patchset.

If generation of little fixups is impractical, I'll drop the old series
entirely and I'll merge the new one.

Each case is a judgement call, please send whatever you think makes
most sense given the above.

2023-11-29 17:34:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

On Wed, 29 Nov 2023 18:10:33 +0100
Michal Hocko <[email protected]> wrote:

> > > How? You are only reporting the number of reclaimed pages and no
> > > reclaimed pages could be not just because of low/min limits but
> > > generally because of other reasons. You would need to report also the
> > > number of scanned/isolated pages.
> > >
> >
> > From my perspective, if memory control group (memcg) protection
> > restrictions occur, we can identify them by the absence of the end()
> > pair of begin(). Other reasons will have both tracepoints raised.
>
> That is not really great way to detect that TBH. Trace events could be
> lost and then you simply do not know what has happened.

Note, you can detect dropped events. If there's a dropped event, you can
ignore the "missing end" from a beginning. You could also make synthetic
events that pair an end event with a beginning event (which uses the last
begin event found). Synthetic event creation is not affected by dropped
events.

There's a lot you can to get information with the prospect of dropped
events. I would not use that as rationale for not using events.

-- Steve

2023-11-29 17:36:04

by Dmitry Rokosov

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

On Wed, Nov 29, 2023 at 06:10:33PM +0100, Michal Hocko wrote:
> On Wed 29-11-23 19:57:52, Dmitry Rokosov wrote:
> > On Wed, Nov 29, 2023 at 05:06:37PM +0100, Michal Hocko wrote:
> > > On Wed 29-11-23 18:20:57, Dmitry Rokosov wrote:
> > > > On Tue, Nov 28, 2023 at 10:32:50AM +0100, Michal Hocko wrote:
> > > > > On Mon 27-11-23 19:16:37, Dmitry Rokosov wrote:
> > > [...]
> > > > > > 2) With this approach, we will not have the ability to trace a situation
> > > > > > where the kernel is requesting reclaim for a specific memcg, but due to
> > > > > > limits issues, we are unable to run it.
> > > > >
> > > > > I do not follow. Could you be more specific please?
> > > > >
> > > >
> > > > I'm referring to a situation where kswapd() or another kernel mm code
> > > > requests some reclaim pages from memcg, but memcg rejects it due to
> > > > limits checkers. This occurs in the shrink_node_memcgs() function.
> > >
> > > Ohh, you mean reclaim protection
> > >
> > > > ===
> > > > mem_cgroup_calculate_protection(target_memcg, memcg);
> > > >
> > > > if (mem_cgroup_below_min(target_memcg, memcg)) {
> > > > /*
> > > > * Hard protection.
> > > > * If there is no reclaimable memory, OOM.
> > > > */
> > > > continue;
> > > > } else if (mem_cgroup_below_low(target_memcg, memcg)) {
> > > > /*
> > > > * Soft protection.
> > > > * Respect the protection only as long as
> > > > * there is an unprotected supply
> > > > * of reclaimable memory from other cgroups.
> > > > */
> > > > if (!sc->memcg_low_reclaim) {
> > > > sc->memcg_low_skipped = 1;
> > > > continue;
> > > > }
> > > > memcg_memory_event(memcg, MEMCG_LOW);
> > > > }
> > > > ===
> > > >
> > > > With separate shrink begin()/end() tracepoints we can detect such
> > > > problem.
> > >
> > > How? You are only reporting the number of reclaimed pages and no
> > > reclaimed pages could be not just because of low/min limits but
> > > generally because of other reasons. You would need to report also the
> > > number of scanned/isolated pages.
> > >
> >
> > From my perspective, if memory control group (memcg) protection
> > restrictions occur, we can identify them by the absence of the end()
> > pair of begin(). Other reasons will have both tracepoints raised.
>
> That is not really great way to detect that TBH. Trace events could be
> lost and then you simply do not know what has happened.

I see, thank you very much for the detailed review! I will prepare a new
patchset with memcg names in the lruvec and slab paths, will back soon.

--
Thank you,
Dmitry

2023-11-29 17:50:20

by Dmitry Rokosov

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace shrink_memcg

On Wed, Nov 29, 2023 at 09:33:41AM -0800, Andrew Morton wrote:
> On Wed, 29 Nov 2023 18:20:57 +0300 Dmitry Rokosov <[email protected]> wrote:
>
> > Okay, I will try to prepare a new patch version with memcg printing from
> > lruvec and slab tracepoints.
> >
> > Then Andrew should drop the previous patchsets, I suppose. Please advise
> > on the correct workflow steps here.
>
> This series is present in mm.git's mm-unstable branch. Note
> "unstable". So dropping the v3 series and merging v4 is totally not a
> problem. It's why this branch exists - it's daily rebasing, in high
> flux.
>
> When a patchset is considered stabilized and ready, I'll move it into
> the mm-stable branch, which is (supposed to be) the non-rebasing tree
> for next merge window.
>
> If you have small fixes then I prefer little fixup patches against what
> is presently in mm-unstable.
>
> If you send replacement patches then no problem, I'll check to see
> whether I should turn them into little fixup deltas.
>
> I prefer little fixups so that people can see what has changed, so I
> can see which review/test issues were addressed and so that people
> don't feel a need to re-review the whole patchset.
>
> If generation of little fixups is impractical, I'll drop the old series
> entirely and I'll merge the new one.
>
> Each case is a judgement call, please send whatever you think makes
> most sense given the above.

Thank you for the detailed explanation! It is now completely clear to
me! I will be sending the new patch series soon.

--
Thank you,
Dmitry