2022-08-22 00:18:56

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH 0/3] memcg: optimizatize charge codepath

Recently Linux networking stack has moved from a very old per socket
pre-charge caching to per-cpu caching to avoid pre-charge fragmentation
and unwarranted OOMs. One impact of this change is that for network
traffic workloads, memcg charging codepath can become a bottleneck. The
kernel test robot has also reported this regression. This patch series
tries to improve the memcg charging for such workloads.

This patch series implement three optimizations:
(A) Reduce atomic ops in page counter update path.
(B) Change layout of struct page_counter to eliminate false sharing
between usage and high.
(C) Increase the memcg charge batch to 64.

To evaluate the impact of these optimizations, on a 72 CPUs machine, we
ran the following workload in root memcg and then compared with scenario
where the workload is run in a three level of cgroup hierarchy with top
level having min and low setup appropriately.

$ netserver -6
# 36 instances of netperf with following params
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
1. root memcg 21694.8
2. 6.0-rc1 10482.7 (-51.6%)
3. 6.0-rc1 + (A) 14542.5 (-32.9%)
4. 6.0-rc1 + (B) 12413.7 (-42.7%)
5. 6.0-rc1 + (C) 17063.7 (-21.3%)
6. 6.0-rc1 + (A+B+C) 20120.3 (-7.2%)

With all three optimizations, the memcg overhead of this workload has
been reduced from 51.6% to just 7.2%.

Shakeel Butt (3):
mm: page_counter: remove unneeded atomic ops for low/min
mm: page_counter: rearrange struct page_counter fields
memcg: increase MEMCG_CHARGE_BATCH to 64

include/linux/memcontrol.h | 7 ++++---
include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
mm/page_counter.c | 13 ++++++-------
3 files changed, 33 insertions(+), 21 deletions(-)

--
2.37.1.595.g718a3a8f04-goog


2022-08-22 00:25:13

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields

With memcg v2 enabled, memcg->memory.usage is a very hot member for
the workloads doing memcg charging on multiple CPUs concurrently.
Particularly the network intensive workloads. In addition, there is a
false cache sharing between memory.usage and memory.high on the charge
path. This patch moves the usage into a separate cacheline and move all
the read most fields into separate cacheline.

To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy with top
level having min and low setup appropriately. More specifically
memory.min equal to size of netperf binary and memory.low double of
that.

$ netserver -6
# 36 instances of netperf with following params
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1) 10482.7 Mbps
With patch 12413.7 Mbps (18.4% improvement)

With the patch, the throughput improved by 18.4%.

One side-effect of this patch is the increase in the size of struct
mem_cgroup. However for the performance improvement, this additional
size is worth it. In addition there are opportunities to reduce the size
of struct mem_cgroup like deprecation of kmem and tcpmem page counters
and better packing.

Signed-off-by: Shakeel Butt <[email protected]>
Reported-by: kernel test robot <[email protected]>
---
include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 679591301994..8ce99bde645f 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -3,15 +3,27 @@
#define _LINUX_PAGE_COUNTER_H

#include <linux/atomic.h>
+#include <linux/cache.h>
#include <linux/kernel.h>
#include <asm/page.h>

+#if defined(CONFIG_SMP)
+struct pc_padding {
+ char x[0];
+} ____cacheline_internodealigned_in_smp;
+#define PC_PADDING(name) struct pc_padding name
+#else
+#define PC_PADDING(name)
+#endif
+
struct page_counter {
+ /*
+ * Make sure 'usage' does not share cacheline with any other field. The
+ * memcg->memory.usage is a hot member of struct mem_cgroup.
+ */
+ PC_PADDING(_pad1_);
atomic_long_t usage;
- unsigned long min;
- unsigned long low;
- unsigned long high;
- unsigned long max;
+ PC_PADDING(_pad2_);

/* effective memory.min and memory.min usage tracking */
unsigned long emin;
@@ -23,16 +35,16 @@ struct page_counter {
atomic_long_t low_usage;
atomic_long_t children_low_usage;

- /* legacy */
unsigned long watermark;
unsigned long failcnt;

- /*
- * 'parent' is placed here to be far from 'usage' to reduce
- * cache false sharing, as 'usage' is written mostly while
- * parent is frequently read for cgroup's hierarchical
- * counting nature.
- */
+ /* Keep all the read most fields in a separete cacheline. */
+ PC_PADDING(_pad3_);
+
+ unsigned long min;
+ unsigned long low;
+ unsigned long high;
+ unsigned long max;
struct page_counter *parent;
};

--
2.37.1.595.g718a3a8f04-goog

2022-08-22 00:51:10

by Soheil Hassas Yeganeh

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields

On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <[email protected]> wrote:
>
> With memcg v2 enabled, memcg->memory.usage is a very hot member for
> the workloads doing memcg charging on multiple CPUs concurrently.
> Particularly the network intensive workloads. In addition, there is a
> false cache sharing between memory.usage and memory.high on the charge
> path. This patch moves the usage into a separate cacheline and move all
> the read most fields into separate cacheline.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
> $ netserver -6
> # 36 instances of netperf with following params
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1) 10482.7 Mbps
> With patch 12413.7 Mbps (18.4% improvement)
>
> With the patch, the throughput improved by 18.4%.

Shakeel, for my understanding: is this on top of the gains from the
previous patch?

> One side-effect of this patch is the increase in the size of struct
> mem_cgroup. However for the performance improvement, this additional
> size is worth it. In addition there are opportunities to reduce the size
> of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> and better packing.
>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reported-by: kernel test robot <[email protected]>
> ---
> include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
> 1 file changed, 23 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 679591301994..8ce99bde645f 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -3,15 +3,27 @@
> #define _LINUX_PAGE_COUNTER_H
>
> #include <linux/atomic.h>
> +#include <linux/cache.h>
> #include <linux/kernel.h>
> #include <asm/page.h>
>
> +#if defined(CONFIG_SMP)
> +struct pc_padding {
> + char x[0];
> +} ____cacheline_internodealigned_in_smp;
> +#define PC_PADDING(name) struct pc_padding name
> +#else
> +#define PC_PADDING(name)
> +#endif
> +
> struct page_counter {
> + /*
> + * Make sure 'usage' does not share cacheline with any other field. The
> + * memcg->memory.usage is a hot member of struct mem_cgroup.
> + */
> + PC_PADDING(_pad1_);
> atomic_long_t usage;
> - unsigned long min;
> - unsigned long low;
> - unsigned long high;
> - unsigned long max;
> + PC_PADDING(_pad2_);
>
> /* effective memory.min and memory.min usage tracking */
> unsigned long emin;
> @@ -23,16 +35,16 @@ struct page_counter {
> atomic_long_t low_usage;
> atomic_long_t children_low_usage;
>
> - /* legacy */
> unsigned long watermark;
> unsigned long failcnt;
>
> - /*
> - * 'parent' is placed here to be far from 'usage' to reduce
> - * cache false sharing, as 'usage' is written mostly while
> - * parent is frequently read for cgroup's hierarchical
> - * counting nature.
> - */
> + /* Keep all the read most fields in a separete cacheline. */
> + PC_PADDING(_pad3_);
> +
> + unsigned long min;
> + unsigned long low;
> + unsigned long high;
> + unsigned long max;
> struct page_counter *parent;
> };
>
> --
> 2.37.1.595.g718a3a8f04-goog
>

2022-08-22 00:53:36

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min

For cgroups using low or min protections, the function
propagate_protected_usage() was doing an atomic xchg() operation
irrespectively. It only needs to do that operation if the new value of
protection is different from older one. This patch does that.

To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy with top
level having min and low setup appropriately. More specifically
memory.min equal to size of netperf binary and memory.low double of
that.

$ netserver -6
# 36 instances of netperf with following params
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1) 10482.7 Mbps
With patch 14542.5 Mbps (38.7% improvement)

With the patch, the throughput improved by 38.7%

Signed-off-by: Shakeel Butt <[email protected]>
Reported-by: kernel test robot <[email protected]>
---
mm/page_counter.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/page_counter.c b/mm/page_counter.c
index eb156ff5d603..47711aa28161 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
unsigned long usage)
{
unsigned long protected, old_protected;
- unsigned long low, min;
long delta;

if (!c->parent)
return;

- min = READ_ONCE(c->min);
- if (min || atomic_long_read(&c->min_usage)) {
- protected = min(usage, min);
+ protected = min(usage, READ_ONCE(c->min));
+ old_protected = atomic_long_read(&c->min_usage);
+ if (protected != old_protected) {
old_protected = atomic_long_xchg(&c->min_usage, protected);
delta = protected - old_protected;
if (delta)
atomic_long_add(delta, &c->parent->children_min_usage);
}

- low = READ_ONCE(c->low);
- if (low || atomic_long_read(&c->low_usage)) {
- protected = min(usage, low);
+ protected = min(usage, READ_ONCE(c->low));
+ old_protected = atomic_long_read(&c->low_usage);
+ if (protected != old_protected) {
old_protected = atomic_long_xchg(&c->low_usage, protected);
delta = protected - old_protected;
if (delta)
--
2.37.1.595.g718a3a8f04-goog

2022-08-22 00:53:44

by Shakeel Butt

[permalink] [raw]
Subject: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64

For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
machines and the network intensive workloads requiring througput in
Gbps, 32 is too small and makes the memcg charging path a bottleneck.
For now, increase it to 64 for easy acceptance to 6.0. We will need to
revisit this in future for ever increasing demand of higher performance.

Please note that the memcg charge path drain the per-cpu memcg charge
stock, so there should not be any oom behavior change.

To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy with top
level having min and low setup appropriately. More specifically
memory.min equal to size of netperf binary and memory.low double of
that.

$ netserver -6
# 36 instances of netperf with following params
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1) 10482.7 Mbps
With patch 17064.7 Mbps (62.7% improvement)

With the patch, the throughput improved by 62.7%.

Signed-off-by: Shakeel Butt <[email protected]>
Reported-by: kernel test robot <[email protected]>
---
include/linux/memcontrol.h | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4d31ce55b1c0..70ae91188e16 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -354,10 +354,11 @@ struct mem_cgroup {
};

/*
- * size of first charge trial. "32" comes from vmscan.c's magic value.
- * TODO: maybe necessary to use big numbers in big irons.
+ * size of first charge trial.
+ * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
+ * workload.
*/
-#define MEMCG_CHARGE_BATCH 32U
+#define MEMCG_CHARGE_BATCH 64U

extern struct mem_cgroup *root_mem_cgroup;

--
2.37.1.595.g718a3a8f04-goog

2022-08-22 00:54:32

by Soheil Hassas Yeganeh

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min

On Sun, Aug 21, 2022 at 8:17 PM Shakeel Butt <[email protected]> wrote:
>
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. It only needs to do that operation if the new value of
> protection is different from older one. This patch does that.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
> $ netserver -6
> # 36 instances of netperf with following params
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1) 10482.7 Mbps
> With patch 14542.5 Mbps (38.7% improvement)
>
> With the patch, the throughput improved by 38.7%
>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reported-by: kernel test robot <[email protected]>

Nice speed up!

Acked-by: Soheil Hassas Yeganeh <[email protected]>

> ---
> mm/page_counter.c | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> unsigned long usage)
> {
> unsigned long protected, old_protected;
> - unsigned long low, min;
> long delta;
>
> if (!c->parent)
> return;
>
> - min = READ_ONCE(c->min);
> - if (min || atomic_long_read(&c->min_usage)) {
> - protected = min(usage, min);
> + protected = min(usage, READ_ONCE(c->min));
> + old_protected = atomic_long_read(&c->min_usage);
> + if (protected != old_protected) {
> old_protected = atomic_long_xchg(&c->min_usage, protected);
> delta = protected - old_protected;
> if (delta)
> atomic_long_add(delta, &c->parent->children_min_usage);
> }
>
> - low = READ_ONCE(c->low);
> - if (low || atomic_long_read(&c->low_usage)) {
> - protected = min(usage, low);
> + protected = min(usage, READ_ONCE(c->low));
> + old_protected = atomic_long_read(&c->low_usage);
> + if (protected != old_protected) {
> old_protected = atomic_long_xchg(&c->low_usage, protected);
> delta = protected - old_protected;
> if (delta)
> --
> 2.37.1.595.g718a3a8f04-goog
>

2022-08-22 01:17:04

by Soheil Hassas Yeganeh

[permalink] [raw]
Subject: Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64

On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <[email protected]> wrote:
>
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.
>
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
> $ netserver -6
> # 36 instances of netperf with following params
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1) 10482.7 Mbps
> With patch 17064.7 Mbps (62.7% improvement)
>
> With the patch, the throughput improved by 62.7%.
>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reported-by: kernel test robot <[email protected]>

Nice!

Acked-by: Soheil Hassas Yeganeh <[email protected]>

> ---
> include/linux/memcontrol.h | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4d31ce55b1c0..70ae91188e16 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -354,10 +354,11 @@ struct mem_cgroup {
> };
>
> /*
> - * size of first charge trial. "32" comes from vmscan.c's magic value.
> - * TODO: maybe necessary to use big numbers in big irons.
> + * size of first charge trial.
> + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
> + * workload.
> */
> -#define MEMCG_CHARGE_BATCH 32U
> +#define MEMCG_CHARGE_BATCH 64U
>
> extern struct mem_cgroup *root_mem_cgroup;
>
> --
> 2.37.1.595.g718a3a8f04-goog
>

2022-08-22 02:32:04

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields

On Mon, Aug 22, 2022 at 08:17:36AM +0800, Shakeel Butt wrote:
> With memcg v2 enabled, memcg->memory.usage is a very hot member for
> the workloads doing memcg charging on multiple CPUs concurrently.
> Particularly the network intensive workloads. In addition, there is a
> false cache sharing between memory.usage and memory.high on the charge
> path. This patch moves the usage into a separate cacheline and move all
> the read most fields into separate cacheline.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
> $ netserver -6
> # 36 instances of netperf with following params
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1) 10482.7 Mbps
> With patch 12413.7 Mbps (18.4% improvement)
>
> With the patch, the throughput improved by 18.4%.
>
> One side-effect of this patch is the increase in the size of struct
> mem_cgroup. However for the performance improvement, this additional
> size is worth it. In addition there are opportunities to reduce the size
> of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> and better packing.
>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reported-by: kernel test robot <[email protected]>

Looks good to me, with one nit below.

Reviewed-by: Feng Tang <[email protected]>

> ---
> include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
> 1 file changed, 23 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 679591301994..8ce99bde645f 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -3,15 +3,27 @@
> #define _LINUX_PAGE_COUNTER_H
>
> #include <linux/atomic.h>
> +#include <linux/cache.h>
> #include <linux/kernel.h>
> #include <asm/page.h>
>
> +#if defined(CONFIG_SMP)
> +struct pc_padding {
> + char x[0];
> +} ____cacheline_internodealigned_in_smp;
> +#define PC_PADDING(name) struct pc_padding name
> +#else
> +#define PC_PADDING(name)
> +#endif

There are 2 similar padding definitions in mmzone.h and memcontrol.h:

struct memcg_padding {
char x[0];
} ____cacheline_internodealigned_in_smp;
#define MEMCG_PADDING(name) struct memcg_padding name

struct zone_padding {
char x[0];
} ____cacheline_internodealigned_in_smp;
#define ZONE_PADDING(name) struct zone_padding name;

Maybe we can generalize them, and lift it into include/cache.h? so
that more places can reuse it in future.

Thanks,
Feng

2022-08-22 02:47:23

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64

On Mon, Aug 22, 2022 at 08:17:37AM +0800, Shakeel Butt wrote:
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.
>
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
> $ netserver -6
> # 36 instances of netperf with following params
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1) 10482.7 Mbps
> With patch 17064.7 Mbps (62.7% improvement)
>
> With the patch, the throughput improved by 62.7%.
>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reported-by: kernel test robot <[email protected]>

This batch number has long been a pain point :) thanks for the work!

Reviewed-by: Feng Tang <[email protected]>

- Feng

> ---
> include/linux/memcontrol.h | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4d31ce55b1c0..70ae91188e16 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -354,10 +354,11 @@ struct mem_cgroup {
> };
>
> /*
> - * size of first charge trial. "32" comes from vmscan.c's magic value.
> - * TODO: maybe necessary to use big numbers in big irons.
> + * size of first charge trial.
> + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
> + * workload.
> */
> -#define MEMCG_CHARGE_BATCH 32U
> +#define MEMCG_CHARGE_BATCH 64U
>
> extern struct mem_cgroup *root_mem_cgroup;
>
> --
> 2.37.1.595.g718a3a8f04-goog
>

2022-08-22 03:16:39

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min

On Mon, Aug 22, 2022 at 08:17:35AM +0800, Shakeel Butt wrote:
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. It only needs to do that operation if the new value of
> protection is different from older one. This patch does that.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
> $ netserver -6
> # 36 instances of netperf with following params
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1) 10482.7 Mbps
> With patch 14542.5 Mbps (38.7% improvement)
>
> With the patch, the throughput improved by 38.7%
>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reported-by: kernel test robot <[email protected]>

Reviewed-by: Feng Tang <[email protected]>

Thanks!

- Feng

> ---
> mm/page_counter.c | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> unsigned long usage)
> {
> unsigned long protected, old_protected;
> - unsigned long low, min;
> long delta;
>
> if (!c->parent)
> return;
>
> - min = READ_ONCE(c->min);
> - if (min || atomic_long_read(&c->min_usage)) {
> - protected = min(usage, min);
> + protected = min(usage, READ_ONCE(c->min));
> + old_protected = atomic_long_read(&c->min_usage);
> + if (protected != old_protected) {
> old_protected = atomic_long_xchg(&c->min_usage, protected);
> delta = protected - old_protected;
> if (delta)
> atomic_long_add(delta, &c->parent->children_min_usage);
> }
>
> - low = READ_ONCE(c->low);
> - if (low || atomic_long_read(&c->low_usage)) {
> - protected = min(usage, low);
> + protected = min(usage, READ_ONCE(c->low));
> + old_protected = atomic_long_read(&c->low_usage);
> + if (protected != old_protected) {
> old_protected = atomic_long_xchg(&c->low_usage, protected);
> delta = protected - old_protected;
> if (delta)
> --
> 2.37.1.595.g718a3a8f04-goog
>

2022-08-22 05:10:26

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields

On Sun, Aug 21, 2022 at 5:24 PM Soheil Hassas Yeganeh <[email protected]> wrote:
>
> On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <[email protected]> wrote:
> >
> > With memcg v2 enabled, memcg->memory.usage is a very hot member for
> > the workloads doing memcg charging on multiple CPUs concurrently.
> > Particularly the network intensive workloads. In addition, there is a
> > false cache sharing between memory.usage and memory.high on the charge
> > path. This patch moves the usage into a separate cacheline and move all
> > the read most fields into separate cacheline.
> >
> > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > ran the following workload in a three level of cgroup hierarchy with top
> > level having min and low setup appropriately. More specifically
> > memory.min equal to size of netperf binary and memory.low double of
> > that.
> >
> > $ netserver -6
> > # 36 instances of netperf with following params
> > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > Results (average throughput of netperf):
> > Without (6.0-rc1) 10482.7 Mbps
> > With patch 12413.7 Mbps (18.4% improvement)
> >
> > With the patch, the throughput improved by 18.4%.
>
> Shakeel, for my understanding: is this on top of the gains from the
> previous patch?
>

No, this is independent of the previous patch. The cover letter has
the numbers for all three optimizations applied together.

2022-08-22 05:37:54

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields

On Sun, Aug 21, 2022 at 7:12 PM Feng Tang <[email protected]> wrote:
>
> On Mon, Aug 22, 2022 at 08:17:36AM +0800, Shakeel Butt wrote:
> > With memcg v2 enabled, memcg->memory.usage is a very hot member for
> > the workloads doing memcg charging on multiple CPUs concurrently.
> > Particularly the network intensive workloads. In addition, there is a
> > false cache sharing between memory.usage and memory.high on the charge
> > path. This patch moves the usage into a separate cacheline and move all
> > the read most fields into separate cacheline.
> >
> > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > ran the following workload in a three level of cgroup hierarchy with top
> > level having min and low setup appropriately. More specifically
> > memory.min equal to size of netperf binary and memory.low double of
> > that.
> >
> > $ netserver -6
> > # 36 instances of netperf with following params
> > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > Results (average throughput of netperf):
> > Without (6.0-rc1) 10482.7 Mbps
> > With patch 12413.7 Mbps (18.4% improvement)
> >
> > With the patch, the throughput improved by 18.4%.
> >
> > One side-effect of this patch is the increase in the size of struct
> > mem_cgroup. However for the performance improvement, this additional
> > size is worth it. In addition there are opportunities to reduce the size
> > of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> > and better packing.
> >
> > Signed-off-by: Shakeel Butt <[email protected]>
> > Reported-by: kernel test robot <[email protected]>
>
> Looks good to me, with one nit below.
>
> Reviewed-by: Feng Tang <[email protected]>

Thanks.

>
> > ---
> > include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
> > 1 file changed, 23 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > index 679591301994..8ce99bde645f 100644
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -3,15 +3,27 @@
> > #define _LINUX_PAGE_COUNTER_H
> >
> > #include <linux/atomic.h>
> > +#include <linux/cache.h>
> > #include <linux/kernel.h>
> > #include <asm/page.h>
> >
> > +#if defined(CONFIG_SMP)
> > +struct pc_padding {
> > + char x[0];
> > +} ____cacheline_internodealigned_in_smp;
> > +#define PC_PADDING(name) struct pc_padding name
> > +#else
> > +#define PC_PADDING(name)
> > +#endif
>
> There are 2 similar padding definitions in mmzone.h and memcontrol.h:
>
> struct memcg_padding {
> char x[0];
> } ____cacheline_internodealigned_in_smp;
> #define MEMCG_PADDING(name) struct memcg_padding name
>
> struct zone_padding {
> char x[0];
> } ____cacheline_internodealigned_in_smp;
> #define ZONE_PADDING(name) struct zone_padding name;
>
> Maybe we can generalize them, and lift it into include/cache.h? so
> that more places can reuse it in future.
>

This makes sense but let me do that in a separate patch.

2022-08-22 10:38:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min

On Mon 22-08-22 11:55:33, Michal Hocko wrote:
> On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
[...]
> > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > index eb156ff5d603..47711aa28161 100644
> > --- a/mm/page_counter.c
> > +++ b/mm/page_counter.c
> > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> > unsigned long usage)
> > {
> > unsigned long protected, old_protected;
> > - unsigned long low, min;
> > long delta;
> >
> > if (!c->parent)
> > return;
> >
> > - min = READ_ONCE(c->min);
> > - if (min || atomic_long_read(&c->min_usage)) {
> > - protected = min(usage, min);
> > + protected = min(usage, READ_ONCE(c->min));
> > + old_protected = atomic_long_read(&c->min_usage);
> > + if (protected != old_protected) {
>
> I have to cache that code back into brain. It is really subtle thing and
> it is not really obvious why this is still correct. I will think about
> that some more but the changelog could help with that a lot.

OK, so the this patch will be most useful when the min > 0 && min <
usage because then the protection doesn't really change since the last
call. In other words when the usage grows above the protection and your
workload benefits from this change because that happens a lot as only a
part of the workload is protected. Correct?

Unless I have missed anything this shouldn't break the correctness but I
still have to think about the proportional distribution of the
protection because that adds to the complexity here.
--
Michal Hocko
SUSE Labs

2022-08-22 10:44:38

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields

On Mon 22-08-22 00:17:36, Shakeel Butt wrote:
> With memcg v2 enabled, memcg->memory.usage is a very hot member for
> the workloads doing memcg charging on multiple CPUs concurrently.
> Particularly the network intensive workloads. In addition, there is a
> false cache sharing between memory.usage and memory.high on the charge
> path. This patch moves the usage into a separate cacheline and move all
> the read most fields into separate cacheline.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.

Again the workload description is not particularly useful. I guess the
only important aspect is the netserver part below and the number of CPUs
because min and low setup doesn't have much to do with this, right? At
least that is my reading of the memory.high mentioned above.

> $ netserver -6
> # 36 instances of netperf with following params
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1) 10482.7 Mbps
> With patch 12413.7 Mbps (18.4% improvement)
>
> With the patch, the throughput improved by 18.4%.
>
> One side-effect of this patch is the increase in the size of struct
> mem_cgroup. However for the performance improvement, this additional
> size is worth it. In addition there are opportunities to reduce the size
> of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> and better packing.
>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reported-by: kernel test robot <[email protected]>
> ---
> include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
> 1 file changed, 23 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 679591301994..8ce99bde645f 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -3,15 +3,27 @@
> #define _LINUX_PAGE_COUNTER_H
>
> #include <linux/atomic.h>
> +#include <linux/cache.h>
> #include <linux/kernel.h>
> #include <asm/page.h>
>
> +#if defined(CONFIG_SMP)
> +struct pc_padding {
> + char x[0];
> +} ____cacheline_internodealigned_in_smp;
> +#define PC_PADDING(name) struct pc_padding name
> +#else
> +#define PC_PADDING(name)
> +#endif
> +
> struct page_counter {
> + /*
> + * Make sure 'usage' does not share cacheline with any other field. The
> + * memcg->memory.usage is a hot member of struct mem_cgroup.
> + */
> + PC_PADDING(_pad1_);

Why don't you simply require alignment for the structure?

Other than that, looks good to me and it makes sense.

--
Michal Hocko
SUSE Labs

2022-08-22 11:03:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min

On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. It only needs to do that operation if the new value of
> protection is different from older one. This patch does that.

This doesn't really explain why.

> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.

I have hard time to really grasp what is the actual setup and why it
matters and why the patch makes any difference. Please elaborate some
more here.

> $ netserver -6
> # 36 instances of netperf with following params
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1) 10482.7 Mbps
> With patch 14542.5 Mbps (38.7% improvement)
>
> With the patch, the throughput improved by 38.7%
>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reported-by: kernel test robot <[email protected]>
> ---
> mm/page_counter.c | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> unsigned long usage)
> {
> unsigned long protected, old_protected;
> - unsigned long low, min;
> long delta;
>
> if (!c->parent)
> return;
>
> - min = READ_ONCE(c->min);
> - if (min || atomic_long_read(&c->min_usage)) {
> - protected = min(usage, min);
> + protected = min(usage, READ_ONCE(c->min));
> + old_protected = atomic_long_read(&c->min_usage);
> + if (protected != old_protected) {

I have to cache that code back into brain. It is really subtle thing and
it is not really obvious why this is still correct. I will think about
that some more but the changelog could help with that a lot.

--
Michal Hocko
SUSE Labs

2022-08-22 11:06:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64

On Mon 22-08-22 00:17:37, Shakeel Butt wrote:
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.

Yes, the batch size has always been an arbitrary number. I do not think
there have ever been any solid grounds for the value we have now except
we need something and SWAP_CLUSTER_MAX was a good enough template.

Increasing it to 64 sounds like a reasonable step. It would be great to
have it scale based on the number of CPUs and potentially other factors
but that would be hard to get right and actually hard to evaluate
because it will depend on the specific workload.

> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change.

It will have an effect on other stuff as well like high limit reclaim
backoff and stast flushing.

> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.

a similar feedback to the test case description as with other patches.
>
> $ netserver -6
> # 36 instances of netperf with following params
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1) 10482.7 Mbps
> With patch 17064.7 Mbps (62.7% improvement)
>
> With the patch, the throughput improved by 62.7%.
>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reported-by: kernel test robot <[email protected]>

Anyway
Acked-by: Michal Hocko <[email protected]>

Thanks!

> ---
> include/linux/memcontrol.h | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4d31ce55b1c0..70ae91188e16 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -354,10 +354,11 @@ struct mem_cgroup {
> };
>
> /*
> - * size of first charge trial. "32" comes from vmscan.c's magic value.
> - * TODO: maybe necessary to use big numbers in big irons.
> + * size of first charge trial.
> + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
> + * workload.
> */
> -#define MEMCG_CHARGE_BATCH 32U
> +#define MEMCG_CHARGE_BATCH 64U
>
> extern struct mem_cgroup *root_mem_cgroup;
>
> --
> 2.37.1.595.g718a3a8f04-goog

--
Michal Hocko
SUSE Labs

2022-08-22 13:12:47

by Soheil Hassas Yeganeh

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields

On Mon, Aug 22, 2022 at 12:55 AM Shakeel Butt <[email protected]> wrote:
>
> On Sun, Aug 21, 2022 at 5:24 PM Soheil Hassas Yeganeh <[email protected]> wrote:
> >
> > On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <[email protected]> wrote:
> > >
> > > With memcg v2 enabled, memcg->memory.usage is a very hot member for
> > > the workloads doing memcg charging on multiple CPUs concurrently.
> > > Particularly the network intensive workloads. In addition, there is a
> > > false cache sharing between memory.usage and memory.high on the charge
> > > path. This patch moves the usage into a separate cacheline and move all
> > > the read most fields into separate cacheline.
> > >
> > > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > > ran the following workload in a three level of cgroup hierarchy with top
> > > level having min and low setup appropriately. More specifically
> > > memory.min equal to size of netperf binary and memory.low double of
> > > that.
> > >
> > > $ netserver -6
> > > # 36 instances of netperf with following params
> > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> > >
> > > Results (average throughput of netperf):
> > > Without (6.0-rc1) 10482.7 Mbps
> > > With patch 12413.7 Mbps (18.4% improvement)
> > >
> > > With the patch, the throughput improved by 18.4%.
> >
> > Shakeel, for my understanding: is this on top of the gains from the
> > previous patch?
> >
>
> No, this is independent of the previous patch. The cover letter has
> the numbers for all three optimizations applied together.

Acked-by: Soheil Hassas Yeganeh <[email protected]>

2022-08-22 15:07:09

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min

On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 22-08-22 11:55:33, Michal Hocko wrote:
> > On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
> [...]
> > > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > > index eb156ff5d603..47711aa28161 100644
> > > --- a/mm/page_counter.c
> > > +++ b/mm/page_counter.c
> > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> > > unsigned long usage)
> > > {
> > > unsigned long protected, old_protected;
> > > - unsigned long low, min;
> > > long delta;
> > >
> > > if (!c->parent)
> > > return;
> > >
> > > - min = READ_ONCE(c->min);
> > > - if (min || atomic_long_read(&c->min_usage)) {
> > > - protected = min(usage, min);
> > > + protected = min(usage, READ_ONCE(c->min));
> > > + old_protected = atomic_long_read(&c->min_usage);
> > > + if (protected != old_protected) {
> >
> > I have to cache that code back into brain. It is really subtle thing and
> > it is not really obvious why this is still correct. I will think about
> > that some more but the changelog could help with that a lot.
>
> OK, so the this patch will be most useful when the min > 0 && min <
> usage because then the protection doesn't really change since the last
> call. In other words when the usage grows above the protection and your
> workload benefits from this change because that happens a lot as only a
> part of the workload is protected. Correct?

Yes, that is correct. I hope the experiment setup is clear now.

>
> Unless I have missed anything this shouldn't break the correctness but I
> still have to think about the proportional distribution of the
> protection because that adds to the complexity here.

The patch is not changing any semantics. It is just removing an
unnecessary atomic xchg() for a specific scenario (min > 0 && min <
usage). I don't think there will be any change related to proportional
distribution of the protection.

2022-08-22 15:32:21

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64

On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <[email protected]> wrote:
>
[...]
>
> > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > ran the following workload in a three level of cgroup hierarchy with top
> > level having min and low setup appropriately. More specifically
> > memory.min equal to size of netperf binary and memory.low double of
> > that.
>
> a similar feedback to the test case description as with other patches.

What more info should I add to the description? Why did I set up min
and low or something else?

> >
> > $ netserver -6
> > # 36 instances of netperf with following params
> > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > Results (average throughput of netperf):
> > Without (6.0-rc1) 10482.7 Mbps
> > With patch 17064.7 Mbps (62.7% improvement)
> >
> > With the patch, the throughput improved by 62.7%.
> >
> > Signed-off-by: Shakeel Butt <[email protected]>
> > Reported-by: kernel test robot <[email protected]>
>
> Anyway
> Acked-by: Michal Hocko <[email protected]>

Thanks

2022-08-22 15:41:38

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields

On Mon 22-08-22 08:06:14, Shakeel Butt wrote:
[...]
> > > struct page_counter {
> > > + /*
> > > + * Make sure 'usage' does not share cacheline with any other field. The
> > > + * memcg->memory.usage is a hot member of struct mem_cgroup.
> > > + */
> > > + PC_PADDING(_pad1_);
> >
> > Why don't you simply require alignment for the structure?
>
> I don't just want the alignment of the structure. I want different
> fields of this structure to not share the cache line. More
> specifically the 'high' and 'usage' fields. With this change the usage
> will be its own cache line, the read-most fields will be on separate
> cache line and the fields which sometimes get updated on charge path
> based on some condition will be a different cache line from the
> previous two.

I do not follow. If you make an explicit requirement for the structure
alignement then the first field in the structure will be guarantied to
have that alignement and you achieve the rest to be in the other cache
line by adding padding behind that.

--
Michal Hocko
SUSE Labs

2022-08-22 15:42:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64

On Mon 22-08-22 08:09:01, Shakeel Butt wrote:
> On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <[email protected]> wrote:
> >
> [...]
> >
> > > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > > ran the following workload in a three level of cgroup hierarchy with top
> > > level having min and low setup appropriately. More specifically
> > > memory.min equal to size of netperf binary and memory.low double of
> > > that.
> >
> > a similar feedback to the test case description as with other patches.
>
> What more info should I add to the description? Why did I set up min
> and low or something else?

I do see why you wanted to keep the test consistent over those three
patches. I would just drop the reference to the protection configuration
because it likely doesn't make much of an impact, does it? It is the
multi cpu setup and false sharing that makes the real difference. Or am
I wrong in assuming that?

--
Michal Hocko
SUSE Labs

2022-08-22 16:02:50

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields

On Mon, Aug 22, 2022 at 3:23 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 22-08-22 00:17:36, Shakeel Butt wrote:
> > With memcg v2 enabled, memcg->memory.usage is a very hot member for
> > the workloads doing memcg charging on multiple CPUs concurrently.
> > Particularly the network intensive workloads. In addition, there is a
> > false cache sharing between memory.usage and memory.high on the charge
> > path. This patch moves the usage into a separate cacheline and move all
> > the read most fields into separate cacheline.
> >
> > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > ran the following workload in a three level of cgroup hierarchy with top
> > level having min and low setup appropriately. More specifically
> > memory.min equal to size of netperf binary and memory.low double of
> > that.
>
> Again the workload description is not particularly useful. I guess the
> only important aspect is the netserver part below and the number of CPUs
> because min and low setup doesn't have much to do with this, right? At
> least that is my reading of the memory.high mentioned above.
>

The experiment numbers below are for only this patch independently
i.e. the unnecessary min/low atomic xchg() is still happening for both
setups. I could run the experiment without setting min and low but I
wanted to keep the setup exactly the same for all three optimizations.

This patch and the following perf numbers shows only the impact of
removing false sharing in struct page_counter for memcg->memory on the
charging code path.

> > $ netserver -6
> > # 36 instances of netperf with following params
> > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > Results (average throughput of netperf):
> > Without (6.0-rc1) 10482.7 Mbps
> > With patch 12413.7 Mbps (18.4% improvement)
> >
> > With the patch, the throughput improved by 18.4%.
> >
> > One side-effect of this patch is the increase in the size of struct
> > mem_cgroup. However for the performance improvement, this additional
> > size is worth it. In addition there are opportunities to reduce the size
> > of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> > and better packing.
> >
> > Signed-off-by: Shakeel Butt <[email protected]>
> > Reported-by: kernel test robot <[email protected]>
> > ---
> > include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
> > 1 file changed, 23 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > index 679591301994..8ce99bde645f 100644
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -3,15 +3,27 @@
> > #define _LINUX_PAGE_COUNTER_H
> >
> > #include <linux/atomic.h>
> > +#include <linux/cache.h>
> > #include <linux/kernel.h>
> > #include <asm/page.h>
> >
> > +#if defined(CONFIG_SMP)
> > +struct pc_padding {
> > + char x[0];
> > +} ____cacheline_internodealigned_in_smp;
> > +#define PC_PADDING(name) struct pc_padding name
> > +#else
> > +#define PC_PADDING(name)
> > +#endif
> > +
> > struct page_counter {
> > + /*
> > + * Make sure 'usage' does not share cacheline with any other field. The
> > + * memcg->memory.usage is a hot member of struct mem_cgroup.
> > + */
> > + PC_PADDING(_pad1_);
>
> Why don't you simply require alignment for the structure?

I don't just want the alignment of the structure. I want different
fields of this structure to not share the cache line. More
specifically the 'high' and 'usage' fields. With this change the usage
will be its own cache line, the read-most fields will be on separate
cache line and the fields which sometimes get updated on charge path
based on some condition will be a different cache line from the
previous two.

2022-08-22 16:12:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min

On Mon 22-08-22 07:55:58, Shakeel Butt wrote:
> On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 22-08-22 11:55:33, Michal Hocko wrote:
> > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
> > [...]
> > > > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > > > index eb156ff5d603..47711aa28161 100644
> > > > --- a/mm/page_counter.c
> > > > +++ b/mm/page_counter.c
> > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> > > > unsigned long usage)
> > > > {
> > > > unsigned long protected, old_protected;
> > > > - unsigned long low, min;
> > > > long delta;
> > > >
> > > > if (!c->parent)
> > > > return;
> > > >
> > > > - min = READ_ONCE(c->min);
> > > > - if (min || atomic_long_read(&c->min_usage)) {
> > > > - protected = min(usage, min);
> > > > + protected = min(usage, READ_ONCE(c->min));
> > > > + old_protected = atomic_long_read(&c->min_usage);
> > > > + if (protected != old_protected) {
> > >
> > > I have to cache that code back into brain. It is really subtle thing and
> > > it is not really obvious why this is still correct. I will think about
> > > that some more but the changelog could help with that a lot.
> >
> > OK, so the this patch will be most useful when the min > 0 && min <
> > usage because then the protection doesn't really change since the last
> > call. In other words when the usage grows above the protection and your
> > workload benefits from this change because that happens a lot as only a
> > part of the workload is protected. Correct?
>
> Yes, that is correct. I hope the experiment setup is clear now.

Maybe it is just me that it took a bit to grasp but maybe we want to
save our future selfs from going through that mental process again. So
please just be explicit about that in the changelog. It is really the
part that workloads excessing the protection will benefit the most that
would help to understand this patch.

> > Unless I have missed anything this shouldn't break the correctness but I
> > still have to think about the proportional distribution of the
> > protection because that adds to the complexity here.
>
> The patch is not changing any semantics. It is just removing an
> unnecessary atomic xchg() for a specific scenario (min > 0 && min <
> usage). I don't think there will be any change related to proportional
> distribution of the protection.

Yes, I suspect you are right. I just remembered previous fixes
like 503970e42325 ("mm: memcontrol: fix memory.low proportional
distribution") which just made me nervous that this is a tricky area.

I will have another look tomorrow with a fresh brain and send an ack.
--
Michal Hocko
SUSE Labs

2022-08-22 16:20:00

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields

On Mon, Aug 22, 2022 at 8:15 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 22-08-22 08:06:14, Shakeel Butt wrote:
> [...]
> > > > struct page_counter {
> > > > + /*
> > > > + * Make sure 'usage' does not share cacheline with any other field. The
> > > > + * memcg->memory.usage is a hot member of struct mem_cgroup.
> > > > + */
> > > > + PC_PADDING(_pad1_);
> > >
> > > Why don't you simply require alignment for the structure?
> >
> > I don't just want the alignment of the structure. I want different
> > fields of this structure to not share the cache line. More
> > specifically the 'high' and 'usage' fields. With this change the usage
> > will be its own cache line, the read-most fields will be on separate
> > cache line and the fields which sometimes get updated on charge path
> > based on some condition will be a different cache line from the
> > previous two.
>
> I do not follow. If you make an explicit requirement for the structure
> alignement then the first field in the structure will be guarantied to
> have that alignement and you achieve the rest to be in the other cache
> line by adding padding behind that.

Oh, you were talking explicitly about _pad1_, yes, we can remove it
and make the struct cache align. I will do it in the next version.

2022-08-22 17:03:37

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min

On Mon, Aug 22, 2022 at 8:20 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 22-08-22 07:55:58, Shakeel Butt wrote:
> > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 22-08-22 11:55:33, Michal Hocko wrote:
> > > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
> > > [...]
> > > > > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > > > > index eb156ff5d603..47711aa28161 100644
> > > > > --- a/mm/page_counter.c
> > > > > +++ b/mm/page_counter.c
> > > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> > > > > unsigned long usage)
> > > > > {
> > > > > unsigned long protected, old_protected;
> > > > > - unsigned long low, min;
> > > > > long delta;
> > > > >
> > > > > if (!c->parent)
> > > > > return;
> > > > >
> > > > > - min = READ_ONCE(c->min);
> > > > > - if (min || atomic_long_read(&c->min_usage)) {
> > > > > - protected = min(usage, min);
> > > > > + protected = min(usage, READ_ONCE(c->min));
> > > > > + old_protected = atomic_long_read(&c->min_usage);
> > > > > + if (protected != old_protected) {
> > > >
> > > > I have to cache that code back into brain. It is really subtle thing and
> > > > it is not really obvious why this is still correct. I will think about
> > > > that some more but the changelog could help with that a lot.
> > >
> > > OK, so the this patch will be most useful when the min > 0 && min <
> > > usage because then the protection doesn't really change since the last
> > > call. In other words when the usage grows above the protection and your
> > > workload benefits from this change because that happens a lot as only a
> > > part of the workload is protected. Correct?
> >
> > Yes, that is correct. I hope the experiment setup is clear now.
>
> Maybe it is just me that it took a bit to grasp but maybe we want to
> save our future selfs from going through that mental process again. So
> please just be explicit about that in the changelog. It is really the
> part that workloads excessing the protection will benefit the most that
> would help to understand this patch.
>

I will add more detail in the commit message in the next version.

> > > Unless I have missed anything this shouldn't break the correctness but I
> > > still have to think about the proportional distribution of the
> > > protection because that adds to the complexity here.
> >
> > The patch is not changing any semantics. It is just removing an
> > unnecessary atomic xchg() for a specific scenario (min > 0 && min <
> > usage). I don't think there will be any change related to proportional
> > distribution of the protection.
>
> Yes, I suspect you are right. I just remembered previous fixes
> like 503970e42325 ("mm: memcontrol: fix memory.low proportional
> distribution") which just made me nervous that this is a tricky area.
>
> I will have another look tomorrow with a fresh brain and send an ack.

I will wait for your ack before sending the next version.

2022-08-22 17:10:12

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64

On Mon, Aug 22, 2022 at 8:22 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 22-08-22 08:09:01, Shakeel Butt wrote:
> > On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <[email protected]> wrote:
> > >
> > [...]
> > >
> > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > > > ran the following workload in a three level of cgroup hierarchy with top
> > > > level having min and low setup appropriately. More specifically
> > > > memory.min equal to size of netperf binary and memory.low double of
> > > > that.
> > >
> > > a similar feedback to the test case description as with other patches.
> >
> > What more info should I add to the description? Why did I set up min
> > and low or something else?
>
> I do see why you wanted to keep the test consistent over those three
> patches. I would just drop the reference to the protection configuration
> because it likely doesn't make much of an impact, does it? It is the
> multi cpu setup and false sharing that makes the real difference. Or am
> I wrong in assuming that?
>

No, you are correct. I will cleanup the commit message in the next version.

2022-08-22 18:29:10

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min

On Mon, Aug 22, 2022 at 12:17:35AM +0000, Shakeel Butt wrote:
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. It only needs to do that operation if the new value of
> protection is different from older one. This patch does that.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
> $ netserver -6
> # 36 instances of netperf with following params
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1) 10482.7 Mbps
> With patch 14542.5 Mbps (38.7% improvement)
>
> With the patch, the throughput improved by 38.7%

Nice savings!

>
> Signed-off-by: Shakeel Butt <[email protected]>
> Reported-by: kernel test robot <[email protected]>
> ---
> mm/page_counter.c | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> unsigned long usage)
> {
> unsigned long protected, old_protected;
> - unsigned long low, min;
> long delta;
>
> if (!c->parent)
> return;
>
> - min = READ_ONCE(c->min);
> - if (min || atomic_long_read(&c->min_usage)) {
> - protected = min(usage, min);
> + protected = min(usage, READ_ONCE(c->min));
> + old_protected = atomic_long_read(&c->min_usage);
> + if (protected != old_protected) {
> old_protected = atomic_long_xchg(&c->min_usage, protected);
> delta = protected - old_protected;
> if (delta)
> atomic_long_add(delta, &c->parent->children_min_usage);

What if there is a concurrent update of c->min_usage? Then the patched version
can miss an update. I can't imagine a case when it will lead to bad consequences,
so probably it's ok. But not super obvious.
I think the way to think of it is that a missed update will be fixed by the next
one, so it's ok to run some time with old numbers.

Acked-by: Roman Gushchin <[email protected]>

Thanks!

2022-08-22 18:49:02

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields

On Mon, Aug 22, 2022 at 09:04:59AM -0700, Shakeel Butt wrote:
> On Mon, Aug 22, 2022 at 8:15 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 22-08-22 08:06:14, Shakeel Butt wrote:
> > [...]
> > > > > struct page_counter {
> > > > > + /*
> > > > > + * Make sure 'usage' does not share cacheline with any other field. The
> > > > > + * memcg->memory.usage is a hot member of struct mem_cgroup.
> > > > > + */
> > > > > + PC_PADDING(_pad1_);
> > > >
> > > > Why don't you simply require alignment for the structure?
> > >
> > > I don't just want the alignment of the structure. I want different
> > > fields of this structure to not share the cache line. More
> > > specifically the 'high' and 'usage' fields. With this change the usage
> > > will be its own cache line, the read-most fields will be on separate
> > > cache line and the fields which sometimes get updated on charge path
> > > based on some condition will be a different cache line from the
> > > previous two.
> >
> > I do not follow. If you make an explicit requirement for the structure
> > alignement then the first field in the structure will be guarantied to
> > have that alignement and you achieve the rest to be in the other cache
> > line by adding padding behind that.
>
> Oh, you were talking explicitly about _pad1_, yes, we can remove it
> and make the struct cache align. I will do it in the next version.

Yes, please, it caught my eyes too.
With this change:
Acked-by: Roman Gushchin <[email protected]>

Also, can you, please, include the numbers on the additional memory overhead?
I think it still worth it, just think we need to include them for a record.

Thanks!

2022-08-22 18:55:18

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64

On Mon, Aug 22, 2022 at 12:17:37AM +0000, Shakeel Butt wrote:
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.
>
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
> $ netserver -6
> # 36 instances of netperf with following params
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1) 10482.7 Mbps
> With patch 17064.7 Mbps (62.7% improvement)
>
> With the patch, the throughput improved by 62.7%.

This is pretty significant!

Acked-by: Roman Gushchin <[email protected]>

I wonder only if we want to make it configurable (Idk a sysctl or maybe
a config option) and close the topic.

Thanks!

2022-08-22 19:45:01

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64

On Mon 22-08-22 11:37:30, Roman Gushchin wrote:
[...]
> I wonder only if we want to make it configurable (Idk a sysctl or maybe
> a config option) and close the topic.

I do not think this is a good idea. We have other examples where we have
outsourced internal tunning to the userspace and it has mostly proven
impractical and long term more problematic than useful (e.g.
lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to
name some that come to my mind). I have seen more often these to be used
incorrectly than useful.

In this case, I guess we should consider either moving to per memcg
charge batching and see whether the pcp overhead x memcg_count is worth
that or some automagic tuning of the batch size depending on how
effectively the batch is used. Certainly a lot of room for
experimenting.

--
Michal Hocko
SUSE Labs

2022-08-23 02:42:53

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64

On Mon, Aug 22, 2022 at 09:34:59PM +0200, Michal Hocko wrote:
> On Mon 22-08-22 11:37:30, Roman Gushchin wrote:
> [...]
> > I wonder only if we want to make it configurable (Idk a sysctl or maybe
> > a config option) and close the topic.
>
> I do not think this is a good idea. We have other examples where we have
> outsourced internal tunning to the userspace and it has mostly proven
> impractical and long term more problematic than useful (e.g.
> lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to
> name some that come to my mind). I have seen more often these to be used
> incorrectly than useful.

A agree, not a strong opinion here. But I wonder if somebody will
complain on Shakeel's change because of the reduced accuracy.
I know some users are using memory cgroups to track the size of various
workloads (including relatively small) and 32->64 pages per cpu change
can be noticeable for them. But we can wait for an actual bug report :)

>
> In this case, I guess we should consider either moving to per memcg
> charge batching and see whether the pcp overhead x memcg_count is worth
> that or some automagic tuning of the batch size depending on how
> effectively the batch is used. Certainly a lot of room for
> experimenting.

I'm not a big believer into the automagic tuning here because it's a fundamental
trade-off of accuracy vs performance and various users might make a different
choice depending on their needs, not on the cpu count or something else.

Per-memcg batching sounds interesting though. For example, we can likely
batch updates on leaf cgroups and have a single atomic update instead of
multiple most of the times. Or do you mean something different?

Thanks!

2022-08-23 05:14:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64

On Mon 22-08-22 19:22:26, Roman Gushchin wrote:
> On Mon, Aug 22, 2022 at 09:34:59PM +0200, Michal Hocko wrote:
> > On Mon 22-08-22 11:37:30, Roman Gushchin wrote:
> > [...]
> > > I wonder only if we want to make it configurable (Idk a sysctl or maybe
> > > a config option) and close the topic.
> >
> > I do not think this is a good idea. We have other examples where we have
> > outsourced internal tunning to the userspace and it has mostly proven
> > impractical and long term more problematic than useful (e.g.
> > lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to
> > name some that come to my mind). I have seen more often these to be used
> > incorrectly than useful.
>
> A agree, not a strong opinion here. But I wonder if somebody will
> complain on Shakeel's change because of the reduced accuracy.
> I know some users are using memory cgroups to track the size of various
> workloads (including relatively small) and 32->64 pages per cpu change
> can be noticeable for them. But we can wait for an actual bug report :)

Yes, that would be my approach. I have seen reports like that already
but that was mostly because of heavy caching on the SLUB side on older
kernels. So there surely are workloads with small limits configured
(e.g. 20MB). On the other hand those users were receptive to adapt their
limits as they were kinda arbitrary anyway.

> > In this case, I guess we should consider either moving to per memcg
> > charge batching and see whether the pcp overhead x memcg_count is worth
> > that or some automagic tuning of the batch size depending on how
> > effectively the batch is used. Certainly a lot of room for
> > experimenting.
>
> I'm not a big believer into the automagic tuning here because it's a fundamental
> trade-off of accuracy vs performance and various users might make a different
> choice depending on their needs, not on the cpu count or something else.

Yes, this not an easy thing to get right. I was mostly thinking some
auto scaling based on the limit size or growing the stock if cache hits
are common and decrease when stocks get flushed often because multiple
memcgs compete over the same pcp stock. But to me it seems like a per
memcg approach might lead better results without too many heuristics
(albeit more memory hungry).

> Per-memcg batching sounds interesting though. For example, we can likely
> batch updates on leaf cgroups and have a single atomic update instead of
> multiple most of the times. Or do you mean something different?

No, that was exactly my thinking as well.

--
Michal Hocko
SUSE Labs

2022-08-23 12:34:13

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min

On Mon 22-08-22 17:20:02, Michal Hocko wrote:
> On Mon 22-08-22 07:55:58, Shakeel Butt wrote:
> > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <[email protected]> wrote:
[...]
> > > Unless I have missed anything this shouldn't break the correctness but I
> > > still have to think about the proportional distribution of the
> > > protection because that adds to the complexity here.
> >
> > The patch is not changing any semantics. It is just removing an
> > unnecessary atomic xchg() for a specific scenario (min > 0 && min <
> > usage). I don't think there will be any change related to proportional
> > distribution of the protection.
>
> Yes, I suspect you are right. I just remembered previous fixes
> like 503970e42325 ("mm: memcontrol: fix memory.low proportional
> distribution") which just made me nervous that this is a tricky area.
>
> I will have another look tomorrow with a fresh brain and send an ack.

I cannot spot any problem. But I guess it would be good to have a little
comment to explain that races on the min_usage update (mentioned by Roman)
are acceptable and savings from atomic update are preferred.

The worst case I can imagine would be something like uncharge 4kB racing
with charge 2MB. The first reduces the protection (min_usage) while the other one
misses that update and doesn't increase it. But even then the effect
shouldn't be really large. At least I have hard time imagine this would
throw things off too much.
--
Michal Hocko
SUSE Labs