2021-01-19 07:26:52

by Feng Tang

[permalink] [raw]
Subject: [PATCH v2] mm: page_counter: relayout structure to reduce false sharing

When checking a memory cgroup related performance regression [1],
from the perf c2c profiling data, we found high false sharing for
accessing 'usage' and 'parent'.

On 64 bit system, the 'usage' and 'parent' are close to each other,
and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
is usally written, while 'parent' is usually read as the cgroup's
hierarchical counting nature.

So move the 'parent' to the end of the structure to make sure they
are in different cache lines.

Following are some performance data with the patch, against
v5.11-rc1. [ In the data, A means a platform with 2 sockets 48C/96T,
B is a platform of 4 sockests 72C/144T, and if a %stddev will be
shown bigger than 2%, P100/P50 means number of test tasks equals
to 100%/50% of nr_cpu]

will-it-scale/malloc1
---------------------
v5.11-rc1 v5.11-rc1+patch

A-P100 15782 ± 2% -0.1% 15765 ± 3% will-it-scale.per_process_ops
A-P50 21511 +8.9% 23432 will-it-scale.per_process_ops
B-P100 9155 +2.2% 9357 will-it-scale.per_process_ops
B-P50 10967 +7.1% 11751 ± 2% will-it-scale.per_process_ops

will-it-scale/pagefault2
------------------------
v5.11-rc1 v5.11-rc1+patch

A-P100 79028 +3.0% 81411 will-it-scale.per_process_ops
A-P50 183960 ± 2% +4.4% 192078 ± 2% will-it-scale.per_process_ops
B-P100 85966 +9.9% 94467 ± 3% will-it-scale.per_process_ops
B-P50 198195 +9.8% 217526 will-it-scale.per_process_ops

fio (4k/1M is block size)
-------------------------
v5.11-rc1 v5.11-rc1+patch

A-P50-r-4k 16881 ± 2% +1.2% 17081 ± 2% fio.read_bw_MBps
A-P50-w-4k 3931 +4.5% 4111 ± 2% fio.write_bw_MBps
A-P50-r-1M 15178 -0.2% 15154 fio.read_bw_MBps
A-P50-w-1M 3924 +0.1% 3929 fio.write_bw_MBps

[1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
Signed-off-by: Feng Tang <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
---
Changelogs:

v2:
* Adjust the format of performance data to be more readable,
as suggested by Michal Hocko

include/linux/page_counter.h | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 85bd413..6795913 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -12,7 +12,6 @@ struct page_counter {
unsigned long low;
unsigned long high;
unsigned long max;
- struct page_counter *parent;

/* effective memory.min and memory.min usage tracking */
unsigned long emin;
@@ -27,6 +26,14 @@ struct page_counter {
/* legacy */
unsigned long watermark;
unsigned long failcnt;
+
+ /*
+ * 'parent' is placed here to be far from 'usage' to reduce
+ * cache false sharing, as 'usage' is written mostly while
+ * parent is frequently read for cgroup's hierarchical
+ * counting nature.
+ */
+ struct page_counter *parent;
};

#if BITS_PER_LONG == 32
--
2.7.4


2021-01-19 16:44:43

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v2] mm: page_counter: relayout structure to reduce false sharing

On Mon, Jan 18, 2021 at 11:20 PM Feng Tang <[email protected]> wrote:
>
> When checking a memory cgroup related performance regression [1],
> from the perf c2c profiling data, we found high false sharing for
> accessing 'usage' and 'parent'.
>
> On 64 bit system, the 'usage' and 'parent' are close to each other,
> and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
> is usally written, while 'parent' is usually read as the cgroup's
> hierarchical counting nature.
>
> So move the 'parent' to the end of the structure to make sure they
> are in different cache lines.
>
> Following are some performance data with the patch, against
> v5.11-rc1. [ In the data, A means a platform with 2 sockets 48C/96T,
> B is a platform of 4 sockests 72C/144T, and if a %stddev will be
> shown bigger than 2%, P100/P50 means number of test tasks equals
> to 100%/50% of nr_cpu]
>
> will-it-scale/malloc1
> ---------------------
> v5.11-rc1 v5.11-rc1+patch
>
> A-P100 15782 ± 2% -0.1% 15765 ± 3% will-it-scale.per_process_ops
> A-P50 21511 +8.9% 23432 will-it-scale.per_process_ops
> B-P100 9155 +2.2% 9357 will-it-scale.per_process_ops
> B-P50 10967 +7.1% 11751 ± 2% will-it-scale.per_process_ops
>
> will-it-scale/pagefault2
> ------------------------
> v5.11-rc1 v5.11-rc1+patch
>
> A-P100 79028 +3.0% 81411 will-it-scale.per_process_ops
> A-P50 183960 ± 2% +4.4% 192078 ± 2% will-it-scale.per_process_ops
> B-P100 85966 +9.9% 94467 ± 3% will-it-scale.per_process_ops
> B-P50 198195 +9.8% 217526 will-it-scale.per_process_ops
>
> fio (4k/1M is block size)
> -------------------------
> v5.11-rc1 v5.11-rc1+patch
>
> A-P50-r-4k 16881 ± 2% +1.2% 17081 ± 2% fio.read_bw_MBps
> A-P50-w-4k 3931 +4.5% 4111 ± 2% fio.write_bw_MBps
> A-P50-r-1M 15178 -0.2% 15154 fio.read_bw_MBps
> A-P50-w-1M 3924 +0.1% 3929 fio.write_bw_MBps
>
> [1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> Signed-off-by: Feng Tang <[email protected]>
> Reviewed-by: Roman Gushchin <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

2021-01-19 17:05:00

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v2] mm: page_counter: relayout structure to reduce false sharing

On Tue, Jan 19, 2021 at 03:20:14PM +0800, Feng Tang wrote:
> When checking a memory cgroup related performance regression [1],
> from the perf c2c profiling data, we found high false sharing for
> accessing 'usage' and 'parent'.
>
> On 64 bit system, the 'usage' and 'parent' are close to each other,
> and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
> is usally written, while 'parent' is usually read as the cgroup's
> hierarchical counting nature.
>
> So move the 'parent' to the end of the structure to make sure they
> are in different cache lines.
>
> Following are some performance data with the patch, against
> v5.11-rc1. [ In the data, A means a platform with 2 sockets 48C/96T,
> B is a platform of 4 sockests 72C/144T, and if a %stddev will be
> shown bigger than 2%, P100/P50 means number of test tasks equals
> to 100%/50% of nr_cpu]
>
> will-it-scale/malloc1
> ---------------------
> v5.11-rc1 v5.11-rc1+patch
>
> A-P100 15782 ? 2% -0.1% 15765 ? 3% will-it-scale.per_process_ops
> A-P50 21511 +8.9% 23432 will-it-scale.per_process_ops
> B-P100 9155 +2.2% 9357 will-it-scale.per_process_ops
> B-P50 10967 +7.1% 11751 ? 2% will-it-scale.per_process_ops
>
> will-it-scale/pagefault2
> ------------------------
> v5.11-rc1 v5.11-rc1+patch
>
> A-P100 79028 +3.0% 81411 will-it-scale.per_process_ops
> A-P50 183960 ? 2% +4.4% 192078 ? 2% will-it-scale.per_process_ops
> B-P100 85966 +9.9% 94467 ? 3% will-it-scale.per_process_ops
> B-P50 198195 +9.8% 217526 will-it-scale.per_process_ops
>
> fio (4k/1M is block size)
> -------------------------
> v5.11-rc1 v5.11-rc1+patch
>
> A-P50-r-4k 16881 ? 2% +1.2% 17081 ? 2% fio.read_bw_MBps
> A-P50-w-4k 3931 +4.5% 4111 ? 2% fio.write_bw_MBps
> A-P50-r-1M 15178 -0.2% 15154 fio.read_bw_MBps
> A-P50-w-1M 3924 +0.1% 3929 fio.write_bw_MBps
>
> [1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> Signed-off-by: Feng Tang <[email protected]>
> Reviewed-by: Roman Gushchin <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

Thanks!

2021-01-20 08:01:39

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2] mm: page_counter: relayout structure to reduce false sharing

On Tue 19-01-21 15:20:14, Feng Tang wrote:
> When checking a memory cgroup related performance regression [1],
> from the perf c2c profiling data, we found high false sharing for
> accessing 'usage' and 'parent'.
>
> On 64 bit system, the 'usage' and 'parent' are close to each other,
> and easy to be in one cacheline (for cacheline size == 64+ B). 'usage'
> is usally written, while 'parent' is usually read as the cgroup's
> hierarchical counting nature.
>
> So move the 'parent' to the end of the structure to make sure they
> are in different cache lines.
>
> Following are some performance data with the patch, against
> v5.11-rc1. [ In the data, A means a platform with 2 sockets 48C/96T,
> B is a platform of 4 sockests 72C/144T, and if a %stddev will be
> shown bigger than 2%, P100/P50 means number of test tasks equals
> to 100%/50% of nr_cpu]
>
> will-it-scale/malloc1
> ---------------------
> v5.11-rc1 v5.11-rc1+patch
>
> A-P100 15782 ? 2% -0.1% 15765 ? 3% will-it-scale.per_process_ops
> A-P50 21511 +8.9% 23432 will-it-scale.per_process_ops
> B-P100 9155 +2.2% 9357 will-it-scale.per_process_ops
> B-P50 10967 +7.1% 11751 ? 2% will-it-scale.per_process_ops
>
> will-it-scale/pagefault2
> ------------------------
> v5.11-rc1 v5.11-rc1+patch
>
> A-P100 79028 +3.0% 81411 will-it-scale.per_process_ops
> A-P50 183960 ? 2% +4.4% 192078 ? 2% will-it-scale.per_process_ops
> B-P100 85966 +9.9% 94467 ? 3% will-it-scale.per_process_ops
> B-P50 198195 +9.8% 217526 will-it-scale.per_process_ops
>
> fio (4k/1M is block size)
> -------------------------
> v5.11-rc1 v5.11-rc1+patch
>
> A-P50-r-4k 16881 ? 2% +1.2% 17081 ? 2% fio.read_bw_MBps
> A-P50-w-4k 3931 +4.5% 4111 ? 2% fio.write_bw_MBps
> A-P50-r-1M 15178 -0.2% 15154 fio.read_bw_MBps
> A-P50-w-1M 3924 +0.1% 3929 fio.write_bw_MBps

Thanks for making results easier to read and understand.

> [1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> Signed-off-by: Feng Tang <[email protected]>
> Reviewed-by: Roman Gushchin <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>

Acked-by: Michal Hocko <[email protected]>

Thanks!

> ---
> Changelogs:
>
> v2:
> * Adjust the format of performance data to be more readable,
> as suggested by Michal Hocko
>
> include/linux/page_counter.h | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 85bd413..6795913 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -12,7 +12,6 @@ struct page_counter {
> unsigned long low;
> unsigned long high;
> unsigned long max;
> - struct page_counter *parent;
>
> /* effective memory.min and memory.min usage tracking */
> unsigned long emin;
> @@ -27,6 +26,14 @@ struct page_counter {
> /* legacy */
> unsigned long watermark;
> unsigned long failcnt;
> +
> + /*
> + * 'parent' is placed here to be far from 'usage' to reduce
> + * cache false sharing, as 'usage' is written mostly while
> + * parent is frequently read for cgroup's hierarchical
> + * counting nature.
> + */
> + struct page_counter *parent;
> };
>
> #if BITS_PER_LONG == 32
> --
> 2.7.4
>

--
Michal Hocko
SUSE Labs