2023-03-24 07:24:29

by Feng Tang

[permalink] [raw]
Subject: [PATCH v1] Documentation: Add document for false sharing

When doing performance tuning or debugging performance regressions,
more and more cases are found to be related to false sharing [1][2],
and the situation can be worse for newer platforms with hundreds of
CPUs. There are already many commits in current kernel specially
for mitigating the performance degradation due to false sharing.

False sharing could harm the performance silently without being
noticed, due to reasons like:
* data members of a big data structure randomly sitting together
in one cache line
* global data of small size are linked compactly together

So it's better to make a simple document about the normal pattern
of false sharing, basic ways to mitigate it and call out to
developers to pay attention during code-writing.

[ Many thanks to Dave Hansen, Ying Huang, Tim Chen, Julie Du and
Yu Chen for their contributions ]

[1]. https://lore.kernel.org/lkml/20220619150456.GB34471@xsang-OptiPlex-9020/
[2]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/

Signed-off-by: Feng Tang <[email protected]>
---
Changelog:

since RFC:
* Reword paragraphs with grammar issues; fixes many format and
typo issues (Randy Dunlap)

.../kernel-hacking/false-sharing.rst | 201 ++++++++++++++++++
Documentation/kernel-hacking/index.rst | 1 +
2 files changed, 202 insertions(+)
create mode 100644 Documentation/kernel-hacking/false-sharing.rst

diff --git a/Documentation/kernel-hacking/false-sharing.rst b/Documentation/kernel-hacking/false-sharing.rst
new file mode 100644
index 000000000000..c67745e8da3d
--- /dev/null
+++ b/Documentation/kernel-hacking/false-sharing.rst
@@ -0,0 +1,201 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+False Sharing
+=============
+
+What is False Sharing
+=====================
+False sharing is related with cache mechanism of maintaining the data
+coherence of one cache line stored in multiple CPU's caches; then
+academic definition for it is in [1]_. Consider a struct with a
+refcount and a string::
+
+ struct foo {
+ refcount_t refcount;
+ ...
+ char name[16];
+ } ____cacheline_internodealigned_in_smp;
+
+Member 'refcount'(A) and 'name'(B) _share_ one cache line like below::
+
+ +-----------+ +-----------+
+ | CPU 0 | | CPU 1 |
+ +-----------+ +-----------+
+ / |
+ / |
+ V V
+ +----------------------+ +----------------------+
+ | A B | Cache 0 | A B | Cache 1
+ +----------------------+ +----------------------+
+ | |
+ ---------------------------+------------------+-----------------------------
+ | |
+ +----------------------+
+ | |
+ +----------------------+
+ Main Memory | A B |
+ +----------------------+
+
+'refcount' is modified frequently, but 'name' is set once at object
+creation time and is never modified. When many CPUs access 'foo' at
+the same time, with 'refcount' being only bumped by one CPU frequently
+and 'name' being read by other CPUs, all those reading CPUs have to
+reload the whole cache line over and over due to the 'sharing', even
+though 'name' is never changed.
+
+There are many real-world cases of performance regressions caused by
+false sharing, and one is a rw_semaphore 'mmap_lock' inside struct
+mm_struct, whose cache line layout change triggered a regression
+and Linus analyzed in [2]_.
+
+There are two key factors for a harmful false sharing:
+
+* A global datum accessed (shared) by many CPUs
+* In the concurrent accesses to the data, there is at least one write
+ operation: write/write or write/read cases.
+
+The sharing could be from totally unrelated kernel components, or
+different code paths of the same kernel component.
+
+
+False Sharing Pitfalls
+======================
+Back in time when one platform had only one or a few CPUs, hot data
+members could be purposely put in the same cache line to make them
+cache hot and save cacheline/TLB, like a lock and the data protected
+by it. But for recent large system with hundreds of CPUs, this may
+not work when the lock is heavily contended, as the lock owner CPU
+could write to the data, while other CPUs are busy spinning the lock.
+
+Looking at past cases, there are several frequently occurring patterns
+for false sharing:
+
+* lock (spinlock/mutex/semaphore) and data protected by it are
+ purposely put in one cache line.
+* global data being put together in one cache line. Some kernel
+ subsystems have many global parameters of small size (4 bytes),
+ which can easily be grouped together and put into one cache line.
+* data members of a big data structure randomly sitting together
+ without being noticed (cache line is usually 64 bytes or more),
+ like struct 'mem_cgroup'.
+
+Following 'mitigation' section provides real-world examples.
+
+False sharing could easily happen unless they are intentionally
+checked, and it is valuable to run specific tools for performance
+critical workloads to detect false sharing affecting performance case
+and optimize accordingly.
+
+
+How to detect and analysis False Sharing
+========================================
+perf record/report/stat are widely used for performance tuning, and
+once hotspots are detected, tools like 'perf-c2c' and 'pahole' can
+be further used to detect and pinpoint the possible false sharing
+data structures. 'addr2line' is also good at decoding instruction
+pointer when there are multiple layers of inline functions.
+
+perf-c2c can capture the cache lines with most false sharing hits,
+decoded functions (line number of file) accessing that cache line,
+and in-line offset of the data. Simple commands are::
+
+ #perf c2c record -ag sleep 3
+ #perf c2c report --call-graph none -k vmlinux
+
+Run it when testing will-it-scale's tlb_flush1 case, and the report
+has pieces like::
+
+ Total records : 1658231
+ Locked Load/Store Operations : 89439
+ Load Operations : 623219
+ Load Local HITM : 92117
+ Load Remote HITM : 139
+
+ #----------------------------------------------------------------------
+ 4 0 2374 0 0 0 0xff1100088366d880
+ #----------------------------------------------------------------------
+ 0.00% 42.29% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81373b7b 0 231 129 5312 64 [k] __mod_lruvec_page_state [kernel.vmlinux] memcontrol.h:752 1
+ 0.00% 13.10% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81374718 0 226 97 3551 64 [k] folio_lruvec_lock_irqsave [kernel.vmlinux] memcontrol.h:752 1
+ 0.00% 11.20% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c29bf 0 170 136 555 64 [k] lru_add_fn [kernel.vmlinux] mm_inline.h:41 1
+ 0.00% 7.62% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c3ec5 0 175 108 632 64 [k] release_pages [kernel.vmlinux] mm_inline.h:41 1
+ 0.00% 23.29% 0.00% 0.00% 0.00% 0x10 1 1 0xffffffff81372d0a 0 234 279 1051 64 [k] __mod_memcg_lruvec_state [kernel.vmlinux] memcontrol.c:736 1
+
+A nice introduction for perf-c2c is [3]_.
+
+'pahole' decodes data structure layouts delimited in cache line
+granularity. Users can match the offset in perf-c2c output with
+pahole's decoding to locate the exact data members. For global
+data, users can search the data address in System.map.
+
+
+Possible Mitigations
+====================
+False sharing does not always need to be mitigated. False sharing
+mitigations need to balance performance gains with complexity and
+space consumption. Sometimes, lower performance is OK, and it's
+unnecessary to hyper-optimize every rarely used data structure or
+a cold data path.
+
+False sharing hurting performance cases are seen more frequently with
+core count increasing, and there have been many patches merged to
+solve it, like in networking and memory management subsystems. Some
+common mitigations (with examples) are:
+
+* Separate hot global data in its own dedicated cache line, even if it
+ is just a 'short' type. The downside is more consumption of memory,
+ cache line and TLB entries.
+
+ Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")
+
+* Reorganize the data structure, separate the interfering members to
+ different cache lines. One downside is it may introduce new false
+ sharing of other members.
+
+ Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")
+
+* Replace 'write' with 'read' when possible, especially in loops.
+ Like for some global variable, use compare(read)-then-write instead
+ of unconditional write. For example, use:
+
+ if (!test_bit(XXX))
+ set_bit(XXX);
+
+ instead of directly "set_bit(XXX);", similarly for atomic_t data.
+
+ Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing")
+ Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")
+
+* Turn hot global data to 'per-cpu data + global data' when possible,
+ or reasonably increase the threshold for syncing per-cpu data to
+ global data, to reduce or postpone the 'write' to that global data.
+
+ Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
+ Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")
+
+Surely, all mitigations should be carefully verified to not cause side
+effects. And to avoid false sharing in advance during coding, it's
+better to:
+
+* Be aware of cache line boundaries
+* Group mostly read-only fields together
+* Group things that are written at the same time together
+* Separate known read-mostly and written-mostly fields
+
+and better add a comment stating the false sharing consideration.
+
+One note is, sometimes even after a severe false sharing is detected
+and solved, the performance may still has no obvious improvement as
+the hotspot switches to a new place.
+
+
+Miscellaneous
+=============
+One open issue is that kernel has an optional data structure
+randomization mechanism, which also randomizes the situation of cache
+line sharing of data members.
+
+
+.. [1] https://en.wikipedia.org/wiki/False_sharing
+.. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/
+.. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/
diff --git a/Documentation/kernel-hacking/index.rst b/Documentation/kernel-hacking/index.rst
index f53027652290..79c03bac99a2 100644
--- a/Documentation/kernel-hacking/index.rst
+++ b/Documentation/kernel-hacking/index.rst
@@ -9,3 +9,4 @@ Kernel Hacking Guides

hacking
locking
+ false-sharing
--
2.34.1


2023-03-24 13:02:45

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH v1] Documentation: Add document for false sharing

On Fri, Mar 24, 2023 at 03:13:16PM +0800, Feng Tang wrote:
> +There are many real-world cases of performance regressions caused by
> +false sharing, and one is a rw_semaphore 'mmap_lock' inside struct
"... . One of these is rw_semaphore 'mmap_lock' ..."

But I think in English we commonly name things as "foobar struct"
instead of "struct foobar" (that is, common noun follow the proper noun
that names something).

> +* A global datum accessed (shared) by many CPUs
Global data?

> +Following 'mitigation' section provides real-world examples.
"The real-world examples are given in 'Possible mitigations' sections."
> + #perf c2c record -ag sleep 3
> + #perf c2c report --call-graph none -k vmlinux

Are these commands really run as root?

> +
> +Run it when testing will-it-scale's tlb_flush1 case, and the report
> +has pieces like::

"When running above during testing ..., perf reports something like::"

> +False sharing hurting performance cases are seen more frequently with
> +core count increasing, and there have been many patches merged to
> +solve it, like in networking and memory management subsystems. Some
> +common mitigations (with examples) are:

"... Because of these detrimental effects, many patches have been
proposed across variety of subsystems (like networking and memory
management) and merged."

> +
> +* Separate hot global data in its own dedicated cache line, even if it
> + is just a 'short' type. The downside is more consumption of memory,
> + cache line and TLB entries.
> +
> + Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")
> +
> +* Reorganize the data structure, separate the interfering members to
> + different cache lines. One downside is it may introduce new false
> + sharing of other members.
> +
> + Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")
> +
> +* Replace 'write' with 'read' when possible, especially in loops.
> + Like for some global variable, use compare(read)-then-write instead
> + of unconditional write. For example, use:
"... For example, write::"
> +
> + if (!test_bit(XXX))
> + set_bit(XXX);
> +
> + instead of directly "set_bit(XXX);", similarly for atomic_t data.
> +
> + Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing")
> + Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")
> +
> +* Turn hot global data to 'per-cpu data + global data' when possible,
> + or reasonably increase the threshold for syncing per-cpu data to
> + global data, to reduce or postpone the 'write' to that global data.
> +
> + Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
> + Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")

IMO it's odd to jump to specifying example commits without some sort of
conjuction (e.g. "for example, see commit <commit>").

> +
> +Surely, all mitigations should be carefully verified to not cause side
> +effects. And to avoid false sharing in advance during coding, it's
> +better to:
> +
> +* Be aware of cache line boundaries
> +* Group mostly read-only fields together
> +* Group things that are written at the same time together
> +* Separate known read-mostly and written-mostly fields

Proactively prevent false sharing with above tips?

Thanks.

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (3.43 kB)
signature.asc (235.00 B)
Download all attachments

2023-03-24 15:15:15

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v1] Documentation: Add document for false sharing



On 3/24/23 00:13, Feng Tang wrote:
> When doing performance tuning or debugging performance regressions,
> more and more cases are found to be related to false sharing [1][2],
> and the situation can be worse for newer platforms with hundreds of
> CPUs. There are already many commits in current kernel specially
> for mitigating the performance degradation due to false sharing.
>
> False sharing could harm the performance silently without being
> noticed, due to reasons like:
> * data members of a big data structure randomly sitting together
> in one cache line
> * global data of small size are linked compactly together
>
> So it's better to make a simple document about the normal pattern
> of false sharing, basic ways to mitigate it and call out to
> developers to pay attention during code-writing.
>
> [ Many thanks to Dave Hansen, Ying Huang, Tim Chen, Julie Du and
> Yu Chen for their contributions ]
>
> [1]. https://lore.kernel.org/lkml/20220619150456.GB34471@xsang-OptiPlex-9020/
> [2]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
>
> Signed-off-by: Feng Tang <[email protected]>

Reviewed-by: Randy Dunlap <[email protected]>

Thanks.

> ---
> Changelog:
>
> since RFC:
> * Reword paragraphs with grammar issues; fixes many format and
> typo issues (Randy Dunlap)
>
> .../kernel-hacking/false-sharing.rst | 201 ++++++++++++++++++
> Documentation/kernel-hacking/index.rst | 1 +
> 2 files changed, 202 insertions(+)
> create mode 100644 Documentation/kernel-hacking/false-sharing.rst
>
> diff --git a/Documentation/kernel-hacking/false-sharing.rst b/Documentation/kernel-hacking/false-sharing.rst
> new file mode 100644
> index 000000000000..c67745e8da3d
> --- /dev/null
> +++ b/Documentation/kernel-hacking/false-sharing.rst
> @@ -0,0 +1,201 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +False Sharing
> +=============
> +
> +What is False Sharing
> +=====================
> +False sharing is related with cache mechanism of maintaining the data
> +coherence of one cache line stored in multiple CPU's caches; then
> +academic definition for it is in [1]_. Consider a struct with a
> +refcount and a string::
> +
> + struct foo {
> + refcount_t refcount;
> + ...
> + char name[16];
> + } ____cacheline_internodealigned_in_smp;
> +
> +Member 'refcount'(A) and 'name'(B) _share_ one cache line like below::
> +
> + +-----------+ +-----------+
> + | CPU 0 | | CPU 1 |
> + +-----------+ +-----------+
> + / |
> + / |
> + V V
> + +----------------------+ +----------------------+
> + | A B | Cache 0 | A B | Cache 1
> + +----------------------+ +----------------------+
> + | |
> + ---------------------------+------------------+-----------------------------
> + | |
> + +----------------------+
> + | |
> + +----------------------+
> + Main Memory | A B |
> + +----------------------+
> +
> +'refcount' is modified frequently, but 'name' is set once at object
> +creation time and is never modified. When many CPUs access 'foo' at
> +the same time, with 'refcount' being only bumped by one CPU frequently
> +and 'name' being read by other CPUs, all those reading CPUs have to
> +reload the whole cache line over and over due to the 'sharing', even
> +though 'name' is never changed.
> +
> +There are many real-world cases of performance regressions caused by
> +false sharing, and one is a rw_semaphore 'mmap_lock' inside struct
> +mm_struct, whose cache line layout change triggered a regression
> +and Linus analyzed in [2]_.
> +
> +There are two key factors for a harmful false sharing:
> +
> +* A global datum accessed (shared) by many CPUs
> +* In the concurrent accesses to the data, there is at least one write
> + operation: write/write or write/read cases.
> +
> +The sharing could be from totally unrelated kernel components, or
> +different code paths of the same kernel component.
> +
> +
> +False Sharing Pitfalls
> +======================
> +Back in time when one platform had only one or a few CPUs, hot data
> +members could be purposely put in the same cache line to make them
> +cache hot and save cacheline/TLB, like a lock and the data protected
> +by it. But for recent large system with hundreds of CPUs, this may
> +not work when the lock is heavily contended, as the lock owner CPU
> +could write to the data, while other CPUs are busy spinning the lock.
> +
> +Looking at past cases, there are several frequently occurring patterns
> +for false sharing:
> +
> +* lock (spinlock/mutex/semaphore) and data protected by it are
> + purposely put in one cache line.
> +* global data being put together in one cache line. Some kernel
> + subsystems have many global parameters of small size (4 bytes),
> + which can easily be grouped together and put into one cache line.
> +* data members of a big data structure randomly sitting together
> + without being noticed (cache line is usually 64 bytes or more),
> + like struct 'mem_cgroup'.
> +
> +Following 'mitigation' section provides real-world examples.
> +
> +False sharing could easily happen unless they are intentionally
> +checked, and it is valuable to run specific tools for performance
> +critical workloads to detect false sharing affecting performance case
> +and optimize accordingly.
> +
> +
> +How to detect and analysis False Sharing
> +========================================
> +perf record/report/stat are widely used for performance tuning, and
> +once hotspots are detected, tools like 'perf-c2c' and 'pahole' can
> +be further used to detect and pinpoint the possible false sharing
> +data structures. 'addr2line' is also good at decoding instruction
> +pointer when there are multiple layers of inline functions.
> +
> +perf-c2c can capture the cache lines with most false sharing hits,
> +decoded functions (line number of file) accessing that cache line,
> +and in-line offset of the data. Simple commands are::
> +
> + #perf c2c record -ag sleep 3
> + #perf c2c report --call-graph none -k vmlinux
> +
> +Run it when testing will-it-scale's tlb_flush1 case, and the report
> +has pieces like::
> +
> + Total records : 1658231
> + Locked Load/Store Operations : 89439
> + Load Operations : 623219
> + Load Local HITM : 92117
> + Load Remote HITM : 139
> +
> + #----------------------------------------------------------------------
> + 4 0 2374 0 0 0 0xff1100088366d880
> + #----------------------------------------------------------------------
> + 0.00% 42.29% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81373b7b 0 231 129 5312 64 [k] __mod_lruvec_page_state [kernel.vmlinux] memcontrol.h:752 1
> + 0.00% 13.10% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81374718 0 226 97 3551 64 [k] folio_lruvec_lock_irqsave [kernel.vmlinux] memcontrol.h:752 1
> + 0.00% 11.20% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c29bf 0 170 136 555 64 [k] lru_add_fn [kernel.vmlinux] mm_inline.h:41 1
> + 0.00% 7.62% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c3ec5 0 175 108 632 64 [k] release_pages [kernel.vmlinux] mm_inline.h:41 1
> + 0.00% 23.29% 0.00% 0.00% 0.00% 0x10 1 1 0xffffffff81372d0a 0 234 279 1051 64 [k] __mod_memcg_lruvec_state [kernel.vmlinux] memcontrol.c:736 1
> +
> +A nice introduction for perf-c2c is [3]_.
> +
> +'pahole' decodes data structure layouts delimited in cache line
> +granularity. Users can match the offset in perf-c2c output with
> +pahole's decoding to locate the exact data members. For global
> +data, users can search the data address in System.map.
> +
> +
> +Possible Mitigations
> +====================
> +False sharing does not always need to be mitigated. False sharing
> +mitigations need to balance performance gains with complexity and
> +space consumption. Sometimes, lower performance is OK, and it's
> +unnecessary to hyper-optimize every rarely used data structure or
> +a cold data path.
> +
> +False sharing hurting performance cases are seen more frequently with
> +core count increasing, and there have been many patches merged to
> +solve it, like in networking and memory management subsystems. Some
> +common mitigations (with examples) are:
> +
> +* Separate hot global data in its own dedicated cache line, even if it
> + is just a 'short' type. The downside is more consumption of memory,
> + cache line and TLB entries.
> +
> + Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")
> +
> +* Reorganize the data structure, separate the interfering members to
> + different cache lines. One downside is it may introduce new false
> + sharing of other members.
> +
> + Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")
> +
> +* Replace 'write' with 'read' when possible, especially in loops.
> + Like for some global variable, use compare(read)-then-write instead
> + of unconditional write. For example, use:
> +
> + if (!test_bit(XXX))
> + set_bit(XXX);
> +
> + instead of directly "set_bit(XXX);", similarly for atomic_t data.
> +
> + Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing")
> + Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")
> +
> +* Turn hot global data to 'per-cpu data + global data' when possible,
> + or reasonably increase the threshold for syncing per-cpu data to
> + global data, to reduce or postpone the 'write' to that global data.
> +
> + Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
> + Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")
> +
> +Surely, all mitigations should be carefully verified to not cause side
> +effects. And to avoid false sharing in advance during coding, it's
> +better to:
> +
> +* Be aware of cache line boundaries
> +* Group mostly read-only fields together
> +* Group things that are written at the same time together
> +* Separate known read-mostly and written-mostly fields
> +
> +and better add a comment stating the false sharing consideration.
> +
> +One note is, sometimes even after a severe false sharing is detected
> +and solved, the performance may still has no obvious improvement as
> +the hotspot switches to a new place.
> +
> +
> +Miscellaneous
> +=============
> +One open issue is that kernel has an optional data structure
> +randomization mechanism, which also randomizes the situation of cache
> +line sharing of data members.
> +
> +
> +.. [1] https://en.wikipedia.org/wiki/False_sharing
> +.. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/
> +.. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/
> diff --git a/Documentation/kernel-hacking/index.rst b/Documentation/kernel-hacking/index.rst
> index f53027652290..79c03bac99a2 100644
> --- a/Documentation/kernel-hacking/index.rst
> +++ b/Documentation/kernel-hacking/index.rst
> @@ -9,3 +9,4 @@ Kernel Hacking Guides
>
> hacking
> locking
> + false-sharing

--
~Randy

2023-03-27 00:47:59

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v1] Documentation: Add document for false sharing

Hi Bagas Sanjaya,

Many thanks for the reviews!

On Fri, Mar 24, 2023 at 07:45:28PM +0700, Bagas Sanjaya wrote:
> On Fri, Mar 24, 2023 at 03:13:16PM +0800, Feng Tang wrote:
> > +There are many real-world cases of performance regressions caused by
> > +false sharing, and one is a rw_semaphore 'mmap_lock' inside struct
> "... . One of these is rw_semaphore 'mmap_lock' ..."

OK, will use this.

> But I think in English we commonly name things as "foobar struct"
> instead of "struct foobar" (that is, common noun follow the proper noun
> that names something).

I can change that. And IIRC, I saw 'struct XXX' and 'XXX struct' both
frequently used in kernel. I just run '# git log | grep -w struct'
and the majority use 'struct XXX'

> > +* A global datum accessed (shared) by many CPUs
> Global data?

In RFC version, I used 'data' and Randy suggested 'datum'. TBH, I
looked it up in a dictionary :), and found:
"Data" is the Latin plural form of "datum"

> > +Following 'mitigation' section provides real-world examples.
> "The real-world examples are given in 'Possible mitigations' sections."

Will use this, thanks.

> > + #perf c2c record -ag sleep 3
> > + #perf c2c report --call-graph none -k vmlinux
>
> Are these commands really run as root?

You are right, people can run it as 'root' or a normal user. And I
guess this won't confuse kernel developers.

My original version is kind of too long and full of explainations,
and some kernel developer suggested that this doc is under
'kernel-hacking' and its audience is kernel developers, and I should
make it clear and short, and not make it look like a wiki page or
man page.

> > +
> > +Run it when testing will-it-scale's tlb_flush1 case, and the report
> > +has pieces like::
>
> "When running above during testing ..., perf reports something like::"

This is more logical, will change.

> > +False sharing hurting performance cases are seen more frequently with
> > +core count increasing, and there have been many patches merged to
> > +solve it, like in networking and memory management subsystems. Some
> > +common mitigations (with examples) are:
>
> "... Because of these detrimental effects, many patches have been
> proposed across variety of subsystems (like networking and memory
> management) and merged."

This is much better, thanks

> > +
> > +* Separate hot global data in its own dedicated cache line, even if it
> > + is just a 'short' type. The downside is more consumption of memory,
> > + cache line and TLB entries.
> > +
> > + Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")
> > +
> > +* Reorganize the data structure, separate the interfering members to
> > + different cache lines. One downside is it may introduce new false
> > + sharing of other members.
> > +
> > + Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")
> > +
> > +* Replace 'write' with 'read' when possible, especially in loops.
> > + Like for some global variable, use compare(read)-then-write instead
> > + of unconditional write. For example, use:
> "... For example, write::"

The following is a coding pattern (for bit operation, atomic, etc.),
and I think 'use' may also be good?

> > +
> > + if (!test_bit(XXX))
> > + set_bit(XXX);
> > +
> > + instead of directly "set_bit(XXX);", similarly for atomic_t data.
> > +
> > + Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing")
> > + Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")
> > +
> > +* Turn hot global data to 'per-cpu data + global data' when possible,
> > + or reasonably increase the threshold for syncing per-cpu data to
> > + global data, to reduce or postpone the 'write' to that global data.
> > +
> > + Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
> > + Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")
>
> IMO it's odd to jump to specifying example commits without some sort of
> conjuction (e.g. "for example, see commit <commit>").

I agree, and I had the same concern, but I was also afraid of that
too many repeating of this, so the previous
"Following 'mitigation' section provides real-world examples."
in last section (which you helped to improve) was added trying
to address this.

> > +
> > +Surely, all mitigations should be carefully verified to not cause side
> > +effects. And to avoid false sharing in advance during coding, it's
> > +better to:
> > +
> > +* Be aware of cache line boundaries
> > +* Group mostly read-only fields together
> > +* Group things that are written at the same time together
> > +* Separate known read-mostly and written-mostly fields
>
> Proactively prevent false sharing with above tips?

You are right. And most of these bullets are directly taken from
Dave Hansen's reviews (thanks to Dave)

Thanks,
Feng

> Thanks.
>
> --
> An old man doll... just what I always wanted! - Clara


2023-03-28 08:47:55

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH v1] Documentation: Add document for false sharing

On 3/27/23 07:39, Feng Tang wrote:
> Hi Bagas Sanjaya,
>
> Many thanks for the reviews!
>
> On Fri, Mar 24, 2023 at 07:45:28PM +0700, Bagas Sanjaya wrote:
>> On Fri, Mar 24, 2023 at 03:13:16PM +0800, Feng Tang wrote:
>>> +There are many real-world cases of performance regressions caused by
>>> +false sharing, and one is a rw_semaphore 'mmap_lock' inside struct
>> "... . One of these is rw_semaphore 'mmap_lock' ..."
>
> OK, will use this.
>
>> But I think in English we commonly name things as "foobar struct"
>> instead of "struct foobar" (that is, common noun follow the proper noun
>> that names something).
>
> I can change that. And IIRC, I saw 'struct XXX' and 'XXX struct' both
> frequently used in kernel. I just run '# git log | grep -w struct'
> and the majority use 'struct XXX'
>
>>> +* A global datum accessed (shared) by many CPUs
>> Global data?
>
> In RFC version, I used 'data' and Randy suggested 'datum'. TBH, I
> looked it up in a dictionary :), and found:
> "Data" is the Latin plural form of "datum"
>

OK, I understand.

>>> + #perf c2c record -ag sleep 3
>>> + #perf c2c report --call-graph none -k vmlinux
>>
>> Are these commands really run as root?
>
> You are right, people can run it as 'root' or a normal user. And I
> guess this won't confuse kernel developers.
>
> My original version is kind of too long and full of explainations,
> and some kernel developer suggested that this doc is under
> 'kernel-hacking' and its audience is kernel developers, and I should
> make it clear and short, and not make it look like a wiki page or
> man page.
>

So something like below, right?

```
$ perf <command> <args>...
$ perf <command> <args>...
```

>>> +* Replace 'write' with 'read' when possible, especially in loops.
>>> + Like for some global variable, use compare(read)-then-write instead
>>> + of unconditional write. For example, use:
>> "... For example, write::"
>
> The following is a coding pattern (for bit operation, atomic, etc.),
> and I think 'use' may also be good?
>

I tend to say "write" when the context is typing code.

>>> +
>>> + if (!test_bit(XXX))
>>> + set_bit(XXX);
>>> +
>>> + instead of directly "set_bit(XXX);", similarly for atomic_t data.
>>> +
>>> + Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing")
>>> + Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")
>>> +
>>> +* Turn hot global data to 'per-cpu data + global data' when possible,
>>> + or reasonably increase the threshold for syncing per-cpu data to
>>> + global data, to reduce or postpone the 'write' to that global data.
>>> +
>>> + Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
>>> + Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")
>>
>> IMO it's odd to jump to specifying example commits without some sort of
>> conjuction (e.g. "for example, see commit <commit>").
>
> I agree, and I had the same concern, but I was also afraid of that
> too many repeating of this, so the previous
> "Following 'mitigation' section provides real-world examples."
> in last section (which you helped to improve) was added trying
> to address this.
>

OK.

And see you in v2!

--
An old man doll... just what I always wanted! - Clara