LinuxLists.cc - [PATCH] percpu-internal/pcpu_chunk: Re-layout pcpu

2023-06-06 12:42:11

Subject: [PATCH] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing

When running UnixBench/Execl throughput case, false sharing is observed
due to frequent read on base_addr and write on free_bytes, chunk_md.

UnixBench/Execl represents a class of workload where bash scripts
are spawned frequently to do some short jobs. It will do system call on
execl frequently, and execl will call mm_init to initialize mm_struct
of the process. mm_init will call __percpu_counter_init for
percpu_counters initialization. Then pcpu_alloc is called to read
the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
it will call pcpu_alloc_area to allocate memory from a specified chunk.
This function will update "free_bytes" and "chunk_md" to record the
rest free bytes and other meta data for this chunk. Correspondingly,
pcpu_free_area will also update these 2 members when free memory.
Call trace from perf is as below:
+ 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init
+ 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc
- 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock
- 53.54% 0x654278696e552f34
main
__execve
entry_SYSCALL_64_after_hwframe
do_syscall_64
__x64_sys_execve
do_execveat_common.isra.47
alloc_bprm
mm_init
__percpu_counter_init
pcpu_alloc
- __mutex_lock.isra.17

In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the
last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
to let ‘base_addr’ locate in a new cacheline.

With this change, on Intel Sapphire Rapids 112C/224T platform,
based on v6.4-rc4, the 160 parallel score improves by 24%.

Reviewed-by: Tim Chen <[email protected]>
Signed-off-by: Yu Ma <[email protected]>
---
mm/percpu-internal.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index f9847c131998..981eeb2ad0a9 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -41,10 +41,10 @@ struct pcpu_chunk {
struct list_head list; /* linked to pcpu_slot lists */
int free_bytes; /* free bytes in the chunk */
struct pcpu_block_md chunk_md;
+ unsigned long *bound_map; /* boundary map */
void *base_addr; /* base address of this chunk */

unsigned long *alloc_map; /* allocation map */
- unsigned long *bound_map; /* boundary map */
struct pcpu_block_md *md_blocks; /* metadata blocks */

void *data; /* chunk data */
--
2.39.3

2023-06-06 19:30:33

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [PATCH] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing

* Yu Ma <[email protected]> [230606 08:27]:
> When running UnixBench/Execl throughput case, false sharing is observed
> due to frequent read on base_addr and write on free_bytes, chunk_md.
>
> UnixBench/Execl represents a class of workload where bash scripts
> are spawned frequently to do some short jobs. It will do system call on
> execl frequently, and execl will call mm_init to initialize mm_struct
> of the process. mm_init will call __percpu_counter_init for
> percpu_counters initialization. Then pcpu_alloc is called to read
> the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
> it will call pcpu_alloc_area to allocate memory from a specified chunk.
> This function will update "free_bytes" and "chunk_md" to record the
> rest free bytes and other meta data for this chunk. Correspondingly,
> pcpu_free_area will also update these 2 members when free memory.
> Call trace from perf is as below:
> + 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init
> + 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc
> - 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock
> - 53.54% 0x654278696e552f34
> main
> __execve
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> __x64_sys_execve
> do_execveat_common.isra.47
> alloc_bprm
> mm_init
> __percpu_counter_init
> pcpu_alloc
> - __mutex_lock.isra.17
>
> In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the
> last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
> to let ‘base_addr’ locate in a new cacheline.
>
> With this change, on Intel Sapphire Rapids 112C/224T platform,
> based on v6.4-rc4, the 160 parallel score improves by 24%.

Can we have a comment somewhere around this structure to avoid someone
reverting this change by accident?

>
> Reviewed-by: Tim Chen <[email protected]>
> Signed-off-by: Yu Ma <[email protected]>
> ---
> mm/percpu-internal.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index f9847c131998..981eeb2ad0a9 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -41,10 +41,10 @@ struct pcpu_chunk {
> struct list_head list; /* linked to pcpu_slot lists */
> int free_bytes; /* free bytes in the chunk */
> struct pcpu_block_md chunk_md;
> + unsigned long *bound_map; /* boundary map */
> void *base_addr; /* base address of this chunk */
>
> unsigned long *alloc_map; /* allocation map */
> - unsigned long *bound_map; /* boundary map */
> struct pcpu_block_md *md_blocks; /* metadata blocks */
>
> void *data; /* chunk data */
> --
> 2.39.3
>

2023-06-06 22:22:08

by Dennis Zhou

[permalink] [raw]

Subject: Re: [PATCH] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing

Hello,

On Tue, Jun 06, 2023 at 03:21:27PM -0400, Liam R. Howlett wrote:
> * Yu Ma <[email protected]> [230606 08:27]:
> > When running UnixBench/Execl throughput case, false sharing is observed
> > due to frequent read on base_addr and write on free_bytes, chunk_md.
> >
> > UnixBench/Execl represents a class of workload where bash scripts
> > are spawned frequently to do some short jobs. It will do system call on
> > execl frequently, and execl will call mm_init to initialize mm_struct
> > of the process. mm_init will call __percpu_counter_init for
> > percpu_counters initialization. Then pcpu_alloc is called to read
> > the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
> > it will call pcpu_alloc_area to allocate memory from a specified chunk.
> > This function will update "free_bytes" and "chunk_md" to record the
> > rest free bytes and other meta data for this chunk. Correspondingly,
> > pcpu_free_area will also update these 2 members when free memory.
> > Call trace from perf is as below:
> > + 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init
> > + 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc
> > - 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock
> > - 53.54% 0x654278696e552f34
> > main
> > __execve
> > entry_SYSCALL_64_after_hwframe
> > do_syscall_64
> > __x64_sys_execve
> > do_execveat_common.isra.47
> > alloc_bprm
> > mm_init
> > __percpu_counter_init
> > pcpu_alloc
> > - __mutex_lock.isra.17
> >
> > In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> > with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the
> > last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
> > to let ‘base_addr’ locate in a new cacheline.
> >
> > With this change, on Intel Sapphire Rapids 112C/224T platform,
> > based on v6.4-rc4, the 160 parallel score improves by 24%.
>
> Can we have a comment somewhere around this structure to avoid someone
> reverting this change by accident?
>

I agree with Liam. It was only recently percpu was added to the
mm_struct so this wasn't originally on the hot path. It's probably worth
reshuffling around pcpu_chunk because as you point out base_addr is
read_only after init. There in general aren't that many of these structs
on any particular host, so its probably good to just annotate with
____cacheline_aligned_in_smp and potentially reshuffle around a few
other variables.

Another optimization here is a batch allocation which hasn't been done
yet (allocate essentially an array of percpu variables all at once, but
allow for their lifetimes to be independent).

PS - I know I'm not super active, but please cc me on percpu changes.

Thanks,
Dennis

> >
> > Reviewed-by: Tim Chen <[email protected]>
> > Signed-off-by: Yu Ma <[email protected]>
> > ---
> > mm/percpu-internal.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> > index f9847c131998..981eeb2ad0a9 100644
> > --- a/mm/percpu-internal.h
> > +++ b/mm/percpu-internal.h
> > @@ -41,10 +41,10 @@ struct pcpu_chunk {
> > struct list_head list; /* linked to pcpu_slot lists */
> > int free_bytes; /* free bytes in the chunk */
> > struct pcpu_block_md chunk_md;
> > + unsigned long *bound_map; /* boundary map */
> > void *base_addr; /* base address of this chunk */
> >
> > unsigned long *alloc_map; /* allocation map */
> > - unsigned long *bound_map; /* boundary map */
> > struct pcpu_block_md *md_blocks; /* metadata blocks */
> >
> > void *data; /* chunk data */
> > --
> > 2.39.3
> >
>

2023-06-07 12:52:11

by Ma, Yu

[permalink] [raw]

Subject: RE: [PATCH] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing

> Hello,
>
> On Tue, Jun 06, 2023 at 03:21:27PM -0400, Liam R. Howlett wrote:
> > * Yu Ma <[email protected]> [230606 08:27]:
> > > When running UnixBench/Execl throughput case, false sharing is
> > > observed due to frequent read on base_addr and write on free_bytes,
> chunk_md.
> > >
> > > UnixBench/Execl represents a class of workload where bash scripts
> > > are spawned frequently to do some short jobs. It will do system call
> > > on execl frequently, and execl will call mm_init to initialize
> > > mm_struct of the process. mm_init will call __percpu_counter_init
> > > for percpu_counters initialization. Then pcpu_alloc is called to
> > > read the base_addr of pcpu_chunk for memory allocation. Inside
> > > pcpu_alloc, it will call pcpu_alloc_area to allocate memory from a
> specified chunk.
> > > This function will update "free_bytes" and "chunk_md" to record the
> > > rest free bytes and other meta data for this chunk. Correspondingly,
> > > pcpu_free_area will also update these 2 members when free memory.
> > > Call trace from perf is as below:
> > > + 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init
> > > + 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc
> > > - 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock
> > > - 53.54% 0x654278696e552f34
> > > main
> > > __execve
> > > entry_SYSCALL_64_after_hwframe
> > > do_syscall_64
> > > __x64_sys_execve
> > > do_execveat_common.isra.47
> > > alloc_bprm
> > > mm_init
> > > __percpu_counter_init
> > > pcpu_alloc
> > > - __mutex_lock.isra.17
> > >
> > > In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> > > with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the last 8
> > > bytes. This patch moves ‘bound_map’ up to ‘base_addr’, to let
> > > ‘base_addr’ locate in a new cacheline.
> > >
> > > With this change, on Intel Sapphire Rapids 112C/224T platform, based
> > > on v6.4-rc4, the 160 parallel score improves by 24%.
> >
> > Can we have a comment somewhere around this structure to avoid
> someone
> > reverting this change by accident?
> >
>
> I agree with Liam. It was only recently percpu was added to the mm_struct so
> this wasn't originally on the hot path. It's probably worth reshuffling around
> pcpu_chunk because as you point out base_addr is read_only after init.
> There in general aren't that many of these structs on any particular host, so
> its probably good to just annotate with ____cacheline_aligned_in_smp and
> potentially reshuffle around a few other variables.
>
Thanks Liam and Dennis for quick feedback, I'll send out the updated patch with comment around.

> Another optimization here is a batch allocation which hasn't been done yet
> (allocate essentially an array of percpu variables all at once, but allow for
> their lifetimes to be independent).
>
> PS - I know I'm not super active, but please cc me on percpu changes.
>
LOL, sure :)

> Thanks,
> Dennis
>
> > >
> > > Reviewed-by: Tim Chen <[email protected]>
> > > Signed-off-by: Yu Ma <[email protected]>
> > > ---
> > > mm/percpu-internal.h | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index
> > > f9847c131998..981eeb2ad0a9 100644
> > > --- a/mm/percpu-internal.h
> > > +++ b/mm/percpu-internal.h
> > > @@ -41,10 +41,10 @@ struct pcpu_chunk {
> > > struct list_head list; /* linked to pcpu_slot lists */
> > > int free_bytes; /* free bytes in the chunk */
> > > struct pcpu_block_md chunk_md;
> > > + unsigned long *bound_map; /* boundary map */
> > > void *base_addr; /* base address of this chunk
> */
> > >
> > > unsigned long *alloc_map; /* allocation map */
> > > - unsigned long *bound_map; /* boundary map */
> > > struct pcpu_block_md *md_blocks; /* metadata blocks */
> > >
> > > void *data; /* chunk data */
> > > --
> > > 2.39.3
> > >
> >

2023-06-07 14:58:07

by Ma, Yu

[permalink] [raw]

Subject: [PATCH v2] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing

When running UnixBench/Execl throughput case, false sharing is observed
due to frequent read on base_addr and write on free_bytes, chunk_md.

UnixBench/Execl represents a class of workload where bash scripts
are spawned frequently to do some short jobs. It will do system call on
execl frequently, and execl will call mm_init to initialize mm_struct
of the process. mm_init will call __percpu_counter_init for
percpu_counters initialization. Then pcpu_alloc is called to read
the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
it will call pcpu_alloc_area to allocate memory from a specified chunk.
This function will update "free_bytes" and "chunk_md" to record the
rest free bytes and other meta data for this chunk. Correspondingly,
pcpu_free_area will also update these 2 members when free memory.
Call trace from perf is as below:
+ 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init
+ 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc
- 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock
- 53.54% 0x654278696e552f34
main
__execve
entry_SYSCALL_64_after_hwframe
do_syscall_64
__x64_sys_execve
do_execveat_common.isra.47
alloc_bprm
mm_init
__percpu_counter_init
pcpu_alloc
- __mutex_lock.isra.17

In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the
last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
to let ‘base_addr’ locate in a new cacheline.

With this change, on Intel Sapphire Rapids 112C/224T platform,
based on v6.4-rc4, the 160 parallel score improves by 24%.

Reviewed-by: Tim Chen <[email protected]>
Signed-off-by: Yu Ma <[email protected]>
---
mm/percpu-internal.h | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index f9847c131998..ecc7be1ec876 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -41,10 +41,16 @@ struct pcpu_chunk {
struct list_head list; /* linked to pcpu_slot lists */
int free_bytes; /* free bytes in the chunk */
struct pcpu_block_md chunk_md;
+ unsigned long *bound_map; /* boundary map */
+
+ /*
+ * To reduce false sharing, current layout is optimized to make sure
+ * base_addr locate in the different cacheline with free_bytes and
+ * chunk_md.
+ */
void *base_addr; /* base address of this chunk */

unsigned long *alloc_map; /* allocation map */
- unsigned long *bound_map; /* boundary map */
struct pcpu_block_md *md_blocks; /* metadata blocks */

void *data; /* chunk data */
--
2.39.3

2023-06-07 15:54:16

by Ma, Yu

[permalink] [raw]

Subject: RE: [PATCH v2] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing

Thanks Liam and Dennis for review, this is updated patch with comment around:

> When running UnixBench/Execl throughput case, false sharing is observed
> due to frequent read on base_addr and write on free_bytes, chunk_md.
>
> UnixBench/Execl represents a class of workload where bash scripts are
> spawned frequently to do some short jobs. It will do system call on execl
> frequently, and execl will call mm_init to initialize mm_struct of the process.
> mm_init will call __percpu_counter_init for percpu_counters initialization.
> Then pcpu_alloc is called to read the base_addr of pcpu_chunk for memory
> allocation. Inside pcpu_alloc, it will call pcpu_alloc_area to allocate memory
> from a specified chunk.
> This function will update "free_bytes" and "chunk_md" to record the rest
> free bytes and other meta data for this chunk. Correspondingly,
> pcpu_free_area will also update these 2 members when free memory.
> Call trace from perf is as below:
> + 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init
> + 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc
> - 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock
> - 53.54% 0x654278696e552f34
> main
> __execve
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> __x64_sys_execve
> do_execveat_common.isra.47
> alloc_bprm
> mm_init
> __percpu_counter_init
> pcpu_alloc
> - __mutex_lock.isra.17
>
> In current pcpu_chunk layout, ‘base_addr’ is in the same cache line with
> ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the last 8 bytes. This
> patch moves ‘bound_map’ up to ‘base_addr’, to let ‘base_addr’ locate in a
> new cacheline.
>
> With this change, on Intel Sapphire Rapids 112C/224T platform, based on
> v6.4-rc4, the 160 parallel score improves by 24%.
>
> Reviewed-by: Tim Chen <[email protected]>
> Signed-off-by: Yu Ma <[email protected]>
> ---
> mm/percpu-internal.h | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index
> f9847c131998..ecc7be1ec876 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -41,10 +41,16 @@ struct pcpu_chunk {
> struct list_head list; /* linked to pcpu_slot lists */
> int free_bytes; /* free bytes in the chunk */
> struct pcpu_block_md chunk_md;
> + unsigned long *bound_map; /* boundary map */
> +
> + /*
> + * To reduce false sharing, current layout is optimized to make sure
> + * base_addr locate in the different cacheline with free_bytes and
> + * chunk_md.
> + */
> void *base_addr; /* base address of this chunk
> */
>
> unsigned long *alloc_map; /* allocation map */
> - unsigned long *bound_map; /* boundary map */
> struct pcpu_block_md *md_blocks; /* metadata blocks */
>
> void *data; /* chunk data */
> --
> 2.39.3

2023-06-09 18:27:50

by Dennis Zhou

[permalink] [raw]

Subject: Re: [PATCH v2] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing

Hi Yu,

On Wed, Jun 07, 2023 at 03:02:32PM +0000, Ma, Yu wrote:
> Thanks Liam and Dennis for review, this is updated patch with comment around:
>
> > When running UnixBench/Execl throughput case, false sharing is observed
> > due to frequent read on base_addr and write on free_bytes, chunk_md.
> >
> > UnixBench/Execl represents a class of workload where bash scripts are
> > spawned frequently to do some short jobs. It will do system call on execl
> > frequently, and execl will call mm_init to initialize mm_struct of the process.
> > mm_init will call __percpu_counter_init for percpu_counters initialization.
> > Then pcpu_alloc is called to read the base_addr of pcpu_chunk for memory
> > allocation. Inside pcpu_alloc, it will call pcpu_alloc_area to allocate memory
> > from a specified chunk.
> > This function will update "free_bytes" and "chunk_md" to record the rest
> > free bytes and other meta data for this chunk. Correspondingly,
> > pcpu_free_area will also update these 2 members when free memory.
> > Call trace from perf is as below:
> > + 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init
> > + 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc
> > - 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock
> > - 53.54% 0x654278696e552f34
> > main
> > __execve
> > entry_SYSCALL_64_after_hwframe
> > do_syscall_64
> > __x64_sys_execve
> > do_execveat_common.isra.47
> > alloc_bprm
> > mm_init
> > __percpu_counter_init
> > pcpu_alloc
> > - __mutex_lock.isra.17
> >
> > In current pcpu_chunk layout, ‘base_addr’ is in the same cache line with
> > ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the last 8 bytes. This
> > patch moves ‘bound_map’ up to ‘base_addr’, to let ‘base_addr’ locate in a
> > new cacheline.
> >
> > With this change, on Intel Sapphire Rapids 112C/224T platform, based on
> > v6.4-rc4, the 160 parallel score improves by 24%.
> >
> > Reviewed-by: Tim Chen <[email protected]>
> > Signed-off-by: Yu Ma <[email protected]>
> > ---
> > mm/percpu-internal.h | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index
> > f9847c131998..ecc7be1ec876 100644
> > --- a/mm/percpu-internal.h
> > +++ b/mm/percpu-internal.h
> > @@ -41,10 +41,16 @@ struct pcpu_chunk {
> > struct list_head list; /* linked to pcpu_slot lists */
> > int free_bytes; /* free bytes in the chunk */
> > struct pcpu_block_md chunk_md;
> > + unsigned long *bound_map; /* boundary map */
> > +
> > + /*
> > + * To reduce false sharing, current layout is optimized to make sure
> > + * base_addr locate in the different cacheline with free_bytes and
> > + * chunk_md.
> > + */
> > void *base_addr; /* base address of this chunk
> > */
> >
> > unsigned long *alloc_map; /* allocation map */
> > - unsigned long *bound_map; /* boundary map */
> > struct pcpu_block_md *md_blocks; /* metadata blocks */
> >
> > void *data; /* chunk data */
> > --
> > 2.39.3
>

Thanks for adding the comment, but would you mind adding
____cacheline_aligned_in_smp? Unless that's something we're trying to
avoid, I think this is a good use case for it both on the pcpu_chunk and
specifically on base_addr as that's what we're accessing without a lock.

Thanks,
Dennis

2023-06-10 00:30:59

by Ma, Yu

[permalink] [raw]

Subject: RE: [PATCH v2] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing

> Hi Yu,
>
> On Wed, Jun 07, 2023 at 03:02:32PM +0000, Ma, Yu wrote:
> > Thanks Liam and Dennis for review, this is updated patch with comment
> around:
> >
> > > When running UnixBench/Execl throughput case, false sharing is
> > > observed due to frequent read on base_addr and write on free_bytes,
> chunk_md.
> > >
> > > UnixBench/Execl represents a class of workload where bash scripts
> > > are spawned frequently to do some short jobs. It will do system call
> > > on execl frequently, and execl will call mm_init to initialize mm_struct of
> the process.
> > > mm_init will call __percpu_counter_init for percpu_counters initialization.
> > > Then pcpu_alloc is called to read the base_addr of pcpu_chunk for
> > > memory allocation. Inside pcpu_alloc, it will call pcpu_alloc_area
> > > to allocate memory from a specified chunk.
> > > This function will update "free_bytes" and "chunk_md" to record the
> > > rest free bytes and other meta data for this chunk. Correspondingly,
> > > pcpu_free_area will also update these 2 members when free memory.
> > > Call trace from perf is as below:
> > > + 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init
> > > + 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc
> > > - 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock
> > > - 53.54% 0x654278696e552f34
> > > main
> > > __execve
> > > entry_SYSCALL_64_after_hwframe
> > > do_syscall_64
> > > __x64_sys_execve
> > > do_execveat_common.isra.47
> > > alloc_bprm
> > > mm_init
> > > __percpu_counter_init
> > > pcpu_alloc
> > > - __mutex_lock.isra.17
> > >
> > > In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> > > with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the last 8
> > > bytes. This patch moves ‘bound_map’ up to ‘base_addr’, to let
> > > ‘base_addr’ locate in a new cacheline.
> > >
> > > With this change, on Intel Sapphire Rapids 112C/224T platform, based
> > > on v6.4-rc4, the 160 parallel score improves by 24%.
> > >
> > > Reviewed-by: Tim Chen <[email protected]>
> > > Signed-off-by: Yu Ma <[email protected]>
> > > ---
> > > mm/percpu-internal.h | 8 +++++++-
> > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index
> > > f9847c131998..ecc7be1ec876 100644
> > > --- a/mm/percpu-internal.h
> > > +++ b/mm/percpu-internal.h
> > > @@ -41,10 +41,16 @@ struct pcpu_chunk {
> > > struct list_head list; /* linked to pcpu_slot lists */
> > > int free_bytes; /* free bytes in the chunk */
> > > struct pcpu_block_md chunk_md;
> > > + unsigned long *bound_map; /* boundary map */
> > > +
> > > + /*
> > > + * To reduce false sharing, current layout is optimized to make sure
> > > + * base_addr locate in the different cacheline with free_bytes and
> > > + * chunk_md.
> > > + */
> > > void *base_addr; /* base address of this chunk
> > > */
> > >
> > > unsigned long *alloc_map; /* allocation map */
> > > - unsigned long *bound_map; /* boundary map */
> > > struct pcpu_block_md *md_blocks; /* metadata blocks */
> > >
> > > void *data; /* chunk data */
> > > --
> > > 2.39.3
> >
>
> Thanks for adding the comment, but would you mind adding
> ____cacheline_aligned_in_smp? Unless that's something we're trying to
> avoid, I think this is a good use case for it both on the pcpu_chunk and
> specifically on base_addr as that's what we're accessing without a lock.
>

Thanks Dennis, I'll send out the updated patch with
____cacheline_aligned_in_smp on base_addr :)

> Thanks,
> Dennis

Regards
Yu

2023-06-10 03:06:21

by Ma, Yu

[permalink] [raw]

Subject: [PATCH v3] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing

When running UnixBench/Execl throughput case, false sharing is observed
due to frequent read on base_addr and write on free_bytes, chunk_md.

UnixBench/Execl represents a class of workload where bash scripts
are spawned frequently to do some short jobs. It will do system call on
execl frequently, and execl will call mm_init to initialize mm_struct
of the process. mm_init will call __percpu_counter_init for
percpu_counters initialization. Then pcpu_alloc is called to read
the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
it will call pcpu_alloc_area to allocate memory from a specified chunk.
This function will update "free_bytes" and "chunk_md" to record the
rest free bytes and other meta data for this chunk. Correspondingly,
pcpu_free_area will also update these 2 members when free memory.
Call trace from perf is as below:
+ 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init
+ 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc
- 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock
- 53.54% 0x654278696e552f34
main
__execve
entry_SYSCALL_64_after_hwframe
do_syscall_64
__x64_sys_execve
do_execveat_common.isra.47
alloc_bprm
mm_init
__percpu_counter_init
pcpu_alloc
- __mutex_lock.isra.17

In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the
last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
to let ‘base_addr’ locate in a new cacheline.

With this change, on Intel Sapphire Rapids 112C/224T platform,
based on v6.4-rc4, the 160 parallel score improves by 24%.

Reviewed-by: Tim Chen <[email protected]>
Signed-off-by: Yu Ma <[email protected]>
---
mm/percpu-internal.h | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index f9847c131998..7f108b25bb93 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -41,10 +41,17 @@ struct pcpu_chunk {
struct list_head list; /* linked to pcpu_slot lists */
int free_bytes; /* free bytes in the chunk */
struct pcpu_block_md chunk_md;
- void *base_addr; /* base address of this chunk */
+ unsigned long *bound_map; /* boundary map */
+
+ /*
+ * base_addr is the base address of this chunk.
+ * To reduce false sharing, current layout is optimized to make sure
+ * base_addr locate in the different cacheline with free_bytes and
+ * chunk_md.
+ */
+ void *base_addr ____cacheline_aligned_in_smp;

unsigned long *alloc_map; /* allocation map */
- unsigned long *bound_map; /* boundary map */
struct pcpu_block_md *md_blocks; /* metadata blocks */

void *data; /* chunk data */
--
2.39.3

2023-06-12 22:17:54

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v3] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing

On Fri, 9 Jun 2023 23:07:30 -0400 Yu Ma <[email protected]> wrote:

> When running UnixBench/Execl throughput case, false sharing is observed
> due to frequent read on base_addr and write on free_bytes, chunk_md.
>
> UnixBench/Execl represents a class of workload where bash scripts
> are spawned frequently to do some short jobs. It will do system call on
> execl frequently, and execl will call mm_init to initialize mm_struct
> of the process. mm_init will call __percpu_counter_init for
> percpu_counters initialization. Then pcpu_alloc is called to read
> the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
> it will call pcpu_alloc_area to allocate memory from a specified chunk.
> This function will update "free_bytes" and "chunk_md" to record the
> rest free bytes and other meta data for this chunk. Correspondingly,
> pcpu_free_area will also update these 2 members when free memory.
> Call trace from perf is as below:
> + 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init
> + 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc
> - 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock
> - 53.54% 0x654278696e552f34
> main
> __execve
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> __x64_sys_execve
> do_execveat_common.isra.47
> alloc_bprm
> mm_init
> __percpu_counter_init
> pcpu_alloc
> - __mutex_lock.isra.17
>
> In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the
> last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
> to let ‘base_addr’ locate in a new cacheline.
>
> With this change, on Intel Sapphire Rapids 112C/224T platform,
> based on v6.4-rc4, the 160 parallel score improves by 24%.

Well that's nice.

>
> ...
>
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -41,10 +41,17 @@ struct pcpu_chunk {
> struct list_head list; /* linked to pcpu_slot lists */
> int free_bytes; /* free bytes in the chunk */
> struct pcpu_block_md chunk_md;
> - void *base_addr; /* base address of this chunk */
> + unsigned long *bound_map; /* boundary map */
> +
> + /*
> + * base_addr is the base address of this chunk.
> + * To reduce false sharing, current layout is optimized to make sure
> + * base_addr locate in the different cacheline with free_bytes and
> + * chunk_md.
> + */
> + void *base_addr ____cacheline_aligned_in_smp;
>
> unsigned long *alloc_map; /* allocation map */
> - unsigned long *bound_map; /* boundary map */
> struct pcpu_block_md *md_blocks; /* metadata blocks */
>
> void *data; /* chunk data */

This will of course consume more memory. Do we have a feel for the
worst-case impact of this?

2023-06-12 22:22:27

by Dennis Zhou

[permalink] [raw]

Subject: Re: [PATCH v3] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing

Hi Andrew,

On Mon, Jun 12, 2023 at 02:43:31PM -0700, Andrew Morton wrote:
> On Fri, 9 Jun 2023 23:07:30 -0400 Yu Ma <[email protected]> wrote:
>
> > When running UnixBench/Execl throughput case, false sharing is observed
> > due to frequent read on base_addr and write on free_bytes, chunk_md.
> >
> > UnixBench/Execl represents a class of workload where bash scripts
> > are spawned frequently to do some short jobs. It will do system call on
> > execl frequently, and execl will call mm_init to initialize mm_struct
> > of the process. mm_init will call __percpu_counter_init for
> > percpu_counters initialization. Then pcpu_alloc is called to read
> > the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc,
> > it will call pcpu_alloc_area to allocate memory from a specified chunk.
> > This function will update "free_bytes" and "chunk_md" to record the
> > rest free bytes and other meta data for this chunk. Correspondingly,
> > pcpu_free_area will also update these 2 members when free memory.
> > Call trace from perf is as below:
> > + 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init
> > + 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc
> > - 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock
> > - 53.54% 0x654278696e552f34
> > main
> > __execve
> > entry_SYSCALL_64_after_hwframe
> > do_syscall_64
> > __x64_sys_execve
> > do_execveat_common.isra.47
> > alloc_bprm
> > mm_init
> > __percpu_counter_init
> > pcpu_alloc
> > - __mutex_lock.isra.17
> >
> > In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> > with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the
> > last 8 bytes. This patch moves ‘bound_map’ up to ‘base_addr’,
> > to let ‘base_addr’ locate in a new cacheline.
> >
> > With this change, on Intel Sapphire Rapids 112C/224T platform,
> > based on v6.4-rc4, the 160 parallel score improves by 24%.
>
> Well that's nice.
>
> >
> > ...
> >
> > --- a/mm/percpu-internal.h
> > +++ b/mm/percpu-internal.h
> > @@ -41,10 +41,17 @@ struct pcpu_chunk {
> > struct list_head list; /* linked to pcpu_slot lists */
> > int free_bytes; /* free bytes in the chunk */
> > struct pcpu_block_md chunk_md;
> > - void *base_addr; /* base address of this chunk */
> > + unsigned long *bound_map; /* boundary map */
> > +
> > + /*
> > + * base_addr is the base address of this chunk.
> > + * To reduce false sharing, current layout is optimized to make sure
> > + * base_addr locate in the different cacheline with free_bytes and
> > + * chunk_md.
> > + */
> > + void *base_addr ____cacheline_aligned_in_smp;
> >
> > unsigned long *alloc_map; /* allocation map */
> > - unsigned long *bound_map; /* boundary map */
> > struct pcpu_block_md *md_blocks; /* metadata blocks */
> >
> > void *data; /* chunk data */
>
> This will of course consume more memory. Do we have a feel for the
> worst-case impact of this?
>

The pcpu_chunk struct is a backing data structure per chunk, so the
additional memory should not be dramatic. A chunk covers ballpark
between 64kb and 512kb memory depending on some config and boot time
stuff, so I believe the additional memory used here is nominal at best.

Working the #s on my desktop:
Percpu: 58624 kB
28 cores -> ~2.1MB of percpu memory.
At say ~128KB per chunk -> 33 chunks, generously 40 chunks.
Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB
of overhead?

I believe we can do a little better to avoid eating that full padding,
so likely less than that.

Acked-by: Dennis Zhou <[email protected]>

Thanks,
Dennis

2023-06-13 18:03:43

by Ma, Yu

[permalink] [raw]

Subject: RE: [PATCH v3] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing

> Hi Andrew,
>
> On Mon, Jun 12, 2023 at 02:43:31PM -0700, Andrew Morton wrote:
> > On Fri, 9 Jun 2023 23:07:30 -0400 Yu Ma <[email protected]> wrote:
> >
> > > When running UnixBench/Execl throughput case, false sharing is
> > > observed due to frequent read on base_addr and write on free_bytes,
> chunk_md.
> > >
> > > UnixBench/Execl represents a class of workload where bash scripts
> > > are spawned frequently to do some short jobs. It will do system call
> > > on execl frequently, and execl will call mm_init to initialize
> > > mm_struct of the process. mm_init will call __percpu_counter_init
> > > for percpu_counters initialization. Then pcpu_alloc is called to
> > > read the base_addr of pcpu_chunk for memory allocation. Inside
> > > pcpu_alloc, it will call pcpu_alloc_area to allocate memory from a
> specified chunk.
> > > This function will update "free_bytes" and "chunk_md" to record the
> > > rest free bytes and other meta data for this chunk. Correspondingly,
> > > pcpu_free_area will also update these 2 members when free memory.
> > > Call trace from perf is as below:
> > > + 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init
> > > + 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc
> > > - 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock
> > > - 53.54% 0x654278696e552f34
> > > main
> > > __execve
> > > entry_SYSCALL_64_after_hwframe
> > > do_syscall_64
> > > __x64_sys_execve
> > > do_execveat_common.isra.47
> > > alloc_bprm
> > > mm_init
> > > __percpu_counter_init
> > > pcpu_alloc
> > > - __mutex_lock.isra.17
> > >
> > > In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> > > with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the last 8
> > > bytes. This patch moves ‘bound_map’ up to ‘base_addr’, to let
> > > ‘base_addr’ locate in a new cacheline.
> > >
> > > With this change, on Intel Sapphire Rapids 112C/224T platform, based
> > > on v6.4-rc4, the 160 parallel score improves by 24%.
> >
> > Well that's nice.
> >
> > >
> > > ...
> > >
> > > --- a/mm/percpu-internal.h
> > > +++ b/mm/percpu-internal.h
> > > @@ -41,10 +41,17 @@ struct pcpu_chunk {
> > > struct list_head list; /* linked to pcpu_slot lists */
> > > int free_bytes; /* free bytes in the chunk */
> > > struct pcpu_block_md chunk_md;
> > > - void *base_addr; /* base address of this chunk
> */
> > > + unsigned long *bound_map; /* boundary map */
> > > +
> > > + /*
> > > + * base_addr is the base address of this chunk.
> > > + * To reduce false sharing, current layout is optimized to make sure
> > > + * base_addr locate in the different cacheline with free_bytes and
> > > + * chunk_md.
> > > + */
> > > + void *base_addr ____cacheline_aligned_in_smp;
> > >
> > > unsigned long *alloc_map; /* allocation map */
> > > - unsigned long *bound_map; /* boundary map */
> > > struct pcpu_block_md *md_blocks; /* metadata blocks */
> > >
> > > void *data; /* chunk data */
> >
> > This will of course consume more memory. Do we have a feel for the
> > worst-case impact of this?
> >
>
> The pcpu_chunk struct is a backing data structure per chunk, so the
> additional memory should not be dramatic. A chunk covers ballpark between
> 64kb and 512kb memory depending on some config and boot time stuff, so I
> believe the additional memory used here is nominal at best.
>
> Working the #s on my desktop:
> Percpu: 58624 kB
> 28 cores -> ~2.1MB of percpu memory.
> At say ~128KB per chunk -> 33 chunks, generously 40 chunks.
> Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB of
> overhead?
>
> I believe we can do a little better to avoid eating that full padding, so likely
> less than that.
>
> Acked-by: Dennis Zhou <[email protected]>
>

Thanks Andrew and Dennis for agreement on the patch.
The layout of this structure (printed by pahole) before and after this patch is as below for reference.
The default size is 136 Bytes with 3 cachelines. With patch v3, it is 192 Bytes with 56 extra padding.
For "____cacheline_aligned_in_smp ", initially it was not added with the same concern on memory,
as it can obtain the same performance gain with reshuffle on base_addr only. Thanks to Dennis with
expertise on the overall usage, it is added to be more clear and bring convenience for future changes.

--default v6.4-rc4--
struct pcpu_chunk {
struct list_head list; /* 0 16 */
int free_bytes; /* 16 4 */
struct pcpu_block_md chunk_md; /* 20 32 */

/* XXX 4 bytes hole, try to pack */

void * base_addr; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
long unsigned int * alloc_map; /* 64 8 */
long unsigned int * bound_map; /* 72 8 */
struct pcpu_block_md * md_blocks; /* 80 8 */
void * data; /* 88 8 */
bool immutable; /* 96 1 */
bool isolated; /* 97 1 */

/* XXX 2 bytes hole, try to pack */

int start_offset; /* 100 4 */
int end_offset; /* 104 4 */

/* XXX 4 bytes hole, try to pack */

struct obj_cgroup * * obj_cgroups; /* 112 8 */
int nr_pages; /* 120 4 */
int nr_populated; /* 124 4 */
/* --- cacheline 2 boundary (128 bytes) --- */
int nr_empty_pop_pages; /* 128 4 */

/* XXX 4 bytes hole, try to pack */

long unsigned int populated[]; /* 136 0 */

/* size: 136, cachelines: 3, members: 17 */
/* sum members: 122, holes: 4, sum holes: 14 */
/* last cacheline: 8 bytes */
};

--with patch v3--
struct pcpu_chunk {
struct list_head list; /* 0 16 */
int free_bytes; /* 16 4 */
struct pcpu_block_md chunk_md; /* 20 32 */

/* XXX 4 bytes hole, try to pack */

long unsigned int * bound_map; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
void * base_addr; /* 64 8 */
long unsigned int * alloc_map; /* 72 8 */
struct pcpu_block_md * md_blocks; /* 80 8 */
void * data; /* 88 8 */
bool immutable; /* 96 1 */
bool isolated; /* 97 1 */

/* XXX 2 bytes hole, try to pack */

int start_offset; /* 100 4 */
int end_offset; /* 104 4 */

/* XXX 4 bytes hole, try to pack */

struct obj_cgroup * * obj_cgroups; /* 112 8 */
int nr_pages; /* 120 4 */
int nr_populated; /* 124 4 */
/* --- cacheline 2 boundary (128 bytes) --- */
int nr_empty_pop_pages; /* 128 4 */

/* XXX 4 bytes hole, try to pack */

long unsigned int populated[]; /* 136 0 */

/* size: 192, cachelines: 3, members: 17 */
/* sum members: 122, holes: 4, sum holes: 14 */
/* padding: 56 */
};

Regards
Yu

> Thanks,
> Dennis