In the syscall test of UnixBench, performance regression occurred
due to false sharing.
The lock and atomic members, including file::f_lock, file::f_count
and file::f_pos_lock are highly contended and frequently updated
in the high-concurrency test scenarios. perf c2c indentified one
affected read access, file::f_op.
To prevent false sharing, the layout of file struct is changed as
following
(A) f_lock, f_count and f_pos_lock are put together to share the
same cache line.
(B) The read mostly members, including f_path, f_inode, f_op are
put into a separate cache line.
(C) f_mode is put together with f_count, since they are used
frequently at the same time.
The optimization has been validated in the syscall test of
UnixBench. performance gain is 30~50%, when the number of parallel
jobs is 16.
Signed-off-by: chenzhiyin <[email protected]>
---
include/linux/fs.h | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 21a981680856..01c55e3a1b96 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -962,23 +962,23 @@ struct file {
struct rcu_head f_rcuhead;
unsigned int f_iocb_flags;
};
- struct path f_path;
- struct inode *f_inode; /* cached value */
- const struct file_operations *f_op;
/*
* Protects f_ep, f_flags.
* Must not be taken from IRQ context.
*/
spinlock_t f_lock;
- atomic_long_t f_count;
- unsigned int f_flags;
fmode_t f_mode;
+ atomic_long_t f_count;
struct mutex f_pos_lock;
+ unsigned int f_flags;
loff_t f_pos;
struct fown_struct f_owner;
const struct cred *f_cred;
struct file_ra_state f_ra;
+ struct path f_path;
+ struct inode *f_inode; /* cached value */
+ const struct file_operations *f_op;
u64 f_version;
#ifdef CONFIG_SECURITY
--
2.39.1
On Mon, May 29, 2023 at 10:06:26PM -0400, chenzhiyin wrote:
> In the syscall test of UnixBench, performance regression occurred
> due to false sharing.
>
> The lock and atomic members, including file::f_lock, file::f_count
> and file::f_pos_lock are highly contended and frequently updated
> in the high-concurrency test scenarios. perf c2c indentified one
> affected read access, file::f_op.
> To prevent false sharing, the layout of file struct is changed as
> following
> (A) f_lock, f_count and f_pos_lock are put together to share the
> same cache line.
> (B) The read mostly members, including f_path, f_inode, f_op are
> put into a separate cache line.
> (C) f_mode is put together with f_count, since they are used
> frequently at the same time.
>
> The optimization has been validated in the syscall test of
> UnixBench. performance gain is 30~50%, when the number of parallel
> jobs is 16.
>
> Signed-off-by: chenzhiyin <[email protected]>
> ---
Sounds interesting, but can we see the actual numbers, please?
So struct file is marked with __randomize_layout which seems to make
this whole reordering pointless or at least only useful if the
structure randomization Kconfig is turned off. Is there any precedence
to optimizing structures that are marked as randomizable?
On Tue, May 30, 2023 at 12:31 PM Christian Brauner <[email protected]> wrote:
>
> On Mon, May 29, 2023 at 10:06:26PM -0400, chenzhiyin wrote:
> > In the syscall test of UnixBench, performance regression occurred
> > due to false sharing.
> >
> > The lock and atomic members, including file::f_lock, file::f_count
> > and file::f_pos_lock are highly contended and frequently updated
> > in the high-concurrency test scenarios. perf c2c indentified one
> > affected read access, file::f_op.
> > To prevent false sharing, the layout of file struct is changed as
> > following
> > (A) f_lock, f_count and f_pos_lock are put together to share the
> > same cache line.
> > (B) The read mostly members, including f_path, f_inode, f_op are
> > put into a separate cache line.
> > (C) f_mode is put together with f_count, since they are used
> > frequently at the same time.
> >
> > The optimization has been validated in the syscall test of
> > UnixBench. performance gain is 30~50%, when the number of parallel
> > jobs is 16.
> >
> > Signed-off-by: chenzhiyin <[email protected]>
> > ---
>
> Sounds interesting, but can we see the actual numbers, please?
> So struct file is marked with __randomize_layout which seems to make
> this whole reordering pointless or at least only useful if the
> structure randomization Kconfig is turned off. Is there any precedence
> to optimizing structures that are marked as randomizable?
Good question!
Also does the impressive improvement is gained only with (A)+(B)+(C)?
(A) and (B) make sense, but something about the claim (C) does not sit right.
Can you explain this claim?
Putting the read mostly f_mode with frequently updated f_count seems
counter to the goal of your patch.
Aren't f_mode and f_flags just as frequently accessed as f_op?
Shouldn't f_mode belong with the read-mostly members?
What am I missing?
Thanks,
Amir.
On Tue, May 30, 2023 at 10:50:42AM +0200, Christian Brauner wrote:
> On Mon, May 29, 2023 at 10:06:26PM -0400, chenzhiyin wrote:
> > In the syscall test of UnixBench, performance regression occurred
> > due to false sharing.
> >
> > The lock and atomic members, including file::f_lock, file::f_count
> > and file::f_pos_lock are highly contended and frequently updated
> > in the high-concurrency test scenarios. perf c2c indentified one
> > affected read access, file::f_op.
> > To prevent false sharing, the layout of file struct is changed as
> > following
> > (A) f_lock, f_count and f_pos_lock are put together to share the
> > same cache line.
> > (B) The read mostly members, including f_path, f_inode, f_op are
> > put into a separate cache line.
> > (C) f_mode is put together with f_count, since they are used
> > frequently at the same time.
> >
> > The optimization has been validated in the syscall test of
> > UnixBench. performance gain is 30~50%, when the number of parallel
> > jobs is 16.
> >
> > Signed-off-by: chenzhiyin <[email protected]>
> > ---
>
> Sounds interesting, but can we see the actual numbers, please?
> So struct file is marked with __randomize_layout which seems to make
> this whole reordering pointless or at least only useful if the
> structure randomization Kconfig is turned off. Is there any precedence
> to optimizing structures that are marked as randomizable?
Most people don't use CONFIG_RANDSTRUCT. So it's still worth optimizing struct
layouts for everyone else.
- Eric
On Tue, May 30, 2023 at 06:55:49PM -0700, Eric Biggers wrote:
> On Tue, May 30, 2023 at 10:50:42AM +0200, Christian Brauner wrote:
> > On Mon, May 29, 2023 at 10:06:26PM -0400, chenzhiyin wrote:
> > > In the syscall test of UnixBench, performance regression occurred
> > > due to false sharing.
> > >
> > > The lock and atomic members, including file::f_lock, file::f_count
> > > and file::f_pos_lock are highly contended and frequently updated
> > > in the high-concurrency test scenarios. perf c2c indentified one
> > > affected read access, file::f_op.
> > > To prevent false sharing, the layout of file struct is changed as
> > > following
> > > (A) f_lock, f_count and f_pos_lock are put together to share the
> > > same cache line.
> > > (B) The read mostly members, including f_path, f_inode, f_op are
> > > put into a separate cache line.
> > > (C) f_mode is put together with f_count, since they are used
> > > frequently at the same time.
> > >
> > > The optimization has been validated in the syscall test of
> > > UnixBench. performance gain is 30~50%, when the number of parallel
> > > jobs is 16.
> > >
> > > Signed-off-by: chenzhiyin <[email protected]>
> > > ---
> >
> > Sounds interesting, but can we see the actual numbers, please?
> > So struct file is marked with __randomize_layout which seems to make
> > this whole reordering pointless or at least only useful if the
> > structure randomization Kconfig is turned off. Is there any precedence
> > to optimizing structures that are marked as randomizable?
>
> Most people don't use CONFIG_RANDSTRUCT. So it's still worth optimizing struct
> layouts for everyone else.
Ok, good to know.
We should still see actual numbers and the commit message should mention
that this interacts with __randomize_layout and why it's still useful.
As Eric said, CONFIG_RANDSTRUCT_NONE is set in the default config
and some production environments, including Ali Cloud. Therefore, it
is worthful to optimize the file struct layout.
Here are the syscall test results of unixbench.
Command: numactl -C 3-18 ./Run -c 16 syscall
Without patch
------------------------
224 CPUs in system; running 16 parallel copies of tests
System Call Overhead 5611223.7 lps (10.0 s, 7 samples)
System Benchmarks Partial Index BASELINE RESULT INDEX
System Call Overhead 15000.0 5611223.7 3740.8
========
System Benchmarks Index Score (Partial Only) 3740.8
With patch
------------------------------------------------------------------------
224 CPUs in system; running 16 parallel copies of tests
System Call Overhead 7567076.6 lps (10.0 s, 7 samples)
System Benchmarks Partial Index BASELINE RESULT INDEX
System Call Overhead 15000.0 7567076.6 5044.7
========
System Benchmarks Index Score (Partial Only) 5044.7
> -----Original Message-----
> From: Eric Biggers <[email protected]>
> Sent: Wednesday, May 31, 2023 9:56 AM
> To: Christian Brauner <[email protected]>
> Cc: Chen, Zhiyin <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; Zou, Nanhai
> <[email protected]>
> Subject: Re: [PATCH] fs.h: Optimize file struct to prevent false sharing
>
> On Tue, May 30, 2023 at 10:50:42AM +0200, Christian Brauner wrote:
> > On Mon, May 29, 2023 at 10:06:26PM -0400, chenzhiyin wrote:
> > > In the syscall test of UnixBench, performance regression occurred
> > > due to false sharing.
> > >
> > > The lock and atomic members, including file::f_lock, file::f_count
> > > and file::f_pos_lock are highly contended and frequently updated in
> > > the high-concurrency test scenarios. perf c2c indentified one
> > > affected read access, file::f_op.
> > > To prevent false sharing, the layout of file struct is changed as
> > > following
> > > (A) f_lock, f_count and f_pos_lock are put together to share the
> > > same cache line.
> > > (B) The read mostly members, including f_path, f_inode, f_op are put
> > > into a separate cache line.
> > > (C) f_mode is put together with f_count, since they are used
> > > frequently at the same time.
> > >
> > > The optimization has been validated in the syscall test of
> > > UnixBench. performance gain is 30~50%, when the number of parallel
> > > jobs is 16.
> > >
> > > Signed-off-by: chenzhiyin <[email protected]>
> > > ---
> >
> > Sounds interesting, but can we see the actual numbers, please?
> > So struct file is marked with __randomize_layout which seems to make
> > this whole reordering pointless or at least only useful if the
> > structure randomization Kconfig is turned off. Is there any precedence
> > to optimizing structures that are marked as randomizable?
>
> Most people don't use CONFIG_RANDSTRUCT. So it's still worth optimizing
> struct layouts for everyone else.
>
> - Eric
On Tue, May 30, 2023 at 01:02:06PM +0300, Amir Goldstein wrote:
> On Tue, May 30, 2023 at 12:31 PM Christian Brauner <[email protected]> wrote:
> >
> > On Mon, May 29, 2023 at 10:06:26PM -0400, chenzhiyin wrote:
> > > In the syscall test of UnixBench, performance regression occurred
> > > due to false sharing.
> > >
> > > The lock and atomic members, including file::f_lock, file::f_count
> > > and file::f_pos_lock are highly contended and frequently updated
> > > in the high-concurrency test scenarios. perf c2c indentified one
> > > affected read access, file::f_op.
> > > To prevent false sharing, the layout of file struct is changed as
> > > following
> > > (A) f_lock, f_count and f_pos_lock are put together to share the
> > > same cache line.
> > > (B) The read mostly members, including f_path, f_inode, f_op are
> > > put into a separate cache line.
> > > (C) f_mode is put together with f_count, since they are used
> > > frequently at the same time.
> > >
> > > The optimization has been validated in the syscall test of
> > > UnixBench. performance gain is 30~50%, when the number of parallel
> > > jobs is 16.
> > >
> > > Signed-off-by: chenzhiyin <[email protected]>
> > > ---
> >
> > Sounds interesting, but can we see the actual numbers, please?
> > So struct file is marked with __randomize_layout which seems to make
> > this whole reordering pointless or at least only useful if the
> > structure randomization Kconfig is turned off. Is there any precedence
> > to optimizing structures that are marked as randomizable?
>
> Good question!
>
> Also does the impressive improvement is gained only with (A)+(B)+(C)?
>
> (A) and (B) make sense, but something about the claim (C) does not sit right.
> Can you explain this claim?
>
> Putting the read mostly f_mode with frequently updated f_count seems
> counter to the goal of your patch.
> Aren't f_mode and f_flags just as frequently accessed as f_op?
> Shouldn't f_mode belong with the read-mostly members?
>
> What am I missing?
I think that f_mode will be more heavily used because it's checked
everytime you call fget variants. For example, f_mode is used to check
whether the file you're about to get a reference to is an O_PATH file
and, depending on the fget variant that the caller used, denies or
allows the caller to get a reference on that file depending on whether
FMODE_PATH is or isn't set. So you have
if (unlikely(file->f_mode & mask))
if (unlikely(!get_file_rcu(file))) // this is just try to bump f_count
everytime you call an fget variant which should be substantial. Other
places are fdget_pos() where f_mode is also checked right after an
fdget()...
On Wed, May 31, 2023 at 10:31:09AM +0000, Chen, Zhiyin wrote:
> As Eric said, CONFIG_RANDSTRUCT_NONE is set in the default config
> and some production environments, including Ali Cloud. Therefore, it
> is worthful to optimize the file struct layout.
>
> Here are the syscall test results of unixbench.
Results look good, but the devil is in the detail....
> Command: numactl -C 3-18 ./Run -c 16 syscall
So the test is restricted to a set of adjacent cores within a single
CPU socket, so all the cachelines are typically being shared within
a single socket's CPU caches. IOWs, the fact there are 224 CPUs in
the machine is largely irrelevant for this microbenchmark.
i.e. is this a microbenchmark that is going faster simply because
the working set for the specific benchmark now fits in L2 or L3
cache when it didn't before?
Does this same result occur for different CPUs types, cache sizes
and architectures? What about when the cores used by the benchmark
are spread across mulitple sockets so the cost of remote cacheline
access is taken into account? If this is actually a real benefit,
then we should see similar or even larger gains between CPU cores
that are further apart because the cost of false cacheline sharing
are higher in those systems....
> Without patch
> ------------------------
> 224 CPUs in system; running 16 parallel copies of tests
> System Call Overhead 5611223.7 lps (10.0 s, 7 samples)
> System Benchmarks Partial Index BASELINE RESULT INDEX
> System Call Overhead 15000.0 5611223.7 3740.8
> ========
> System Benchmarks Index Score (Partial Only) 3740.8
>
> With patch
> ------------------------------------------------------------------------
> 224 CPUs in system; running 16 parallel copies of tests
> System Call Overhead 7567076.6 lps (10.0 s, 7 samples)
> System Benchmarks Partial Index BASELINE RESULT INDEX
> System Call Overhead 15000.0 7567076.6 5044.7
> ========
> System Benchmarks Index Score (Partial Only) 5044.7
Where is all this CPU time being saved? Do you have a profile
showing what functions in the kernel are running far more
efficiently now?
Yes, the results look good, but if all this change is doing is
micro-optimising a single code path, it's much less impressive and
far more likley that it has no impact on real-world performance...
More information, please!
-Dave.
--
Dave Chinner
[email protected]
In the syscall test of UnixBench, performance regression occurred due
to false sharing.
The lock and atomic members, including file::f_lock, file::f_count and
file::f_pos_lock are highly contended and frequently updated in the
high-concurrency test scenarios. perf c2c indentified one affected
read access, file::f_op.
To prevent false sharing, the layout of file struct is changed as
following
(A) f_lock, f_count and f_pos_lock are put together to share the same
cache line.
(B) The read mostly members, including f_path, f_inode, f_op are put
into a separate cache line.
(C) f_mode is put together with f_count, since they are used frequently
at the same time.
Due to '__randomize_layout' attribute of file struct, the updated layout
only can be effective when CONFIG_RANDSTRUCT_NONE is 'y'.
The optimization has been validated in the syscall test of UnixBench.
performance gain is 30~50%. Furthermore, to confirm the optimization
effectiveness on the other codes path, the results of fsdisk, fsbuffer
and fstime are also shown.
Here are the detailed test results of unixbench.
Command: numactl -C 3-18 ./Run -c 16 syscall fsbuffer fstime fsdisk
Without Patch
------------------------------------------------------------------------
File Copy 1024 bufsize 2000 maxblocks 875052.1 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 235484.0 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 2815153.5 KBps (30.0 s, 2 samples)
System Call Overhead 5772268.3 lps (10.0 s, 7 samples)
System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 875052.1 2209.7
File Copy 256 bufsize 500 maxblocks 1655.0 235484.0 1422.9
File Copy 4096 bufsize 8000 maxblocks 5800.0 2815153.5 4853.7
System Call Overhead 15000.0 5772268.3 3848.2
========
System Benchmarks Index Score (Partial Only) 2768.3
With Patch
------------------------------------------------------------------------
File Copy 1024 bufsize 2000 maxblocks 1009977.2 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 264765.9 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 3052236.0 KBps (30.0 s, 2 samples)
System Call Overhead 8237404.4 lps (10.0 s, 7 samples)
System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 1009977.2 2550.4
File Copy 256 bufsize 500 maxblocks 1655.0 264765.9 1599.8
File Copy 4096 bufsize 8000 maxblocks 5800.0 3052236.0 5262.5
System Call Overhead 15000.0 8237404.4 5491.6
========
System Benchmarks Index Score (Partial Only) 3295.3
Signed-off-by: chenzhiyin <[email protected]>
---
include/linux/fs.h | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 133f0640fb24..cf1388e4dad0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -962,23 +962,23 @@ struct file {
struct rcu_head f_rcuhead;
unsigned int f_iocb_flags;
};
- struct path f_path;
- struct inode *f_inode; /* cached value */
- const struct file_operations *f_op;
/*
* Protects f_ep, f_flags.
* Must not be taken from IRQ context.
*/
spinlock_t f_lock;
- atomic_long_t f_count;
- unsigned int f_flags;
fmode_t f_mode;
+ atomic_long_t f_count;
struct mutex f_pos_lock;
loff_t f_pos;
+ unsigned int f_flags;
struct fown_struct f_owner;
const struct cred *f_cred;
struct file_ra_state f_ra;
+ struct path f_path;
+ struct inode *f_inode; /* cached value */
+ const struct file_operations *f_op;
u64 f_version;
#ifdef CONFIG_SECURITY
--
2.39.1
On Thu, Jun 01, 2023 at 05:24:00AM -0400, chenzhiyin wrote:
> In the syscall test of UnixBench, performance regression occurred due
> to false sharing.
>
> The lock and atomic members, including file::f_lock, file::f_count and
> file::f_pos_lock are highly contended and frequently updated in the
> high-concurrency test scenarios. perf c2c indentified one affected
> read access, file::f_op.
> To prevent false sharing, the layout of file struct is changed as
> following
> (A) f_lock, f_count and f_pos_lock are put together to share the same
> cache line.
> (B) The read mostly members, including f_path, f_inode, f_op are put
> into a separate cache line.
> (C) f_mode is put together with f_count, since they are used frequently
> at the same time.
> Due to '__randomize_layout' attribute of file struct, the updated layout
> only can be effective when CONFIG_RANDSTRUCT_NONE is 'y'.
>
> The optimization has been validated in the syscall test of UnixBench.
> performance gain is 30~50%. Furthermore, to confirm the optimization
> effectiveness on the other codes path, the results of fsdisk, fsbuffer
> and fstime are also shown.
>
> Here are the detailed test results of unixbench.
>
> Command: numactl -C 3-18 ./Run -c 16 syscall fsbuffer fstime fsdisk
>
> Without Patch
> ------------------------------------------------------------------------
> File Copy 1024 bufsize 2000 maxblocks 875052.1 KBps (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks 235484.0 KBps (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks 2815153.5 KBps (30.0 s, 2 samples)
> System Call Overhead 5772268.3 lps (10.0 s, 7 samples)
>
> System Benchmarks Partial Index BASELINE RESULT INDEX
> File Copy 1024 bufsize 2000 maxblocks 3960.0 875052.1 2209.7
> File Copy 256 bufsize 500 maxblocks 1655.0 235484.0 1422.9
> File Copy 4096 bufsize 8000 maxblocks 5800.0 2815153.5 4853.7
> System Call Overhead 15000.0 5772268.3 3848.2
> ========
> System Benchmarks Index Score (Partial Only) 2768.3
>
> With Patch
> ------------------------------------------------------------------------
> File Copy 1024 bufsize 2000 maxblocks 1009977.2 KBps (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks 264765.9 KBps (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks 3052236.0 KBps (30.0 s, 2 samples)
> System Call Overhead 8237404.4 lps (10.0 s, 7 samples)
>
> System Benchmarks Partial Index BASELINE RESULT INDEX
> File Copy 1024 bufsize 2000 maxblocks 3960.0 1009977.2 2550.4
> File Copy 256 bufsize 500 maxblocks 1655.0 264765.9 1599.8
> File Copy 4096 bufsize 8000 maxblocks 5800.0 3052236.0 5262.5
> System Call Overhead 15000.0 8237404.4 5491.6
> ========
> System Benchmarks Index Score (Partial Only) 3295.3
>
> Signed-off-by: chenzhiyin <[email protected]>
> ---
Dave had some more concerns and perf analysis requests for this. So this
will be put on hold until these are addressed.
On 6/1/23 11:24, chenzhiyin wrote:
> In the syscall test of UnixBench, performance regression occurred due
> to false sharing.
>
> The lock and atomic members, including file::f_lock, file::f_count and
> file::f_pos_lock are highly contended and frequently updated in the
> high-concurrency test scenarios. perf c2c indentified one affected
> read access, file::f_op.
> To prevent false sharing, the layout of file struct is changed as
> following
> (A) f_lock, f_count and f_pos_lock are put together to share the same
> cache line.
> (B) The read mostly members, including f_path, f_inode, f_op are put
> into a separate cache line.
> (C) f_mode is put together with f_count, since they are used frequently
> at the same time.
> Due to '__randomize_layout' attribute of file struct, the updated layout
> only can be effective when CONFIG_RANDSTRUCT_NONE is 'y'.
>
> The optimization has been validated in the syscall test of UnixBench.
> performance gain is 30~50%. Furthermore, to confirm the optimization
> effectiveness on the other codes path, the results of fsdisk, fsbuffer
> and fstime are also shown.
>
> Here are the detailed test results of unixbench.
>
> Command: numactl -C 3-18 ./Run -c 16 syscall fsbuffer fstime fsdisk
>
> Without Patch
> ------------------------------------------------------------------------
> File Copy 1024 bufsize 2000 maxblocks 875052.1 KBps (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks 235484.0 KBps (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks 2815153.5 KBps (30.0 s, 2 samples)
> System Call Overhead 5772268.3 lps (10.0 s, 7 samples)
>
> System Benchmarks Partial Index BASELINE RESULT INDEX
> File Copy 1024 bufsize 2000 maxblocks 3960.0 875052.1 2209.7
> File Copy 256 bufsize 500 maxblocks 1655.0 235484.0 1422.9
> File Copy 4096 bufsize 8000 maxblocks 5800.0 2815153.5 4853.7
> System Call Overhead 15000.0 5772268.3 3848.2
> ========
> System Benchmarks Index Score (Partial Only) 2768.3
>
> With Patch
> ------------------------------------------------------------------------
> File Copy 1024 bufsize 2000 maxblocks 1009977.2 KBps (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks 264765.9 KBps (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks 3052236.0 KBps (30.0 s, 2 samples)
> System Call Overhead 8237404.4 lps (10.0 s, 7 samples)
>
> System Benchmarks Partial Index BASELINE RESULT INDEX
> File Copy 1024 bufsize 2000 maxblocks 3960.0 1009977.2 2550.4
> File Copy 256 bufsize 500 maxblocks 1655.0 264765.9 1599.8
> File Copy 4096 bufsize 8000 maxblocks 5800.0 3052236.0 5262.5
> System Call Overhead 15000.0 8237404.4 5491.6
> ========
> System Benchmarks Index Score (Partial Only) 3295.3
>
> Signed-off-by: chenzhiyin <[email protected]>
> ---
> include/linux/fs.h | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 133f0640fb24..cf1388e4dad0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -962,23 +962,23 @@ struct file {
> struct rcu_head f_rcuhead;
> unsigned int f_iocb_flags;
> };
> - struct path f_path;
> - struct inode *f_inode; /* cached value */
> - const struct file_operations *f_op;
>
> /*
> * Protects f_ep, f_flags.
> * Must not be taken from IRQ context.
> */
> spinlock_t f_lock;
> - atomic_long_t f_count;
> - unsigned int f_flags;
> fmode_t f_mode;
> + atomic_long_t f_count;
> struct mutex f_pos_lock;
> loff_t f_pos;
> + unsigned int f_flags;
> struct fown_struct f_owner;
> const struct cred *f_cred;
> struct file_ra_state f_ra;
> + struct path f_path;
> + struct inode *f_inode; /* cached value */
> + const struct file_operations *f_op;
>
> u64 f_version;
> #ifdef CONFIG_SECURITY
Maybe add a comment for the struct that values are cache line optimized?
I.e. any change in the structure that does not check for cache lines
might/will invalidate your optimization - your patch adds maintenance
overhead, without giving a hint about that.
Thanks,
Bernd
Good questions.
perf has been applied to analyze the performance. In the syscall test, the patch can
reduce the CPU cycles for filp_close. Besides, the HITM count is also reduced from
43182 to 33146.
The test is not restricted to a set of adjacent cores. The numactl command is only
used to limit the number of processing cores. In most situations, only 8/16/32 CPU
cores are used. Performance improvement is still obvious, even if non-adjacent
CPU cores are used.
No matter what CPU type, cache size, or architecture, false sharing is always
negative on performance. And the read mostly members should be put together.
To further prove the updated layout effectiveness on some other codes path,
results of fsdisk, fsbuffer, and fstime are also shown in the new commit message.
Actually, the new layout can only reduce false sharing in high-contention situations.
The performance gain is not obvious, if there are some other bottlenecks. For
instance, if the cores are spread across multiple sockets, memory access may be
the new bottleneck due to NUMA.
Here are the results across NUMA nodes. The patch has no negative effect on the
performance result.
Command: numactl -C 0-3,16-19,63-66,72-75 ./Run -c 16 syscall fstime fsdisk fsbuffer
With Patch
Benchmark Run: Thu Jun 01 2023 03:13:52 - 03:23:15
224 CPUs in system; running 16 parallel copies of tests
File Copy 1024 bufsize 2000 maxblocks 589958.6 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 148779.2 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 1968023.8 KBps (30.0 s, 2 samples)
System Call Overhead 5804316.1 lps (10.0 s, 7 samples)
System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 589958.6 1489.8
File Copy 256 bufsize 500 maxblocks 1655.0 148779.2 899.0
File Copy 4096 bufsize 8000 maxblocks 5800.0 1968023.8 3393.1
System Call Overhead 15000.0 5804316.1 3869.5
========
System Benchmarks Index Score (Partial Only) 2047.8
Without Patch
Benchmark Run: Thu Jun 01 2023 02:11:45 - 02:21:08
224 CPUs in system; running 16 parallel copies of tests
File Copy 1024 bufsize 2000 maxblocks 571829.9 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 147693.8 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 1938854.5 KBps (30.0 s, 2 samples)
System Call Overhead 5791936.3 lps (10.0 s, 7 samples)
System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 571829.9 1444.0
File Copy 256 bufsize 500 maxblocks 1655.0 147693.8 892.4
File Copy 4096 bufsize 8000 maxblocks 5800.0 1938854.5 3342.9
System Call Overhead 15000.0 5791936.3 3861.3
========
System Benchmarks Index Score (Partial Only) 2019.5
> -----Original Message-----
> From: Dave Chinner <[email protected]>
> Sent: Thursday, June 1, 2023 6:31 AM
> To: Chen, Zhiyin <[email protected]>
> Cc: Eric Biggers <[email protected]>; Christian Brauner
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; Zou, Nanhai
> <[email protected]>; Feng, Xiaotian <[email protected]>
> Subject: Re: [PATCH] fs.h: Optimize file struct to prevent false sharing
>
> On Wed, May 31, 2023 at 10:31:09AM +0000, Chen, Zhiyin wrote:
> > As Eric said, CONFIG_RANDSTRUCT_NONE is set in the default config and
> > some production environments, including Ali Cloud. Therefore, it is
> > worthful to optimize the file struct layout.
> >
> > Here are the syscall test results of unixbench.
>
> Results look good, but the devil is in the detail....
>
> > Command: numactl -C 3-18 ./Run -c 16 syscall
>
> So the test is restricted to a set of adjacent cores within a single CPU socket,
> so all the cachelines are typically being shared within a single socket's CPU
> caches. IOWs, the fact there are 224 CPUs in the machine is largely irrelevant
> for this microbenchmark.
>
> i.e. is this a microbenchmark that is going faster simply because the working
> set for the specific benchmark now fits in L2 or L3 cache when it didn't before?
>
> Does this same result occur for different CPUs types, cache sizes and
> architectures? What about when the cores used by the benchmark are
> spread across mulitple sockets so the cost of remote cacheline access is taken
> into account? If this is actually a real benefit, then we should see similar or
> even larger gains between CPU cores that are further apart because the cost
> of false cacheline sharing are higher in those systems....
>
> > Without patch
> > ------------------------
> > 224 CPUs in system; running 16 parallel copies of tests
> > System Call Overhead 5611223.7 lps (10.0 s, 7 samples)
> > System Benchmarks Partial Index BASELINE RESULT INDEX
> > System Call Overhead 15000.0 5611223.7 3740.8
> > ========
> > System Benchmarks Index Score (Partial Only) 3740.8
> >
> > With patch
> > ----------------------------------------------------------------------
> > --
> > 224 CPUs in system; running 16 parallel copies of tests
> > System Call Overhead 7567076.6 lps (10.0 s, 7 samples)
> > System Benchmarks Partial Index BASELINE RESULT INDEX
> > System Call Overhead 15000.0 7567076.6 5044.7
> > ========
> > System Benchmarks Index Score (Partial Only) 5044.7
>
> Where is all this CPU time being saved? Do you have a profile showing what
> functions in the kernel are running far more efficiently now?
>
> Yes, the results look good, but if all this change is doing is micro-optimising a
> single code path, it's much less impressive and far more likley that it has no
> impact on real-world performance...
>
> More information, please!
>
> -Dave.
>
> --
> Dave Chinner
> [email protected]
On Thu, Jun 01, 2023 at 10:47:53AM +0000, Chen, Zhiyin wrote:
> Good questions.
> perf has been applied to analyze the performance. In the syscall test, the patch can
> reduce the CPU cycles for filp_close. Besides, the HITM count is also reduced from
> 43182 to 33146.
> The test is not restricted to a set of adjacent cores. The numactl command is only
> used to limit the number of processing cores.
And, in doing so, it limits the physical locality of the cores being
used to 3-18. That effectively puts them all on the socket because
the test is not using all 16 CPUs and the scheduler tends to put all
related tasks on the same socket if there are enoguh idle CPUs to do
so....
> In most situations, only 8/16/32 CPU
> cores are used. Performance improvement is still obvious, even if non-adjacent
> CPU cores are used.
>
> No matter what CPU type, cache size, or architecture, false sharing is always
> negative on performance. And the read mostly members should be put together.
>
> To further prove the updated layout effectiveness on some other codes path,
> results of fsdisk, fsbuffer, and fstime are also shown in the new commit message.
>
> Actually, the new layout can only reduce false sharing in high-contention situations.
> The performance gain is not obvious, if there are some other bottlenecks. For
> instance, if the cores are spread across multiple sockets, memory access may be
> the new bottleneck due to NUMA.
>
> Here are the results across NUMA nodes. The patch has no negative effect on the
> performance result.
>
> Command: numactl -C 0-3,16-19,63-66,72-75 ./Run -c 16 syscall fstime fsdisk fsbuffer
> With Patch
> Benchmark Run: Thu Jun 01 2023 03:13:52 - 03:23:15
> 224 CPUs in system; running 16 parallel copies of tests
>
> File Copy 1024 bufsize 2000 maxblocks 589958.6 KBps (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks 148779.2 KBps (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks 1968023.8 KBps (30.0 s, 2 samples)
> System Call Overhead 5804316.1 lps (10.0 s, 7 samples)
Ok, so very small data buffers and file sizes which means the
working set of the benchmark is almost certainly going to be CPU
cache resident.
This is a known problem with old IO benchmarks on modern CPUs - the
data set is small enough that it often fits mostly in the CPU cache
and so small variations in code layout can make 20-30%
difference in performance for file copy benchmarks. Use a different
compiler, or even a different filesystem, and the amazing gain
goes away and may even result in a regression....
For example, this has been a known problem with IOZone for at least
15 years now, making it largely unreliable as a benchmarking tool.
Unless, of course, you know exactly what you are doing and can avoid
all the tests that are susceptible to CPU cache residency
variations....
> System Benchmarks Partial Index BASELINE RESULT INDEX
> File Copy 1024 bufsize 2000 maxblocks 3960.0 589958.6 1489.8
> File Copy 256 bufsize 500 maxblocks 1655.0 148779.2 899.0
> File Copy 4096 bufsize 8000 maxblocks 5800.0 1968023.8 3393.1
> System Call Overhead 15000.0 5804316.1 3869.5
> ========
> System Benchmarks Index Score (Partial Only) 2047.8
>
> Without Patch
> Benchmark Run: Thu Jun 01 2023 02:11:45 - 02:21:08
> 224 CPUs in system; running 16 parallel copies of tests
>
> File Copy 1024 bufsize 2000 maxblocks 571829.9 KBps (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks 147693.8 KBps (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks 1938854.5 KBps (30.0 s, 2 samples)
> System Call Overhead 5791936.3 lps (10.0 s, 7 samples)
>
> System Benchmarks Partial Index BASELINE RESULT INDEX
> File Copy 1024 bufsize 2000 maxblocks 3960.0 571829.9 1444.0
> File Copy 256 bufsize 500 maxblocks 1655.0 147693.8 892.4
> File Copy 4096 bufsize 8000 maxblocks 5800.0 1938854.5 3342.9
> System Call Overhead 15000.0 5791936.3 3861.3
> ========
> System Benchmarks Index Score (Partial Only) 2019.5
Yeah, that's what I thought we'd see. i.e. as soon as we go
off-socket, there's no actual performance change. This generally
means there is no difference in cacheline sharing across CPUs
between the two tests. You can likely use `perf stat` to confirm
this from the hardware l1/l2/llc data cache miss counters; I'd guess
they are nearly identical with/without the patch.
If this truly was a false cacheline sharing situation, the
cross-socket test results should measurably increase in perofrmance
as the frequently accessed read-only data cacheline is shared across
all CPU caches instead of being bounced exclusively between CPUs.
The amount of l1/l2/llc data cache misses during the workload should
reduce measurably if this is happening.
As a technical note, if you want to split data out into different
cachelines, you should be using annotations like
'____cacheline_aligned_in_smp' to align structures and variables
inside structures to the start of a new cacheline. Not only is this
self documenting, it will pad the structure appropriately to ensure
that the update-heavy variable(s) you want isolated to a new
cacheline are actually on a separate cacheline. It may be that the
manual cacheline separation isn't quite good enough to show
improvement on multi-socket machines, so improving the layout via
explicit alignment directives may show further improvement.
FYI, here's an example of how avoiding false sharing should improve
performance when we go off-socket. Here's a comparison of the same
16-way workload, one on a 2x8p dual socket machine (machine A), the
other running on a single 16p CPU core (machine B). The workload
used 99% of all available CPU doing bulk file removal.
commit b0dff466c00975a3e3ec97e6b0266bfd3e4805d6
Author: Dave Chinner <[email protected]>
Date: Wed May 20 13:17:11 2020 -0700
xfs: separate read-only variables in struct xfs_mount
Seeing massive cpu usage from xfs_agino_range() on one machine;
instruction level profiles look similar to another machine running
the same workload, only one machine is consuming 10x as much CPU as
the other and going much slower. The only real difference between
the two machines is core count per socket. Both are running
identical 16p/16GB virtual machine configurations
Machine A:
25.83% [k] xfs_agino_range
12.68% [k] __xfs_dir3_data_check
6.95% [k] xfs_verify_ino
6.78% [k] xfs_dir2_data_entry_tag_p
3.56% [k] xfs_buf_find
2.31% [k] xfs_verify_dir_ino
2.02% [k] xfs_dabuf_map.constprop.0
1.65% [k] xfs_ag_block_count
And takes around 13 minutes to remove 50 million inodes.
Machine B:
13.90% [k] __pv_queued_spin_lock_slowpath
3.76% [k] do_raw_spin_lock
2.83% [k] xfs_dir3_leaf_check_int
2.75% [k] xfs_agino_range
2.51% [k] __raw_callee_save___pv_queued_spin_unlock
2.18% [k] __xfs_dir3_data_check
2.02% [k] xfs_log_commit_cil
And takes around 5m30s to remove 50 million inodes.
Suspect is cacheline contention on m_sectbb_log which is used in one
of the macros in xfs_agino_range. This is a read-only variable but
shares a cacheline with m_active_trans which is a global atomic that
gets bounced all around the machine.
The workload is trying to run hundreds of thousands of transactions
per second and hence cacheline contention will be occurring on this
atomic counter. Hence xfs_agino_range() is likely just be an
innocent bystander as the cache coherency protocol fights over the
cacheline between CPU cores and sockets.
On machine A, this rearrangement of the struct xfs_mount
results in the profile changing to:
9.77% [kernel] [k] xfs_agino_range
6.27% [kernel] [k] __xfs_dir3_data_check
5.31% [kernel] [k] __pv_queued_spin_lock_slowpath
4.54% [kernel] [k] xfs_buf_find
3.79% [kernel] [k] do_raw_spin_lock
3.39% [kernel] [k] xfs_verify_ino
2.73% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock
Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
of machine B and still runs substantially slower than it should.
Current rm -rf of 50 million files:
vanilla patched
machine A 13m20s 6m42s
machine B 5m30s 5m02s
It's an improvement, hence indicating that separation and further
optimisation of read-only global filesystem data is worthwhile, but
it clearly isn't the underlying issue causing this specific
performance degradation.
Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Notice how much of an improvement occurred on the 2x8p system vs a
single 16p core when the false sharing was removed? The 16p core
showed ~10% reduction in CPU time, whilst the 2x8p showed a 50%
reduction in CPU time. That's the sort of gains I'd expect if false
sharing was an issue for this workload. The lack of multi-socket
performance improvement tends to indicate that false sharing is not
occurring and that something else has resulted in the single socket
performance increases....
Cheers,
Dave.
--
Dave Chinner
[email protected]
> > Good questions.
> > perf has been applied to analyze the performance. In the syscall test,
> > the patch can reduce the CPU cycles for filp_close. Besides, the HITM
> > count is also reduced from
> > 43182 to 33146.
> > The test is not restricted to a set of adjacent cores. The numactl
> > command is only used to limit the number of processing cores.
>
> And, in doing so, it limits the physical locality of the cores being used to 3-18.
> That effectively puts them all on the socket because the test is not using all
> 16 CPUs and the scheduler tends to put all related tasks on the same socket if
> there are enoguh idle CPUs to do so....
>
> > In most situations, only 8/16/32 CPU
> > cores are used. Performance improvement is still obvious, even if
> > non-adjacent CPU cores are used.
> >
> > No matter what CPU type, cache size, or architecture, false sharing is
> > always negative on performance. And the read mostly members should be
> put together.
> >
> > To further prove the updated layout effectiveness on some other codes
> > path, results of fsdisk, fsbuffer, and fstime are also shown in the new
> commit message.
> >
> > Actually, the new layout can only reduce false sharing in high-contention
> situations.
> > The performance gain is not obvious, if there are some other
> > bottlenecks. For instance, if the cores are spread across multiple
> > sockets, memory access may be the new bottleneck due to NUMA.
> >
> > Here are the results across NUMA nodes. The patch has no negative
> > effect on the performance result.
> >
> > Command: numactl -C 0-3,16-19,63-66,72-75 ./Run -c 16 syscall fstime
> > fsdisk fsbuffer With Patch Benchmark Run: Thu Jun 01 2023 03:13:52 -
> > 03:23:15
> > 224 CPUs in system; running 16 parallel copies of tests
> >
> > File Copy 1024 bufsize 2000 maxblocks 589958.6 KBps (30.0 s, 2 samples)
> > File Copy 256 bufsize 500 maxblocks 148779.2 KBps (30.0 s, 2 samples)
> > File Copy 4096 bufsize 8000 maxblocks 1968023.8 KBps (30.0 s, 2 samples)
> > System Call Overhead 5804316.1 lps (10.0 s, 7 samples)
>
> Ok, so very small data buffers and file sizes which means the working set of
> the benchmark is almost certainly going to be CPU cache resident.
>
> This is a known problem with old IO benchmarks on modern CPUs - the data
> set is small enough that it often fits mostly in the CPU cache and so small
> variations in code layout can make 20-30% difference in performance for file
> copy benchmarks. Use a different compiler, or even a different filesystem,
> and the amazing gain goes away and may even result in a regression....
>
> For example, this has been a known problem with IOZone for at least
> 15 years now, making it largely unreliable as a benchmarking tool.
> Unless, of course, you know exactly what you are doing and can avoid all the
> tests that are susceptible to CPU cache residency variations....
>
> > System Benchmarks Partial Index BASELINE RESULT INDEX
> > File Copy 1024 bufsize 2000 maxblocks 3960.0 589958.6 1489.8
> > File Copy 256 bufsize 500 maxblocks 1655.0 148779.2 899.0
> > File Copy 4096 bufsize 8000 maxblocks 5800.0 1968023.8 3393.1
> > System Call Overhead 15000.0 5804316.1 3869.5
> > ========
> > System Benchmarks Index Score (Partial Only) 2047.8
> >
> > Without Patch
> > Benchmark Run: Thu Jun 01 2023 02:11:45 - 02:21:08
> > 224 CPUs in system; running 16 parallel copies of tests
> >
> > File Copy 1024 bufsize 2000 maxblocks 571829.9 KBps (30.0 s, 2 samples)
> > File Copy 256 bufsize 500 maxblocks 147693.8 KBps (30.0 s, 2 samples)
> > File Copy 4096 bufsize 8000 maxblocks 1938854.5 KBps (30.0 s, 2 samples)
> > System Call Overhead 5791936.3 lps (10.0 s, 7 samples)
> >
> > System Benchmarks Partial Index BASELINE RESULT INDEX
> > File Copy 1024 bufsize 2000 maxblocks 3960.0 571829.9 1444.0
> > File Copy 256 bufsize 500 maxblocks 1655.0 147693.8 892.4
> > File Copy 4096 bufsize 8000 maxblocks 5800.0 1938854.5 3342.9
> > System Call Overhead 15000.0 5791936.3 3861.3
> > ========
> > System Benchmarks Index Score (Partial Only) 2019.5
>
> Yeah, that's what I thought we'd see. i.e. as soon as we go off-socket, there's
> no actual performance change. This generally means there is no difference in
> cacheline sharing across CPUs between the two tests. You can likely use `perf
> stat` to confirm this from the hardware l1/l2/llc data cache miss counters; I'd
> guess they are nearly identical with/without the patch.
>
> If this truly was a false cacheline sharing situation, the cross-socket test
> results should measurably increase in perofrmance as the frequently
> accessed read-only data cacheline is shared across all CPU caches instead of
> being bounced exclusively between CPUs.
> The amount of l1/l2/llc data cache misses during the workload should reduce
> measurably if this is happening.
>
> As a technical note, if you want to split data out into different cachelines, you
> should be using annotations like '____cacheline_aligned_in_smp' to align
> structures and variables inside structures to the start of a new cacheline. Not
> only is this self documenting, it will pad the structure appropriately to ensure
> that the update-heavy variable(s) you want isolated to a new cacheline are
> actually on a separate cacheline. It may be that the manual cacheline
> separation isn't quite good enough to show improvement on multi-socket
> machines, so improving the layout via explicit alignment directives may show
> further improvement.
>
> FYI, here's an example of how avoiding false sharing should improve
> performance when we go off-socket. Here's a comparison of the same 16-
> way workload, one on a 2x8p dual socket machine (machine A), the other
> running on a single 16p CPU core (machine B). The workload used 99% of all
> available CPU doing bulk file removal.
>
> commit b0dff466c00975a3e3ec97e6b0266bfd3e4805d6
> Author: Dave Chinner <mailto:[email protected]>
> Date: Wed May 20 13:17:11 2020 -0700
>
> xfs: separate read-only variables in struct xfs_mount
>
> Seeing massive cpu usage from xfs_agino_range() on one machine;
> instruction level profiles look similar to another machine running
> the same workload, only one machine is consuming 10x as much CPU as
> the other and going much slower. The only real difference between
> the two machines is core count per socket. Both are running
> identical 16p/16GB virtual machine configurations
>
> Machine A:
>
> 25.83% [k] xfs_agino_range
> 12.68% [k] __xfs_dir3_data_check
> 6.95% [k] xfs_verify_ino
> 6.78% [k] xfs_dir2_data_entry_tag_p
> 3.56% [k] xfs_buf_find
> 2.31% [k] xfs_verify_dir_ino
> 2.02% [k] xfs_dabuf_map.constprop.0
> 1.65% [k] xfs_ag_block_count
>
> And takes around 13 minutes to remove 50 million inodes.
>
> Machine B:
>
> 13.90% [k] __pv_queued_spin_lock_slowpath
> 3.76% [k] do_raw_spin_lock
> 2.83% [k] xfs_dir3_leaf_check_int
> 2.75% [k] xfs_agino_range
> 2.51% [k] __raw_callee_save___pv_queued_spin_unlock
> 2.18% [k] __xfs_dir3_data_check
> 2.02% [k] xfs_log_commit_cil
>
> And takes around 5m30s to remove 50 million inodes.
>
> Suspect is cacheline contention on m_sectbb_log which is used in one
> of the macros in xfs_agino_range. This is a read-only variable but
> shares a cacheline with m_active_trans which is a global atomic that
> gets bounced all around the machine.
>
> The workload is trying to run hundreds of thousands of transactions
> per second and hence cacheline contention will be occurring on this
> atomic counter. Hence xfs_agino_range() is likely just be an
> innocent bystander as the cache coherency protocol fights over the
> cacheline between CPU cores and sockets.
>
> On machine A, this rearrangement of the struct xfs_mount
> results in the profile changing to:
>
> 9.77% [kernel] [k] xfs_agino_range
> 6.27% [kernel] [k] __xfs_dir3_data_check
> 5.31% [kernel] [k] __pv_queued_spin_lock_slowpath
> 4.54% [kernel] [k] xfs_buf_find
> 3.79% [kernel] [k] do_raw_spin_lock
> 3.39% [kernel] [k] xfs_verify_ino
> 2.73% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock
>
> Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
> of machine B and still runs substantially slower than it should.
>
> Current rm -rf of 50 million files:
>
> vanilla patched
> machine A 13m20s 6m42s
> machine B 5m30s 5m02s
>
> It's an improvement, hence indicating that separation and further
> optimisation of read-only global filesystem data is worthwhile, but
> it clearly isn't the underlying issue causing this specific
> performance degradation.
>
> Signed-off-by: Dave Chinner <mailto:[email protected]>
> Reviewed-by: Christoph Hellwig <mailto:[email protected]>
> Reviewed-by: Darrick J. Wong <mailto:[email protected]>
> Signed-off-by: Darrick J. Wong <mailto:[email protected]>
>
> Notice how much of an improvement occurred on the 2x8p system vs a
> single 16p core when the false sharing was removed? The 16p core showed
> ~10% reduction in CPU time, whilst the 2x8p showed a 50% reduction in CPU
> time. That's the sort of gains I'd expect if false sharing was an issue for this
> workload. The lack of multi-socket performance improvement tends to
> indicate that false sharing is not occurring and that something else has
> resulted in the single socket performance increases....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> mailto:[email protected]
Thanks for your questions and comments.
Certainly, both data buffers and file sizes are very small in UnixBench. The file copy
test results are only shown to prove that the patch has no negative effect on the
IO performance or some other codes path. In the real world, different performance
problems have different bottlenecks and need different corresponding solutions.
Here, this patch can only reduce false sharing in high-contention situations where
different threads and processes handle the same file. Therefore, there is no obvious
performance improvement if the bottleneck is something else. However, a reasonable
layout of file struct is still necessary.
As it is known well, '____cacheline_aligned_in_smp' can be used to separate data into
different cache lines. However, the size of the corresponding struct also becomes larger
which could cause more memory costs and performance regression in some other cases.
Therefore, in this patch, the file struct is only re-layouted carefully.
To deep dive into the performance, perf c2c is used instead of perf stat. Because the count
of HITM (Hit Modification) can help to identify the problem of false sharing. The perf c2c
results are attached as plain text.
With the analysis of the perf c2c result.
(1) In all of the four cases, more than 99% HITM is caused by the same address, which is the
file operator actually.
(2) Compare the results between running on single NUMA node and running on multi NUMA
nodes. Both load remote HITM and load remote DRAM increase much more, which caused
performance regression. In UnixBench, the syscall test run less iterations in specific
duration (10s) when it is running across numa nodes and the HITM is also less even if the
same kernel is used.
(3) Compare the results between kernel with patch and kernel without patch. The number
of HITM is decreased, no matter the test is running on single NUMA node or multiple NUMA
nodes.
Above all, this patch can reduce false sharing caused by file struct in high-contention situation.
However, if the false sharing is not the bottleneck (i.e. when running on multiple NUMA nodes,
remote DRAM access may be the new bottleneck), this patch cannot bring obvious
performance improvement.
If any else performance data should be shown, please feel free to let me know. Cheers.
(a.1) Single Numa Node
=================================================
Trace Event Information
=================================================
Total records : 12591573
Locked Load/Store Operations : 371850
Load Operations : 594693
Loads - uncacheable : 0
Loads - IO : 0
Loads - Miss : 0
Loads - no mapping : 0
Load Fill Buffer Hit : 136631
Load L1D hit : 346615
Load L2D hit : 113
Load LLC hit : 111273
Load Local HITM : 86212
Load Remote HITM : 16
Load Remote HIT : 0
Load Local DRAM : 16
Load Remote DRAM : 45
Load MESI State Exclusive : 45
Load MESI State Shared : 16
Load LLC Misses : 77
Load access blocked by data : 320
Load access blocked by address : 359024
=================================================
Shared Data Cache Line Table
=================================================
#
# ----------- Cacheline --------------- Tot ------- Load Hitm ------- Total
# Index Address Node PA cnt Hitm Total LclHitm RmtHitm
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... .......
#
0 0xff2dfd8dfd4cd200 1 86964 99.80% 86058 86058 0
(a.2) Multi Numa Node
=================================================
Trace Event Information
=================================================
Total records : 12168050
Locked Load/Store Operations : 293778
Load Operations : 422976
Loads - uncacheable : 2
Loads - IO : 0
Loads - Miss : 0
Loads - no mapping : 0
Load Fill Buffer Hit : 99324
Load L1D hit : 248307
Load L2D hit : 130
Load LLC hit : 58779
Load Local HITM : 48665
Load Remote HITM : 13471
Load Remote HIT : 0
Load Local DRAM : 2658
Load Remote DRAM : 13776
Load MESI State Exclusive : 13776
Load MESI State Shared : 2658
Load LLC Misses : 29905
Load access blocked by data : 373
Load access blocked by address : 258037
=================================================
Shared Data Cache Line Table
=================================================
#
# ----------- Cacheline --------------- Tot ------- Load Hitm -------
# Index Address Node PA cnt Hitm Total LclHitm RmtHitm
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... .......
#
0 0xff2dfd8dfd4cd200 1 59737 99.43% 61780 48409 13371
(b) Without Patch
(b.1) Single Numa Node
=================================================
Trace Event Information
=================================================
Total records : 12919299
Locked Load/Store Operations : 373015
Load Operations : 764465
Loads - uncacheable : 0
Loads - IO : 0
Loads - Miss : 0
Loads - no mapping : 0
Load Fill Buffer Hit : 269056
Load L1D hit : 395309
Load L2D hit : 75
Load LLC hit : 100013
Load Local HITM : 95001
Load Remote HITM : 3
Load Remote HIT : 0
Load Local DRAM : 7
Load Remote DRAM : 5
Load MESI State Exclusive : 5
Load MESI State Shared : 7
Load LLC Misses : 15
Load access blocked by data : 335
Load access blocked by address : 545957
=================================================
Shared Data Cache Line Table
=================================================
#
# ----------- Cacheline --------------- Tot ------- Load Hitm -------
# Index Address Node PA cnt Hitm Total LclHitm RmtHitm
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... .......
#
0 0xff33448e58b68400 0 220190 99.79% 94801 94801 0
(b.2) Multi Numa Node
=================================================
Trace Event Information
=================================================
Total records : 12409259
Locked Load/Store Operations : 325557
Load Operations : 625900
Loads - uncacheable : 0
Loads - IO : 0
Loads - Miss : 18
Loads - no mapping : 0
Load Fill Buffer Hit : 219323
Load L1D hit : 324152
Load L2D hit : 87
Load LLC hit : 64345
Load Local HITM : 61334
Load Remote HITM : 16979
Load Remote HIT : 0
Load Local DRAM : 756
Load Remote DRAM : 17219
Load MESI State Exclusive : 17219
Load MESI State Shared : 756
Load LLC Misses : 34954
Load access blocked by data : 387
Load access blocked by address : 443196
=================================================
Shared Data Cache Line Table
=================================================
#
# ----------- Cacheline --------------- Tot ------- Load Hitm -------
# Index Address Node PA cnt Hitm Total LclHitm RmtHitm
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... .......
#
0 0xff33448e58b68400 0 190537 99.71% 78088 61176 16912
Best Regards
Zhiyin
On Thu, 01 Jun 2023 05:24:00 -0400, chenzhiyin wrote:
> In the syscall test of UnixBench, performance regression occurred due
> to false sharing.
>
> The lock and atomic members, including file::f_lock, file::f_count and
> file::f_pos_lock are highly contended and frequently updated in the
> high-concurrency test scenarios. perf c2c indentified one affected
> read access, file::f_op.
> To prevent false sharing, the layout of file struct is changed as
> following
> (A) f_lock, f_count and f_pos_lock are put together to share the same
> cache line.
> (B) The read mostly members, including f_path, f_inode, f_op are put
> into a separate cache line.
> (C) f_mode is put together with f_count, since they are used frequently
> at the same time.
> Due to '__randomize_layout' attribute of file struct, the updated layout
> only can be effective when CONFIG_RANDSTRUCT_NONE is 'y'.
>
> [...]
Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc
[1/1] fs.h: Optimize file struct to prevent false sharing
https://git.kernel.org/vfs/vfs/c/b63bfcf3c65d