2024-02-02 09:36:55

by JonasZhou

[permalink] [raw]
Subject: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

In the struct address_space, there is a 32-byte gap between i_mmap
and i_mmap_rwsem. Due to the alignment of struct address_space
variables to 8 bytes, in certain situations, i_mmap and
i_mmap_rwsem may end up in the same CACHE line.

While running Unixbench/execl, we observe high false sharing issues
when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
after i_private_list, ensuring a 64-byte gap between i_mmap and
i_mmap_rwsem.

For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the
score of Unixbench/execl improves by ~3.94%, and the score of
Unixbench/shell improves by ~3.26%.

Baseline:
-------------------------------------------------------------
162 546 748 11374 21 0xffff92e266af90c0
-------------------------------------------------------------
46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1

with this change:
-------------------------------------------------------------
360 12 300 57 35 0xffff982cdae76400
-------------------------------------------------------------
50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0

Signed-off-by: JonasZhou-oc <[email protected]>
---
include/linux/fs.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index ed5966a70495..2d6ccde5d1be 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -482,10 +482,10 @@ struct address_space {
pgoff_t writeback_index;
const struct address_space_operations *a_ops;
unsigned long flags;
- struct rw_semaphore i_mmap_rwsem;
errseq_t wb_err;
spinlock_t i_private_lock;
struct list_head i_private_list;
+ struct rw_semaphore i_mmap_rwsem;
void * i_private_data;
} __attribute__((aligned(sizeof(long)))) __randomize_layout;
/*
--
2.25.1



2024-02-02 10:21:29

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

[That needs a review from Willy.]

On Fri, Feb 02, 2024 at 05:34:07PM +0800, JonasZhou-oc wrote:
> In the struct address_space, there is a 32-byte gap between i_mmap
> and i_mmap_rwsem. Due to the alignment of struct address_space
> variables to 8 bytes, in certain situations, i_mmap and
> i_mmap_rwsem may end up in the same CACHE line.
>
> While running Unixbench/execl, we observe high false sharing issues
> when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
> after i_private_list, ensuring a 64-byte gap between i_mmap and
> i_mmap_rwsem.
>
> For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the
> score of Unixbench/execl improves by ~3.94%, and the score of
> Unixbench/shell improves by ~3.26%.
>
> Baseline:
> -------------------------------------------------------------
> 162 546 748 11374 21 0xffff92e266af90c0
> -------------------------------------------------------------
> 46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
> 4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
> 0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
> 6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
> 0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
> 0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
> 5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
> 1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
> 1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1
> 0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
> 0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
> 2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1
> 30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1
> 1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1
>
> with this change:
> -------------------------------------------------------------
> 360 12 300 57 35 0xffff982cdae76400
> -------------------------------------------------------------
> 50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1
> 8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1
> 0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1
> 33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1
> 0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1
> 8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1
> 0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1
> 0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0
>
> Signed-off-by: JonasZhou-oc <[email protected]>
> ---
> include/linux/fs.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ed5966a70495..2d6ccde5d1be 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -482,10 +482,10 @@ struct address_space {
> pgoff_t writeback_index;
> const struct address_space_operations *a_ops;
> unsigned long flags;
> - struct rw_semaphore i_mmap_rwsem;
> errseq_t wb_err;
> spinlock_t i_private_lock;
> struct list_head i_private_list;
> + struct rw_semaphore i_mmap_rwsem;
> void * i_private_data;
> } __attribute__((aligned(sizeof(long)))) __randomize_layout;
> /*
> --
> 2.25.1
>

2024-02-02 15:04:11

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

On Fri, Feb 02, 2024 at 05:34:07PM +0800, JonasZhou-oc wrote:
> In the struct address_space, there is a 32-byte gap between i_mmap
> and i_mmap_rwsem. Due to the alignment of struct address_space
> variables to 8 bytes, in certain situations, i_mmap and
> i_mmap_rwsem may end up in the same CACHE line.
>
> While running Unixbench/execl, we observe high false sharing issues
> when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
> after i_private_list, ensuring a 64-byte gap between i_mmap and
> i_mmap_rwsem.

I'm confused. i_mmap_rwsem protects i_mmap. Usually you want the lock
and the thing it's protecting in the same cacheline. Why is that not
the case here?


2024-02-03 01:32:21

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

On Fri, Feb 02, 2024 at 03:03:51PM +0000, Matthew Wilcox wrote:
> On Fri, Feb 02, 2024 at 05:34:07PM +0800, JonasZhou-oc wrote:
> > In the struct address_space, there is a 32-byte gap between i_mmap
> > and i_mmap_rwsem. Due to the alignment of struct address_space
> > variables to 8 bytes, in certain situations, i_mmap and
> > i_mmap_rwsem may end up in the same CACHE line.
> >
> > While running Unixbench/execl, we observe high false sharing issues
> > when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
> > after i_private_list, ensuring a 64-byte gap between i_mmap and
> > i_mmap_rwsem.
>
> I'm confused. i_mmap_rwsem protects i_mmap. Usually you want the lock
> and the thing it's protecting in the same cacheline. Why is that not
> the case here?

We actually had this seven months ago:

https://lore.kernel.org/all/[email protected]/

Unfortunately, no argumentation was forthcoming about *why* this was
the right approach. All we got was a different patch and an assertion
that it still improved performance.

We need to understand what's going on! Please don't do the same thing
as the other submitter and just assert that it does.

2024-02-05 03:22:33

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

On Fri, Feb 02, 2024 at 07:32:36PM +0000, Matthew Wilcox wrote:
> On Fri, Feb 02, 2024 at 03:03:51PM +0000, Matthew Wilcox wrote:
> > On Fri, Feb 02, 2024 at 05:34:07PM +0800, JonasZhou-oc wrote:
> > > In the struct address_space, there is a 32-byte gap between i_mmap
> > > and i_mmap_rwsem. Due to the alignment of struct address_space
> > > variables to 8 bytes, in certain situations, i_mmap and
> > > i_mmap_rwsem may end up in the same CACHE line.
> > >
> > > While running Unixbench/execl, we observe high false sharing issues
> > > when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
> > > after i_private_list, ensuring a 64-byte gap between i_mmap and
> > > i_mmap_rwsem.
> >
> > I'm confused. i_mmap_rwsem protects i_mmap. Usually you want the lock
> > and the thing it's protecting in the same cacheline.

You are correct in the case that there is never any significant
contention on the lock. i.e. gaining the lock will also pull the
cacheline for the object it protects and so avoid an extra memory
fetch.

However....

> > Why is that not
> > the case here?
>
> We actually had this seven months ago:
>
> https://lore.kernel.org/all/[email protected]/
>
> Unfortunately, no argumentation was forthcoming about *why* this was
> the right approach. All we got was a different patch and an assertion
> that it still improved performance.
>
> We need to understand what's going on! Please don't do the same thing
> as the other submitter and just assert that it does.

Intuition tells me that what the OP is seeing is the opposite case
to above: there is significant contention on the lock. In that case,
optimal "contention performance" comes from separating the lock and
the objects it protects into different cachelines.

The reason for this is that if the lock and objects it protects are
on the same cacheline, lock contention affects both the lock and the
objects being manipulated inside the critical section. i.e. attempts
to grab the lock pull the cacheline away from the CPU that holds the
lock, and then accesses to the object that are protected by the lock
then have to pull the cacheline back.

i.e. the cost of the extra memory fetch from an uncontended
cacheline is less than the cost of having to repeatedly fetch the
memory inside a critical section on a contended cacheline.

I consider optimisation attempts like this the canary in the mine:
it won't be long before these or similar workloads report
catastrophic lock contention on the lock in question. Moving items
in the structure is equivalent to re-arranging the deck chairs
whilst the ship sinks - we might keep our heads above water a
little longer, but the ship is still sinking and we're still going
to have to fix the leak sooner rather than later...

-Dave.
--
Dave Chinner
[email protected]

2024-02-05 06:22:46

by JonasZhou

[permalink] [raw]
Subject: Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

> On Fri, Feb 02, 2024 at 03:03:51PM +0000, Matthew Wilcox wrote:
> > On Fri, Feb 02, 2024 at 05:34:07PM +0800, JonasZhou-oc wrote:
> > > In the struct address_space, there is a 32-byte gap between i_mmap
> > > and i_mmap_rwsem. Due to the alignment of struct address_space
> > > variables to 8 bytes, in certain situations, i_mmap and
> > > i_mmap_rwsem may end up in the same CACHE line.
> > >
> > > While running Unixbench/execl, we observe high false sharing issues
> > > when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
> > > after i_private_list, ensuring a 64-byte gap between i_mmap and
> > > i_mmap_rwsem.
> >
> > I'm confused. i_mmap_rwsem protects i_mmap. Usually you want the lock
> > and the thing it's protecting in the same cacheline. Why is that not
> > the case here?
>
> We actually had this seven months ago:
>
> https://lore.kernel.org/all/[email protected]/
>
> Unfortunately, no argumentation was forthcoming about *why* this was
> the right approach. All we got was a different patch and an assertion
> that it still improved performance.
>
> We need to understand what's going on! Please don't do the same thing
> as the other submitter and just assert that it does.

When running UnixBench/execl, each execl process repeatedly performs
i_mmap_lock_write -> vma_interval_tree_remove/insert ->
i_mmap_unlock_write. As indicated below, when i_mmap and i_mmap_rwsem
are in the same CACHE Line, there will be more HITM.

Func0: i_mmap_lock_write
Func1: vma_interval_tree_remove/insert
Func2: i_mmap_unlock_write
In the same CACHE Line
Process A | Process B | Process C | Process D | CACHE Line state
----------+-----------+-----------+-----------+-----------------
Func0 | | | | I->M
| Func0 | | | HITM M->S
Func1 | | | | may change to M
| | Func0 | | HITM M->S
Func2 | | | | S->M
| | | Func0 | HITM M->S

In different CACHE Lines
Process A | Process B | Process C | Process D | CACHE Line state
----------+-----------+-----------+-----------+-----------------
Func0 | | | | I->M
| Func0 | | | HITM M->S
Func1 | | | |
| | Func0 | | S->S
Func2 | | | | S->M
| | | Func0 | HITM M->S

The same issue will occur in Unixbench/shell because the shell
launches a lot of shell commands, loads executable files and dynamic
libraries into memory, execute, and exit.

Yes, his commit has been merged into the Linux kernel, but there
is an issue. After moving i_mmap_rwsem below flags, there is a
32-byte gap between i_mmap_rwsem and i_mmap. However, the struct
address_space is aligned to sizeof(long), which is 8 on the x86-64
architecture. As a result, i_mmap_rwsem and i_mmap may be placed on
the same CACHE Line, causing a false sharing problem. This issue has
been observed using the perf c2c tool.

2024-02-05 23:16:22

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

On Mon, Feb 05, 2024 at 02:22:29PM +0800, JonasZhou wrote:
> > On Fri, Feb 02, 2024 at 03:03:51PM +0000, Matthew Wilcox wrote:
> > > On Fri, Feb 02, 2024 at 05:34:07PM +0800, JonasZhou-oc wrote:
> > > > In the struct address_space, there is a 32-byte gap between i_mmap
> > > > and i_mmap_rwsem. Due to the alignment of struct address_space
> > > > variables to 8 bytes, in certain situations, i_mmap and
> > > > i_mmap_rwsem may end up in the same CACHE line.
> > > >
> > > > While running Unixbench/execl, we observe high false sharing issues
> > > > when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
> > > > after i_private_list, ensuring a 64-byte gap between i_mmap and
> > > > i_mmap_rwsem.
> > >
> > > I'm confused. i_mmap_rwsem protects i_mmap. Usually you want the lock
> > > and the thing it's protecting in the same cacheline. Why is that not
> > > the case here?
> >
> > We actually had this seven months ago:
> >
> > https://lore.kernel.org/all/[email protected]/
> >
> > Unfortunately, no argumentation was forthcoming about *why* this was
> > the right approach. All we got was a different patch and an assertion
> > that it still improved performance.
> >
> > We need to understand what's going on! Please don't do the same thing
> > as the other submitter and just assert that it does.
>
> When running UnixBench/execl, each execl process repeatedly performs
> i_mmap_lock_write -> vma_interval_tree_remove/insert ->
> i_mmap_unlock_write. As indicated below, when i_mmap and i_mmap_rwsem
> are in the same CACHE Line, there will be more HITM.

As I expected, your test is exercising the contention case rather
than the single, uncontended case. As such, your patch is simply
optimising the structure layout for the contended case at the
expense of an extra cacheline miss in the uncontended case.

I'm not an mm expert, so I don't know which case we should optimise
for.

However, the existing code is not obviously wrong, it's just that
your micro-benchmark exercises the pathological worst case for the
optimisation choices made for this structure. Whether the contention
case is worth optimising is the first decision that needs to be
made, then people can decide if hacking minor optimisations into the
code is better than reworking the locking and/or algorithm to avoid
the contention altogether is a better direction...

-Dave.
--
Dave Chinner
[email protected]

2024-02-05 23:28:45

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

On Mon, Feb 05, 2024 at 02:22:18PM +1100, Dave Chinner wrote:
> Intuition tells me that what the OP is seeing is the opposite case
> to above: there is significant contention on the lock. In that case,
> optimal "contention performance" comes from separating the lock and
> the objects it protects into different cachelines.
>
> The reason for this is that if the lock and objects it protects are
> on the same cacheline, lock contention affects both the lock and the
> objects being manipulated inside the critical section. i.e. attempts
> to grab the lock pull the cacheline away from the CPU that holds the
> lock, and then accesses to the object that are protected by the lock
> then have to pull the cacheline back.
>
> i.e. the cost of the extra memory fetch from an uncontended
> cacheline is less than the cost of having to repeatedly fetch the
> memory inside a critical section on a contended cacheline.
>
> I consider optimisation attempts like this the canary in the mine:
> it won't be long before these or similar workloads report
> catastrophic lock contention on the lock in question. Moving items
> in the structure is equivalent to re-arranging the deck chairs
> whilst the ship sinks - we might keep our heads above water a
> little longer, but the ship is still sinking and we're still going
> to have to fix the leak sooner rather than later...

So the fundamental problem is our data structure. It's actually two
problems wrapped up in one bad idea.

i_mmap is a struct rb_root_cached:

struct rb_root_cached {
struct rb_root rb_root;
struct rb_node *rb_leftmost;
};

struct rb_root {
struct rb_node *rb_node;
};

so it's two pointers into the tree; one to the leftmost node, one
to the topmost node. That means we're modifying one or both of these
pointers frequently. I imagine it's the rb_root, which is being modified
frequently because we're always ... appending to the right? And so
we're rotating frequently. Worst case would be if we're appending to
the left and modifying both pointers.

There are things we could do to address this by making rotations less
frequent for the common case, which I suspect is mapping the entire file.
And perhaps we should do these things as a stopgap.

The larger problem is that rbtrees suck. They suck the same way that
linked lists suck; the CPU can't prefetch very far ahead (or in this
case down). It can do a little more work in that it can prefetch both
the left & right pointers, but it can't fetch the grandchildren until the
children fetches have finished, so the dependent reads have us stumped.

The solution to this problem is to change the interval tree data structure
from an Red-Black tree to a B-tree, or something similar where we use
an array of pointers instead of a single pointer. Not the Maple Tree;
that is designed for non-overlapping ranges. One could take inspiration
from the Maple Tree and design a very similar data structure that can
store and iterate over overlapping ranges. I can't understand the macros
this late at night, so I don't fully understand what the interval tree
is doing, but I can probably make a more fully informed observation by
the end of the week.

2024-02-06 01:34:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

On Mon, Feb 05, 2024 at 02:22:29PM +0800, JonasZhou wrote:
> When running UnixBench/execl, each execl process repeatedly performs
> i_mmap_lock_write -> vma_interval_tree_remove/insert ->
> i_mmap_unlock_write. As indicated below, when i_mmap and i_mmap_rwsem
> are in the same CACHE Line, there will be more HITM.

(I wasn't familiar with the term HITM. For anyone else who's
unfamiliar, this appears to mean a HIT in another core's cache, which
has the cachline in the Modified state)

> Func0: i_mmap_lock_write
> Func1: vma_interval_tree_remove/insert
> Func2: i_mmap_unlock_write
> In the same CACHE Line
> Process A | Process B | Process C | Process D | CACHE Line state
> ----------+-----------+-----------+-----------+-----------------
> Func0 | | | | I->M
> | Func0 | | | HITM M->S
> Func1 | | | | may change to M
> | | Func0 | | HITM M->S
> Func2 | | | | S->M
> | | | Func0 | HITM M->S
>
> In different CACHE Lines
> Process A | Process B | Process C | Process D | CACHE Line state
> ----------+-----------+-----------+-----------+-----------------
> Func0 | | | | I->M
> | Func0 | | | HITM M->S
> Func1 | | | |
> | | Func0 | | S->S
> Func2 | | | | S->M
> | | | Func0 | HITM M->S
>
> The same issue will occur in Unixbench/shell because the shell
> launches a lot of shell commands, loads executable files and dynamic
> libraries into memory, execute, and exit.

OK, I see.

> Yes, his commit has been merged into the Linux kernel, but there
> is an issue. After moving i_mmap_rwsem below flags, there is a
> 32-byte gap between i_mmap_rwsem and i_mmap. However, the struct
> address_space is aligned to sizeof(long), which is 8 on the x86-64
> architecture. As a result, i_mmap_rwsem and i_mmap may be placed on
> the same CACHE Line, causing a false sharing problem. This issue has
> been observed using the perf c2c tool.

Got it. OK, let's put this patch in. It's a stopgap measure, clearly.
I'll reply to Dave's email with a longer term solution.

2024-02-06 13:07:00

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

> Got it. OK, let's put this patch in. It's a stopgap measure, clearly.
> I'll reply to Dave's email with a longer term solution.

Fyi, I've picked this up yesterday. You should've gotten a notification
hopefully.

2024-02-06 21:36:15

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

On Mon, Feb 05, 2024 at 11:28:29PM +0000, Matthew Wilcox wrote:
> On Mon, Feb 05, 2024 at 02:22:18PM +1100, Dave Chinner wrote:
> > Intuition tells me that what the OP is seeing is the opposite case
> > to above: there is significant contention on the lock. In that case,
> > optimal "contention performance" comes from separating the lock and
> > the objects it protects into different cachelines.
> >
> > The reason for this is that if the lock and objects it protects are
> > on the same cacheline, lock contention affects both the lock and the
> > objects being manipulated inside the critical section. i.e. attempts
> > to grab the lock pull the cacheline away from the CPU that holds the
> > lock, and then accesses to the object that are protected by the lock
> > then have to pull the cacheline back.
> >
> > i.e. the cost of the extra memory fetch from an uncontended
> > cacheline is less than the cost of having to repeatedly fetch the
> > memory inside a critical section on a contended cacheline.
> >
> > I consider optimisation attempts like this the canary in the mine:
> > it won't be long before these or similar workloads report
> > catastrophic lock contention on the lock in question. Moving items
> > in the structure is equivalent to re-arranging the deck chairs
> > whilst the ship sinks - we might keep our heads above water a
> > little longer, but the ship is still sinking and we're still going
> > to have to fix the leak sooner rather than later...
>
> So the fundamental problem is our data structure. It's actually two
> problems wrapped up in one bad idea.
>
> i_mmap is a struct rb_root_cached:
>
> struct rb_root_cached {
> struct rb_root rb_root;
> struct rb_node *rb_leftmost;
> };
>
> struct rb_root {
> struct rb_node *rb_node;
> };
>
> so it's two pointers into the tree; one to the leftmost node, one
> to the topmost node. That means we're modifying one or both of these
> pointers frequently. I imagine it's the rb_root, which is being modified
> frequently because we're always ... appending to the right? And so
> we're rotating frequently. Worst case would be if we're appending to
> the left and modifying both pointers.
>
> There are things we could do to address this by making rotations less
> frequent for the common case, which I suspect is mapping the entire file.
> And perhaps we should do these things as a stopgap.
>
> The larger problem is that rbtrees suck. They suck the same way that
> linked lists suck; the CPU can't prefetch very far ahead (or in this
> case down). It can do a little more work in that it can prefetch both
> the left & right pointers, but it can't fetch the grandchildren until the
> children fetches have finished, so the dependent reads have us stumped.

Yes, pretty much all balanced binary search trees have this problem.
I pointed out this problem with rbtrees many years ago when someone
tried to implement range locks with an rbtree to track locked
ranges. On modern CPUs, trees with array based nodes (btrees, the
linux radix tree, etc) are far more efficient. But....

> The solution to this problem is to change the interval tree data structure
> from an Red-Black tree to a B-tree, or something similar where we use
> an array of pointers instead of a single pointer.

... B-trees are not immune to pointer chasing problems, either.
Most use binary searches within the nodes, and so we have the
problem of unpredictable cacheline misses within the nodes as well
as being unable to do depth based prefetch similar to rbtrees.

Perhaps we should be looking at something like this:

https://lore.kernel.org/linux-fsdevel/[email protected]/

-Dave.
--
Dave Chinner
[email protected]

2024-02-07 02:40:36

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

On Wed, Feb 07, 2024 at 08:35:59AM +1100, Dave Chinner wrote:
> > The solution to this problem is to change the interval tree data structure
> > from an Red-Black tree to a B-tree, or something similar where we use
> > an array of pointers instead of a single pointer.
>
> .... B-trees are not immune to pointer chasing problems, either.
> Most use binary searches within the nodes, and so we have the
> problem of unpredictable cacheline misses within the nodes as well
> as being unable to do depth based prefetch similar to rbtrees.
>
> Perhaps we should be looking at something like this:
>
> https://lore.kernel.org/linux-fsdevel/[email protected]/

I need more data (and maybe Kent has it!)

Without any data, I believe that Eytzinger layout is an idea that was
a really good one in the 1990s/2000s and has now passed its expiration
date because of the changed nature of hardware.

In the mid-90s, we could do O(10) instructions in the time it took to
service a LLC miss. Today, it's more like O(2500). That means it is
far more important to be predictable than it is to execute the minimum
number of instructions. If your B-tree nodes are 256kB in size (I believe
that's what bacachefs is using?), then Eytzinger layout may make some
sense, but if you're using smaller nodes (which I further believe is
appropriate for an in-memory B-tree), then a straight 'load each index
and compare it' may outperform a search that jumps around inside a node.

I'm pontificating and will happily yield to someone who has data.
I've tried to mark my assumptions clearly above.


Something else I'm not aware of (and filesystem B-trees will not have
any experience of) is what research exists on efficiently adding new
entries to a balanced tree so as to minimise rebalances. Filesystems
are like the Maple Tree in that for every logical offset inside a file,
there is precisely one answer to "what physical block does this map to".

For the i_mmap tree, we want instead to answer the question "Which VMAs
have an intersection with this range of the file", and for the benchmark
in question, we will have a large number of entries that compare equal to
each other (they have the same start, and same end, but different values).
So they could be inserted at many different points in the tree. We'd like
to choose the point which causes the least amount of rebalancing.

2024-03-06 06:19:00

by JonasZhou

[permalink] [raw]
Subject: Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

> On Mon, Feb 05, 2024 at 02:22:29PM +0800, JonasZhou wrote:
>
> As I expected, your test is exercising the contention case rather
> than the single, uncontended case. As such, your patch is simply
> optimising the structure layout for the contended case at the
> expense of an extra cacheline miss in the uncontended case.
>
> I'm not an mm expert, so I don't know which case we should optimise
> for.
>
> However, the existing code is not obviously wrong, it's just that
> your micro-benchmark exercises the pathological worst case for the
> optimisation choices made for this structure. Whether the contention
> case is worth optimising is the first decision that needs to be
> made, then people can decide if hacking minor optimisations into the
> code is better than reworking the locking and/or algorithm to avoid
> the contention altogether is a better direction...

According to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?qt=grep&q=unixbench,
many people use Unixbench and submit optimization patches to Linux based
on its scores. So this is not my micro-benchmark exercises.

When multiple processes open the same file, such as multiple processes
opening libc.so, there will be contention.

> -Dave.
> --
> Dave Chinner
> [email protected]