LinuxLists.cc - [PATCH] mm: relocate 'write_protect_seq' in struct mm

2021-06-11 01:58:57

Subject: [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct

0day robot reported a 9.2% regression for will-it-scale mmap1 test
case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast
from racing with COW during fork").

Further debug shows the regression is due to that commit changes
the offset of hot fields 'mmap_lock' inside structure 'mm_struct',
thus some cache alignment changes.

From the perf data, the contention for 'mmap_lock' is very severe
and takes around 95% cpu cycles, and it is a rw_semaphore

struct rw_semaphore {
atomic_long_t count; /* 8 bytes */
atomic_long_t owner; /* 8 bytes */
struct optimistic_spin_queue osq; /* spinner MCS lock */
...

Before commit 57efa1fe5957 adds the 'write_protect_seq', it
happens to have a very optimal cache alignment layout, as
Linus explained:

"and before the addition of the 'write_protect_seq' field, the
mmap_sem was at offset 120 in 'struct mm_struct'.

Which meant that count and owner were in two different cachelines,
and then when you have contention and spend time in
rwsem_down_write_slowpath(), this is probably *exactly* the kind
of layout you want.

Because first the rwsem_write_trylock() will do a cmpxchg on the
first cacheline (for the optimistic fast-path), and then in the
case of contention, rwsem_down_write_slowpath() will just access
the second cacheline.

Which is probably just optimal for a load that spends a lot of
time contended - new waiters touch that first cacheline, and then
they queue themselves up on the second cacheline."

After the commit, the rw_semaphore is at offset 128, which means
the 'count' and 'owner' fields are now in the same cacheline,
and causes more cache bouncing.

Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which
will affect its offset:

CONFIG_MMU
CONFIG_MEMBARRIER
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES

The layout above is on 64 bits system with 0day's default kernel
config (similar to RHEL-8.3's config), in which all these 3 options
are 'y'. And the layout can vary with different kernel configs.

Relayouting a structure is usually a double-edged sword, as sometimes
it can helps one case, but hurt other cases. For this case, one
solution is, as the newly added 'write_protect_seq' is a 4 bytes long
seqcount_t (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an
existing 4 bytes hole in 'mm_struct' will not change other fields'
alignment, while restoring the regression.

[1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Xu <[email protected]>
---
include/linux/mm_types.h | 27 ++++++++++++++++++++-------
1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5aacc1c..cba6022 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -445,13 +445,6 @@ struct mm_struct {
*/
atomic_t has_pinned;

- /**
- * @write_protect_seq: Locked when any thread is write
- * protecting pages mapped by this mm to enforce a later COW,
- * for instance during page table copying for fork().
- */
- seqcount_t write_protect_seq;
-
#ifdef CONFIG_MMU
atomic_long_t pgtables_bytes; /* PTE page table pages */
#endif
@@ -460,6 +453,18 @@ struct mm_struct {
spinlock_t page_table_lock; /* Protects page tables and some
* counters
*/
+ /*
+ * With some kernel config, the current mmap_lock's offset
+ * inside 'mm_struct' is at 0x120, which is very optimal, as
+ * its two hot fields 'count' and 'owner' sit in 2 different
+ * cachelines, and when mmap_lock is highly contended, both
+ * of the 2 fields will be accessed frequently, current layout
+ * will help to reduce cache bouncing.
+ *
+ * So please be careful with adding new fields before
+ * mmap_lock, which can easily push the 2 fields into one
+ * cacheline.
+ */
struct rw_semaphore mmap_lock;

struct list_head mmlist; /* List of maybe swapped mm's. These
@@ -480,7 +485,15 @@ struct mm_struct {
unsigned long stack_vm; /* VM_STACK */
unsigned long def_flags;

+ /**
+ * @write_protect_seq: Locked when any thread is write
+ * protecting pages mapped by this mm to enforce a later COW,
+ * for instance during page table copying for fork().
+ */
+ seqcount_t write_protect_seq;
+
spinlock_t arg_lock; /* protect the below fields */
+
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
--
2.7.4

2021-06-11 17:11:18

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct

On Fri, Jun 11, 2021 at 09:54:42AM +0800, Feng Tang wrote:
> 0day robot reported a 9.2% regression for will-it-scale mmap1 test
> case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast
> from racing with COW during fork").
>
> Further debug shows the regression is due to that commit changes
> the offset of hot fields 'mmap_lock' inside structure 'mm_struct',
> thus some cache alignment changes.
>
> From the perf data, the contention for 'mmap_lock' is very severe
> and takes around 95% cpu cycles, and it is a rw_semaphore
>
> struct rw_semaphore {
> atomic_long_t count; /* 8 bytes */
> atomic_long_t owner; /* 8 bytes */
> struct optimistic_spin_queue osq; /* spinner MCS lock */
> ...
>
> Before commit 57efa1fe5957 adds the 'write_protect_seq', it
> happens to have a very optimal cache alignment layout, as
> Linus explained:
>
> "and before the addition of the 'write_protect_seq' field, the
> mmap_sem was at offset 120 in 'struct mm_struct'.
>
> Which meant that count and owner were in two different cachelines,
> and then when you have contention and spend time in
> rwsem_down_write_slowpath(), this is probably *exactly* the kind
> of layout you want.
>
> Because first the rwsem_write_trylock() will do a cmpxchg on the
> first cacheline (for the optimistic fast-path), and then in the
> case of contention, rwsem_down_write_slowpath() will just access
> the second cacheline.
>
> Which is probably just optimal for a load that spends a lot of
> time contended - new waiters touch that first cacheline, and then
> they queue themselves up on the second cacheline."
>
> After the commit, the rw_semaphore is at offset 128, which means
> the 'count' and 'owner' fields are now in the same cacheline,
> and causes more cache bouncing.
>
> Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which
> will affect its offset:
>
> CONFIG_MMU
> CONFIG_MEMBARRIER
> CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
>
> The layout above is on 64 bits system with 0day's default kernel
> config (similar to RHEL-8.3's config), in which all these 3 options
> are 'y'. And the layout can vary with different kernel configs.
>
> Relayouting a structure is usually a double-edged sword, as sometimes
> it can helps one case, but hurt other cases. For this case, one
> solution is, as the newly added 'write_protect_seq' is a 4 bytes long
> seqcount_t (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an
> existing 4 bytes hole in 'mm_struct' will not change other fields'
> alignment, while restoring the regression.
>
> [1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
> Reported-by: kernel test robot <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>
> Reviewed-by: John Hubbard <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Xu <[email protected]>
> ---
> include/linux/mm_types.h | 27 ++++++++++++++++++++-------
> 1 file changed, 20 insertions(+), 7 deletions(-)

It seems Ok to me, but didn't we earlier add the has_pinned which
would have changed the layout too? Are we chasing performance delta's
nobody cares about?

Still it is mechanically fine, so:

Reviewed-by: Jason Gunthorpe <[email protected]>

Jason

2021-06-14 03:35:12

by Feng Tang

[permalink] [raw]

Subject: Re: [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct

Hi Jason,

On Fri, Jun 11, 2021 at 02:09:17PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 11, 2021 at 09:54:42AM +0800, Feng Tang wrote:
> > 0day robot reported a 9.2% regression for will-it-scale mmap1 test
> > case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast
> > from racing with COW during fork").
> >
> > Further debug shows the regression is due to that commit changes
> > the offset of hot fields 'mmap_lock' inside structure 'mm_struct',
> > thus some cache alignment changes.
> >
> > From the perf data, the contention for 'mmap_lock' is very severe
> > and takes around 95% cpu cycles, and it is a rw_semaphore
> >
> > struct rw_semaphore {
> > atomic_long_t count; /* 8 bytes */
> > atomic_long_t owner; /* 8 bytes */
> > struct optimistic_spin_queue osq; /* spinner MCS lock */
> > ...
> >
> > Before commit 57efa1fe5957 adds the 'write_protect_seq', it
> > happens to have a very optimal cache alignment layout, as
> > Linus explained:
> >
> > "and before the addition of the 'write_protect_seq' field, the
> > mmap_sem was at offset 120 in 'struct mm_struct'.
> >
> > Which meant that count and owner were in two different cachelines,
> > and then when you have contention and spend time in
> > rwsem_down_write_slowpath(), this is probably *exactly* the kind
> > of layout you want.
> >
> > Because first the rwsem_write_trylock() will do a cmpxchg on the
> > first cacheline (for the optimistic fast-path), and then in the
> > case of contention, rwsem_down_write_slowpath() will just access
> > the second cacheline.
> >
> > Which is probably just optimal for a load that spends a lot of
> > time contended - new waiters touch that first cacheline, and then
> > they queue themselves up on the second cacheline."
> >
> > After the commit, the rw_semaphore is at offset 128, which means
> > the 'count' and 'owner' fields are now in the same cacheline,
> > and causes more cache bouncing.
> >
> > Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which
> > will affect its offset:
> >
> > CONFIG_MMU
> > CONFIG_MEMBARRIER
> > CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
> >
> > The layout above is on 64 bits system with 0day's default kernel
> > config (similar to RHEL-8.3's config), in which all these 3 options
> > are 'y'. And the layout can vary with different kernel configs.
> >
> > Relayouting a structure is usually a double-edged sword, as sometimes
> > it can helps one case, but hurt other cases. For this case, one
> > solution is, as the newly added 'write_protect_seq' is a 4 bytes long
> > seqcount_t (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an
> > existing 4 bytes hole in 'mm_struct' will not change other fields'
> > alignment, while restoring the regression.
> >
> > [1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
> > Reported-by: kernel test robot <[email protected]>
> > Signed-off-by: Feng Tang <[email protected]>
> > Reviewed-by: John Hubbard <[email protected]>
> > Cc: Jason Gunthorpe <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Cc: Peter Xu <[email protected]>
> > ---
> > include/linux/mm_types.h | 27 ++++++++++++++++++++-------
> > 1 file changed, 20 insertions(+), 7 deletions(-)
>
> It seems Ok to me, but didn't we earlier add the has_pinned which
> would have changed the layout too? Are we chasing performance delta's
> nobody cares about?

Good point! I checked my email folder for 0day's reports, and haven't
found a report related with Peter's commit 008cfe4418b3 ("mm: Introduce
mm_struct.has_pinned) which adds 'has_pinned' field.

Will run the same test for it and report back.

> Still it is mechanically fine, so:
>
> Reviewed-by: Jason Gunthorpe <[email protected]>

Thanks for the review!

- Feng

> Jason

2021-06-15 03:22:55

by Feng Tang

[permalink] [raw]

Subject: Re: [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct

On Mon, Jun 14, 2021 at 11:27:39AM +0800, Feng Tang wrote:
> >
> > It seems Ok to me, but didn't we earlier add the has_pinned which
> > would have changed the layout too? Are we chasing performance delta's
> > nobody cares about?
>
> Good point! I checked my email folder for 0day's reports, and haven't
> found a report related with Peter's commit 008cfe4418b3 ("mm: Introduce
> mm_struct.has_pinned) which adds 'has_pinned' field.
>
> Will run the same test for it and report back.

I run the same will-it-scale/mmap1 case for Peter's commit 008cfe4418b3
and its parent commit, and there is no obvious performance diff:

a1bffa48745afbb5 008cfe4418b3dbda2ff820cdd7b
---------------- ---------------------------

344353 -0.4% 342929 will-it-scale.48.threads
7173 -0.4% 7144 will-it-scale.per_thread_ops

And from the pahole info for the 2 kernels, Peter's commit adds the
'has_pinned' is put into an existing 4 bytes hole, so all other following
fields keep their alignment unchanged. Peter may do it purposely
considering the alignment. So no performance change is expected.

Pahole info for kernel before 008cfe4418b3:

struct mm_struct {
...
/* --- cacheline 1 boundary (64 bytes) --- */
long unsigned int task_size; /* 64 8 */
long unsigned int highest_vm_end; /* 72 8 */
pgd_t * pgd; /* 80 8 */
atomic_t membarrier_state; /* 88 4 */
atomic_t mm_users; /* 92 4 */
atomic_t mm_count; /* 96 4 */

/* XXX 4 bytes hole, try to pack */

atomic_long_t pgtables_bytes; /* 104 8 */
int map_count; /* 112 4 */
spinlock_t page_table_lock; /* 116 4 */
struct rw_semaphore mmap_lock; /* 120 40 */
/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */

pahold info with 008cfe4418b3:

struct mm_struct {
...
/* --- cacheline 1 boundary (64 bytes) --- */
long unsigned int task_size; /* 64 8 */
long unsigned int highest_vm_end; /* 72 8 */
pgd_t * pgd; /* 80 8 */
atomic_t membarrier_state; /* 88 4 */
atomic_t mm_users; /* 92 4 */
atomic_t mm_count; /* 96 4 */
atomic_t has_pinned; /* 100 4 */
atomic_long_t pgtables_bytes; /* 104 8 */
int map_count; /* 112 4 */
spinlock_t page_table_lock; /* 116 4 */
struct rw_semaphore mmap_lock; /* 120 40 */
/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */

Thanks,
Feng

2021-06-15 18:54:42

by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct

On Tue, Jun 15, 2021 at 09:11:03AM +0800, Feng Tang wrote:
> On Mon, Jun 14, 2021 at 11:27:39AM +0800, Feng Tang wrote:
> > >
> > > It seems Ok to me, but didn't we earlier add the has_pinned which
> > > would have changed the layout too? Are we chasing performance delta's
> > > nobody cares about?
> >
> > Good point! I checked my email folder for 0day's reports, and haven't
> > found a report related with Peter's commit 008cfe4418b3 ("mm: Introduce
> > mm_struct.has_pinned) which adds 'has_pinned' field.
> >
> > Will run the same test for it and report back.
>
> I run the same will-it-scale/mmap1 case for Peter's commit 008cfe4418b3
> and its parent commit, and there is no obvious performance diff:
>
> a1bffa48745afbb5 008cfe4418b3dbda2ff820cdd7b
> ---------------- ---------------------------
>
> 344353 -0.4% 342929 will-it-scale.48.threads
> 7173 -0.4% 7144 will-it-scale.per_thread_ops
>
> And from the pahole info for the 2 kernels, Peter's commit adds the
> 'has_pinned' is put into an existing 4 bytes hole, so all other following
> fields keep their alignment unchanged. Peter may do it purposely
> considering the alignment. So no performance change is expected.

Thanks for verifying this. I didn't do it on purpose at least for the initial
version, but I do remember some comment to fill up that hole, so it may have
got moved around.

Also note that if nothing goes wrong, has_pinned will be gone in the next
release with commit 3c0c4cda6d48 ("mm: gup: pack has_pinned in MMF_HAS_PINNED",
2021-05-26); it's in -mm-next but not reaching the main branch yet. So then I
think the 4 bytes hole should come back to us again, with no perf loss either.

What I'm thinking is whether we should move some important (and especially
CONFIG_* irrelevant) fields at the top of the whole struct define, make sure
they're most optimal for the common workload and make them static. Then
there'll be less or no possibility some new field regress some common workload
by accident. Not sure whether it makes sense to do so.

Thanks,

--
Peter Xu

2021-06-16 01:52:59

by Feng Tang

[permalink] [raw]

Subject: Re: [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct

Hi Peter,

On Tue, Jun 15, 2021 at 02:52:49PM -0400, Peter Xu wrote:
> On Tue, Jun 15, 2021 at 09:11:03AM +0800, Feng Tang wrote:
> > On Mon, Jun 14, 2021 at 11:27:39AM +0800, Feng Tang wrote:
> > > >
> > > > It seems Ok to me, but didn't we earlier add the has_pinned which
> > > > would have changed the layout too? Are we chasing performance delta's
> > > > nobody cares about?
> > >
> > > Good point! I checked my email folder for 0day's reports, and haven't
> > > found a report related with Peter's commit 008cfe4418b3 ("mm: Introduce
> > > mm_struct.has_pinned) which adds 'has_pinned' field.
> > >
> > > Will run the same test for it and report back.
> >
> > I run the same will-it-scale/mmap1 case for Peter's commit 008cfe4418b3
> > and its parent commit, and there is no obvious performance diff:
> >
> > a1bffa48745afbb5 008cfe4418b3dbda2ff820cdd7b
> > ---------------- ---------------------------
> >
> > 344353 -0.4% 342929 will-it-scale.48.threads
> > 7173 -0.4% 7144 will-it-scale.per_thread_ops
> >
> > And from the pahole info for the 2 kernels, Peter's commit adds the
> > 'has_pinned' is put into an existing 4 bytes hole, so all other following
> > fields keep their alignment unchanged. Peter may do it purposely
> > considering the alignment. So no performance change is expected.
>
> Thanks for verifying this. I didn't do it on purpose at least for the initial
> version, but I do remember some comment to fill up that hole, so it may have
> got moved around.
>
> Also note that if nothing goes wrong, has_pinned will be gone in the next
> release with commit 3c0c4cda6d48 ("mm: gup: pack has_pinned in MMF_HAS_PINNED",
> 2021-05-26); it's in -mm-next but not reaching the main branch yet. So then I
> think the 4 bytes hole should come back to us again, with no perf loss either.

Thanks for the heads up.

> What I'm thinking is whether we should move some important (and especially
> CONFIG_* irrelevant) fields at the top of the whole struct define, make sure
> they're most optimal for the common workload and make them static. Then
> there'll be less or no possibility some new field regress some common workload
> by accident. Not sure whether it makes sense to do so.

Yep, it makes sense to me, as it makes the alignment more predictable and
controllable. But usually we dare not to move the fields around, as it could
cause improvments to some cases and regressions to other cases, given different
benchmarks can see different hotspots. And most of our patches changing
data structure's layout are mostly regression driven :)

Thanks,
Feng

> Thanks,
>
> --
> Peter Xu