2023-06-06 12:32:16

by Yu Ma

[permalink] [raw]
Subject: [PATCH] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock

UnixBench/Execl represents a class of workload where bash scripts are
spawned frequently to do some short jobs. When running multiple parallel
tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them
come from load_elf_binary through the call chain
"execl->do_execveat_common->bprm_execve->load_elf_binary". In do_mmap,it will
call mmap_region to create vma node, initialize it and insert it to vma
maintain structure in mm_struct and i_mmap tree of the mapping file, then
increase map_count to record the number of vma nodes used. The hot osq_lock
is to protect operations on file’s i_mmap tree. For the mm_struct member
change like vma insertion and map_count update, they do not affect i_mmap
tree. Move those operations out of the lock's critical section, to reduce
hold time on the lock.

With this change, on Intel Sapphire Rapids 112C/224T platform, based on
v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
obvious performance gain on v6.4-rc4 due to regression of this benchmark
from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: convert
mm's rss stats into percpu_counter). Related discussion and conclusion
can be referred at the mail thread initiated by 0day as below:
Link: https://lore.kernel.org/linux-mm/[email protected]/

Reviewed-by: Tim Chen <[email protected]>
Signed-off-by: Yu Ma <[email protected]>
---
mm/mmap.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 13678edaa22c..0e694a0433bc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2711,12 +2711,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
if (vma_iter_prealloc(&vmi))
goto close_and_free_vma;

- if (vma->vm_file)
- i_mmap_lock_write(vma->vm_file->f_mapping);
-
vma_iter_store(&vmi, vma);
mm->map_count++;
if (vma->vm_file) {
+ i_mmap_lock_write(vma->vm_file->f_mapping);
if (vma->vm_flags & VM_SHARED)
mapping_allow_writable(vma->vm_file->f_mapping);

--
2.39.3



2023-06-06 19:30:28

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock

* Yu Ma <[email protected]> [230606 08:23]:
> UnixBench/Execl represents a class of workload where bash scripts are
> spawned frequently to do some short jobs. When running multiple parallel
> tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them
> come from load_elf_binary through the call chain
> "execl->do_execveat_common->bprm_execve->load_elf_binary". In do_mmap,it will
> call mmap_region to create vma node, initialize it and insert it to vma
> maintain structure in mm_struct and i_mmap tree of the mapping file, then
> increase map_count to record the number of vma nodes used. The hot osq_lock
> is to protect operations on file’s i_mmap tree. For the mm_struct member
> change like vma insertion and map_count update, they do not affect i_mmap
> tree. Move those operations out of the lock's critical section, to reduce
> hold time on the lock.
>
> With this change, on Intel Sapphire Rapids 112C/224T platform, based on
> v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
> obvious performance gain on v6.4-rc4 due to regression of this benchmark
> from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: convert
> mm's rss stats into percpu_counter).

I didn't think it was safe to insert a VMA into the VMA tree without
holding this write lock? We now have a window of time where a file
mapping doesn't exist for a vma that's in the tree? Is this always
safe? Does the locking order in mm/rmap.c need to change?

>Related discussion and conclusion
> can be referred at the mail thread initiated by 0day as below:
> Link: https://lore.kernel.org/linux-mm/[email protected]/

I don't see a conclusion on that thread talking about changing the
locking order?

>
> Reviewed-by: Tim Chen <[email protected]>
> Signed-off-by: Yu Ma <[email protected]>
> ---
> mm/mmap.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 13678edaa22c..0e694a0433bc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2711,12 +2711,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> if (vma_iter_prealloc(&vmi))
> goto close_and_free_vma;
>
> - if (vma->vm_file)
> - i_mmap_lock_write(vma->vm_file->f_mapping);
> -
> vma_iter_store(&vmi, vma);
> mm->map_count++;
> if (vma->vm_file) {
> + i_mmap_lock_write(vma->vm_file->f_mapping);
> if (vma->vm_flags & VM_SHARED)
> mapping_allow_writable(vma->vm_file->f_mapping);
>
> --
> 2.39.3
>
>

2023-06-06 20:22:12

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock

* Liam R. Howlett <[email protected]> [230606 15:20]:
> * Yu Ma <[email protected]> [230606 08:23]:
> > UnixBench/Execl represents a class of workload where bash scripts are
> > spawned frequently to do some short jobs. When running multiple parallel
> > tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them
> > come from load_elf_binary through the call chain
> > "execl->do_execveat_common->bprm_execve->load_elf_binary". In do_mmap,it will
> > call mmap_region to create vma node, initialize it and insert it to vma
> > maintain structure in mm_struct and i_mmap tree of the mapping file, then
> > increase map_count to record the number of vma nodes used. The hot osq_lock
> > is to protect operations on file’s i_mmap tree. For the mm_struct member
> > change like vma insertion and map_count update, they do not affect i_mmap
> > tree. Move those operations out of the lock's critical section, to reduce
> > hold time on the lock.
> >
> > With this change, on Intel Sapphire Rapids 112C/224T platform, based on
> > v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
> > obvious performance gain on v6.4-rc4 due to regression of this benchmark
> > from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: convert
> > mm's rss stats into percpu_counter).
>
> I didn't think it was safe to insert a VMA into the VMA tree without
> holding this write lock? We now have a window of time where a file
> mapping doesn't exist for a vma that's in the tree? Is this always
> safe? Does the locking order in mm/rmap.c need to change?

So I'm pretty sure it's not safe because we've been ensuring that this
lock was taken during vma tree inserts since 2002 [1]. Take a look at
vma_link() in that commit. I still don't have an answer as to why it's
not safe though.

[1] https://github.com/mpe/linux-fullhistory/commit/bbbce8f41d3da0ac968bab7a967e12e2be1a7eb0

>
> >Related discussion and conclusion
> > can be referred at the mail thread initiated by 0day as below:
> > Link: https://lore.kernel.org/linux-mm/[email protected]/
>
> I don't see a conclusion on that thread talking about changing the
> locking order?
>
> >
> > Reviewed-by: Tim Chen <[email protected]>
> > Signed-off-by: Yu Ma <[email protected]>
> > ---
> > mm/mmap.c | 4 +---
> > 1 file changed, 1 insertion(+), 3 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 13678edaa22c..0e694a0433bc 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2711,12 +2711,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > if (vma_iter_prealloc(&vmi))
> > goto close_and_free_vma;
> >
> > - if (vma->vm_file)
> > - i_mmap_lock_write(vma->vm_file->f_mapping);
> > -
> > vma_iter_store(&vmi, vma);
> > mm->map_count++;
> > if (vma->vm_file) {
> > + i_mmap_lock_write(vma->vm_file->f_mapping);
> > if (vma->vm_flags & VM_SHARED)
> > mapping_allow_writable(vma->vm_file->f_mapping);
> >
> > --
> > 2.39.3
> >
> >

2023-06-07 13:15:57

by Yu Ma

[permalink] [raw]
Subject: RE: [PATCH] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock

> * Liam R. Howlett <[email protected]> [230606 15:20]:
> > * Yu Ma <[email protected]> [230606 08:23]:
> > > UnixBench/Execl represents a class of workload where bash scripts
> > > are spawned frequently to do some short jobs. When running multiple
> > > parallel tasks, hot osq_lock is observed from do_mmap and exit_mmap.
> > > Both of them come from load_elf_binary through the call chain
> > > "execl->do_execveat_common->bprm_execve->load_elf_binary". In
> > > do_mmap,it will call mmap_region to create vma node, initialize it
> > > and insert it to vma maintain structure in mm_struct and i_mmap tree
> > > of the mapping file, then increase map_count to record the number of
> > > vma nodes used. The hot osq_lock is to protect operations on file’s
> > > i_mmap tree. For the mm_struct member change like vma insertion and
> > > map_count update, they do not affect i_mmap tree. Move those
> > > operations out of the lock's critical section, to reduce hold time on the
> lock.
> > >
> > > With this change, on Intel Sapphire Rapids 112C/224T platform, based
> > > on v6.0-rc6, the 160 parallel score improves by 12%. The patch has
> > > no obvious performance gain on v6.4-rc4 due to regression of this
> > > benchmark from this commit
> f1a7941243c102a44e8847e3b94ff4ff3ec56f25
> > > (mm: convert mm's rss stats into percpu_counter).
> >
> > I didn't think it was safe to insert a VMA into the VMA tree without
> > holding this write lock? We now have a window of time where a file
> > mapping doesn't exist for a vma that's in the tree? Is this always
> > safe? Does the locking order in mm/rmap.c need to change?
>
> So I'm pretty sure it's not safe because we've been ensuring that this lock
> was taken during vma tree inserts since 2002 [1]. Take a look at
> vma_link() in that commit. I still don't have an answer as to why it's not safe
> though.
>
> [1] https://github.com/mpe/linux-
> fullhistory/commit/bbbce8f41d3da0ac968bab7a967e12e2be1a7eb0
>

Thanks Liam for your quick review and digging in related code. I just checked vma_link() in 2002, the file lock is there to protect __vma_link(), and in __vma_link(), there are 3 functions, the first 2 are operations to insert vma to mm_struct, and the last func __vma_link_file() is to insert vma to the file mapping tree. So this file lock around __vma_link() makes sense since it has operations of file mapping tree inside. It still cannot explain why the operations to mm_struct cannot be moved out.

> >
> > >Related discussion and conclusion
> > > can be referred at the mail thread initiated by 0day as below:
> > > Link:
> > >https://lore.kernel.org/linux-mm/a4aa2e13-7187-600b-c628-
> 7e8fb108def0
> > >@intel.com/
> >
> > I don't see a conclusion on that thread talking about changing the
> > locking order?
I may not intro this link clear, it is about why no obvious improvement observed on latest kernel for the time being :)

> >
> > >
> > > Reviewed-by: Tim Chen <[email protected]>
> > > Signed-off-by: Yu Ma <[email protected]>
> > > ---
> > > mm/mmap.c | 4 +---
> > > 1 file changed, 1 insertion(+), 3 deletions(-)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 13678edaa22c..0e694a0433bc 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2711,12 +2711,10 @@ unsigned long mmap_region(struct file *file,
> unsigned long addr,
> > > if (vma_iter_prealloc(&vmi))
> > > goto close_and_free_vma;
> > >
> > > - if (vma->vm_file)
> > > - i_mmap_lock_write(vma->vm_file->f_mapping);
> > > -
> > > vma_iter_store(&vmi, vma);
> > > mm->map_count++;
> > > if (vma->vm_file) {
> > > + i_mmap_lock_write(vma->vm_file->f_mapping);
> > > if (vma->vm_flags & VM_SHARED)
> > > mapping_allow_writable(vma->vm_file->f_mapping);
> > >
> > > --
> > > 2.39.3
> > >
> > >

2023-06-21 14:49:36

by Yu Ma

[permalink] [raw]
Subject: RE: [PATCH] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock

+ Christoph Hellwig in To list

Hi Chris,
Sorry that it is pretty long ago, do we have any idea about why put vma operations to mm struct in the critical section of file mapping lock (i.e vma_link() as below, the operations to mm struct like __vma_link_list and __vma_link_rb is protected by mapping lock). We'd like to study more here if possible, sincerely appreciate for any hint or feedback.
https://github.com/mpe/linux-fullhistory/commit/bbbce8f41d3da0ac968bab7a967e12e2be1a7eb0


Regards
Yu

> > * Liam R. Howlett <[email protected]> [230606 15:20]:
> > > * Yu Ma <[email protected]> [230606 08:23]:
> > > > UnixBench/Execl represents a class of workload where bash scripts
> > > > are spawned frequently to do some short jobs. When running
> > > > multiple parallel tasks, hot osq_lock is observed from do_mmap and
> exit_mmap.
> > > > Both of them come from load_elf_binary through the call chain
> > > > "execl->do_execveat_common->bprm_execve->load_elf_binary". In
> > > > do_mmap,it will call mmap_region to create vma node, initialize it
> > > > and insert it to vma maintain structure in mm_struct and i_mmap
> > > > tree of the mapping file, then increase map_count to record the
> > > > number of vma nodes used. The hot osq_lock is to protect
> > > > operations on file’s i_mmap tree. For the mm_struct member change
> > > > like vma insertion and map_count update, they do not affect i_mmap
> > > > tree. Move those operations out of the lock's critical section, to
> > > > reduce hold time on the
> > lock.
> > > >
> > > > With this change, on Intel Sapphire Rapids 112C/224T platform,
> > > > based on v6.0-rc6, the 160 parallel score improves by 12%. The
> > > > patch has no obvious performance gain on v6.4-rc4 due to
> > > > regression of this benchmark from this commit
> > f1a7941243c102a44e8847e3b94ff4ff3ec56f25
> > > > (mm: convert mm's rss stats into percpu_counter).
> > >
> > > I didn't think it was safe to insert a VMA into the VMA tree without
> > > holding this write lock? We now have a window of time where a file
> > > mapping doesn't exist for a vma that's in the tree? Is this always
> > > safe? Does the locking order in mm/rmap.c need to change?
> >
> > So I'm pretty sure it's not safe because we've been ensuring that this
> > lock was taken during vma tree inserts since 2002 [1]. Take a look at
> > vma_link() in that commit. I still don't have an answer as to why
> > it's not safe though.
> >
> > [1] https://github.com/mpe/linux-
> > fullhistory/commit/bbbce8f41d3da0ac968bab7a967e12e2be1a7eb0
> >
>
> Thanks Liam for your quick review and digging in related code. I just checked
> vma_link() in 2002, the file lock is there to protect __vma_link(), and in
> __vma_link(), there are 3 functions, the first 2 are operations to insert vma to
> mm_struct, and the last func __vma_link_file() is to insert vma to the file
> mapping tree. So this file lock around __vma_link() makes sense since it has
> operations of file mapping tree inside. It still cannot explain why the
> operations to mm_struct cannot be moved out.
>
> > >
> > > >Related discussion and conclusion
> > > > can be referred at the mail thread initiated by 0day as below:
> > > > Link:
> > > >https://lore.kernel.org/linux-mm/a4aa2e13-7187-600b-c628-
> > 7e8fb108def0
> > > >@intel.com/
> > >
> > > I don't see a conclusion on that thread talking about changing the
> > > locking order?
> I may not intro this link clear, it is about why no obvious improvement
> observed on latest kernel for the time being :)
>
> > >
> > > >
> > > > Reviewed-by: Tim Chen <[email protected]>
> > > > Signed-off-by: Yu Ma <[email protected]>
> > > > ---
> > > > mm/mmap.c | 4 +---
> > > > 1 file changed, 1 insertion(+), 3 deletions(-)
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c index
> > > > 13678edaa22c..0e694a0433bc 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -2711,12 +2711,10 @@ unsigned long mmap_region(struct file
> > > > *file,
> > unsigned long addr,
> > > > if (vma_iter_prealloc(&vmi))
> > > > goto close_and_free_vma;
> > > >
> > > > - if (vma->vm_file)
> > > > - i_mmap_lock_write(vma->vm_file->f_mapping);
> > > > -
> > > > vma_iter_store(&vmi, vma);
> > > > mm->map_count++;
> > > > if (vma->vm_file) {
> > > > + i_mmap_lock_write(vma->vm_file->f_mapping);
> > > > if (vma->vm_flags & VM_SHARED)
> > > > mapping_allow_writable(vma->vm_file->f_mapping);
> > > >
> > > > --
> > > > 2.39.3
> > > >
> > > >

2023-07-05 17:09:59

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock

On Tue, Jun 06, 2023 at 03:20:13PM -0400, Liam R. Howlett wrote:
> * Yu Ma <[email protected]> [230606 08:23]:
> > UnixBench/Execl represents a class of workload where bash scripts are
> > spawned frequently to do some short jobs. When running multiple parallel
> > tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them
> > come from load_elf_binary through the call chain
> > "execl->do_execveat_common->bprm_execve->load_elf_binary". In do_mmap,it will
> > call mmap_region to create vma node, initialize it and insert it to vma
> > maintain structure in mm_struct and i_mmap tree of the mapping file, then
> > increase map_count to record the number of vma nodes used. The hot osq_lock
> > is to protect operations on file’s i_mmap tree. For the mm_struct member
> > change like vma insertion and map_count update, they do not affect i_mmap
> > tree. Move those operations out of the lock's critical section, to reduce
> > hold time on the lock.
> >
> > With this change, on Intel Sapphire Rapids 112C/224T platform, based on
> > v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
> > obvious performance gain on v6.4-rc4 due to regression of this benchmark
> > from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: convert
> > mm's rss stats into percpu_counter).
>
> I didn't think it was safe to insert a VMA into the VMA tree without
> holding this write lock? We now have a window of time where a file
> mapping doesn't exist for a vma that's in the tree? Is this always
> safe? Does the locking order in mm/rmap.c need to change?

We hold mmap lock on write here, right? Who can observe the VMA until the
lock is released?

It cannot be retrieved from the VMA tree as it requires at least read mmap
lock. And the VMA doesn't exist anywhere else.

I believe the change is safe.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-07-05 18:18:33

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock

* Kirill A. Shutemov <[email protected]> [230705 12:54]:
> On Tue, Jun 06, 2023 at 03:20:13PM -0400, Liam R. Howlett wrote:
> > * Yu Ma <[email protected]> [230606 08:23]:
> > > UnixBench/Execl represents a class of workload where bash scripts are
> > > spawned frequently to do some short jobs. When running multiple parallel
> > > tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them
> > > come from load_elf_binary through the call chain
> > > "execl->do_execveat_common->bprm_execve->load_elf_binary". In do_mmap,it will
> > > call mmap_region to create vma node, initialize it and insert it to vma
> > > maintain structure in mm_struct and i_mmap tree of the mapping file, then
> > > increase map_count to record the number of vma nodes used. The hot osq_lock
> > > is to protect operations on file’s i_mmap tree. For the mm_struct member
> > > change like vma insertion and map_count update, they do not affect i_mmap
> > > tree. Move those operations out of the lock's critical section, to reduce
> > > hold time on the lock.
> > >
> > > With this change, on Intel Sapphire Rapids 112C/224T platform, based on
> > > v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
> > > obvious performance gain on v6.4-rc4 due to regression of this benchmark
> > > from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: convert
> > > mm's rss stats into percpu_counter).
> >
> > I didn't think it was safe to insert a VMA into the VMA tree without
> > holding this write lock? We now have a window of time where a file
> > mapping doesn't exist for a vma that's in the tree? Is this always
> > safe? Does the locking order in mm/rmap.c need to change?
>
> We hold mmap lock on write here, right?

Yes.

>Who can observe the VMA until the
> lock is released?

With CONFIG_PER_VMA_LOCK we can have the VMA read under the rcu
read lock for page faults from the tree. I am not sure if the vma is
initialized to avoid page fault issues - vma_start_write() should either
be taken or initialise the vma as this is the case.

There is also a possibility of a driver mapping a VMA and having entry
points from other locations. It isn't accessed through the tree though
so I don't think this change will introduce new races?

>
> It cannot be retrieved from the VMA tree as it requires at least read mmap
> lock. And the VMA doesn't exist anywhere else.
>
> I believe the change is safe.

I guess insert_vm_struct(), and vma_link() callers should be checked and
updated accordingly?

Thanks,
Liam

2023-07-05 19:43:56

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock

On Wed, Jul 05, 2023 at 01:33:48PM -0400, Liam R. Howlett wrote:
> * Kirill A. Shutemov <[email protected]> [230705 12:54]:
> > On Tue, Jun 06, 2023 at 03:20:13PM -0400, Liam R. Howlett wrote:
> > > * Yu Ma <[email protected]> [230606 08:23]:
> > > > UnixBench/Execl represents a class of workload where bash scripts are
> > > > spawned frequently to do some short jobs. When running multiple parallel
> > > > tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them
> > > > come from load_elf_binary through the call chain
> > > > "execl->do_execveat_common->bprm_execve->load_elf_binary". In do_mmap,it will
> > > > call mmap_region to create vma node, initialize it and insert it to vma
> > > > maintain structure in mm_struct and i_mmap tree of the mapping file, then
> > > > increase map_count to record the number of vma nodes used. The hot osq_lock
> > > > is to protect operations on file’s i_mmap tree. For the mm_struct member
> > > > change like vma insertion and map_count update, they do not affect i_mmap
> > > > tree. Move those operations out of the lock's critical section, to reduce
> > > > hold time on the lock.
> > > >
> > > > With this change, on Intel Sapphire Rapids 112C/224T platform, based on
> > > > v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
> > > > obvious performance gain on v6.4-rc4 due to regression of this benchmark
> > > > from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: convert
> > > > mm's rss stats into percpu_counter).
> > >
> > > I didn't think it was safe to insert a VMA into the VMA tree without
> > > holding this write lock? We now have a window of time where a file
> > > mapping doesn't exist for a vma that's in the tree? Is this always
> > > safe? Does the locking order in mm/rmap.c need to change?
> >
> > We hold mmap lock on write here, right?
>
> Yes.
>
> >Who can observe the VMA until the
> > lock is released?
>
> With CONFIG_PER_VMA_LOCK we can have the VMA read under the rcu
> read lock for page faults from the tree. I am not sure if the vma is
> initialized to avoid page fault issues - vma_start_write() should either
> be taken or initialise the vma as this is the case.

Right, with CONFIG_PER_VMA_LOCK the vma has to be unusable until it is
fully initialized, effectively providing the same guarantees as mmap write
lock. If it is not the case, it is CONFIG_PER_VMA_LOCK bug.

> There is also a possibility of a driver mapping a VMA and having entry
> points from other locations. It isn't accessed through the tree though
> so I don't think this change will introduce new races?

Right.

> > It cannot be retrieved from the VMA tree as it requires at least read mmap
> > lock. And the VMA doesn't exist anywhere else.
> >
> > I believe the change is safe.
>
> I guess insert_vm_struct(), and vma_link() callers should be checked and
> updated accordingly?

Yep.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-07-06 15:43:51

by Yu Ma

[permalink] [raw]
Subject: RE: [PATCH] mm/mmap: move vmKirill A. Shutemov <[email protected]>a operations to mm_struct out of the critical section of file mapping lock

> -----Original Message-----
> From: Kirill A. Shutemov <[email protected]>
> Sent: Thursday, July 6, 2023 3:06 AM
> To: Liam R. Howlett <[email protected]>; Ma, Yu
> <[email protected]>; [email protected]; Chen, Tim C
> <[email protected]>; [email protected]; [email protected];
> Hansen, Dave <[email protected]>; Williams, Dan J
> <[email protected]>; [email protected]; Deng, Pan
> <[email protected]>; Li, Tianyou <[email protected]>; Zhu, Lipeng
> <[email protected]>; [email protected]
> Subject: Re: [PATCH] mm/mmap: move vma operations to mm_struct out of
> the critical section of file mapping lock
>
> On Wed, Jul 05, 2023 at 01:33:48PM -0400, Liam R. Howlett wrote:
> > * Kirill A. Shutemov <[email protected]> [230705 12:54]:
> > > On Tue, Jun 06, 2023 at 03:20:13PM -0400, Liam R. Howlett wrote:
> > > > * Yu Ma <[email protected]> [230606 08:23]:
> > > > > UnixBench/Execl represents a class of workload where bash
> > > > > scripts are spawned frequently to do some short jobs. When
> > > > > running multiple parallel tasks, hot osq_lock is observed from
> > > > > do_mmap and exit_mmap. Both of them come from load_elf_binary
> > > > > through the call chain
> > > > > "execl->do_execveat_common->bprm_execve->load_elf_binary". In
> > > > > do_mmap,it will call mmap_region to create vma node, initialize
> > > > > it and insert it to vma maintain structure in mm_struct and
> > > > > i_mmap tree of the mapping file, then increase map_count to
> > > > > record the number of vma nodes used. The hot osq_lock is to
> > > > > protect operations on file’s i_mmap tree. For the mm_struct
> > > > > member change like vma insertion and map_count update, they do
> not affect i_mmap tree. Move those operations out of the lock's critical
> section, to reduce hold time on the lock.
> > > > >
> > > > > With this change, on Intel Sapphire Rapids 112C/224T platform,
> > > > > based on v6.0-rc6, the 160 parallel score improves by 12%. The
> > > > > patch has no obvious performance gain on v6.4-rc4 due to
> > > > > regression of this benchmark from this commit
> > > > > f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: convert mm's rss
> stats into percpu_counter).
> > > >
> > > > I didn't think it was safe to insert a VMA into the VMA tree
> > > > without holding this write lock? We now have a window of time
> > > > where a file mapping doesn't exist for a vma that's in the tree?
> > > > Is this always safe? Does the locking order in mm/rmap.c need to
> change?
> > >
> > > We hold mmap lock on write here, right?
> >
> > Yes.
> >
> > >Who can observe the VMA until the
> > > lock is released?
> >
> > With CONFIG_PER_VMA_LOCK we can have the VMA read under the rcu
> read
> > lock for page faults from the tree. I am not sure if the vma is
> > initialized to avoid page fault issues - vma_start_write() should
> > either be taken or initialise the vma as this is the case.
>
> Right, with CONFIG_PER_VMA_LOCK the vma has to be unusable until it is
> fully initialized, effectively providing the same guarantees as mmap write lock.
> If it is not the case, it is CONFIG_PER_VMA_LOCK bug.
>
> > There is also a possibility of a driver mapping a VMA and having entry
> > points from other locations. It isn't accessed through the tree
> > though so I don't think this change will introduce new races?
>
> Right.
>
> > > It cannot be retrieved from the VMA tree as it requires at least
> > > read mmap lock. And the VMA doesn't exist anywhere else.
> > >
> > > I believe the change is safe.
> >
> > I guess insert_vm_struct(), and vma_link() callers should be checked
> > and updated accordingly?
>
> Yep.
>
Thanks Kirill and Liam, I'll update patch and send out for review.


Regards
Yu

2023-07-06 22:22:00

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock

On Wed, Jul 5, 2023 at 12:06 PM Kirill A. Shutemov <[email protected]> wrote:
>
> On Wed, Jul 05, 2023 at 01:33:48PM -0400, Liam R. Howlett wrote:
> > * Kirill A. Shutemov <[email protected]> [230705 12:54]:
> > > On Tue, Jun 06, 2023 at 03:20:13PM -0400, Liam R. Howlett wrote:
> > > > * Yu Ma <[email protected]> [230606 08:23]:
> > > > > UnixBench/Execl represents a class of workload where bash scripts are
> > > > > spawned frequently to do some short jobs. When running multiple parallel
> > > > > tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them
> > > > > come from load_elf_binary through the call chain
> > > > > "execl->do_execveat_common->bprm_execve->load_elf_binary". In do_mmap,it will
> > > > > call mmap_region to create vma node, initialize it and insert it to vma
> > > > > maintain structure in mm_struct and i_mmap tree of the mapping file, then
> > > > > increase map_count to record the number of vma nodes used. The hot osq_lock
> > > > > is to protect operations on file’s i_mmap tree. For the mm_struct member
> > > > > change like vma insertion and map_count update, they do not affect i_mmap
> > > > > tree. Move those operations out of the lock's critical section, to reduce
> > > > > hold time on the lock.
> > > > >
> > > > > With this change, on Intel Sapphire Rapids 112C/224T platform, based on
> > > > > v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
> > > > > obvious performance gain on v6.4-rc4 due to regression of this benchmark
> > > > > from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: convert
> > > > > mm's rss stats into percpu_counter).
> > > >
> > > > I didn't think it was safe to insert a VMA into the VMA tree without
> > > > holding this write lock? We now have a window of time where a file
> > > > mapping doesn't exist for a vma that's in the tree? Is this always
> > > > safe? Does the locking order in mm/rmap.c need to change?
> > >
> > > We hold mmap lock on write here, right?
> >
> > Yes.
> >
> > >Who can observe the VMA until the
> > > lock is released?
> >
> > With CONFIG_PER_VMA_LOCK we can have the VMA read under the rcu
> > read lock for page faults from the tree. I am not sure if the vma is
> > initialized to avoid page fault issues - vma_start_write() should either
> > be taken or initialise the vma as this is the case.
>
> Right, with CONFIG_PER_VMA_LOCK the vma has to be unusable until it is
> fully initialized, effectively providing the same guarantees as mmap write
> lock. If it is not the case, it is CONFIG_PER_VMA_LOCK bug.

Jumping into the conversation.

If we are adding a VMA into the tree before it's fully usable then we
should write-lock it before it becomes visible to page faults. Kirill
is right that there is a problem and we should not rely on
vma->vm_file->f_mapping lock here. Instead we should write-lock the
VMA before adding it into the tree even without this change.
IIUC, the rule with mmap_lock is that VMA can be safely modified if it
is either isolated or if we write-locked the mmap_lock. With
CONFIG_PER_VMA_LOCK the same rule should apply to per-VMA locks - the
VMA should be either isolated or we should write-lock it. Here we are
adding the unlocked VMA into the tree and then we keep modifying it.
This did not bite us because modifications are only done to
file-backed VMAs and we do not handle file-backed page faults under
per-VMA locks yet, however this will become a problem once we start
supporting that.
If we all agree to the above I can post a change to lock the VMA
before adding it into the tree.

>
> > There is also a possibility of a driver mapping a VMA and having entry
> > points from other locations. It isn't accessed through the tree though
> > so I don't think this change will introduce new races?
>
> Right.
>
> > > It cannot be retrieved from the VMA tree as it requires at least read mmap
> > > lock. And the VMA doesn't exist anywhere else.
> > >
> > > I believe the change is safe.
> >
> > I guess insert_vm_struct(), and vma_link() callers should be checked and
> > updated accordingly?
>
> Yep.
>
> --
> Kiryl Shutsemau / Kirill A. Shutemov
>

2023-07-07 04:53:39

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock

On Thu, Jul 6, 2023 at 3:04 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Wed, Jul 5, 2023 at 12:06 PM Kirill A. Shutemov <[email protected]> wrote:
> >
> > On Wed, Jul 05, 2023 at 01:33:48PM -0400, Liam R. Howlett wrote:
> > > * Kirill A. Shutemov <[email protected]> [230705 12:54]:
> > > > On Tue, Jun 06, 2023 at 03:20:13PM -0400, Liam R. Howlett wrote:
> > > > > * Yu Ma <[email protected]> [230606 08:23]:
> > > > > > UnixBench/Execl represents a class of workload where bash scripts are
> > > > > > spawned frequently to do some short jobs. When running multiple parallel
> > > > > > tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them
> > > > > > come from load_elf_binary through the call chain
> > > > > > "execl->do_execveat_common->bprm_execve->load_elf_binary". In do_mmap,it will
> > > > > > call mmap_region to create vma node, initialize it and insert it to vma
> > > > > > maintain structure in mm_struct and i_mmap tree of the mapping file, then
> > > > > > increase map_count to record the number of vma nodes used. The hot osq_lock
> > > > > > is to protect operations on file’s i_mmap tree. For the mm_struct member
> > > > > > change like vma insertion and map_count update, they do not affect i_mmap
> > > > > > tree. Move those operations out of the lock's critical section, to reduce
> > > > > > hold time on the lock.
> > > > > >
> > > > > > With this change, on Intel Sapphire Rapids 112C/224T platform, based on
> > > > > > v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
> > > > > > obvious performance gain on v6.4-rc4 due to regression of this benchmark
> > > > > > from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: convert
> > > > > > mm's rss stats into percpu_counter).
> > > > >
> > > > > I didn't think it was safe to insert a VMA into the VMA tree without
> > > > > holding this write lock? We now have a window of time where a file
> > > > > mapping doesn't exist for a vma that's in the tree? Is this always
> > > > > safe? Does the locking order in mm/rmap.c need to change?
> > > >
> > > > We hold mmap lock on write here, right?
> > >
> > > Yes.
> > >
> > > >Who can observe the VMA until the
> > > > lock is released?
> > >
> > > With CONFIG_PER_VMA_LOCK we can have the VMA read under the rcu
> > > read lock for page faults from the tree. I am not sure if the vma is
> > > initialized to avoid page fault issues - vma_start_write() should either
> > > be taken or initialise the vma as this is the case.
> >
> > Right, with CONFIG_PER_VMA_LOCK the vma has to be unusable until it is
> > fully initialized, effectively providing the same guarantees as mmap write
> > lock. If it is not the case, it is CONFIG_PER_VMA_LOCK bug.
>
> Jumping into the conversation.
>
> If we are adding a VMA into the tree before it's fully usable then we
> should write-lock it before it becomes visible to page faults. Kirill
> is right that there is a problem and we should not rely on
> vma->vm_file->f_mapping lock here. Instead we should write-lock the
> VMA before adding it into the tree even without this change.
> IIUC, the rule with mmap_lock is that VMA can be safely modified if it
> is either isolated or if we write-locked the mmap_lock. With
> CONFIG_PER_VMA_LOCK the same rule should apply to per-VMA locks - the
> VMA should be either isolated or we should write-lock it. Here we are
> adding the unlocked VMA into the tree and then we keep modifying it.
> This did not bite us because modifications are only done to
> file-backed VMAs and we do not handle file-backed page faults under
> per-VMA locks yet, however this will become a problem once we start
> supporting that.
> If we all agree to the above I can post a change to lock the VMA
> before adding it into the tree.

This patch is addressing the issue with per-VMA locks here:
https://lore.kernel.org/all/[email protected]/

>
> >
> > > There is also a possibility of a driver mapping a VMA and having entry
> > > points from other locations. It isn't accessed through the tree though
> > > so I don't think this change will introduce new races?
> >
> > Right.
> >
> > > > It cannot be retrieved from the VMA tree as it requires at least read mmap
> > > > lock. And the VMA doesn't exist anywhere else.
> > > >
> > > > I believe the change is safe.
> > >
> > > I guess insert_vm_struct(), and vma_link() callers should be checked and
> > > updated accordingly?
> >
> > Yep.
> >
> > --
> > Kiryl Shutsemau / Kirill A. Shutemov
> >

2023-07-11 17:37:40

by Yu Ma

[permalink] [raw]
Subject: [PATCH v2] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock

UnixBench/Execl represents a class of workload where bash scripts are
spawned frequently to do some short jobs. When running multiple parallel
tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them
come from load_elf_binary through the call chain
"execl->do_execveat_common->bprm_execve->load_elf_binary". In do_mmap,it will
call mmap_region to create vma node, initialize it and insert it to vma
maintain structure in mm_struct and i_mmap tree of the mapping file, then
increase map_count to record the number of vma nodes used. The hot osq_lock
is to protect operations on file’s i_mmap tree. For the mm_struct member
change like vma insertion and map_count update, they do not affect i_mmap
tree. Move those operations out of the lock's critical section, to reduce
hold time on the lock.

With this change, on Intel Sapphire Rapids 112C/224T platform, based on
v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
obvious performance gain on v6.4-rc4 due to regression of this benchmark
from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: convert
mm's rss stats into percpu_counter). Related discussion and conclusion
can be referred at the mail thread initiated by 0day as below:
Link: https://lore.kernel.org/linux-mm/[email protected]/

Reviewed-by: Tim Chen <[email protected]>
Signed-off-by: Yu Ma <[email protected]>
---
v1 -> v2:
- Update vma_link() to reduce the hold time on file mapping lock as well.
Based on v6.4-rc7, vma_link() is only called by insert_vm_struct ()
and copy_vma(), which are both protected by mmap_lock.
---
mm/mmap.c | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index d600404580b2..6f42ca2ab84a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -444,14 +444,11 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
if (vma_iter_prealloc(&vmi))
return -ENOMEM;

+ vma_iter_store(&vmi, vma);
+
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
i_mmap_lock_write(mapping);
- }
-
- vma_iter_store(&vmi, vma);
-
- if (mapping) {
__vma_link_file(vma, mapping);
i_mmap_unlock_write(mapping);
}
@@ -2708,12 +2705,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
if (vma_iter_prealloc(&vmi))
goto close_and_free_vma;

- if (vma->vm_file)
- i_mmap_lock_write(vma->vm_file->f_mapping);
-
vma_iter_store(&vmi, vma);
mm->map_count++;
if (vma->vm_file) {
+ i_mmap_lock_write(vma->vm_file->f_mapping);
if (vma->vm_flags & VM_SHARED)
mapping_allow_writable(vma->vm_file->f_mapping);

--
2.39.3


2023-07-11 18:41:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock

On Tue, 11 Jul 2023 13:20:20 -0400 Yu Ma <[email protected]> wrote:

> UnixBench/Execl represents a class of workload where bash scripts are
> spawned frequently to do some short jobs. When running multiple parallel
> tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them
> come from load_elf_binary through the call chain
> "execl->do_execveat_common->bprm_execve->load_elf_binary". In do_mmap,it will
> call mmap_region to create vma node, initialize it and insert it to vma
> maintain structure in mm_struct and i_mmap tree of the mapping file, then
> increase map_count to record the number of vma nodes used. The hot osq_lock
> is to protect operations on file’s i_mmap tree. For the mm_struct member
> change like vma insertion and map_count update, they do not affect i_mmap
> tree. Move those operations out of the lock's critical section, to reduce
> hold time on the lock.
>
> With this change, on Intel Sapphire Rapids 112C/224T platform, based on
> v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
> obvious performance gain on v6.4-rc4 due to regression of this benchmark
> from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: convert
> mm's rss stats into percpu_counter). Related discussion and conclusion
> can be referred at the mail thread initiated by 0day as below:
> Link: https://lore.kernel.org/linux-mm/[email protected]/

Could you please redo/retest this against a kernel which has
1c7873e3364 ("mm: lock newly mapped VMA with corrected ordering")?
mainline, mm-unstable or linux-next.

Thanks.

2023-07-12 00:28:21

by Yu Ma

[permalink] [raw]
Subject: RE: [PATCH v2] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock

> -----Original Message-----
> From: Andrew Morton <[email protected]>
> Sent: Wednesday, July 12, 2023 2:28 AM
> To: Ma, Yu <[email protected]>
> Cc: [email protected]; [email protected]; Williams, Dan J
> <[email protected]>; Hansen, Dave <[email protected]>; linux-
> [email protected]; [email protected]; Zhu, Lipeng
> <[email protected]>; Deng, Pan <[email protected]>;
> [email protected]; Li, Tianyou <[email protected]>; Chen, Tim C
> <[email protected]>; [email protected]
> Subject: Re: [PATCH v2] mm/mmap: move vma operations to mm_struct out
> of the critical section of file mapping lock
>
> On Tue, 11 Jul 2023 13:20:20 -0400 Yu Ma <[email protected]> wrote:
>
> > UnixBench/Execl represents a class of workload where bash scripts are
> > spawned frequently to do some short jobs. When running multiple
> > parallel tasks, hot osq_lock is observed from do_mmap and exit_mmap.
> > Both of them come from load_elf_binary through the call chain
> > "execl->do_execveat_common->bprm_execve->load_elf_binary". In
> > do_mmap,it will call mmap_region to create vma node, initialize it and
> > insert it to vma maintain structure in mm_struct and i_mmap tree of
> > the mapping file, then increase map_count to record the number of vma
> > nodes used. The hot osq_lock is to protect operations on file’s i_mmap
> > tree. For the mm_struct member change like vma insertion and map_count
> > update, they do not affect i_mmap tree. Move those operations out of
> > the lock's critical section, to reduce hold time on the lock.
> >
> > With this change, on Intel Sapphire Rapids 112C/224T platform, based
> > on v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
> > obvious performance gain on v6.4-rc4 due to regression of this
> > benchmark from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25
> > (mm: convert mm's rss stats into percpu_counter). Related discussion
> > and conclusion can be referred at the mail thread initiated by 0day as
> below:
> > Link:
> > https://lore.kernel.org/linux-mm/a4aa2e13-7187-600b-c628-
> 7e8fb108def0@
> > intel.com/
>
> Could you please redo/retest this against a kernel which has
> 1c7873e3364 ("mm: lock newly mapped VMA with corrected ordering")?
> mainline, mm-unstable or linux-next.
>
OK, no problem.


Regards
Yu

2023-07-12 14:48:45

by Yu Ma

[permalink] [raw]
Subject: [PATCH v3] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock

UnixBench/Execl represents a class of workload where bash scripts are
spawned frequently to do some short jobs. When running multiple parallel
tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them
come from load_elf_binary through the call chain
"execl->do_execveat_common->bprm_execve->load_elf_binary". In do_mmap,it will
call mmap_region to create vma node, initialize it and insert it to vma
maintain structure in mm_struct and i_mmap tree of the mapping file, then
increase map_count to record the number of vma nodes used. The hot osq_lock
is to protect operations on file’s i_mmap tree. For the mm_struct member
change like vma insertion and map_count update, they do not affect i_mmap
tree. Move those operations out of the lock's critical section, to reduce
hold time on the lock.

With this change, on Intel Sapphire Rapids 112C/224T platform, based on
v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
obvious performance gain on v6.5-rc1 due to regression of this benchmark
from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: convert
mm's rss stats into percpu_counter). Related discussion and conclusion
can be referred at the mail thread initiated by 0day as below:
Link: https://lore.kernel.org/linux-mm/[email protected]/

Reviewed-by: Tim Chen <[email protected]>
Signed-off-by: Yu Ma <[email protected]>
---
v2 -> v3: Rebase the patch to v6.5-rc1, which includes 1c7873e3364 (mm:
lock newly mapped VMA with corrected ordering), and update commit
message to status on v6.5-rc1
v1 -> v2: Update vma_link() to reduce the hold time on file mapping lock
as well. Based on v6.5-rc1, vma_link() is only called by
insert_vm_struct () and copy_vma(), which are both protected by mmap_lock.
---
---
mm/mmap.c | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 3eda23c9ebe7..ce31aec82e82 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -412,14 +412,11 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
if (vma_iter_prealloc(&vmi))
return -ENOMEM;

+ vma_iter_store(&vmi, vma);
+
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
i_mmap_lock_write(mapping);
- }
-
- vma_iter_store(&vmi, vma);
-
- if (mapping) {
__vma_link_file(vma, mapping);
i_mmap_unlock_write(mapping);
}
@@ -2811,12 +2808,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,

/* Lock the VMA since it is modified after insertion into VMA tree */
vma_start_write(vma);
- if (vma->vm_file)
- i_mmap_lock_write(vma->vm_file->f_mapping);
-
vma_iter_store(&vmi, vma);
mm->map_count++;
if (vma->vm_file) {
+ i_mmap_lock_write(vma->vm_file->f_mapping);
if (vma->vm_flags & VM_SHARED)
mapping_allow_writable(vma->vm_file->f_mapping);

--
2.39.3