page_table_lock is a lock that for page table, we won't change page
table in __anon_vma_prepare(). As we can see, it works well in
anon_vma_clone(). They do the same operation.
Remove unnecessary page_table_lock in __anon_vma_prepare().
Signed-off-by: Yajun Deng <[email protected]>
---
mm/rmap.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index 2608c40dffad..e894640a9327 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -187,7 +187,6 @@ static void anon_vma_chain_link(struct vm_area_struct *vma,
*/
int __anon_vma_prepare(struct vm_area_struct *vma)
{
- struct mm_struct *mm = vma->vm_mm;
struct anon_vma *anon_vma, *allocated;
struct anon_vma_chain *avc;
@@ -208,8 +207,6 @@ int __anon_vma_prepare(struct vm_area_struct *vma)
}
anon_vma_lock_write(anon_vma);
- /* page_table_lock to protect against threads */
- spin_lock(&mm->page_table_lock);
if (likely(!vma->anon_vma)) {
vma->anon_vma = anon_vma;
anon_vma_chain_link(vma, avc, anon_vma);
@@ -217,7 +214,6 @@ int __anon_vma_prepare(struct vm_area_struct *vma)
allocated = NULL;
avc = NULL;
}
- spin_unlock(&mm->page_table_lock);
anon_vma_unlock_write(anon_vma);
if (unlikely(allocated))
--
2.25.1
On 22.04.24 12:52, Yajun Deng wrote:
> page_table_lock is a lock that for page table, we won't change page
> table in __anon_vma_prepare(). As we can see, it works well in
> anon_vma_clone(). They do the same operation.
We are reusing mm->page_table_lock to serialize, not the *actual*
low-level page table locks that really protect PTEs.
With that locking gone, there would be nothing protection vma->anon_vma.
Note that anon_vma_clone() is likely called with the mmap_lock held in
write mode, which is not the case for __anon_vma_prepare() ...
I think this change is wrong.
--
Cheers,
David / dhildenb
On 2024/4/22 18:52, Yajun Deng wrote:
> page_table_lock is a lock that for page table, we won't change page
> table in __anon_vma_prepare(). As we can see, it works well in
> anon_vma_clone(). They do the same operation.
>
> Remove unnecessary page_table_lock in __anon_vma_prepare().
IIUC, the page_table_lock here is not to protect page table, but as
mentioned in the comments, to prevent concurrent modification by
multiple threads.
>
> Signed-off-by: Yajun Deng <[email protected]>
> ---
> mm/rmap.c | 4 ----
> 1 file changed, 4 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2608c40dffad..e894640a9327 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -187,7 +187,6 @@ static void anon_vma_chain_link(struct vm_area_struct *vma,
> */
> int __anon_vma_prepare(struct vm_area_struct *vma)
> {
> - struct mm_struct *mm = vma->vm_mm;
> struct anon_vma *anon_vma, *allocated;
> struct anon_vma_chain *avc;
>
> @@ -208,8 +207,6 @@ int __anon_vma_prepare(struct vm_area_struct *vma)
> }
>
> anon_vma_lock_write(anon_vma);
> - /* page_table_lock to protect against threads */
> - spin_lock(&mm->page_table_lock);
> if (likely(!vma->anon_vma)) {
> vma->anon_vma = anon_vma;
> anon_vma_chain_link(vma, avc, anon_vma);
> @@ -217,7 +214,6 @@ int __anon_vma_prepare(struct vm_area_struct *vma)
> allocated = NULL;
> avc = NULL;
> }
> - spin_unlock(&mm->page_table_lock);
> anon_vma_unlock_write(anon_vma);
>
> if (unlikely(allocated))
April 22, 2024 at 7:24 PM, "David Hildenbrand" <[email protected]> wrote:
>
> On 22.04.24 12:52, Yajun Deng wrote:
>
> >
> > page_table_lock is a lock that for page table, we won't change page
> >
> > table in __anon_vma_prepare(). As we can see, it works well in
> >
> > anon_vma_clone(). They do the same operation.
> >
>
> We are reusing mm->page_table_lock to serialize, not the *actual* low-level page table locks that really protect PTEs.
>
> With that locking gone, there would be nothing protection vma->anon_vma.
>
> Note that anon_vma_clone() is likely called with the mmap_lock held in write mode, which is not the case for __anon_vma_prepare() ...
Yes, anon_vma_clone() is called with the mmap_lock held. I added mmap_assert_write_locked(dst->vm_mm) to prove it.
I added mmap_assert_write_locked(vma->vm_mm) in __anon_vma_prepare() at the same time, it shows __anon_vma_prepare()
is also called with the mmap_lock held too.
>
> I think this change is wrong.
>
> -- Cheers,
>
> David / dhildenb
>
On 23.04.24 09:53, Yajun Deng wrote:
> April 22, 2024 at 7:24 PM, "David Hildenbrand" <[email protected]> wrote:
>
>
>
>>
>> On 22.04.24 12:52, Yajun Deng wrote:
>>
>>>
>>> page_table_lock is a lock that for page table, we won't change page
>>>
>>> table in __anon_vma_prepare(). As we can see, it works well in
>>>
>>> anon_vma_clone(). They do the same operation.
>>>
>>
>> We are reusing mm->page_table_lock to serialize, not the *actual* low-level page table locks that really protect PTEs.
>>
>> With that locking gone, there would be nothing protection vma->anon_vma.
>>
>> Note that anon_vma_clone() is likely called with the mmap_lock held in write mode, which is not the case for __anon_vma_prepare() ...
>
> Yes, anon_vma_clone() is called with the mmap_lock held. I added mmap_assert_write_locked(dst->vm_mm) to prove it.
> I added mmap_assert_write_locked(vma->vm_mm) in __anon_vma_prepare() at the same time, it shows __anon_vma_prepare()
> is also called with the mmap_lock held too.
Make sure you actually have lockdep built in and enabled.
__anon_vma_prepare() is for example called from do_anonymous_page()
where we might only hold the mmap_lock in read mode (or not at all IIRC
with VMA in read mode).
--
Cheers,
David / dhildenb
April 23, 2024 at 4:18 PM, "David Hildenbrand" <[email protected]> wrote:
>
> On 23.04.24 09:53, Yajun Deng wrote:
>
> >
> > April 22, 2024 at 7:24 PM, "David Hildenbrand" <[email protected]> wrote:
> >
> > > >>
> >
> > >
> > > On 22.04.24 12:52, Yajun Deng wrote:
> > >
> >
> > page_table_lock is a lock that for page table, we won't change page
> >
> > table in __anon_vma_prepare(). As we can see, it works well in
> >
> > anon_vma_clone(). They do the same operation.
> >
> > >
> > > We are reusing mm->page_table_lock to serialize, not the *actual* low-level page table locks that really protect PTEs.
> > >
> > > With that locking gone, there would be nothing protection vma->anon_vma.
> > >
> > > Note that anon_vma_clone() is likely called with the mmap_lock held in write mode, which is not the case for __anon_vma_prepare() ...
> > >
> >
> > Yes, anon_vma_clone() is called with the mmap_lock held. I added mmap_assert_write_locked(dst->vm_mm) to prove it.
> >
> > I added mmap_assert_write_locked(vma->vm_mm) in __anon_vma_prepare() at the same time, it shows __anon_vma_prepare()
> >
> > is also called with the mmap_lock held too.
> >
>
> Make sure you actually have lockdep built in and enabled.
>
This is my config.
CONFIG_LOCKDEP=n
CONFIG_DEBUG_VM=y
I did another test.
I put mmap_assert_write_locked(mm) before 'set_bit(MMF_OOM_SKIP, &mm->flags)' in mmap.c, it's outside the lock.
It will crash when on boot. I think mmap_assert_write_locked() works.
> __anon_vma_prepare() is for example called from do_anonymous_page() where we might only hold the mmap_lock in read mode (or not at all IIRC with VMA in read mode).
>
> -- Cheers,
>
> David / dhildenb
>
* Yajun Deng <[email protected]> [240423 04:35]:
> April 23, 2024 at 4:18 PM, "David Hildenbrand" <[email protected]> wrote:
>
>
>
> >
> > On 23.04.24 09:53, Yajun Deng wrote:
> >
> > >
> > > April 22, 2024 at 7:24 PM, "David Hildenbrand" <[email protected]> wrote:
> > >
> > > > >>
> > >
> > > >
> > > > On 22.04.24 12:52, Yajun Deng wrote:
> > > >
> > >
> > > page_table_lock is a lock that for page table, we won't change page
> > >
> > > table in __anon_vma_prepare(). As we can see, it works well in
> > >
> > > anon_vma_clone(). They do the same operation.
> > >
> > > >
> > > > We are reusing mm->page_table_lock to serialize, not the *actual* low-level page table locks that really protect PTEs.
> > > >
> > > > With that locking gone, there would be nothing protection vma->anon_vma.
> > > >
> > > > Note that anon_vma_clone() is likely called with the mmap_lock held in write mode, which is not the case for __anon_vma_prepare() ...
> > > >
> > >
> > > Yes, anon_vma_clone() is called with the mmap_lock held. I added mmap_assert_write_locked(dst->vm_mm) to prove it.
> > >
> > > I added mmap_assert_write_locked(vma->vm_mm) in __anon_vma_prepare() at the same time, it shows __anon_vma_prepare()
> > >
> > > is also called with the mmap_lock held too.
> > >
> >
> > Make sure you actually have lockdep built in and enabled.
> >
>
> This is my config.
> CONFIG_LOCKDEP=n
> CONFIG_DEBUG_VM=y
>
> I did another test.
> I put mmap_assert_write_locked(mm) before 'set_bit(MMF_OOM_SKIP, &mm->flags)' in mmap.c, it's outside the lock.
> It will crash when on boot. I think mmap_assert_write_locked() works.
If you are changing locks, then please test with lockdep on.
>
>
> > __anon_vma_prepare() is for example called from do_anonymous_page() where we might only hold the mmap_lock in read mode (or not at all IIRC with VMA in read mode).
Consider two concurrent readers getting to this function with the same
vma. There is no mergeable anon vma, so both create a new anon_vma.
You take the anon_vma lock to parallelize the linking to the vma - but
they are different locks because they are both new anon_vma structs
allocated by concurrent readers.
You now need a lock that you know cannot allow this to happen. Looking
at the top of mm/rmap.c and see which one works. The next in line is
the page table lock, so that one is used here.
What if we reverse the locks? We can deadlock.
What if we don't take the anon_vma lock? We can have two writers to the
anon_vma.
Thanks,
Liam
April 24, 2024 at 1:11 AM, "Liam R. Howlett" <[email protected]> wrote:
>
> * Yajun Deng <[email protected]> [240423 04:35]:
>
> >
> > April 23, 2024 at 4:18 PM, "David Hildenbrand" <[email protected]> wrote:
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On 23.04.24 09:53, Yajun Deng wrote:
> >
> >
> >
> > >
> >
> > > April 22, 2024 at 7:24 PM, "David Hildenbrand" <[email protected]> wrote:
> >
> > >
> >
> > > > >>
> >
> > >
> >
> > > >
> >
> > > > On 22.04.24 12:52, Yajun Deng wrote:
> >
> > > >
> >
> > >
> >
> > > page_table_lock is a lock that for page table, we won't change page
> >
> > >
> >
> > > table in __anon_vma_prepare(). As we can see, it works well in
> >
> > >
> >
> > > anon_vma_clone(). They do the same operation.
> >
> > >
> >
> > > >
> >
> > > > We are reusing mm->page_table_lock to serialize, not the *actual* low-level page table locks that really protect PTEs.
> >
> > > >
> >
> > > > With that locking gone, there would be nothing protection vma->anon_vma.
> >
> > > >
> >
> > > > Note that anon_vma_clone() is likely called with the mmap_lock held in write mode, which is not the case for __anon_vma_prepare() ...
> >
> > > >
> >
> > >
> >
> > > Yes, anon_vma_clone() is called with the mmap_lock held. I added mmap_assert_write_locked(dst->vm_mm) to prove it.
> >
> > >
> >
> > > I added mmap_assert_write_locked(vma->vm_mm) in __anon_vma_prepare() at the same time, it shows __anon_vma_prepare()
> >
> > >
> >
> > > is also called with the mmap_lock held too.
> >
> > >
> >
> >
> >
> > Make sure you actually have lockdep built in and enabled.
> >
> >
> >
> >
> >
> > This is my config.
> >
> > CONFIG_LOCKDEP=n
> >
> > CONFIG_DEBUG_VM=y
> >
> >
> >
> > I did another test.
> >
> > I put mmap_assert_write_locked(mm) before 'set_bit(MMF_OOM_SKIP, &mm->flags)' in mmap.c, it's outside the lock.
> >
> > It will crash when on boot. I think mmap_assert_write_locked() works.
> >
>
> If you are changing locks, then please test with lockdep on.
>
It's my fault. It shows a warning with lockdep on.
> >
> > __anon_vma_prepare() is for example called from do_anonymous_page() where we might only hold the mmap_lock in read mode (or not at all IIRC with VMA in read mode).
> >
>
> Consider two concurrent readers getting to this function with the same
>
> vma. There is no mergeable anon vma, so both create a new anon_vma.
>
> You take the anon_vma lock to parallelize the linking to the vma - but
>
> they are different locks because they are both new anon_vma structs
>
> allocated by concurrent readers.
>
> You now need a lock that you know cannot allow this to happen. Looking
>
> at the top of mm/rmap.c and see which one works. The next in line is
>
> the page table lock, so that one is used here.
>
> What if we reverse the locks? We can deadlock.
>
> What if we don't take the anon_vma lock? We can have two writers to the
>
> anon_vma.
>
> Thanks,
>
> Liam
>