2023-03-07 21:00:33

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH] mm/ksm: Fix race with ksm_exit() in VMA iteration

ksm_exit() may remove the mm from the ksm_scan between the unlocking of
the ksm_mmlist and the start of the VMA iteration. This results in the
mmap_read_lock() not being taken and a report from lockdep that the mm
isn't locked in the maple tree code.

Fix the race by checking if this mm has been removed before iterating
the VMAs. __ksm_exit() uses the mmap lock to synchronize the freeing of
an mm, so it is safe to keep iterating over the VMAs when it is going to
be freed.

This change will slow down the mm exit during the race condition, but
will speed up the non-race scenarios iteration over the VMA list, which
should be much more common.

Reported-by: Pengfei Xu <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Reported-by: [email protected]
Link: https://syzkaller.appspot.com/bug?id=64a3e95957cd3deab99df7cd7b5a9475af92c93e
Cc: [email protected]
Cc: [email protected]
Cc: Andrew Morton <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: <[email protected]>
Fixes: a5f18ba07276 ("mm/ksm: use vma iterators instead of vma linked list")
Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/ksm.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 525c3306e78b..723ddbe6ea97 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1044,9 +1044,10 @@ static int unmerge_and_remove_all_rmap_items(void)

mm = mm_slot->slot.mm;
mmap_read_lock(mm);
+ if (ksm_test_exit(mm))
+ goto mm_exiting;
+
for_each_vma(vmi, vma) {
- if (ksm_test_exit(mm))
- break;
if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
continue;
err = unmerge_ksm_pages(vma,
@@ -1055,6 +1056,7 @@ static int unmerge_and_remove_all_rmap_items(void)
goto error;
}

+mm_exiting:
remove_trailing_rmap_items(&mm_slot->rmap_list);
mmap_read_unlock(mm);

--
2.39.2



2023-03-08 09:43:06

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm/ksm: Fix race with ksm_exit() in VMA iteration

On 07.03.23 21:59, Liam R. Howlett wrote:
> ksm_exit() may remove the mm from the ksm_scan between the unlocking of
> the ksm_mmlist and the start of the VMA iteration. This results in the
> mmap_read_lock() not being taken and a report from lockdep that the mm
> isn't locked in the maple tree code.

I'm confused. The code does

mmap_read_lock(mm);
...
for_each_vma(vmi, vma) {
mmap_read_unlock(mm);

How can we not take the mmap_read_lock() ? Or am I staring at the wrong
mmap_read_lock() ?

>
> Fix the race by checking if this mm has been removed before iterating
> the VMAs. __ksm_exit() uses the mmap lock to synchronize the freeing of
> an mm, so it is safe to keep iterating over the VMAs when it is going to
> be freed.
>
> This change will slow down the mm exit during the race condition, but
> will speed up the non-race scenarios iteration over the VMA list, which
> should be much more common.

Would leaving the existing check in help to just stop scanning faster in
that case?

>
> Reported-by: Pengfei Xu <[email protected]>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Reported-by: [email protected]
> Link: https://syzkaller.appspot.com/bug?id=64a3e95957cd3deab99df7cd7b5a9475af92c93e
> Cc: [email protected]
> Cc: [email protected]
> Cc: Andrew Morton <[email protected]>
> Cc: Matthew Wilcox (Oracle) <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: <[email protected]>
> Fixes: a5f18ba07276 ("mm/ksm: use vma iterators instead of vma linked list")
> Signed-off-by: Liam R. Howlett <[email protected]>
> ---
> mm/ksm.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 525c3306e78b..723ddbe6ea97 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1044,9 +1044,10 @@ static int unmerge_and_remove_all_rmap_items(void)
>
> mm = mm_slot->slot.mm;
> mmap_read_lock(mm);

Better add a comment:

/*
* Don't iterate any VMAs if we might be racing against ksm_exit(),
* just exit early.
*/

> + if (ksm_test_exit(mm))
> + goto mm_exiting;
> +
> for_each_vma(vmi, vma) {
> - if (ksm_test_exit(mm))
> - break;
> if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
> continue;
> err = unmerge_ksm_pages(vma,
> @@ -1055,6 +1056,7 @@ static int unmerge_and_remove_all_rmap_items(void)
> goto error;
> }
>
> +mm_exiting:
> remove_trailing_rmap_items(&mm_slot->rmap_list);
> mmap_read_unlock(mm);
>



--
Thanks,

David / dhildenb


2023-03-08 16:20:51

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH] mm/ksm: Fix race with ksm_exit() in VMA iteration

* David Hildenbrand <[email protected]> [230308 04:41]:
> On 07.03.23 21:59, Liam R. Howlett wrote:
> > ksm_exit() may remove the mm from the ksm_scan between the unlocking of
> > the ksm_mmlist and the start of the VMA iteration. This results in the
> > mmap_read_lock() not being taken and a report from lockdep that the mm
> > isn't locked in the maple tree code.
>
> I'm confused.

Thanks for looking at this. My explanation is incorrect.

>The code does
>
> mmap_read_lock(mm);
> ...
> for_each_vma(vmi, vma) {
> mmap_read_unlock(mm);
>
> How can we not take the mmap_read_lock() ? Or am I staring at the wrong
> mmap_read_lock() ?

That's the right one. The mmap lock is taken, but the one we are
checking is not the correct one. Let me try again.

Checking the mm struct against the one in the vmi confirms they are the
same, so lockdep is telling us the lock we took doesn't match what it
expected. I verified that the lock is the same before the
'for_each_vma()' call by inserting a BUG_ON() which is never triggered
with the reproducer.

ksm_test_exit() uses the mm->mm_users atomic to detect an mm exit. This
is usually done in mmget(), mmput(), and friends.

__ksm_exit() and unmerge_and_remove_all_rmap_items() handle freeing by
use of the mm->mm_count atomic. This is usually via mmgrab() and mmdrop().

mmput() will call __mmput() if mm_users is decremented to zero.
__mmput() calls mmdrop() after the ksm_exit() and then continue with
teardown.

So, I believe what is happening is that the external lock flag is being
cleared from the maple tree (the one lockdep checks) before we call the
iterator.

task 1 task 2
unmerge_and_remove_all_rmap_items()
spin_lock(&ksm_mmlist_lock);
ksm_scan.mm_slot is set
spin_unlock(&ksm_mmlist_lock);

=======================================================================
At this point mm->mm_users is 0, but mm_count is not as it will
be decremented at the end of __mmput().
=======================================================================

__mmput()
ksm_exit()
__ksm_exit()
spin_lock(&ksm_mmlist_lock);
mm_slot is set
spin_unlock(&ksm_mmlist_lock)
mm_slot == ksm_scan.mm_slot
mmap_write_lock();
mmap_write_unlock();
return
exit_mmap()
...
mmap_write_lock();
__mt_destory()
Free all maple tree nodes
mt->flags = 0;
mmap_write_unlock();
...

mmap_read_lock()
for_each_vma()
lockdep checks *internal* spinlock


This was fine before the change as the previous for loop would not have
checked the locking and would have hit the ksm_test_exit() test before
any problem arose.

Now we are getting a lockdep warning because the maple tree flag for the
external lock is cleared.

How about this as the start to the commit message:

The VMA iterator may trigger a lockdep warning if the mm is in the
process of being cleaned up before obtaining the mmap_read_lock().

>
> >
> > Fix the race by checking if this mm has been removed before iterating
> > the VMAs. __ksm_exit() uses the mmap lock to synchronize the freeing of
> > an mm, so it is safe to keep iterating over the VMAs when it is going to
> > be freed.
> >
> > This change will slow down the mm exit during the race condition, but
> > will speed up the non-race scenarios iteration over the VMA list, which
> > should be much more common.
>
> Would leaving the existing check in help to just stop scanning faster in
> that case?

Yes. But why? We would stop the scanning faster in the race condition
case, but slow the normal case down.

This check was here to ensure that the mm isn't being torn down while
it's iterating over the loop. Hugh (Cc'ed) added this in 2009, but the
fundamental problem he specifies in his commit message in 9ba692948008
("ksm: fix oom deadlock") is that exit_mmap() does not take the
mmap_lock() - which is no longer the case. We are safe to iterate the
VMAs with the mmap_read_lock() as the mmap_write_lock() is taken during
tear down of the VMA tree today.

>
> >
> > Reported-by: Pengfei Xu <[email protected]>
> > Link: https://lore.kernel.org/lkml/[email protected]/
> > Reported-by: [email protected]
> > Link: https://syzkaller.appspot.com/bug?id=64a3e95957cd3deab99df7cd7b5a9475af92c93e
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: Andrew Morton <[email protected]>
> > Cc: Matthew Wilcox (Oracle) <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: <[email protected]>
> > Fixes: a5f18ba07276 ("mm/ksm: use vma iterators instead of vma linked list")
> > Signed-off-by: Liam R. Howlett <[email protected]>
> > ---
> > mm/ksm.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 525c3306e78b..723ddbe6ea97 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -1044,9 +1044,10 @@ static int unmerge_and_remove_all_rmap_items(void)
> > mm = mm_slot->slot.mm;
> > mmap_read_lock(mm);
>
> Better add a comment:
>
> /*
> * Don't iterate any VMAs if we might be racing against ksm_exit(),
> * just exit early.
> */
>
> > + if (ksm_test_exit(mm))
> > + goto mm_exiting;
> > +
> > for_each_vma(vmi, vma) {
> > - if (ksm_test_exit(mm))
> > - break;
> > if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
> > continue;
> > err = unmerge_ksm_pages(vma,
> > @@ -1055,6 +1056,7 @@ static int unmerge_and_remove_all_rmap_items(void)
> > goto error;
> > }
> > +mm_exiting:
> > remove_trailing_rmap_items(&mm_slot->rmap_list);
> > mmap_read_unlock(mm);
>
>
>
> --
> Thanks,
>
> David / dhildenb
>

2023-03-08 16:47:37

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm/ksm: Fix race with ksm_exit() in VMA iteration

On 08.03.23 17:19, Liam R. Howlett wrote:
> * David Hildenbrand <[email protected]> [230308 04:41]:
>> On 07.03.23 21:59, Liam R. Howlett wrote:
>>> ksm_exit() may remove the mm from the ksm_scan between the unlocking of
>>> the ksm_mmlist and the start of the VMA iteration. This results in the
>>> mmap_read_lock() not being taken and a report from lockdep that the mm
>>> isn't locked in the maple tree code.
>>
>> I'm confused.
>
> Thanks for looking at this. My explanation is incorrect.
>

Heh, so that explains my confusion :)

>> The code does
>>
>> mmap_read_lock(mm);
>> ...
>> for_each_vma(vmi, vma) {
>> mmap_read_unlock(mm);
>>
>> How can we not take the mmap_read_lock() ? Or am I staring at the wrong
>> mmap_read_lock() ?
>
> That's the right one. The mmap lock is taken, but the one we are
> checking is not the correct one. Let me try again.
>
> Checking the mm struct against the one in the vmi confirms they are the
> same, so lockdep is telling us the lock we took doesn't match what it
> expected. I verified that the lock is the same before the
> 'for_each_vma()' call by inserting a BUG_ON() which is never triggered
> with the reproducer.
>
> ksm_test_exit() uses the mm->mm_users atomic to detect an mm exit. This
> is usually done in mmget(), mmput(), and friends.
>
> __ksm_exit() and unmerge_and_remove_all_rmap_items() handle freeing by
> use of the mm->mm_count atomic. This is usually via mmgrab() and mmdrop().
>
> mmput() will call __mmput() if mm_users is decremented to zero.
> __mmput() calls mmdrop() after the ksm_exit() and then continue with
> teardown.
>
> So, I believe what is happening is that the external lock flag is being
> cleared from the maple tree (the one lockdep checks) before we call the
> iterator.
>

Thanks for the explanation.

So, IIUC, we are really only fixing a lockdep issue, assuming that the
maple tree cleanup code leaves the maple tree in a state where an
iterator essentially exits right away. Further, I assume this wasn't a
problem before the maple tree: there would simply be no VMAs to iterate.

> task 1 task 2
> unmerge_and_remove_all_rmap_items()
> spin_lock(&ksm_mmlist_lock);
> ksm_scan.mm_slot is set
> spin_unlock(&ksm_mmlist_lock);
>
> =======================================================================
> At this point mm->mm_users is 0, but mm_count is not as it will
> be decremented at the end of __mmput().
> =======================================================================
>
> __mmput()
> ksm_exit()
> __ksm_exit()
> spin_lock(&ksm_mmlist_lock);
> mm_slot is set
> spin_unlock(&ksm_mmlist_lock)
> mm_slot == ksm_scan.mm_slot
> mmap_write_lock();
> mmap_write_unlock();
> return
> exit_mmap()
> ...
> mmap_write_lock();
> __mt_destory()
> Free all maple tree nodes
> mt->flags = 0;
> mmap_write_unlock();
> ...
>
> mmap_read_lock()
> for_each_vma()
> lockdep checks *internal* spinlock
>
>
> This was fine before the change as the previous for loop would not have
> checked the locking and would have hit the ksm_test_exit() test before
> any problem arose.
>
> Now we are getting a lockdep warning because the maple tree flag for the
> external lock is cleared.
>
> How about this as the start to the commit message:
>
> The VMA iterator may trigger a lockdep warning if the mm is in the
> process of being cleaned up before obtaining the mmap_read_lock().

Maybe something like the following (matches my understanding, as an
inspiration):

"
exit_mmap() will tear down the VMAs (maple tree) with the mmap_lock held
in write mode. Once we take the mmap_lock in read mode in
unmerge_and_remove_all_rmap_items(), we are protected against such
concurrent teardown, however, the teardown might already have happened
just the way KSM slot registration machinery works.

Without the VMA iterator, we didn't care. But with the VMA iterator,
lockdep will now complain when stumbling over a the destroyed maple tree.

Let's check for the teardown by relying on ksm_test_exit() earlier,
before working on a torn down maple tree.
"

>
>>
>>>
>>> Fix the race by checking if this mm has been removed before iterating
>>> the VMAs. __ksm_exit() uses the mmap lock to synchronize the freeing of
>>> an mm, so it is safe to keep iterating over the VMAs when it is going to
>>> be freed.
>>>
>>> This change will slow down the mm exit during the race condition, but
>>> will speed up the non-race scenarios iteration over the VMA list, which
>>> should be much more common.
>>
>> Would leaving the existing check in help to just stop scanning faster in
>> that case?
>
> Yes. But why? We would stop the scanning faster in the race condition
> case, but slow the normal case down.
>
> This check was here to ensure that the mm isn't being torn down while
> it's iterating over the loop. Hugh (Cc'ed) added this in 2009, but the
> fundamental problem he specifies in his commit message in 9ba692948008
> ("ksm: fix oom deadlock") is that exit_mmap() does not take the
> mmap_lock() - which is no longer the case. We are safe to iterate the
> VMAs with the mmap_read_lock() as the mmap_write_lock() is taken during
> tear down of the VMA tree today.
>

Right. I just spotted that we have a ksm_test_exit() already in
unmerge_ksm_pages(), so that should be sufficient to make us stop
scanning in case ksm_exit() is waiting for the mmap lock.


Adding a comment summarizing why that's required before iterating would
be nice. Like

/* Exit right away if the maple tree might have been torn down. */


With a better description, feel free to add

Acked-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb


2023-03-08 17:27:02

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH] mm/ksm: Fix race with ksm_exit() in VMA iteration

* David Hildenbrand <[email protected]> [230308 11:46]:
> On 08.03.23 17:19, Liam R. Howlett wrote:
> > * David Hildenbrand <[email protected]> [230308 04:41]:
> > > On 07.03.23 21:59, Liam R. Howlett wrote:
> > > > ksm_exit() may remove the mm from the ksm_scan between the unlocking of
> > > > the ksm_mmlist and the start of the VMA iteration. This results in the
> > > > mmap_read_lock() not being taken and a report from lockdep that the mm
> > > > isn't locked in the maple tree code.
> > >
> > > I'm confused.
> >
> > Thanks for looking at this. My explanation is incorrect.
> >
>
> Heh, so that explains my confusion :)
>
> > > The code does
> > >
> > > mmap_read_lock(mm);
> > > ...
> > > for_each_vma(vmi, vma) {
> > > mmap_read_unlock(mm);
> > >
> > > How can we not take the mmap_read_lock() ? Or am I staring at the wrong
> > > mmap_read_lock() ?
> >
> > That's the right one. The mmap lock is taken, but the one we are
> > checking is not the correct one. Let me try again.
> >
> > Checking the mm struct against the one in the vmi confirms they are the
> > same, so lockdep is telling us the lock we took doesn't match what it
> > expected. I verified that the lock is the same before the
> > 'for_each_vma()' call by inserting a BUG_ON() which is never triggered
> > with the reproducer.
> >
> > ksm_test_exit() uses the mm->mm_users atomic to detect an mm exit. This
> > is usually done in mmget(), mmput(), and friends.
> >
> > __ksm_exit() and unmerge_and_remove_all_rmap_items() handle freeing by
> > use of the mm->mm_count atomic. This is usually via mmgrab() and mmdrop().
> >
> > mmput() will call __mmput() if mm_users is decremented to zero.
> > __mmput() calls mmdrop() after the ksm_exit() and then continue with
> > teardown.
> >
> > So, I believe what is happening is that the external lock flag is being
> > cleared from the maple tree (the one lockdep checks) before we call the
> > iterator.
> >
>
> Thanks for the explanation.
>
> So, IIUC, we are really only fixing a lockdep issue, assuming that the
> maple tree cleanup code leaves the maple tree in a state where an iterator
> essentially exits right away. Further, I assume this wasn't a problem before
> the maple tree: there would simply be no VMAs to iterate.

Yes, the tree is empty so it will be a noop after the dereference.

This is really just a lockdep fix so I don't think it mattered before.

>
> > task 1 task 2
> > unmerge_and_remove_all_rmap_items()
> > spin_lock(&ksm_mmlist_lock);
> > ksm_scan.mm_slot is set
> > spin_unlock(&ksm_mmlist_lock);
> >
> > =======================================================================
> > At this point mm->mm_users is 0, but mm_count is not as it will
> > be decremented at the end of __mmput().
> > =======================================================================
> >
> > __mmput()
> > ksm_exit()
> > __ksm_exit()
> > spin_lock(&ksm_mmlist_lock);
> > mm_slot is set
> > spin_unlock(&ksm_mmlist_lock)
> > mm_slot == ksm_scan.mm_slot
> > mmap_write_lock();
> > mmap_write_unlock();
> > return
> > exit_mmap()
> > ...
> > mmap_write_lock();
> > __mt_destory()
> > Free all maple tree nodes
> > mt->flags = 0;
> > mmap_write_unlock();
> > ...
> >
> > mmap_read_lock()
> > for_each_vma()
> > lockdep checks *internal* spinlock
> >
> >
> > This was fine before the change as the previous for loop would not have
> > checked the locking and would have hit the ksm_test_exit() test before
> > any problem arose.
> >
> > Now we are getting a lockdep warning because the maple tree flag for the
> > external lock is cleared.
> >
> > How about this as the start to the commit message:
> >
> > The VMA iterator may trigger a lockdep warning if the mm is in the
> > process of being cleaned up before obtaining the mmap_read_lock().
>
> Maybe something like the following (matches my understanding, as an
> inspiration):
>
> "
> exit_mmap() will tear down the VMAs (maple tree) with the mmap_lock held in
> write mode. Once we take the mmap_lock in read mode in
> unmerge_and_remove_all_rmap_items(), we are protected against such
> concurrent teardown, however, the teardown might already have happened just
> the way KSM slot registration machinery works.
>
> Without the VMA iterator, we didn't care. But with the VMA iterator, lockdep
> will now complain when stumbling over a the destroyed maple tree.
>
> Let's check for the teardown by relying on ksm_test_exit() earlier, before
> working on a torn down maple tree.
> "

I'll give it a shot.

>
> >
> > >
> > > >
> > > > Fix the race by checking if this mm has been removed before iterating
> > > > the VMAs. __ksm_exit() uses the mmap lock to synchronize the freeing of
> > > > an mm, so it is safe to keep iterating over the VMAs when it is going to
> > > > be freed.
> > > >
> > > > This change will slow down the mm exit during the race condition, but
> > > > will speed up the non-race scenarios iteration over the VMA list, which
> > > > should be much more common.
> > >
> > > Would leaving the existing check in help to just stop scanning faster in
> > > that case?
> >
> > Yes. But why? We would stop the scanning faster in the race condition
> > case, but slow the normal case down.
> >
> > This check was here to ensure that the mm isn't being torn down while
> > it's iterating over the loop. Hugh (Cc'ed) added this in 2009, but the
> > fundamental problem he specifies in his commit message in 9ba692948008
> > ("ksm: fix oom deadlock") is that exit_mmap() does not take the
> > mmap_lock() - which is no longer the case. We are safe to iterate the
> > VMAs with the mmap_read_lock() as the mmap_write_lock() is taken during
> > tear down of the VMA tree today.
> >
>
> Right. I just spotted that we have a ksm_test_exit() already in
> unmerge_ksm_pages(), so that should be sufficient to make us stop scanning
> in case ksm_exit() is waiting for the mmap lock.

Yeah, I don't think it's necessary in this case but that function is
used elsewhere.

>
>
> Adding a comment summarizing why that's required before iterating would be
> nice. Like
>
> /* Exit right away if the maple tree might have been torn down. */

Ack.

>
>
> With a better description, feel free to add
>
> Acked-by: David Hildenbrand <[email protected]>

Will do. I'll give Hugh some time to look at this before sending out a
v2.

Thanks,
Liam