2013-04-16 07:03:38

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

The commit 751efd8610d3 (mmu_notifier_unregister NULL Pointer deref
and multiple ->release()) breaks the fix:
3ad3d901bbcfb15a5e4690e55350db0899095a68
(mm: mmu_notifier: fix freed page still mapped in secondary MMU)

This patch reverts the commit and simply fix the bug spotted
by that patch

This bug is spotted by commit 751efd8610d3:
======
There is a race condition between mmu_notifier_unregister() and
__mmu_notifier_release().

Assume two tasks, one calling mmu_notifier_unregister() as a result of a
filp_close() ->flush() callout (task A), and the other calling
mmu_notifier_release() from an mmput() (task B).

A B
t1 srcu_read_lock()
t2 if (!hlist_unhashed())
t3 srcu_read_unlock()
t4 srcu_read_lock()
t5 hlist_del_init_rcu()
t6 synchronize_srcu()
t7 srcu_read_unlock()
t8 hlist_del_rcu() <--- NULL pointer deref.
======

This can be fixed by using hlist_del_init_rcu instead of hlist_del_rcu.

The another issue spotted in the commit is
"multiple ->release() callouts", we needn't care it too much because
it is really rare (e.g, can not happen on kvm since mmu-notify is unregistered
after exit_mmap()) and the later call of multiple ->release should be
fast since all the pages have already been released by the first call.

Signed-off-by: Xiao Guangrong <[email protected]>
---
mm/mmu_notifier.c | 81 +++++++++++++++++++++++++++--------------------------
1 files changed, 41 insertions(+), 40 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index be04122..606777a 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -40,48 +40,45 @@ void __mmu_notifier_release(struct mm_struct *mm)
int id;

/*
- * srcu_read_lock() here will block synchronize_srcu() in
- * mmu_notifier_unregister() until all registered
- * ->release() callouts this function makes have
- * returned.
+ * SRCU here will block mmu_notifier_unregister until
+ * ->release returns.
*/
id = srcu_read_lock(&srcu);
+ hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
+ /*
+ * if ->release runs before mmu_notifier_unregister it
+ * must be handled as it's the only way for the driver
+ * to flush all existing sptes and stop the driver
+ * from establishing any more sptes before all the
+ * pages in the mm are freed.
+ */
+ if (mn->ops->release)
+ mn->ops->release(mn, mm);
+ srcu_read_unlock(&srcu, id);
+
spin_lock(&mm->mmu_notifier_mm->lock);
while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
mn = hlist_entry(mm->mmu_notifier_mm->list.first,
struct mmu_notifier,
hlist);
-
/*
- * Unlink. This will prevent mmu_notifier_unregister()
- * from also making the ->release() callout.
+ * We arrived before mmu_notifier_unregister so
+ * mmu_notifier_unregister will do nothing other than
+ * to wait ->release to finish and
+ * mmu_notifier_unregister to return.
*/
hlist_del_init_rcu(&mn->hlist);
- spin_unlock(&mm->mmu_notifier_mm->lock);
-
- /*
- * Clear sptes. (see 'release' description in mmu_notifier.h)
- */
- if (mn->ops->release)
- mn->ops->release(mn, mm);
-
- spin_lock(&mm->mmu_notifier_mm->lock);
}
spin_unlock(&mm->mmu_notifier_mm->lock);

/*
- * All callouts to ->release() which we have done are complete.
- * Allow synchronize_srcu() in mmu_notifier_unregister() to complete
- */
- srcu_read_unlock(&srcu, id);
-
- /*
- * mmu_notifier_unregister() may have unlinked a notifier and may
- * still be calling out to it. Additionally, other notifiers
- * may have been active via vmtruncate() et. al. Block here
- * to ensure that all notifier callouts for this mm have been
- * completed and the sptes are really cleaned up before returning
- * to exit_mmap().
+ * synchronize_srcu here prevents mmu_notifier_release to
+ * return to exit_mmap (which would proceed freeing all pages
+ * in the mm) until the ->release method returns, if it was
+ * invoked by mmu_notifier_unregister.
+ *
+ * The mmu_notifier_mm can't go away from under us because one
+ * mm_count is hold by exit_mmap.
*/
synchronize_srcu(&srcu);
}
@@ -292,31 +289,35 @@ void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
{
BUG_ON(atomic_read(&mm->mm_count) <= 0);

- spin_lock(&mm->mmu_notifier_mm->lock);
if (!hlist_unhashed(&mn->hlist)) {
+ /*
+ * SRCU here will force exit_mmap to wait ->release to finish
+ * before freeing the pages.
+ */
int id;

+ id = srcu_read_lock(&srcu);
/*
- * Ensure we synchronize up with __mmu_notifier_release().
+ * exit_mmap will block in mmu_notifier_release to
+ * guarantee ->release is called before freeing the
+ * pages.
*/
- id = srcu_read_lock(&srcu);
-
- hlist_del_rcu(&mn->hlist);
- spin_unlock(&mm->mmu_notifier_mm->lock);
-
if (mn->ops->release)
mn->ops->release(mn, mm);
+ srcu_read_unlock(&srcu, id);

+ spin_lock(&mm->mmu_notifier_mm->lock);
/*
- * Allow __mmu_notifier_release() to complete.
+ * Can not use list_del_rcu() since __mmu_notifier_release
+ * can delete it before we hold the lock.
*/
- srcu_read_unlock(&srcu, id);
- } else
+ hlist_del_init_rcu(&mn->hlist);
spin_unlock(&mm->mmu_notifier_mm->lock);
+ }

/*
- * Wait for any running method to finish, including ->release() if it
- * was run by __mmu_notifier_release() instead of us.
+ * Wait any running method to finish, of course including
+ * ->release if it was run by mmu_notifier_relase instead of us.
*/
synchronize_srcu(&srcu);

--
1.7.7.6


2013-04-16 09:31:34

by Robin Holt

[permalink] [raw]
Subject: Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

On Tue, Apr 16, 2013 at 02:39:49PM +0800, Xiao Guangrong wrote:
> The commit 751efd8610d3 (mmu_notifier_unregister NULL Pointer deref
> and multiple ->release()) breaks the fix:
> 3ad3d901bbcfb15a5e4690e55350db0899095a68
> (mm: mmu_notifier: fix freed page still mapped in secondary MMU)

Can you describe how the page is still mapped? I thought I had all
cases covered. Whichever call hits first, I thought we had one callout
to the registered notifiers. Are you saying we need multiple callouts?

Also, shouldn't you be asking for a revert commit and then supply a
subsequent commit for the real fix? I thought that was the process for
doing a revert.

Thanks,
Robin Holt

>
> This patch reverts the commit and simply fix the bug spotted
> by that patch
>
> This bug is spotted by commit 751efd8610d3:
> ======
> There is a race condition between mmu_notifier_unregister() and
> __mmu_notifier_release().
>
> Assume two tasks, one calling mmu_notifier_unregister() as a result of a
> filp_close() ->flush() callout (task A), and the other calling
> mmu_notifier_release() from an mmput() (task B).
>
> A B
> t1 srcu_read_lock()
> t2 if (!hlist_unhashed())
> t3 srcu_read_unlock()
> t4 srcu_read_lock()
> t5 hlist_del_init_rcu()
> t6 synchronize_srcu()
> t7 srcu_read_unlock()
> t8 hlist_del_rcu() <--- NULL pointer deref.
> ======
>
> This can be fixed by using hlist_del_init_rcu instead of hlist_del_rcu.
>
> The another issue spotted in the commit is
> "multiple ->release() callouts", we needn't care it too much because
> it is really rare (e.g, can not happen on kvm since mmu-notify is unregistered
> after exit_mmap()) and the later call of multiple ->release should be
> fast since all the pages have already been released by the first call.
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> mm/mmu_notifier.c | 81 +++++++++++++++++++++++++++--------------------------
> 1 files changed, 41 insertions(+), 40 deletions(-)
>
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index be04122..606777a 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -40,48 +40,45 @@ void __mmu_notifier_release(struct mm_struct *mm)
> int id;
>
> /*
> - * srcu_read_lock() here will block synchronize_srcu() in
> - * mmu_notifier_unregister() until all registered
> - * ->release() callouts this function makes have
> - * returned.
> + * SRCU here will block mmu_notifier_unregister until
> + * ->release returns.
> */
> id = srcu_read_lock(&srcu);
> + hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
> + /*
> + * if ->release runs before mmu_notifier_unregister it
> + * must be handled as it's the only way for the driver
> + * to flush all existing sptes and stop the driver
> + * from establishing any more sptes before all the
> + * pages in the mm are freed.
> + */
> + if (mn->ops->release)
> + mn->ops->release(mn, mm);
> + srcu_read_unlock(&srcu, id);
> +
> spin_lock(&mm->mmu_notifier_mm->lock);
> while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
> mn = hlist_entry(mm->mmu_notifier_mm->list.first,
> struct mmu_notifier,
> hlist);
> -
> /*
> - * Unlink. This will prevent mmu_notifier_unregister()
> - * from also making the ->release() callout.
> + * We arrived before mmu_notifier_unregister so
> + * mmu_notifier_unregister will do nothing other than
> + * to wait ->release to finish and
> + * mmu_notifier_unregister to return.
> */
> hlist_del_init_rcu(&mn->hlist);
> - spin_unlock(&mm->mmu_notifier_mm->lock);
> -
> - /*
> - * Clear sptes. (see 'release' description in mmu_notifier.h)
> - */
> - if (mn->ops->release)
> - mn->ops->release(mn, mm);
> -
> - spin_lock(&mm->mmu_notifier_mm->lock);
> }
> spin_unlock(&mm->mmu_notifier_mm->lock);
>
> /*
> - * All callouts to ->release() which we have done are complete.
> - * Allow synchronize_srcu() in mmu_notifier_unregister() to complete
> - */
> - srcu_read_unlock(&srcu, id);
> -
> - /*
> - * mmu_notifier_unregister() may have unlinked a notifier and may
> - * still be calling out to it. Additionally, other notifiers
> - * may have been active via vmtruncate() et. al. Block here
> - * to ensure that all notifier callouts for this mm have been
> - * completed and the sptes are really cleaned up before returning
> - * to exit_mmap().
> + * synchronize_srcu here prevents mmu_notifier_release to
> + * return to exit_mmap (which would proceed freeing all pages
> + * in the mm) until the ->release method returns, if it was
> + * invoked by mmu_notifier_unregister.
> + *
> + * The mmu_notifier_mm can't go away from under us because one
> + * mm_count is hold by exit_mmap.
> */
> synchronize_srcu(&srcu);
> }
> @@ -292,31 +289,35 @@ void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
> {
> BUG_ON(atomic_read(&mm->mm_count) <= 0);
>
> - spin_lock(&mm->mmu_notifier_mm->lock);
> if (!hlist_unhashed(&mn->hlist)) {
> + /*
> + * SRCU here will force exit_mmap to wait ->release to finish
> + * before freeing the pages.
> + */
> int id;
>
> + id = srcu_read_lock(&srcu);
> /*
> - * Ensure we synchronize up with __mmu_notifier_release().
> + * exit_mmap will block in mmu_notifier_release to
> + * guarantee ->release is called before freeing the
> + * pages.
> */
> - id = srcu_read_lock(&srcu);
> -
> - hlist_del_rcu(&mn->hlist);
> - spin_unlock(&mm->mmu_notifier_mm->lock);
> -
> if (mn->ops->release)
> mn->ops->release(mn, mm);
> + srcu_read_unlock(&srcu, id);
>
> + spin_lock(&mm->mmu_notifier_mm->lock);
> /*
> - * Allow __mmu_notifier_release() to complete.
> + * Can not use list_del_rcu() since __mmu_notifier_release
> + * can delete it before we hold the lock.
> */
> - srcu_read_unlock(&srcu, id);
> - } else
> + hlist_del_init_rcu(&mn->hlist);
> spin_unlock(&mm->mmu_notifier_mm->lock);
> + }
>
> /*
> - * Wait for any running method to finish, including ->release() if it
> - * was run by __mmu_notifier_release() instead of us.
> + * Wait any running method to finish, of course including
> + * ->release if it was run by mmu_notifier_relase instead of us.
> */
> synchronize_srcu(&srcu);
>
> --
> 1.7.7.6

2013-04-16 10:37:13

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

On 04/16/2013 05:31 PM, Robin Holt wrote:
> On Tue, Apr 16, 2013 at 02:39:49PM +0800, Xiao Guangrong wrote:
>> The commit 751efd8610d3 (mmu_notifier_unregister NULL Pointer deref
>> and multiple ->release()) breaks the fix:
>> 3ad3d901bbcfb15a5e4690e55350db0899095a68
>> (mm: mmu_notifier: fix freed page still mapped in secondary MMU)
>
> Can you describe how the page is still mapped? I thought I had all
> cases covered. Whichever call hits first, I thought we had one callout
> to the registered notifiers. Are you saying we need multiple callouts?

No.

You patch did this:

hlist_del_init_rcu(&mn->hlist); 1 <======
+ spin_unlock(&mm->mmu_notifier_mm->lock);
+
+ /*
+ * Clear sptes. (see 'release' description in mmu_notifier.h)
+ */
+ if (mn->ops->release)
+ mn->ops->release(mn, mm); 2 <======
+
+ spin_lock(&mm->mmu_notifier_mm->lock);

At point 1, you delete the notify, but the page is still on LRU. Other
cpu can reclaim the page but without call ->invalid_page().

At point 2, you call ->release(), the secondary MMU make page Accessed/Dirty
but that page has already been on the free-list of page-alloctor.

>
> Also, shouldn't you be asking for a revert commit and then supply a
> subsequent commit for the real fix? I thought that was the process for
> doing a revert.

Can not do that pure reversion since your patch moved hlist_for_each_entry_rcu
which has been modified now.

Should i do pure-eversion + hlist_for_each_entry_rcu update first?

2013-04-16 11:25:56

by Robin Holt

[permalink] [raw]
Subject: Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

On Tue, Apr 16, 2013 at 06:26:36PM +0800, Xiao Guangrong wrote:
> On 04/16/2013 05:31 PM, Robin Holt wrote:
> > On Tue, Apr 16, 2013 at 02:39:49PM +0800, Xiao Guangrong wrote:
> >> The commit 751efd8610d3 (mmu_notifier_unregister NULL Pointer deref
> >> and multiple ->release()) breaks the fix:
> >> 3ad3d901bbcfb15a5e4690e55350db0899095a68
> >> (mm: mmu_notifier: fix freed page still mapped in secondary MMU)
> >
> > Can you describe how the page is still mapped? I thought I had all
> > cases covered. Whichever call hits first, I thought we had one callout
> > to the registered notifiers. Are you saying we need multiple callouts?
>
> No.
>
> You patch did this:
>
> hlist_del_init_rcu(&mn->hlist); 1 <======
> + spin_unlock(&mm->mmu_notifier_mm->lock);
> +
> + /*
> + * Clear sptes. (see 'release' description in mmu_notifier.h)
> + */
> + if (mn->ops->release)
> + mn->ops->release(mn, mm); 2 <======
> +
> + spin_lock(&mm->mmu_notifier_mm->lock);
>
> At point 1, you delete the notify, but the page is still on LRU. Other
> cpu can reclaim the page but without call ->invalid_page().
>
> At point 2, you call ->release(), the secondary MMU make page Accessed/Dirty
> but that page has already been on the free-list of page-alloctor.

That expectation on srcu _REALLY_ needs to be documented better.
Maybe I missed it in the comments, but there is an expectation beyond
the synchronize_srcu(). This code has been extremely poorly described
and I think it is time we fix that up.

I do see that in comments for mmu_notifier_unregister, there is an
expectation upon already having all the spte's removed prior to making
this call. I think that is also a stale comment as it mentions a lock
which I am not sure ever really existed.

> > Also, shouldn't you be asking for a revert commit and then supply a
> > subsequent commit for the real fix? I thought that was the process for
> > doing a revert.
>
> Can not do that pure reversion since your patch moved hlist_for_each_entry_rcu
> which has been modified now.
>
> Should i do pure-eversion + hlist_for_each_entry_rcu update first?

Let's not go off without considering this first.

It looks like what we really need to do is ensure there is a method
for ensuring that the mmu_notifier remains on the list while callouts
invalidate_page() callouts are being made and also a means of ensuring
that only one ->release() callout is made.

First, is it the case that when kvm calls mmu_notifier_unregister(),
it has already cleared the spte's? (what does spte stand for anyway)?
If so, then we really need to close the hole in __mmu_notifier_release().
I think we would need to modify code in both _unregister and _release,
but the issue is really _release.


I need to get ready and drive into work. If you want to float something
out there, that is fine. Otherwise, I will try to work something up
when I get to the office.

Thanks,
Robin

2013-04-16 11:43:25

by Robin Holt

[permalink] [raw]
Subject: Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

Argh. Taking a step back helped clear my head.

For the -stable releases, I agree we should just go with your
revert-plus-hlist_del_init_rcu patch. I will give it a test
when I am in the office.

For the v3.10 release, we should work on making this more
correct and completely documented.

Robin

On Tue, Apr 16, 2013 at 06:25:53AM -0500, Robin Holt wrote:
> On Tue, Apr 16, 2013 at 06:26:36PM +0800, Xiao Guangrong wrote:
> > On 04/16/2013 05:31 PM, Robin Holt wrote:
> > > On Tue, Apr 16, 2013 at 02:39:49PM +0800, Xiao Guangrong wrote:
> > >> The commit 751efd8610d3 (mmu_notifier_unregister NULL Pointer deref
> > >> and multiple ->release()) breaks the fix:
> > >> 3ad3d901bbcfb15a5e4690e55350db0899095a68
> > >> (mm: mmu_notifier: fix freed page still mapped in secondary MMU)
> > >
> > > Can you describe how the page is still mapped? I thought I had all
> > > cases covered. Whichever call hits first, I thought we had one callout
> > > to the registered notifiers. Are you saying we need multiple callouts?
> >
> > No.
> >
> > You patch did this:
> >
> > hlist_del_init_rcu(&mn->hlist); 1 <======
> > + spin_unlock(&mm->mmu_notifier_mm->lock);
> > +
> > + /*
> > + * Clear sptes. (see 'release' description in mmu_notifier.h)
> > + */
> > + if (mn->ops->release)
> > + mn->ops->release(mn, mm); 2 <======
> > +
> > + spin_lock(&mm->mmu_notifier_mm->lock);
> >
> > At point 1, you delete the notify, but the page is still on LRU. Other
> > cpu can reclaim the page but without call ->invalid_page().
> >
> > At point 2, you call ->release(), the secondary MMU make page Accessed/Dirty
> > but that page has already been on the free-list of page-alloctor.
>
> That expectation on srcu _REALLY_ needs to be documented better.
> Maybe I missed it in the comments, but there is an expectation beyond
> the synchronize_srcu(). This code has been extremely poorly described
> and I think it is time we fix that up.
>
> I do see that in comments for mmu_notifier_unregister, there is an
> expectation upon already having all the spte's removed prior to making
> this call. I think that is also a stale comment as it mentions a lock
> which I am not sure ever really existed.
>
> > > Also, shouldn't you be asking for a revert commit and then supply a
> > > subsequent commit for the real fix? I thought that was the process for
> > > doing a revert.
> >
> > Can not do that pure reversion since your patch moved hlist_for_each_entry_rcu
> > which has been modified now.
> >
> > Should i do pure-eversion + hlist_for_each_entry_rcu update first?
>
> Let's not go off without considering this first.
>
> It looks like what we really need to do is ensure there is a method
> for ensuring that the mmu_notifier remains on the list while callouts
> invalidate_page() callouts are being made and also a means of ensuring
> that only one ->release() callout is made.
>
> First, is it the case that when kvm calls mmu_notifier_unregister(),
> it has already cleared the spte's? (what does spte stand for anyway)?
> If so, then we really need to close the hole in __mmu_notifier_release().
> I think we would need to modify code in both _unregister and _release,
> but the issue is really _release.
>
>
> I need to get ready and drive into work. If you want to float something
> out there, that is fine. Otherwise, I will try to work something up
> when I get to the office.
>
> Thanks,
> Robin

2013-04-16 13:07:36

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

On 04/16/2013 07:43 PM, Robin Holt wrote:
> Argh. Taking a step back helped clear my head.
>
> For the -stable releases, I agree we should just go with your
> revert-plus-hlist_del_init_rcu patch. I will give it a test
> when I am in the office.

Okay. Wait for your test report. Thank you in advance.

>
> For the v3.10 release, we should work on making this more
> correct and completely documented.

Better document is always welcomed.

Double call ->release is not bad, like i mentioned it in the changelog:

it is really rare (e.g, can not happen on kvm since mmu-notify is unregistered
after exit_mmap()) and the later call of multiple ->release should be
fast since all the pages have already been released by the first call.

But, of course, it's great if you have a _light_ way to avoid this.

2013-04-16 18:08:38

by Robin Holt

[permalink] [raw]
Subject: Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

On Tue, Apr 16, 2013 at 09:07:20PM +0800, Xiao Guangrong wrote:
> On 04/16/2013 07:43 PM, Robin Holt wrote:
> > Argh. Taking a step back helped clear my head.
> >
> > For the -stable releases, I agree we should just go with your
> > revert-plus-hlist_del_init_rcu patch. I will give it a test
> > when I am in the office.
>
> Okay. Wait for your test report. Thank you in advance.
>
> >
> > For the v3.10 release, we should work on making this more
> > correct and completely documented.
>
> Better document is always welcomed.
>
> Double call ->release is not bad, like i mentioned it in the changelog:
>
> it is really rare (e.g, can not happen on kvm since mmu-notify is unregistered
> after exit_mmap()) and the later call of multiple ->release should be
> fast since all the pages have already been released by the first call.
>
> But, of course, it's great if you have a _light_ way to avoid this.

Getting my test environment set back up took longer than I would have liked.

Your patch passed. I got no NULL-pointer derefs.

How would you feel about adding the following to your patch?

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index deca874..ff2fd5f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -157,6 +157,7 @@ struct mmu_notifier_ops {
struct mmu_notifier {
struct hlist_node hlist;
const struct mmu_notifier_ops *ops;
+ int released;
};

static inline int mm_has_notifiers(struct mm_struct *mm)
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 606777a..949704b 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -44,7 +44,8 @@ void __mmu_notifier_release(struct mm_struct *mm)
* ->release returns.
*/
id = srcu_read_lock(&srcu);
- hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
+ hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+ int released;
/*
* if ->release runs before mmu_notifier_unregister it
* must be handled as it's the only way for the driver
@@ -52,8 +53,10 @@ void __mmu_notifier_release(struct mm_struct *mm)
* from establishing any more sptes before all the
* pages in the mm are freed.
*/
- if (mn->ops->release)
+ released = xchg(&mn->released, 1);
+ if (mn->ops->release && !released)
mn->ops->release(mn, mm);
+ }
srcu_read_unlock(&srcu, id);

spin_lock(&mm->mmu_notifier_mm->lock);
@@ -214,6 +217,7 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
mm->mmu_notifier_mm = mmu_notifier_mm;
mmu_notifier_mm = NULL;
}
+ mn->released = 0;
atomic_inc(&mm->mm_count);

/*
@@ -295,6 +299,7 @@ void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
* before freeing the pages.
*/
int id;
+ int released;

id = srcu_read_lock(&srcu);
/*
@@ -302,7 +307,8 @@ void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
* guarantee ->release is called before freeing the
* pages.
*/
- if (mn->ops->release)
+ released = xchg(&mn->released, 1);
+ if (mn->ops->release && !released)
mn->ops->release(mn, mm);
srcu_read_unlock(&srcu, id);

2013-04-17 02:56:14

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

On 04/17/2013 02:08 AM, Robin Holt wrote:
> On Tue, Apr 16, 2013 at 09:07:20PM +0800, Xiao Guangrong wrote:
>> On 04/16/2013 07:43 PM, Robin Holt wrote:
>>> Argh. Taking a step back helped clear my head.
>>>
>>> For the -stable releases, I agree we should just go with your
>>> revert-plus-hlist_del_init_rcu patch. I will give it a test
>>> when I am in the office.
>>
>> Okay. Wait for your test report. Thank you in advance.
>>
>>>
>>> For the v3.10 release, we should work on making this more
>>> correct and completely documented.
>>
>> Better document is always welcomed.
>>
>> Double call ->release is not bad, like i mentioned it in the changelog:
>>
>> it is really rare (e.g, can not happen on kvm since mmu-notify is unregistered
>> after exit_mmap()) and the later call of multiple ->release should be
>> fast since all the pages have already been released by the first call.
>>
>> But, of course, it's great if you have a _light_ way to avoid this.
>
> Getting my test environment set back up took longer than I would have liked.
>
> Your patch passed. I got no NULL-pointer derefs.

Thanks for your test again.

>
> How would you feel about adding the following to your patch?

I prefer to make these changes as a separate patch, this change is the
improvement, please do not mix it with bugfix.

You can make a patchset (comments improvement and this change) based on
my fix.

2013-04-17 14:10:40

by Robin Holt

[permalink] [raw]
Subject: Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

On Wed, Apr 17, 2013 at 10:55:26AM +0800, Xiao Guangrong wrote:
> On 04/17/2013 02:08 AM, Robin Holt wrote:
> > On Tue, Apr 16, 2013 at 09:07:20PM +0800, Xiao Guangrong wrote:
> >> On 04/16/2013 07:43 PM, Robin Holt wrote:
> >>> Argh. Taking a step back helped clear my head.
> >>>
> >>> For the -stable releases, I agree we should just go with your
> >>> revert-plus-hlist_del_init_rcu patch. I will give it a test
> >>> when I am in the office.
> >>
> >> Okay. Wait for your test report. Thank you in advance.
> >>
> >>>
> >>> For the v3.10 release, we should work on making this more
> >>> correct and completely documented.
> >>
> >> Better document is always welcomed.
> >>
> >> Double call ->release is not bad, like i mentioned it in the changelog:
> >>
> >> it is really rare (e.g, can not happen on kvm since mmu-notify is unregistered
> >> after exit_mmap()) and the later call of multiple ->release should be
> >> fast since all the pages have already been released by the first call.
> >>
> >> But, of course, it's great if you have a _light_ way to avoid this.
> >
> > Getting my test environment set back up took longer than I would have liked.
> >
> > Your patch passed. I got no NULL-pointer derefs.
>
> Thanks for your test again.
>
> >
> > How would you feel about adding the following to your patch?
>
> I prefer to make these changes as a separate patch, this change is the
> improvement, please do not mix it with bugfix.

I think your "improvement" classification is a bit deceiving. My previous
patch fixed the bug in calling release multiple times. Your patch without
this will reintroduce that buggy behavior. Just because the bug is already
worked around by KVM does not mean it is not a bug.

Robin

2013-04-17 18:41:45

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

On 04/17/2013 10:10 PM, Robin Holt wrote:
> On Wed, Apr 17, 2013 at 10:55:26AM +0800, Xiao Guangrong wrote:
>> On 04/17/2013 02:08 AM, Robin Holt wrote:
>>> On Tue, Apr 16, 2013 at 09:07:20PM +0800, Xiao Guangrong wrote:
>>>> On 04/16/2013 07:43 PM, Robin Holt wrote:
>>>>> Argh. Taking a step back helped clear my head.
>>>>>
>>>>> For the -stable releases, I agree we should just go with your
>>>>> revert-plus-hlist_del_init_rcu patch. I will give it a test
>>>>> when I am in the office.
>>>>
>>>> Okay. Wait for your test report. Thank you in advance.
>>>>
>>>>>
>>>>> For the v3.10 release, we should work on making this more
>>>>> correct and completely documented.
>>>>
>>>> Better document is always welcomed.
>>>>
>>>> Double call ->release is not bad, like i mentioned it in the changelog:
>>>>
>>>> it is really rare (e.g, can not happen on kvm since mmu-notify is unregistered
>>>> after exit_mmap()) and the later call of multiple ->release should be
>>>> fast since all the pages have already been released by the first call.
>>>>
>>>> But, of course, it's great if you have a _light_ way to avoid this.
>>>
>>> Getting my test environment set back up took longer than I would have liked.
>>>
>>> Your patch passed. I got no NULL-pointer derefs.
>>
>> Thanks for your test again.
>>
>>>
>>> How would you feel about adding the following to your patch?
>>
>> I prefer to make these changes as a separate patch, this change is the
>> improvement, please do not mix it with bugfix.
>
> I think your "improvement" classification is a bit deceiving. My previous
> patch fixed the bug in calling release multiple times. Your patch without
> this will reintroduce that buggy behavior. Just because the bug is already
> worked around by KVM does not mean it is not a bug.

As your tested, calling ->release() multiple times can work, but just make your
testcase more _slower_. So your changes is trying to speed it up - it is a
improvement.

Well, _if_ it is really a bug, could you please do not fix two bugs in one patch?

Thanks!

2013-04-17 18:45:27

by Robin Holt

[permalink] [raw]
Subject: Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

On Thu, Apr 18, 2013 at 02:41:31AM +0800, Xiao Guangrong wrote:
> On 04/17/2013 10:10 PM, Robin Holt wrote:
> > On Wed, Apr 17, 2013 at 10:55:26AM +0800, Xiao Guangrong wrote:
> >> On 04/17/2013 02:08 AM, Robin Holt wrote:
> >>> On Tue, Apr 16, 2013 at 09:07:20PM +0800, Xiao Guangrong wrote:
> >>>> On 04/16/2013 07:43 PM, Robin Holt wrote:
> >>>>> Argh. Taking a step back helped clear my head.
> >>>>>
> >>>>> For the -stable releases, I agree we should just go with your
> >>>>> revert-plus-hlist_del_init_rcu patch. I will give it a test
> >>>>> when I am in the office.
> >>>>
> >>>> Okay. Wait for your test report. Thank you in advance.
> >>>>
> >>>>>
> >>>>> For the v3.10 release, we should work on making this more
> >>>>> correct and completely documented.
> >>>>
> >>>> Better document is always welcomed.
> >>>>
> >>>> Double call ->release is not bad, like i mentioned it in the changelog:
> >>>>
> >>>> it is really rare (e.g, can not happen on kvm since mmu-notify is unregistered
> >>>> after exit_mmap()) and the later call of multiple ->release should be
> >>>> fast since all the pages have already been released by the first call.
> >>>>
> >>>> But, of course, it's great if you have a _light_ way to avoid this.
> >>>
> >>> Getting my test environment set back up took longer than I would have liked.
> >>>
> >>> Your patch passed. I got no NULL-pointer derefs.
> >>
> >> Thanks for your test again.
> >>
> >>>
> >>> How would you feel about adding the following to your patch?
> >>
> >> I prefer to make these changes as a separate patch, this change is the
> >> improvement, please do not mix it with bugfix.
> >
> > I think your "improvement" classification is a bit deceiving. My previous
> > patch fixed the bug in calling release multiple times. Your patch without
> > this will reintroduce that buggy behavior. Just because the bug is already
> > worked around by KVM does not mean it is not a bug.
>
> As your tested, calling ->release() multiple times can work, but just make your
> testcase more _slower_. So your changes is trying to speed it up - it is a
> improvement.
>
> Well, _if_ it is really a bug, could you please do not fix two bugs in one patch?

The code, as is, does not call ->release() multiple times. Your code
changes the behavior to call it multiple times. You are introducing the
bug by your code changes. Why not fix the bug you create in the patch
which creates it?

Robin

2013-04-17 18:52:42

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

On 04/18/2013 02:45 AM, Robin Holt wrote:

>>>>>>> For the v3.10 release, we should work on making this more
>>>>>>> correct and completely documented.
>>>>>>
>>>>>> Better document is always welcomed.
>>>>>>
>>>>>> Double call ->release is not bad, like i mentioned it in the changelog:
>>>>>>
>>>>>> it is really rare (e.g, can not happen on kvm since mmu-notify is unregistered
>>>>>> after exit_mmap()) and the later call of multiple ->release should be
>>>>>> fast since all the pages have already been released by the first call.
>>>>>>
>>>>>> But, of course, it's great if you have a _light_ way to avoid this.
>>>>>
>>>>> Getting my test environment set back up took longer than I would have liked.
>>>>>
>>>>> Your patch passed. I got no NULL-pointer derefs.
>>>>
>>>> Thanks for your test again.
>>>>
>>>>>
>>>>> How would you feel about adding the following to your patch?
>>>>
>>>> I prefer to make these changes as a separate patch, this change is the
>>>> improvement, please do not mix it with bugfix.
>>>
>>> I think your "improvement" classification is a bit deceiving. My previous
>>> patch fixed the bug in calling release multiple times. Your patch without
>>> this will reintroduce that buggy behavior. Just because the bug is already
>>> worked around by KVM does not mean it is not a bug.
>>
>> As your tested, calling ->release() multiple times can work, but just make your
>> testcase more _slower_. So your changes is trying to speed it up - it is a
>> improvement.
>>
>> Well, _if_ it is really a bug, could you please do not fix two bugs in one patch?
>
> The code, as is, does not call ->release() multiple times. Your code
> changes the behavior to call it multiple times. You are introducing the
> bug by your code changes. Why not fix the bug you create in the patch
> which creates it?

Andrew, your thought?

2013-04-17 23:39:09

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

Hi Robin,
On 04/16/2013 05:31 PM, Robin Holt wrote:
> On Tue, Apr 16, 2013 at 02:39:49PM +0800, Xiao Guangrong wrote:
>> The commit 751efd8610d3 (mmu_notifier_unregister NULL Pointer deref
>> and multiple ->release()) breaks the fix:
>> 3ad3d901bbcfb15a5e4690e55350db0899095a68
>> (mm: mmu_notifier: fix freed page still mapped in secondary MMU)
> Can you describe how the page is still mapped? I thought I had all
> cases covered. Whichever call hits first, I thought we had one callout
> to the registered notifiers. Are you saying we need multiple callouts?
>
> Also, shouldn't you be asking for a revert commit and then supply a
> subsequent commit for the real fix? I thought that was the process for
> doing a revert.

mmu_notifier is used for sync normal pte and spte, correct?

>
> Thanks,
> Robin Holt
>
>> This patch reverts the commit and simply fix the bug spotted
>> by that patch
>>
>> This bug is spotted by commit 751efd8610d3:
>> ======
>> There is a race condition between mmu_notifier_unregister() and
>> __mmu_notifier_release().
>>
>> Assume two tasks, one calling mmu_notifier_unregister() as a result of a
>> filp_close() ->flush() callout (task A), and the other calling
>> mmu_notifier_release() from an mmput() (task B).
>>
>> A B
>> t1 srcu_read_lock()
>> t2 if (!hlist_unhashed())
>> t3 srcu_read_unlock()
>> t4 srcu_read_lock()
>> t5 hlist_del_init_rcu()
>> t6 synchronize_srcu()
>> t7 srcu_read_unlock()
>> t8 hlist_del_rcu() <--- NULL pointer deref.
>> ======
>>
>> This can be fixed by using hlist_del_init_rcu instead of hlist_del_rcu.
>>
>> The another issue spotted in the commit is
>> "multiple ->release() callouts", we needn't care it too much because
>> it is really rare (e.g, can not happen on kvm since mmu-notify is unregistered
>> after exit_mmap()) and the later call of multiple ->release should be
>> fast since all the pages have already been released by the first call.
>>
>> Signed-off-by: Xiao Guangrong <[email protected]>
>> ---
>> mm/mmu_notifier.c | 81 +++++++++++++++++++++++++++--------------------------
>> 1 files changed, 41 insertions(+), 40 deletions(-)
>>
>> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
>> index be04122..606777a 100644
>> --- a/mm/mmu_notifier.c
>> +++ b/mm/mmu_notifier.c
>> @@ -40,48 +40,45 @@ void __mmu_notifier_release(struct mm_struct *mm)
>> int id;
>>
>> /*
>> - * srcu_read_lock() here will block synchronize_srcu() in
>> - * mmu_notifier_unregister() until all registered
>> - * ->release() callouts this function makes have
>> - * returned.
>> + * SRCU here will block mmu_notifier_unregister until
>> + * ->release returns.
>> */
>> id = srcu_read_lock(&srcu);
>> + hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
>> + /*
>> + * if ->release runs before mmu_notifier_unregister it
>> + * must be handled as it's the only way for the driver
>> + * to flush all existing sptes and stop the driver
>> + * from establishing any more sptes before all the
>> + * pages in the mm are freed.
>> + */
>> + if (mn->ops->release)
>> + mn->ops->release(mn, mm);
>> + srcu_read_unlock(&srcu, id);
>> +
>> spin_lock(&mm->mmu_notifier_mm->lock);
>> while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
>> mn = hlist_entry(mm->mmu_notifier_mm->list.first,
>> struct mmu_notifier,
>> hlist);
>> -
>> /*
>> - * Unlink. This will prevent mmu_notifier_unregister()
>> - * from also making the ->release() callout.
>> + * We arrived before mmu_notifier_unregister so
>> + * mmu_notifier_unregister will do nothing other than
>> + * to wait ->release to finish and
>> + * mmu_notifier_unregister to return.
>> */
>> hlist_del_init_rcu(&mn->hlist);
>> - spin_unlock(&mm->mmu_notifier_mm->lock);
>> -
>> - /*
>> - * Clear sptes. (see 'release' description in mmu_notifier.h)
>> - */
>> - if (mn->ops->release)
>> - mn->ops->release(mn, mm);
>> -
>> - spin_lock(&mm->mmu_notifier_mm->lock);
>> }
>> spin_unlock(&mm->mmu_notifier_mm->lock);
>>
>> /*
>> - * All callouts to ->release() which we have done are complete.
>> - * Allow synchronize_srcu() in mmu_notifier_unregister() to complete
>> - */
>> - srcu_read_unlock(&srcu, id);
>> -
>> - /*
>> - * mmu_notifier_unregister() may have unlinked a notifier and may
>> - * still be calling out to it. Additionally, other notifiers
>> - * may have been active via vmtruncate() et. al. Block here
>> - * to ensure that all notifier callouts for this mm have been
>> - * completed and the sptes are really cleaned up before returning
>> - * to exit_mmap().
>> + * synchronize_srcu here prevents mmu_notifier_release to
>> + * return to exit_mmap (which would proceed freeing all pages
>> + * in the mm) until the ->release method returns, if it was
>> + * invoked by mmu_notifier_unregister.
>> + *
>> + * The mmu_notifier_mm can't go away from under us because one
>> + * mm_count is hold by exit_mmap.
>> */
>> synchronize_srcu(&srcu);
>> }
>> @@ -292,31 +289,35 @@ void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
>> {
>> BUG_ON(atomic_read(&mm->mm_count) <= 0);
>>
>> - spin_lock(&mm->mmu_notifier_mm->lock);
>> if (!hlist_unhashed(&mn->hlist)) {
>> + /*
>> + * SRCU here will force exit_mmap to wait ->release to finish
>> + * before freeing the pages.
>> + */
>> int id;
>>
>> + id = srcu_read_lock(&srcu);
>> /*
>> - * Ensure we synchronize up with __mmu_notifier_release().
>> + * exit_mmap will block in mmu_notifier_release to
>> + * guarantee ->release is called before freeing the
>> + * pages.
>> */
>> - id = srcu_read_lock(&srcu);
>> -
>> - hlist_del_rcu(&mn->hlist);
>> - spin_unlock(&mm->mmu_notifier_mm->lock);
>> -
>> if (mn->ops->release)
>> mn->ops->release(mn, mm);
>> + srcu_read_unlock(&srcu, id);
>>
>> + spin_lock(&mm->mmu_notifier_mm->lock);
>> /*
>> - * Allow __mmu_notifier_release() to complete.
>> + * Can not use list_del_rcu() since __mmu_notifier_release
>> + * can delete it before we hold the lock.
>> */
>> - srcu_read_unlock(&srcu, id);
>> - } else
>> + hlist_del_init_rcu(&mn->hlist);
>> spin_unlock(&mm->mmu_notifier_mm->lock);
>> + }
>>
>> /*
>> - * Wait for any running method to finish, including ->release() if it
>> - * was run by __mmu_notifier_release() instead of us.
>> + * Wait any running method to finish, of course including
>> + * ->release if it was run by mmu_notifier_relase instead of us.
>> */
>> synchronize_srcu(&srcu);
>>
>> --
>> 1.7.7.6
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-05-01 09:03:51

by Robin Holt

[permalink] [raw]
Subject: Re: [PATCH] mm: mmu_notifier: re-fix freed page still mapped in secondary MMU

On Thu, Apr 18, 2013 at 02:52:31AM +0800, Xiao Guangrong wrote:
> On 04/18/2013 02:45 AM, Robin Holt wrote:
>
> >>>>>>> For the v3.10 release, we should work on making this more
> >>>>>>> correct and completely documented.
> >>>>>>
> >>>>>> Better document is always welcomed.
> >>>>>>
> >>>>>> Double call ->release is not bad, like i mentioned it in the changelog:
> >>>>>>
> >>>>>> it is really rare (e.g, can not happen on kvm since mmu-notify is unregistered
> >>>>>> after exit_mmap()) and the later call of multiple ->release should be
> >>>>>> fast since all the pages have already been released by the first call.
> >>>>>>
> >>>>>> But, of course, it's great if you have a _light_ way to avoid this.
> >>>>>
> >>>>> Getting my test environment set back up took longer than I would have liked.
> >>>>>
> >>>>> Your patch passed. I got no NULL-pointer derefs.
> >>>>
> >>>> Thanks for your test again.
> >>>>
> >>>>>
> >>>>> How would you feel about adding the following to your patch?
> >>>>
> >>>> I prefer to make these changes as a separate patch, this change is the
> >>>> improvement, please do not mix it with bugfix.
> >>>
> >>> I think your "improvement" classification is a bit deceiving. My previous
> >>> patch fixed the bug in calling release multiple times. Your patch without
> >>> this will reintroduce that buggy behavior. Just because the bug is already
> >>> worked around by KVM does not mean it is not a bug.
> >>
> >> As your tested, calling ->release() multiple times can work, but just make your
> >> testcase more _slower_. So your changes is trying to speed it up - it is a
> >> improvement.
> >>
> >> Well, _if_ it is really a bug, could you please do not fix two bugs in one patch?
> >
> > The code, as is, does not call ->release() multiple times. Your code
> > changes the behavior to call it multiple times. You are introducing the
> > bug by your code changes. Why not fix the bug you create in the patch
> > which creates it?
>
> Andrew, your thought?
>

What ever happened with this?

Robin