2019-01-28 12:16:43

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC] mm: migrate: don't rely on PageMovable() of newpage after unlocking it

While debugging some crashes related to virtio-balloon deflation that
happened under the old balloon migration code, I stumbled over a possible
race I think.

What we experienced:

drivers/virtio/virtio_balloon.c:release_pages_balloon():
- WARNING: CPU: 13 PID: 6586 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
- list_del corruption. prev->next should be ffffe253961090a0, but was dead000000000100

Turns out after having added the page to a local list when dequeuing,
the page would suddenly be moved to an LRU list before we would free it
via the local list, corrupting both lists. So a page we own and that is
!LRU was moved to an LRU list.

For us, this was triggered by backporting 195a8c43e93d8 ("virtio-balloon:
deflate via a page list") onto old balloon compaction code, but I think
this only made the BUG become visible and the race still exists in
new !LRU migraton code.

My theory:

In __unmap_and_move(), we lock the old and newpage and perform the
migration. In case of vitio-balloon, the new page will become
movable, the old page will no longer be movable.

However, after unlocking newpage, I think there is nothing stopping
the newpage from getting dequeued and freed by virtio-balloon. This
will result in the newpage
1. No longer having PageMovable()
2. Getting moved to the local list before finally freeing it (using
page->lru)

Back in the migration thread in __unmap_and_move(), we would after
unlocking the newpage suddenly no longer have PageMovable(newpage) and
will therefore call putback_lru_page(newpage), modifying page->lru
although that list is still in use by virtio-balloon.

To summarize, we have a race between migrating the newpage and checking
for PageMovable(newpage). Instead of checking __PageMovable(newpage), we
can simply rely on is_lru. Because if the old page was not an LRU page, the
new page also shouldn't be. If there is a problem with that, we could check
for __PageMovable(newpage) before unlocking the new page.

As I am not yet sure if this race actually exists, especially also
upstream, I am sending this as RFC.

Cc: Andrew Morton <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Dominik Brodowski <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Vratislav Bendel <[email protected]>
Cc: Rafael Aquini <[email protected]>
Reported-by: Vratislav Bendel <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/migrate.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 4512afab46ac..31e002270b05 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1135,10 +1135,12 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
* If migration is successful, decrease refcount of the newpage
* which will not free the page because new page owner increased
* refcounter. As well, if it is LRU page, add the page to LRU
- * list in here.
+ * list in here. Don't rely on PageMovable(newpage), as that could
+ * already have changed after unlocking newpage (e.g.
+ * virtio-balloon deflation).
*/
if (rc == MIGRATEPAGE_SUCCESS) {
- if (unlikely(__PageMovable(newpage)))
+ if (unlikely(!is_lru))
put_page(newpage);
else
putback_lru_page(newpage);
--
2.17.2



2019-01-28 13:07:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH RFC] mm: migrate: don't rely on PageMovable() of newpage after unlocking it

On Mon 28-01-19 13:16:09, David Hildenbrand wrote:
[...]
> My theory:
>
> In __unmap_and_move(), we lock the old and newpage and perform the
> migration. In case of vitio-balloon, the new page will become
> movable, the old page will no longer be movable.
>
> However, after unlocking newpage, I think there is nothing stopping
> the newpage from getting dequeued and freed by virtio-balloon. This
> will result in the newpage
> 1. No longer having PageMovable()
> 2. Getting moved to the local list before finally freeing it (using
> page->lru)

Does that mean that the virtio-balloon can change the Movable state
while there are other users of the page? Can you point to the code that
does it? How come this can be safe at all? Or is the PageMovable stable
only under the page lock?

--
Michal Hocko
SUSE Labs

2019-01-28 13:16:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC] mm: migrate: don't rely on PageMovable() of newpage after unlocking it

On 28.01.19 14:07, Michal Hocko wrote:
> On Mon 28-01-19 13:16:09, David Hildenbrand wrote:
> [...]
>> My theory:
>>
>> In __unmap_and_move(), we lock the old and newpage and perform the
>> migration. In case of vitio-balloon, the new page will become
>> movable, the old page will no longer be movable.
>>
>> However, after unlocking newpage, I think there is nothing stopping
>> the newpage from getting dequeued and freed by virtio-balloon. This
>> will result in the newpage
>> 1. No longer having PageMovable()
>> 2. Getting moved to the local list before finally freeing it (using
>> page->lru)
>
> Does that mean that the virtio-balloon can change the Movable state
> while there are other users of the page? Can you point to the code that
> does it? How come this can be safe at all? Or is the PageMovable stable
> only under the page lock?
>

PageMovable is stable under the lock. The relevant instructions are in

mm/balloon_compaction.c and include/linux/balloon_compaction.h

balloon_page_insert() and balloon_page_delete() modify PageMovable and
are only called with both, the page locked and b_dev_info->pages_lock
locked.

Especially:
1. balloon_page_dequeue() which tries to find an unlocked page that is
not isolated. and deletes it via balloon_page_delete().

2. balloon_page_migrate() will effectively call balloon_page_delete() on
old page and balloon_page_insert() newpage. Both pages are locked before
calling balloon_page_migrate() and the b_dev_info->pages_lock is taken
internally.


So what's left is a very small race window when migrating to a new page,
after dropping the lock of the newpage.

--

Thanks,

David / dhildenb

2019-01-28 13:22:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH RFC] mm: migrate: don't rely on PageMovable() of newpage after unlocking it

On Mon 28-01-19 14:14:28, David Hildenbrand wrote:
> On 28.01.19 14:07, Michal Hocko wrote:
> > On Mon 28-01-19 13:16:09, David Hildenbrand wrote:
> > [...]
> >> My theory:
> >>
> >> In __unmap_and_move(), we lock the old and newpage and perform the
> >> migration. In case of vitio-balloon, the new page will become
> >> movable, the old page will no longer be movable.
> >>
> >> However, after unlocking newpage, I think there is nothing stopping
> >> the newpage from getting dequeued and freed by virtio-balloon. This
> >> will result in the newpage
> >> 1. No longer having PageMovable()
> >> 2. Getting moved to the local list before finally freeing it (using
> >> page->lru)
> >
> > Does that mean that the virtio-balloon can change the Movable state
> > while there are other users of the page? Can you point to the code that
> > does it? How come this can be safe at all? Or is the PageMovable stable
> > only under the page lock?
> >
>
> PageMovable is stable under the lock. The relevant instructions are in
>
> mm/balloon_compaction.c and include/linux/balloon_compaction.h

OK, I have just checked __ClearPageMovable and it indeed requires
PageLock. Then we also have to move is_lru = __PageMovable(page) after
the page lock.
--
Michal Hocko
SUSE Labs

2019-01-28 13:23:34

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC] mm: migrate: don't rely on PageMovable() of newpage after unlocking it

On 28.01.19 14:21, Michal Hocko wrote:
> On Mon 28-01-19 14:14:28, David Hildenbrand wrote:
>> On 28.01.19 14:07, Michal Hocko wrote:
>>> On Mon 28-01-19 13:16:09, David Hildenbrand wrote:
>>> [...]
>>>> My theory:
>>>>
>>>> In __unmap_and_move(), we lock the old and newpage and perform the
>>>> migration. In case of vitio-balloon, the new page will become
>>>> movable, the old page will no longer be movable.
>>>>
>>>> However, after unlocking newpage, I think there is nothing stopping
>>>> the newpage from getting dequeued and freed by virtio-balloon. This
>>>> will result in the newpage
>>>> 1. No longer having PageMovable()
>>>> 2. Getting moved to the local list before finally freeing it (using
>>>> page->lru)
>>>
>>> Does that mean that the virtio-balloon can change the Movable state
>>> while there are other users of the page? Can you point to the code that
>>> does it? How come this can be safe at all? Or is the PageMovable stable
>>> only under the page lock?
>>>
>>
>> PageMovable is stable under the lock. The relevant instructions are in
>>
>> mm/balloon_compaction.c and include/linux/balloon_compaction.h
>
> OK, I have just checked __ClearPageMovable and it indeed requires
> PageLock. Then we also have to move is_lru = __PageMovable(page) after
> the page lock.
>

I assume that is fine as is as the page is isolated? (yes, it will be
modified later when moving but we are interested in the original state)

--

Thanks,

David / dhildenb

2019-01-28 13:35:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH RFC] mm: migrate: don't rely on PageMovable() of newpage after unlocking it

On Mon 28-01-19 14:22:52, David Hildenbrand wrote:
> On 28.01.19 14:21, Michal Hocko wrote:
> > On Mon 28-01-19 14:14:28, David Hildenbrand wrote:
> >> On 28.01.19 14:07, Michal Hocko wrote:
> >>> On Mon 28-01-19 13:16:09, David Hildenbrand wrote:
> >>> [...]
> >>>> My theory:
> >>>>
> >>>> In __unmap_and_move(), we lock the old and newpage and perform the
> >>>> migration. In case of vitio-balloon, the new page will become
> >>>> movable, the old page will no longer be movable.
> >>>>
> >>>> However, after unlocking newpage, I think there is nothing stopping
> >>>> the newpage from getting dequeued and freed by virtio-balloon. This
> >>>> will result in the newpage
> >>>> 1. No longer having PageMovable()
> >>>> 2. Getting moved to the local list before finally freeing it (using
> >>>> page->lru)
> >>>
> >>> Does that mean that the virtio-balloon can change the Movable state
> >>> while there are other users of the page? Can you point to the code that
> >>> does it? How come this can be safe at all? Or is the PageMovable stable
> >>> only under the page lock?
> >>>
> >>
> >> PageMovable is stable under the lock. The relevant instructions are in
> >>
> >> mm/balloon_compaction.c and include/linux/balloon_compaction.h
> >
> > OK, I have just checked __ClearPageMovable and it indeed requires
> > PageLock. Then we also have to move is_lru = __PageMovable(page) after
> > the page lock.
> >
>
> I assume that is fine as is as the page is isolated? (yes, it will be
> modified later when moving but we are interested in the original state)

OK, I've missed that the page is indeed isolated. Then the patch makes
sense to me.
--
Michal Hocko
SUSE Labs

2019-01-28 14:39:45

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC] mm: migrate: don't rely on PageMovable() of newpage after unlocking it

On 28.01.19 14:35, Michal Hocko wrote:
> On Mon 28-01-19 14:22:52, David Hildenbrand wrote:
>> On 28.01.19 14:21, Michal Hocko wrote:
>>> On Mon 28-01-19 14:14:28, David Hildenbrand wrote:
>>>> On 28.01.19 14:07, Michal Hocko wrote:
>>>>> On Mon 28-01-19 13:16:09, David Hildenbrand wrote:
>>>>> [...]
>>>>>> My theory:
>>>>>>
>>>>>> In __unmap_and_move(), we lock the old and newpage and perform the
>>>>>> migration. In case of vitio-balloon, the new page will become
>>>>>> movable, the old page will no longer be movable.
>>>>>>
>>>>>> However, after unlocking newpage, I think there is nothing stopping
>>>>>> the newpage from getting dequeued and freed by virtio-balloon. This
>>>>>> will result in the newpage
>>>>>> 1. No longer having PageMovable()
>>>>>> 2. Getting moved to the local list before finally freeing it (using
>>>>>> page->lru)
>>>>>
>>>>> Does that mean that the virtio-balloon can change the Movable state
>>>>> while there are other users of the page? Can you point to the code that
>>>>> does it? How come this can be safe at all? Or is the PageMovable stable
>>>>> only under the page lock?
>>>>>
>>>>
>>>> PageMovable is stable under the lock. The relevant instructions are in
>>>>
>>>> mm/balloon_compaction.c and include/linux/balloon_compaction.h
>>>
>>> OK, I have just checked __ClearPageMovable and it indeed requires
>>> PageLock. Then we also have to move is_lru = __PageMovable(page) after
>>> the page lock.
>>>
>>
>> I assume that is fine as is as the page is isolated? (yes, it will be
>> modified later when moving but we are interested in the original state)
>
> OK, I've missed that the page is indeed isolated. Then the patch makes
> sense to me.
>

Thanks Michal. I assume this has broken ever since balloon compaction
was introduced. I'll wait a little more and then resend as !RFC with a
cc-stable tag.

--

Thanks,

David / dhildenb

2019-01-28 14:55:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH RFC] mm: migrate: don't rely on PageMovable() of newpage after unlocking it

On Mon 28-01-19 15:38:38, David Hildenbrand wrote:
> On 28.01.19 14:35, Michal Hocko wrote:
> > On Mon 28-01-19 14:22:52, David Hildenbrand wrote:
> >> On 28.01.19 14:21, Michal Hocko wrote:
> >>> On Mon 28-01-19 14:14:28, David Hildenbrand wrote:
> >>>> On 28.01.19 14:07, Michal Hocko wrote:
> >>>>> On Mon 28-01-19 13:16:09, David Hildenbrand wrote:
> >>>>> [...]
> >>>>>> My theory:
> >>>>>>
> >>>>>> In __unmap_and_move(), we lock the old and newpage and perform the
> >>>>>> migration. In case of vitio-balloon, the new page will become
> >>>>>> movable, the old page will no longer be movable.
> >>>>>>
> >>>>>> However, after unlocking newpage, I think there is nothing stopping
> >>>>>> the newpage from getting dequeued and freed by virtio-balloon. This
> >>>>>> will result in the newpage
> >>>>>> 1. No longer having PageMovable()
> >>>>>> 2. Getting moved to the local list before finally freeing it (using
> >>>>>> page->lru)
> >>>>>
> >>>>> Does that mean that the virtio-balloon can change the Movable state
> >>>>> while there are other users of the page? Can you point to the code that
> >>>>> does it? How come this can be safe at all? Or is the PageMovable stable
> >>>>> only under the page lock?
> >>>>>
> >>>>
> >>>> PageMovable is stable under the lock. The relevant instructions are in
> >>>>
> >>>> mm/balloon_compaction.c and include/linux/balloon_compaction.h
> >>>
> >>> OK, I have just checked __ClearPageMovable and it indeed requires
> >>> PageLock. Then we also have to move is_lru = __PageMovable(page) after
> >>> the page lock.
> >>>
> >>
> >> I assume that is fine as is as the page is isolated? (yes, it will be
> >> modified later when moving but we are interested in the original state)
> >
> > OK, I've missed that the page is indeed isolated. Then the patch makes
> > sense to me.
> >
>
> Thanks Michal. I assume this has broken ever since balloon compaction
> was introduced. I'll wait a little more and then resend as !RFC with a
> cc-stable tag.

Please make sure to CC Minchan when reposting.

Btw.
Acked-by: Michal Hocko <[email protected]>
--
Michal Hocko
SUSE Labs

2019-01-28 15:02:40

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH RFC] mm: migrate: don't rely on PageMovable() of newpage after unlocking it

On Mon, Jan 28, 2019 at 03:38:38PM +0100, David Hildenbrand wrote:
> On 28.01.19 14:35, Michal Hocko wrote:
> > On Mon 28-01-19 14:22:52, David Hildenbrand wrote:
> >> On 28.01.19 14:21, Michal Hocko wrote:
> >>> On Mon 28-01-19 14:14:28, David Hildenbrand wrote:
> >>>> On 28.01.19 14:07, Michal Hocko wrote:
> >>>>> On Mon 28-01-19 13:16:09, David Hildenbrand wrote:
> >>>>> [...]
> >>>>>> My theory:
> >>>>>>
> >>>>>> In __unmap_and_move(), we lock the old and newpage and perform the
> >>>>>> migration. In case of vitio-balloon, the new page will become
> >>>>>> movable, the old page will no longer be movable.
> >>>>>>
> >>>>>> However, after unlocking newpage, I think there is nothing stopping
> >>>>>> the newpage from getting dequeued and freed by virtio-balloon. This
> >>>>>> will result in the newpage
> >>>>>> 1. No longer having PageMovable()
> >>>>>> 2. Getting moved to the local list before finally freeing it (using
> >>>>>> page->lru)
> >>>>>
> >>>>> Does that mean that the virtio-balloon can change the Movable state
> >>>>> while there are other users of the page? Can you point to the code that
> >>>>> does it? How come this can be safe at all? Or is the PageMovable stable
> >>>>> only under the page lock?
> >>>>>
> >>>>
> >>>> PageMovable is stable under the lock. The relevant instructions are in
> >>>>
> >>>> mm/balloon_compaction.c and include/linux/balloon_compaction.h
> >>>
> >>> OK, I have just checked __ClearPageMovable and it indeed requires
> >>> PageLock. Then we also have to move is_lru = __PageMovable(page) after
> >>> the page lock.
> >>>
> >>
> >> I assume that is fine as is as the page is isolated? (yes, it will be
> >> modified later when moving but we are interested in the original state)
> >
> > OK, I've missed that the page is indeed isolated. Then the patch makes
> > sense to me.
> >
>
> Thanks Michal. I assume this has broken ever since balloon compaction
> was introduced. I'll wait a little more and then resend as !RFC with a
> cc-stable tag.
>

Yes, balloon deflation could always race against migration
This race was a problem, initially, and was dealt with, via:

commit 117aad1e9e4d97448d1df3f84b08bd65811e6d6a
Author: Rafael Aquini <[email protected]>
Date: Mon Sep 30 13:45:16 2013 -0700

mm: avoid reinserting isolated balloon pages into LRU lists



I think this upstream patch has re-introduced it, in a more subtle way,
as we're stumbling on it now, again:

commit d6d86c0a7f8ddc5b38cf089222cb1d9540762dc2
Author: Konstantin Khlebnikov <[email protected]>
Date: Thu Oct 9 15:29:27 2014 -0700

mm/balloon_compaction: redesign ballooned pages management



On this particular race against migration case, virtio ballon deflation would
not see it before

commit b1123ea6d3b3da25af5c8a9d843bd07ab63213f4
Author: Minchan Kim <[email protected]>
Date: Tue Jul 26 15:23:09 2016 -0700

mm: balloon: use general non-lru movable page feature

as the recently released balloon page would be post-processed
without the page->lru list handling, which for migration stability
purposes must be done under the protection of page_lock.


get rid of balloon reference count.


-- Rafael

2019-01-28 15:05:19

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH RFC] mm: migrate: don't rely on PageMovable() of newpage after unlocking it

On Mon, Jan 28, 2019 at 10:01:56AM -0500, Rafael Aquini wrote:
> On Mon, Jan 28, 2019 at 03:38:38PM +0100, David Hildenbrand wrote:
> > On 28.01.19 14:35, Michal Hocko wrote:
> > > On Mon 28-01-19 14:22:52, David Hildenbrand wrote:
> > >> On 28.01.19 14:21, Michal Hocko wrote:
> > >>> On Mon 28-01-19 14:14:28, David Hildenbrand wrote:
> > >>>> On 28.01.19 14:07, Michal Hocko wrote:
> > >>>>> On Mon 28-01-19 13:16:09, David Hildenbrand wrote:
> > >>>>> [...]
> > >>>>>> My theory:
> > >>>>>>
> > >>>>>> In __unmap_and_move(), we lock the old and newpage and perform the
> > >>>>>> migration. In case of vitio-balloon, the new page will become
> > >>>>>> movable, the old page will no longer be movable.
> > >>>>>>
> > >>>>>> However, after unlocking newpage, I think there is nothing stopping
> > >>>>>> the newpage from getting dequeued and freed by virtio-balloon. This
> > >>>>>> will result in the newpage
> > >>>>>> 1. No longer having PageMovable()
> > >>>>>> 2. Getting moved to the local list before finally freeing it (using
> > >>>>>> page->lru)
> > >>>>>
> > >>>>> Does that mean that the virtio-balloon can change the Movable state
> > >>>>> while there are other users of the page? Can you point to the code that
> > >>>>> does it? How come this can be safe at all? Or is the PageMovable stable
> > >>>>> only under the page lock?
> > >>>>>
> > >>>>
> > >>>> PageMovable is stable under the lock. The relevant instructions are in
> > >>>>
> > >>>> mm/balloon_compaction.c and include/linux/balloon_compaction.h
> > >>>
> > >>> OK, I have just checked __ClearPageMovable and it indeed requires
> > >>> PageLock. Then we also have to move is_lru = __PageMovable(page) after
> > >>> the page lock.
> > >>>
> > >>
> > >> I assume that is fine as is as the page is isolated? (yes, it will be
> > >> modified later when moving but we are interested in the original state)
> > >
> > > OK, I've missed that the page is indeed isolated. Then the patch makes
> > > sense to me.
> > >
> >
> > Thanks Michal. I assume this has broken ever since balloon compaction
> > was introduced. I'll wait a little more and then resend as !RFC with a
> > cc-stable tag.
> >
>
> Yes, balloon deflation could always race against migration
> This race was a problem, initially, and was dealt with, via:
>
> commit 117aad1e9e4d97448d1df3f84b08bd65811e6d6a
> Author: Rafael Aquini <[email protected]>
> Date: Mon Sep 30 13:45:16 2013 -0700
>
> mm: avoid reinserting isolated balloon pages into LRU lists
>
>
>
> I think this upstream patch has re-introduced it, in a more subtle way,
> as we're stumbling on it now, again:
>
> commit d6d86c0a7f8ddc5b38cf089222cb1d9540762dc2
> Author: Konstantin Khlebnikov <[email protected]>
> Date: Thu Oct 9 15:29:27 2014 -0700
>
> mm/balloon_compaction: redesign ballooned pages management
>
>
>
> On this particular race against migration case, virtio ballon deflation would
> not see it before
>
> commit b1123ea6d3b3da25af5c8a9d843bd07ab63213f4
> Author: Minchan Kim <[email protected]>
> Date: Tue Jul 26 15:23:09 2016 -0700
>
> mm: balloon: use general non-lru movable page feature
>
> as the recently released balloon page would be post-processed
> without the page->lru list handling, which for migration stability
> purposes must be done under the protection of page_lock.
>
>

missing part here:

I think your patch adresses this new case.


Acked-by: Rafael Aquini <[email protected]>


> get rid of balloon reference count.

^^ this was a left over (sorry about my fat-fingers)
>
>
> -- Rafael

2019-01-28 15:13:42

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC] mm: migrate: don't rely on PageMovable() of newpage after unlocking it

On 28.01.19 16:04, Rafael Aquini wrote:
> On Mon, Jan 28, 2019 at 10:01:56AM -0500, Rafael Aquini wrote:
>> On Mon, Jan 28, 2019 at 03:38:38PM +0100, David Hildenbrand wrote:
>>> On 28.01.19 14:35, Michal Hocko wrote:
>>>> On Mon 28-01-19 14:22:52, David Hildenbrand wrote:
>>>>> On 28.01.19 14:21, Michal Hocko wrote:
>>>>>> On Mon 28-01-19 14:14:28, David Hildenbrand wrote:
>>>>>>> On 28.01.19 14:07, Michal Hocko wrote:
>>>>>>>> On Mon 28-01-19 13:16:09, David Hildenbrand wrote:
>>>>>>>> [...]
>>>>>>>>> My theory:
>>>>>>>>>
>>>>>>>>> In __unmap_and_move(), we lock the old and newpage and perform the
>>>>>>>>> migration. In case of vitio-balloon, the new page will become
>>>>>>>>> movable, the old page will no longer be movable.
>>>>>>>>>
>>>>>>>>> However, after unlocking newpage, I think there is nothing stopping
>>>>>>>>> the newpage from getting dequeued and freed by virtio-balloon. This
>>>>>>>>> will result in the newpage
>>>>>>>>> 1. No longer having PageMovable()
>>>>>>>>> 2. Getting moved to the local list before finally freeing it (using
>>>>>>>>> page->lru)
>>>>>>>>
>>>>>>>> Does that mean that the virtio-balloon can change the Movable state
>>>>>>>> while there are other users of the page? Can you point to the code that
>>>>>>>> does it? How come this can be safe at all? Or is the PageMovable stable
>>>>>>>> only under the page lock?
>>>>>>>>
>>>>>>>
>>>>>>> PageMovable is stable under the lock. The relevant instructions are in
>>>>>>>
>>>>>>> mm/balloon_compaction.c and include/linux/balloon_compaction.h
>>>>>>
>>>>>> OK, I have just checked __ClearPageMovable and it indeed requires
>>>>>> PageLock. Then we also have to move is_lru = __PageMovable(page) after
>>>>>> the page lock.
>>>>>>
>>>>>
>>>>> I assume that is fine as is as the page is isolated? (yes, it will be
>>>>> modified later when moving but we are interested in the original state)
>>>>
>>>> OK, I've missed that the page is indeed isolated. Then the patch makes
>>>> sense to me.
>>>>
>>>
>>> Thanks Michal. I assume this has broken ever since balloon compaction
>>> was introduced. I'll wait a little more and then resend as !RFC with a
>>> cc-stable tag.
>>>
>>
>> Yes, balloon deflation could always race against migration
>> This race was a problem, initially, and was dealt with, via:
>>
>> commit 117aad1e9e4d97448d1df3f84b08bd65811e6d6a
>> Author: Rafael Aquini <[email protected]>
>> Date: Mon Sep 30 13:45:16 2013 -0700
>>
>> mm: avoid reinserting isolated balloon pages into LRU lists
>>
>>
>>
>> I think this upstream patch has re-introduced it, in a more subtle way,
>> as we're stumbling on it now, again:
>>
>> commit d6d86c0a7f8ddc5b38cf089222cb1d9540762dc2
>> Author: Konstantin Khlebnikov <[email protected]>
>> Date: Thu Oct 9 15:29:27 2014 -0700
>>
>> mm/balloon_compaction: redesign ballooned pages management
>>
>>
>>
>> On this particular race against migration case, virtio ballon deflation would
>> not see it before
>>
>> commit b1123ea6d3b3da25af5c8a9d843bd07ab63213f4
>> Author: Minchan Kim <[email protected]>
>> Date: Tue Jul 26 15:23:09 2016 -0700
>>
>> mm: balloon: use general non-lru movable page feature
>>
>> as the recently released balloon page would be post-processed
>> without the page->lru list handling, which for migration stability
>> purposes must be done under the protection of page_lock.
>>
>>
>
> missing part here:
>
> I think your patch adresses this new case.
>
>
> Acked-by: Rafael Aquini <[email protected]>
>
>
>> get rid of balloon reference count.
>
> ^^ this was a left over (sorry about my fat-fingers)

:D

Thanks! I'll resend with

Cc: [email protected] # 3.12+
Fixes: d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages
management")

--

Thanks,

David / dhildenb