2014-01-03 20:18:56

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

On 12/18/2013 04:43 PM, Andrew Morton wrote:
> On Wed, 18 Dec 2013 17:23:03 +0800 Wanpeng Li <[email protected]> wrote:
>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 55c8b8d..1e24813 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1347,6 +1347,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>>>> unsigned long end;
>>>> int ret = SWAP_AGAIN;
>>>> int locked_vma = 0;
>>>> + int we_locked = 0;
>>>>
>>>> address = (vma->vm_start + cursor) & CLUSTER_MASK;
>>>> end = address + CLUSTER_SIZE;
>>>> @@ -1385,9 +1386,15 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>>>> BUG_ON(!page || PageAnon(page));
>>>>
>>>> if (locked_vma) {
>>>> - mlock_vma_page(page); /* no-op if already mlocked */
>>>> - if (page == check_page)
>>>> + if (page != check_page) {
>>>> + we_locked = trylock_page(page);
>>>
>>> If it's not us who has the page already locked, but somebody else, he
>>> might unlock it at this point and then the BUG_ON in mlock_vma_page()
>>> will trigger again.
>
> yes, this patch is pretty weak.
>
>> Any better idea is appreciated. ;-)
>
> Remove the BUG_ON() from mlock_vma_page()? Why was it added?
> isolate_lru_page() and putback_lru_page() and *might* require
> the page be locked, but I don't immediately see issues?

Ping? This BUG() is triggerable in 3.13-rc6 right now.


Thanks,
Sasha


2014-01-03 20:52:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

On Fri, Jan 3, 2014 at 12:17 PM, Sasha Levin <[email protected]> wrote:
>
> Ping? This BUG() is triggerable in 3.13-rc6 right now.

So Andrew suggested just removing the BUG_ON(), but it's been there
for a *long* time.

And I detest the patch that was sent out that said "Should I check?"

Maybe we should just remove that mlock_vma_page() thing instead in
try_to_unmap_cluster()? Or maybe actually lock the page around calling
it?

Linus

2014-01-03 23:36:26

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

On 01/03/2014 09:52 PM, Linus Torvalds wrote:
> On Fri, Jan 3, 2014 at 12:17 PM, Sasha Levin <[email protected]> wrote:
>>
>> Ping? This BUG() is triggerable in 3.13-rc6 right now.
>
> So Andrew suggested just removing the BUG_ON(), but it's been there
> for a *long* time.

Yes, Andrew also merged this patch for that:
http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch

But there wasn't enough confidence in the fix to sent it to you yet, I guess.

The related thread: http://www.spinics.net/lists/linux-mm/msg66972.html

> And I detest the patch that was sent out that said "Should I check?"
>
> Maybe we should just remove that mlock_vma_page() thing instead in

You mean that it it's already undeterministic because it can be already skipped when
mmap_sem can't be acquired for read? I think the assumption for this case is that mmap_sem
is already held for write which means VM_LOCKED is unset anyway (per comments at
try_to_unmap_file(), which calls try_to_unmap_cluster()). I'm however not sure how it is
protected from somebody else holding the semaphore...

> try_to_unmap_cluster()? Or maybe actually lock the page around calling
> it?

check_page is already locked, see try_to_munlock() which calls try_to_unmap_file(). So
this might smell of potential deadlock?

I'm for going with the removal of BUG_ON. The TestSetPageMlocked should provide enough
race protection.

> Linus
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2014-01-03 23:56:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

On Sat, 04 Jan 2014 00:36:18 +0100 Vlastimil Babka <[email protected]> wrote:

> On 01/03/2014 09:52 PM, Linus Torvalds wrote:
> > On Fri, Jan 3, 2014 at 12:17 PM, Sasha Levin <[email protected]> wrote:
> >>
> >> Ping? This BUG() is triggerable in 3.13-rc6 right now.
> >
> > So Andrew suggested just removing the BUG_ON(), but it's been there
> > for a *long* time.
>
> Yes, Andrew also merged this patch for that:
> http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch
>
> But there wasn't enough confidence in the fix to sent it to you yet, I guess.
>
> The related thread: http://www.spinics.net/lists/linux-mm/msg66972.html

Yes, I'd taken the cowardly approach of scheduling it for 3.14, with a
3.13.x backport.

Nobody answered my question! Is this a new bug or is it a
five-year-old bug which we only just discovered?

I guess it doesn't matter much - we should fix it in 3.13. I'll
include it in the next for-3.13 batch.

2014-01-04 00:18:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <[email protected]> wrote:
>
> I'm for going with the removal of BUG_ON. The TestSetPageMlocked should provide enough
> race protection.

Maybe. But dammit, that's subtle, and I don't think you're even right.

It basically depends on mlock_vma_page() and munlock_vma_page() being
able to run CONCURRENTLY on the same page. In particular, you could
have a mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
immediately clearing it on another, and then the rest of those
functions could run with a totally arbitrary interleaving when working
with the exact same page.

They both do basically

if (!isolate_lru_page(page))
putback_lru_page(page);

but one or the other would randomly win the race (it's internally
protected by the lru lock), and *if* the munlock_vma_page() wins it,
it would also do

try_to_munlock(page);

but if mlock_vma_page() wins it, that wouldn't happen. That looks
entirely broken - you end up with the PageMlocked bit clear, but
try_to_munlock() was never called on that page, because
mlock_vma_page() got to the page isolation before the "subsequent"
munlock_vma_page().

And this is very much what the page lock serialization would prevent.
So no, the PageMlocked in *no* way gives serialization. It's an atomic
bit op, yes, but that only "serializes" in one direction, not when you
can have a mix of bit setting and clearing.

So quite frankly, I think you're wrong. The BUG_ON() is correct, or at
least enforces some kind of ordering. And try_to_unmap_cluster() is
just broken in calling that without the page being locked. That's my
opinion. There may be some *other* reason why it all happens to work,
but no, "TestSetPageMlocked should provide enough race protection" is
simply not true, and even if it were, it's way too subtle and odd to
be a good rule.

So I really object to just removing the BUG_ON(). Not with a *lot*
more explanation as to why these kinds of issues wouldn't matter.

Linus

2014-01-04 03:04:46

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

On 01/03/2014 06:56 PM, Andrew Morton wrote:
> Nobody answered my question! Is this a new bug or is it a
> five-year-old bug which we only just discovered?

I've rolled trinity back and was unable to reproduce this issue anymore. This
seems to be a 5 year old bug.


Thanks,
Sasha

2014-01-04 03:32:22

by Bob Liu

[permalink] [raw]
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs


On 01/04/2014 04:52 AM, Linus Torvalds wrote:
> On Fri, Jan 3, 2014 at 12:17 PM, Sasha Levin <[email protected]> wrote:
>>
>> Ping? This BUG() is triggerable in 3.13-rc6 right now.
>
> So Andrew suggested just removing the BUG_ON(), but it's been there
> for a *long* time.
>
> And I detest the patch that was sent out that said "Should I check?"
>
> Maybe we should just remove that mlock_vma_page() thing instead in
> try_to_unmap_cluster()? Or maybe actually lock the page around calling

I didn't get the reason why we have to call mlock_vma_page() in
try_to_unmap_cluster() and I agree to just remove it.

> it?
>
> Linus
>

--
Regards,
-Bob

2014-01-04 08:09:24

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

On 01/04/2014 01:18 AM, Linus Torvalds wrote:
> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <[email protected]> wrote:
>>
>> I'm for going with the removal of BUG_ON. The TestSetPageMlocked should provide enough
>> race protection.
>
> Maybe. But dammit, that's subtle, and I don't think you're even right.
>
> It basically depends on mlock_vma_page() and munlock_vma_page() being
> able to run CONCURRENTLY on the same page. In particular, you could
> have a mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
> immediately clearing it on another, and then the rest of those
> functions could run with a totally arbitrary interleaving when working
> with the exact same page.
>
> They both do basically
>
> if (!isolate_lru_page(page))
> putback_lru_page(page);
>
> but one or the other would randomly win the race (it's internally
> protected by the lru lock), and *if* the munlock_vma_page() wins it,
> it would also do
>
> try_to_munlock(page);
>
> but if mlock_vma_page() wins it, that wouldn't happen. That looks
> entirely broken - you end up with the PageMlocked bit clear, but
> try_to_munlock() was never called on that page, because
> mlock_vma_page() got to the page isolation before the "subsequent"
> munlock_vma_page().

I got the impression (see e.g. munlock_vma_page() comments) that the
whole thing is designed with this possibility in mind. isolate_lru_page()
may fail (presumably also in other scenarios than this) and if
try_to_munlock() was not called here, then yes the page might lose the
PageMlocked bit and go to LRU instead of inevictable list, but
try_to_unmap() should catch and fix this. That would also explain why
mlock_vma_page() is called from try_to_unmap_cluster().
So if I understand correctly, PageMlocked bit is not something that has
to be correctly set 100% of the time, but when it's set correctly most
of the time, then most of these pages will go to inevictable list and spare
vmscan's time.

> And this is very much what the page lock serialization would prevent.
> So no, the PageMlocked in *no* way gives serialization. It's an atomic
> bit op, yes, but that only "serializes" in one direction, not when you
> can have a mix of bit setting and clearing.
>
> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at
> least enforces some kind of ordering. And try_to_unmap_cluster() is
> just broken in calling that without the page being locked. That's my
> opinion. There may be some *other* reason why it all happens to work,
> but no, "TestSetPageMlocked should provide enough race protection" is
> simply not true, and even if it were, it's way too subtle and odd to
> be a good rule.

Right, it was stupid of me to write such strong statement without any
details. I wanted to review that patch when back at work next week, but
since it came up now, I just wanted to point out that it's in the pipeline
for this bug.

> So I really object to just removing the BUG_ON(). Not with a *lot*
> more explanation as to why these kinds of issues wouldn't matter.
>
> Linus
>

2014-01-06 17:27:18

by Motohiro Kosaki

[permalink] [raw]
Subject: RE: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs



> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Linus
> Torvalds
> Sent: Friday, January 03, 2014 7:18 PM
> To: Vlastimil Babka
> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu;
> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman;
> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing
> List
> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear
> VMAs
>
> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <[email protected]> wrote:
> >
> > I'm for going with the removal of BUG_ON. The TestSetPageMlocked
> > should provide enough race protection.
>
> Maybe. But dammit, that's subtle, and I don't think you're even right.
>
> It basically depends on mlock_vma_page() and munlock_vma_page() being
> able to run CONCURRENTLY on the same page. In particular, you could have a
> mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
> immediately clearing it on another, and then the rest of those functions
> could run with a totally arbitrary interleaving when working with the exact
> same page.
>
> They both do basically
>
> if (!isolate_lru_page(page))
> putback_lru_page(page);
>
> but one or the other would randomly win the race (it's internally protected
> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do
>
> try_to_munlock(page);
>
> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely
> broken - you end up with the PageMlocked bit clear, but
> try_to_munlock() was never called on that page, because
> mlock_vma_page() got to the page isolation before the "subsequent"
> munlock_vma_page().
>
> And this is very much what the page lock serialization would prevent.
> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op,
> yes, but that only "serializes" in one direction, not when you can have a mix
> of bit setting and clearing.
>
> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least
> enforces some kind of ordering. And try_to_unmap_cluster() is just broken
> in calling that without the page being locked. That's my opinion. There may
> be some *other* reason why it all happens to work, but no,
> "TestSetPageMlocked should provide enough race protection" is simply not
> true, and even if it were, it's way too subtle and odd to be a good rule.
>
> So I really object to just removing the BUG_ON(). Not with a *lot* more
> explanation as to why these kinds of issues wouldn't matter.

I don't have a perfect answer. But I can explain a bit history. Let's me try.

First off, 5 years ago, Lee's original putback_lru_page() implementation required
page-lock, but I removed the restriction months later. That's why we can see
strange BUG_ON here.

5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected by
mmap_sem (write mdoe). Then, mlock and munlock had no race.
Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to
protect against munlock.

Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under
mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If reclaim race against step (1),
reclaim must lose because it uses trylock. On the other hand, if reclaim race against step (2), reclaim must detect
VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem.

By the way, page isolation is still necessary because we need to protect another page modification like page migration.


My memory was alomostly flushed and I might lost some technical concern and past discussion. Please point me out,
If I am overlooking something.

Thanks.

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-06 22:01:29

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

On Mon, Jan 6, 2014 at 11:47 AM, Motohiro Kosaki
<[email protected]> wrote:
>
>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Linus
>> Torvalds
>> Sent: Friday, January 03, 2014 7:18 PM
>> To: Vlastimil Babka
>> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu;
>> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman;
>> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing
>> List
>> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear
>> VMAs
>>
>> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <[email protected]> wrote:
>> >
>> > I'm for going with the removal of BUG_ON. The TestSetPageMlocked
>> > should provide enough race protection.
>>
>> Maybe. But dammit, that's subtle, and I don't think you're even right.
>>
>> It basically depends on mlock_vma_page() and munlock_vma_page() being
>> able to run CONCURRENTLY on the same page. In particular, you could have a
>> mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
>> immediately clearing it on another, and then the rest of those functions
>> could run with a totally arbitrary interleaving when working with the exact
>> same page.
>>
>> They both do basically
>>
>> if (!isolate_lru_page(page))
>> putback_lru_page(page);
>>
>> but one or the other would randomly win the race (it's internally protected
>> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do
>>
>> try_to_munlock(page);
>>
>> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely
>> broken - you end up with the PageMlocked bit clear, but
>> try_to_munlock() was never called on that page, because
>> mlock_vma_page() got to the page isolation before the "subsequent"
>> munlock_vma_page().
>>
>> And this is very much what the page lock serialization would prevent.
>> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op,
>> yes, but that only "serializes" in one direction, not when you can have a mix
>> of bit setting and clearing.
>>
>> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least
>> enforces some kind of ordering. And try_to_unmap_cluster() is just broken
>> in calling that without the page being locked. That's my opinion. There may
>> be some *other* reason why it all happens to work, but no,
>> "TestSetPageMlocked should provide enough race protection" is simply not
>> true, and even if it were, it's way too subtle and odd to be a good rule.
>>
>> So I really object to just removing the BUG_ON(). Not with a *lot* more
>> explanation as to why these kinds of issues wouldn't matter.
>
> I don't have a perfect answer. But I can explain a bit history. Let's me try.
>
> First off, 5 years ago, Lee's original putback_lru_page() implementation required
> page-lock, but I removed the restriction months later. That's why we can see
> strange BUG_ON here.
>
> 5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected by
> mmap_sem (write mdoe). Then, mlock and munlock had no race.
> Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to
> protect against munlock.
>
> Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under
> mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If reclaim race against step (1),
> reclaim must lose because it uses trylock. On the other hand, if reclaim race against step (2), reclaim must detect
> VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem.
>
> By the way, page isolation is still necessary because we need to protect another page modification like page migration.
>
>
> My memory was alomostly flushed and I might lost some technical concern and past discussion. Please point me out,
> If I am overlooking something.

No. I did talk about completely different issue. My memory is
completely broken as I said. I need to read latest code and dig past
discussion. Sorry again, please ignore my last mail.

2014-01-07 15:02:06

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

On 01/06/2014 05:47 PM, Motohiro Kosaki wrote:
>
>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Linus
>> Torvalds
>> Sent: Friday, January 03, 2014 7:18 PM
>> To: Vlastimil Babka
>> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu;
>> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman;
>> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing
>> List
>> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear
>> VMAs
>>
>> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <[email protected]> wrote:
>>>
>>> I'm for going with the removal of BUG_ON. The TestSetPageMlocked
>>> should provide enough race protection.
>>
>> Maybe. But dammit, that's subtle, and I don't think you're even right.
>>
>> It basically depends on mlock_vma_page() and munlock_vma_page() being
>> able to run CONCURRENTLY on the same page. In particular, you could have a
>> mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
>> immediately clearing it on another, and then the rest of those functions
>> could run with a totally arbitrary interleaving when working with the exact
>> same page.
>>
>> They both do basically
>>
>> if (!isolate_lru_page(page))
>> putback_lru_page(page);
>>
>> but one or the other would randomly win the race (it's internally protected
>> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do
>>
>> try_to_munlock(page);
>>
>> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely
>> broken - you end up with the PageMlocked bit clear, but
>> try_to_munlock() was never called on that page, because
>> mlock_vma_page() got to the page isolation before the "subsequent"
>> munlock_vma_page().
>>
>> And this is very much what the page lock serialization would prevent.
>> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op,
>> yes, but that only "serializes" in one direction, not when you can have a mix
>> of bit setting and clearing.
>>
>> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least
>> enforces some kind of ordering. And try_to_unmap_cluster() is just broken
>> in calling that without the page being locked. That's my opinion. There may
>> be some *other* reason why it all happens to work, but no,
>> "TestSetPageMlocked should provide enough race protection" is simply not
>> true, and even if it were, it's way too subtle and odd to be a good rule.
>>
>> So I really object to just removing the BUG_ON(). Not with a *lot* more
>> explanation as to why these kinds of issues wouldn't matter.
>
> I don't have a perfect answer. But I can explain a bit history. Let's me try.
>
> First off, 5 years ago, Lee's original putback_lru_page() implementation required
> page-lock, but I removed the restriction months later. That's why we can see
> strange BUG_ON here.
>
> 5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected by
> mmap_sem (write mdoe). Then, mlock and munlock had no race.
> Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to
> protect against munlock.
>
> Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under
> mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If reclaim race against step (1),
> reclaim must lose because it uses trylock. On the other hand, if reclaim race against step (2), reclaim must detect
> VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem.
>
> By the way, page isolation is still necessary because we need to protect another page modification like page migration.

I guess you meant page locking, not (lru) isolation. Indeed, the documentation
at Documentation/vm/unevictable-lru.txt also discusses races with page migration.

I've checked and it seems really the case that mlock_migrate_page()
could race with mlock_vma_page() so that PG_mlocked is set on the
old page again after deciding that the new page will be without the flag.
(or after the flag was transferred to the new page)
That would not be fatal, but distort accounting anyway.

So here's a patch that if accepted should replace the removal of BUG_ON patch in
-mm tree: http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch

The idea is that try_to_unmap_cluster() will try locking the page
for mlock, and just leave it alone if lock cannot be obtained. Again
that's not fatal, as eventually something will encounter and mlock the page.

-----8<-----

From: Vlastimil Babka <[email protected]>
Date: Tue, 7 Jan 2014 14:59:58 +0100
Subject: [PATCH] mm: try_to_unmap_cluster() should lock_page() before mlocking

A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin fuzzing
with trinity. The call site try_to_unmap_cluster() does not lock the pages
other than its check_page parameter (which is already locked).

The BUG_ON in mlock_vma_page() is not documented and its purpose is somewhat
unclear, but apparently it serializes against page migration, which could
otherwise fail to transfer the PG_mlocked flag. This would not be fatal, as the
page would be eventually encountered again, but NR_MLOCK accounting would
become distorted nevertheless. This patch adds a comment to the BUG_ON in
mlock_vma_page() and munlock_vma_page() to that effect.

The call site try_to_unmap_cluster() is fixed so that for page != check_page,
trylock_page() is attempted (to avoid possible deadlocks as we already have
check_page locked) and mlock_vma_page() is performed only upon success. If the
page lock cannot be obtained, the page is left without PG_mlocked, which is
again not a problem in the whole unevictable memory design.

Reported-by: Sasha Levin <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Bob Liu <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/mlock.c | 2 ++
mm/rmap.c | 14 ++++++++++++--
2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 192e6ee..1b12dfa 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -79,6 +79,7 @@ void clear_page_mlock(struct page *page)
*/
void mlock_vma_page(struct page *page)
{
+ /* Serialize with page migration */
BUG_ON(!PageLocked(page));

if (!TestSetPageMlocked(page)) {
@@ -153,6 +154,7 @@ unsigned int munlock_vma_page(struct page *page)
{
unsigned int nr_pages;

+ /* For try_to_munlock() and to serialize with page migration */
BUG_ON(!PageLocked(page));

if (TestClearPageMlocked(page)) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 068522d..b99c742 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
BUG_ON(!page || PageAnon(page));

if (locked_vma) {
- mlock_vma_page(page); /* no-op if already mlocked */
- if (page == check_page)
+ if (page == check_page) {
+ /* we know we have check_page locked */
+ mlock_vma_page(page);
ret = SWAP_MLOCK;
+ } else if (trylock_page(page)) {
+ /*
+ * If we can lock the page, perform mlock.
+ * Otherwise leave the page alone, it will be
+ * eventually encountered again later.
+ */
+ mlock_vma_page(page);
+ unlock_page(page);
+ }
continue; /* don't unmap */
}

--
1.8.4


2014-01-08 01:07:46

by Bob Liu

[permalink] [raw]
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs


On 01/07/2014 11:01 PM, Vlastimil Babka wrote:
> On 01/06/2014 05:47 PM, Motohiro Kosaki wrote:
>>
>>
>>> -----Original Message-----
>>> From: [email protected] [mailto:[email protected]] On Behalf Of Linus
>>> Torvalds
>>> Sent: Friday, January 03, 2014 7:18 PM
>>> To: Vlastimil Babka
>>> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu;
>>> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman;
>>> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing
>>> List
>>> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear
>>> VMAs
>>>
>>> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <[email protected]> wrote:
>>>>
>>>> I'm for going with the removal of BUG_ON. The TestSetPageMlocked
>>>> should provide enough race protection.
>>>
>>> Maybe. But dammit, that's subtle, and I don't think you're even right.
>>>
>>> It basically depends on mlock_vma_page() and munlock_vma_page() being
>>> able to run CONCURRENTLY on the same page. In particular, you could have a
>>> mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
>>> immediately clearing it on another, and then the rest of those functions
>>> could run with a totally arbitrary interleaving when working with the exact
>>> same page.
>>>
>>> They both do basically
>>>
>>> if (!isolate_lru_page(page))
>>> putback_lru_page(page);
>>>
>>> but one or the other would randomly win the race (it's internally protected
>>> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do
>>>
>>> try_to_munlock(page);
>>>
>>> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely
>>> broken - you end up with the PageMlocked bit clear, but
>>> try_to_munlock() was never called on that page, because
>>> mlock_vma_page() got to the page isolation before the "subsequent"
>>> munlock_vma_page().
>>>
>>> And this is very much what the page lock serialization would prevent.
>>> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op,
>>> yes, but that only "serializes" in one direction, not when you can have a mix
>>> of bit setting and clearing.
>>>
>>> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least
>>> enforces some kind of ordering. And try_to_unmap_cluster() is just broken
>>> in calling that without the page being locked. That's my opinion. There may
>>> be some *other* reason why it all happens to work, but no,
>>> "TestSetPageMlocked should provide enough race protection" is simply not
>>> true, and even if it were, it's way too subtle and odd to be a good rule.
>>>
>>> So I really object to just removing the BUG_ON(). Not with a *lot* more
>>> explanation as to why these kinds of issues wouldn't matter.
>>
>> I don't have a perfect answer. But I can explain a bit history. Let's me try.
>>
>> First off, 5 years ago, Lee's original putback_lru_page() implementation required
>> page-lock, but I removed the restriction months later. That's why we can see
>> strange BUG_ON here.
>>
>> 5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected by
>> mmap_sem (write mdoe). Then, mlock and munlock had no race.
>> Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to
>> protect against munlock.
>>
>> Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under
>> mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If reclaim race against step (1),
>> reclaim must lose because it uses trylock. On the other hand, if reclaim race against step (2), reclaim must detect
>> VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem.
>>
>> By the way, page isolation is still necessary because we need to protect another page modification like page migration.
>
> I guess you meant page locking, not (lru) isolation. Indeed, the documentation
> at Documentation/vm/unevictable-lru.txt also discusses races with page migration.
>
> I've checked and it seems really the case that mlock_migrate_page()
> could race with mlock_vma_page() so that PG_mlocked is set on the
> old page again after deciding that the new page will be without the flag.
> (or after the flag was transferred to the new page)
> That would not be fatal, but distort accounting anyway.
>
> So here's a patch that if accepted should replace the removal of BUG_ON patch in
> -mm tree: http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch
>

Make sense to me.

> The idea is that try_to_unmap_cluster() will try locking the page
> for mlock, and just leave it alone if lock cannot be obtained. Again
> that's not fatal, as eventually something will encounter and mlock the page.
>
> -----8<-----
>
> From: Vlastimil Babka <[email protected]>
> Date: Tue, 7 Jan 2014 14:59:58 +0100
> Subject: [PATCH] mm: try_to_unmap_cluster() should lock_page() before mlocking
>
> A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin fuzzing
> with trinity. The call site try_to_unmap_cluster() does not lock the pages
> other than its check_page parameter (which is already locked).
>
> The BUG_ON in mlock_vma_page() is not documented and its purpose is somewhat
> unclear, but apparently it serializes against page migration, which could
> otherwise fail to transfer the PG_mlocked flag. This would not be fatal, as the
> page would be eventually encountered again, but NR_MLOCK accounting would
> become distorted nevertheless. This patch adds a comment to the BUG_ON in
> mlock_vma_page() and munlock_vma_page() to that effect.
>
> The call site try_to_unmap_cluster() is fixed so that for page != check_page,
> trylock_page() is attempted (to avoid possible deadlocks as we already have
> check_page locked) and mlock_vma_page() is performed only upon success. If the
> page lock cannot be obtained, the page is left without PG_mlocked, which is
> again not a problem in the whole unevictable memory design.
>
> Reported-by: Sasha Levin <[email protected]>
> Cc: Wanpeng Li <[email protected]>
> Cc: Michel Lespinasse <[email protected]>
> Cc: Bob Liu <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>

Reviewed-by: Bob Liu <[email protected]>

> ---
> mm/mlock.c | 2 ++
> mm/rmap.c | 14 ++++++++++++--
> 2 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 192e6ee..1b12dfa 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -79,6 +79,7 @@ void clear_page_mlock(struct page *page)
> */
> void mlock_vma_page(struct page *page)
> {
> + /* Serialize with page migration */
> BUG_ON(!PageLocked(page));
>
> if (!TestSetPageMlocked(page)) {
> @@ -153,6 +154,7 @@ unsigned int munlock_vma_page(struct page *page)
> {
> unsigned int nr_pages;
>
> + /* For try_to_munlock() and to serialize with page migration */
> BUG_ON(!PageLocked(page));
>
> if (TestClearPageMlocked(page)) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 068522d..b99c742 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> BUG_ON(!page || PageAnon(page));
>
> if (locked_vma) {
> - mlock_vma_page(page); /* no-op if already mlocked */
> - if (page == check_page)
> + if (page == check_page) {
> + /* we know we have check_page locked */
> + mlock_vma_page(page);
> ret = SWAP_MLOCK;
> + } else if (trylock_page(page)) {
> + /*
> + * If we can lock the page, perform mlock.
> + * Otherwise leave the page alone, it will be
> + * eventually encountered again later.
> + */
> + mlock_vma_page(page);
> + unlock_page(page);
> + }
> continue; /* don't unmap */
> }
>
>

--
Regards,
-Bob

2014-01-08 02:44:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

On Tue, Jan 7, 2014 at 11:01 PM, Vlastimil Babka <[email protected]> wrote:
>
> So here's a patch that if accepted should replace the removal of BUG_ON patch in
> -mm tree: http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch
>
> The idea is that try_to_unmap_cluster() will try locking the page
> for mlock, and just leave it alone if lock cannot be obtained. Again
> that's not fatal, as eventually something will encounter and mlock the page.

This looks sane to me. Andrew?

Linus

2014-01-10 17:49:25

by Motohiro Kosaki

[permalink] [raw]
Subject: RE: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

> diff --git a/mm/rmap.c b/mm/rmap.c
> index 068522d..b99c742 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long
> cursor, unsigned int *mapcount,
> BUG_ON(!page || PageAnon(page));
>
> if (locked_vma) {
> - mlock_vma_page(page); /* no-op if already
> mlocked */
> - if (page == check_page)
> + if (page == check_page) {
> + /* we know we have check_page locked */
> + mlock_vma_page(page);
> ret = SWAP_MLOCK;
> + } else if (trylock_page(page)) {
> + /*
> + * If we can lock the page, perform mlock.
> + * Otherwise leave the page alone, it will be
> + * eventually encountered again later.
> + */
> + mlock_vma_page(page);
> + unlock_page(page);
> + }
> continue; /* don't unmap */
> }

I audited all related mm code. However I couldn't find any race that it can close.

First off, current munlock code is crazy tricky.

munlock
down_write(mmap_sem)
do_mlock()
mlock_fixup
munlock_vma_pages_range
__munlock_pagevec
spin_lock_irq(zone->lru_lock)
TestClearPageMlocked(page)
del_page_from_lru_list
spin_unlock_irq(zone->lru_lock)
lock_page
__munlock_isolated_page
unlock_page

up_write(mmap_sem)

Look, TestClearPageMlocked(page) is not protected by lock_page. But this is still
safe because Page_mocked mean one or more vma marked VM_LOCKED then we
only need care about turning down final VM_LOCKED. I.e. mmap_sem protect them.

And,

spin_lock_irq(zone->lru_lock)
del_page_from_lru_list
spin_unlock_irq(zone->lru_lock)

This idiom ensures I or someone else isolate the page from lru and isolated pages
will be put back by putback_lru_page in anycase. So, the pages will move the right
lru eventually.

And then, taking page-lock doesn't help to close vs munlock race.

On the other hands, page migration has the following call stack.

some-caller [isolate_lru_page]
unmap_and_move
__unmap_and_move
trylock_page
try_to_unmap
move_to_new_page
migrate_page
migrate_page_copy
unlock_page

The critical part (i.e. migrate_page_copy) is protected by both page isolation and page lock.
Page fault path take lock page and doesn't use page isolation. This is correct.
try_to_unmap_cluster doesn't take page lock, but it ensure the page is isolated. This is correct too.

Plus, do_wp_page() has the following comment. But it is wrong. This lock is necessary to protect against
page migration, but not lru manipulation.

if ((ret & VM_FAULT_WRITE) && (vma->vm_flags & VM_LOCKED)) {
lock_page(old_page); /* LRU manipulation */
munlock_vma_page(old_page);
unlock_page(old_page);
}


But, you know, this is crazy ugly lock design. I'm ok to change the rule to that PG_mlocked must be protected
page lock. If so, I propose to add PageMlocked() like this

} else if (!PageMlocked() && trylock_page(page)) {
/*
* If we can lock the page, perform mlock.
* Otherwise leave the page alone, it will be
* eventually encountered again later.
*/
mlock_vma_page(page);
unlock_page(page);

This is safe. Even if race is happen and vmscan failed to mark PG_mlocked, that's no problem. Next vmscan may
do the right job and will fix the mistake.

Anyway, I'm also sure this is not recent issue. It lived 5 years. It is no reason to rush.

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-13 14:04:01

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

On 01/10/2014 06:48 PM, Motohiro Kosaki wrote:
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 068522d..b99c742 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long
>> cursor, unsigned int *mapcount,
>> BUG_ON(!page || PageAnon(page));
>>
>> if (locked_vma) {
>> - mlock_vma_page(page); /* no-op if already
>> mlocked */
>> - if (page == check_page)
>> + if (page == check_page) {
>> + /* we know we have check_page locked */
>> + mlock_vma_page(page);
>> ret = SWAP_MLOCK;
>> + } else if (trylock_page(page)) {
>> + /*
>> + * If we can lock the page, perform mlock.
>> + * Otherwise leave the page alone, it will be
>> + * eventually encountered again later.
>> + */
>> + mlock_vma_page(page);
>> + unlock_page(page);
>> + }
>> continue; /* don't unmap */
>> }
>
> I audited all related mm code. However I couldn't find any race that it can close.

Well, I would say the lock here closes the race with page migration, no? (As discussed below)

> First off, current munlock code is crazy tricky.

Oops, that's actually a result of my patches to speed up munlock by batching pages (since 3.12).

> munlock
> down_write(mmap_sem)
> do_mlock()
> mlock_fixup
> munlock_vma_pages_range
> __munlock_pagevec
> spin_lock_irq(zone->lru_lock)
> TestClearPageMlocked(page)
> del_page_from_lru_list
> spin_unlock_irq(zone->lru_lock)
> lock_page
> __munlock_isolated_page
> unlock_page
>
> up_write(mmap_sem)
>
> Look, TestClearPageMlocked(page) is not protected by lock_page.

Right :( That's my fault, when developing the patch series I didn't see the page
migration race, and it seemed that lock is only needed to protect the rmap operations
in __munlock_isolated_page()

> But this is still
> safe because Page_mocked mean one or more vma marked VM_LOCKED then we
> only need care about turning down final VM_LOCKED. I.e. mmap_sem protect them.
>
> And,
>
> spin_lock_irq(zone->lru_lock)
> del_page_from_lru_list
> spin_unlock_irq(zone->lru_lock)
>
> This idiom ensures I or someone else isolate the page from lru and isolated pages
> will be put back by putback_lru_page in anycase. So, the pages will move the right
> lru eventually.
>
> And then, taking page-lock doesn't help to close vs munlock race.
>
> On the other hands, page migration has the following call stack.
>
> some-caller [isolate_lru_page]
> unmap_and_move
> __unmap_and_move
> trylock_page
> try_to_unmap
> move_to_new_page
> migrate_page
> migrate_page_copy
> unlock_page
>
> The critical part (i.e. migrate_page_copy) is protected by both page isolation and page lock.

However, none of that protects against setting PG_mlocked in try_to_unmap_cluster() -> mlock_vma_page(),
or clearing PG_mlocked in __munlock_pagevec().

The race I have in mind for munlock is:

CPU1 does page migration
some-caller [isolate_lru_page]
unmap_and_move
__unmap_and_move
trylock_page
try_to_unmap
move_to_new_page
migrate_page
migrate_page_copy
mlock_migrate_page - transfers PG_mlocked from old page to new page

CPU2 does munlock:
munlock
down_write(mmap_sem)
do_mlock()
mlock_fixup
munlock_vma_pages_range
__munlock_pagevec
spin_lock_irq(zone->lru_lock)
TestClearPageMlocked(page) - here it finds PG_mlocked already cleared
so it stops, but meanwhile new page already has PG_mlocked set and will
stay inevictable

> Page fault path take lock page and doesn't use page isolation. This is correct.
> try_to_unmap_cluster doesn't take page lock, but it ensure the page is isolated. This is correct too.

I don't see where try_to_unmap_cluster() has guaranteed that pages other than check_page are isolated?
(I might just be missing that). So in the race example above, CPU2 could be doing try_to_unmap_cluster()
and set PG_mlocked on old page, with no effect on the new page. Not fatal for the design of lazy mlocking,
but a distorted accounting anyway.

> Plus, do_wp_page() has the following comment. But it is wrong. This lock is necessary to protect against
> page migration, but not lru manipulation.
>
> if ((ret & VM_FAULT_WRITE) && (vma->vm_flags & VM_LOCKED)) {
> lock_page(old_page); /* LRU manipulation */
> munlock_vma_page(old_page);
> unlock_page(old_page);
> }
>
>
> But, you know, this is crazy ugly lock design. I'm ok to change the rule to that PG_mlocked must be protected
> page lock. If so, I propose to add PageMlocked() like this
>
> } else if (!PageMlocked() && trylock_page(page)) {
> /*
> * If we can lock the page, perform mlock.
> * Otherwise leave the page alone, it will be
> * eventually encountered again later.
> */
> mlock_vma_page(page);
> unlock_page(page);
>
> This is safe. Even if race is happen and vmscan failed to mark PG_mlocked, that's no problem. Next vmscan may
> do the right job and will fix the mistake.
>
> Anyway, I'm also sure this is not recent issue. It lived 5 years. It is no reason to rush.

Yeah but there's the new issue with __munlock_pagevec() :( Perhaps a more serious one as it could leave pages inevictable
when they shouldn't be. I'm not sure if the performance benefits of that could be preserved with full page locking.
Another option would be that page migration would somehow deal with the race, or just leave the target pages without
PG_mlocked and let them be dealt with later.
But if we go with the rule that page lock must protect PG_mlocked, then there's also clear_page_mlock() to consider.
And finally, we could then at least replace the atomic test and set with faster non-atomic variants.

2014-01-14 11:05:13

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

On 01/13/2014 03:03 PM, Vlastimil Babka wrote:
> On 01/10/2014 06:48 PM, Motohiro Kosaki wrote:
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 068522d..b99c742 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long
>>> cursor, unsigned int *mapcount,
>>> BUG_ON(!page || PageAnon(page));
>>>
>>> if (locked_vma) {
>>> - mlock_vma_page(page); /* no-op if already
>>> mlocked */
>>> - if (page == check_page)
>>> + if (page == check_page) {
>>> + /* we know we have check_page locked */
>>> + mlock_vma_page(page);
>>> ret = SWAP_MLOCK;
>>> + } else if (trylock_page(page)) {
>>> + /*
>>> + * If we can lock the page, perform mlock.
>>> + * Otherwise leave the page alone, it will be
>>> + * eventually encountered again later.
>>> + */
>>> + mlock_vma_page(page);
>>> + unlock_page(page);
>>> + }
>>> continue; /* don't unmap */
>>> }
>>
>> I audited all related mm code. However I couldn't find any race that it can close.
>
> Well, I would say the lock here closes the race with page migration, no? (As discussed below)
>
>> First off, current munlock code is crazy tricky.
>
> Oops, that's actually a result of my patches to speed up munlock by batching pages (since 3.12).
>
>> munlock
>> down_write(mmap_sem)
>> do_mlock()
>> mlock_fixup
>> munlock_vma_pages_range
>> __munlock_pagevec
>> spin_lock_irq(zone->lru_lock)
>> TestClearPageMlocked(page)
>> del_page_from_lru_list
>> spin_unlock_irq(zone->lru_lock)
>> lock_page
>> __munlock_isolated_page
>> unlock_page
>>
>> up_write(mmap_sem)
>>
>> Look, TestClearPageMlocked(page) is not protected by lock_page.
>
> Right :( That's my fault, when developing the patch series I didn't see the page
> migration race, and it seemed that lock is only needed to protect the rmap operations
> in __munlock_isolated_page()
>
>> But this is still
>> safe because Page_mocked mean one or more vma marked VM_LOCKED then we
>> only need care about turning down final VM_LOCKED. I.e. mmap_sem protect them.
>>
>> And,
>>
>> spin_lock_irq(zone->lru_lock)
>> del_page_from_lru_list
>> spin_unlock_irq(zone->lru_lock)
>>
>> This idiom ensures I or someone else isolate the page from lru and isolated pages
>> will be put back by putback_lru_page in anycase. So, the pages will move the right
>> lru eventually.
>>
>> And then, taking page-lock doesn't help to close vs munlock race.
>>
>> On the other hands, page migration has the following call stack.
>>
>> some-caller [isolate_lru_page]
>> unmap_and_move
>> __unmap_and_move
>> trylock_page
>> try_to_unmap
>> move_to_new_page
>> migrate_page
>> migrate_page_copy
>> unlock_page
>>
>> The critical part (i.e. migrate_page_copy) is protected by both page isolation and page lock.
>
> However, none of that protects against setting PG_mlocked in try_to_unmap_cluster() -> mlock_vma_page(),
> or clearing PG_mlocked in __munlock_pagevec().
>
> The race I have in mind for munlock is:
>
> CPU1 does page migration
> some-caller [isolate_lru_page]
> unmap_and_move
> __unmap_and_move
> trylock_page
> try_to_unmap
> move_to_new_page
> migrate_page
> migrate_page_copy
> mlock_migrate_page - transfers PG_mlocked from old page to new page
>
> CPU2 does munlock:
> munlock
> down_write(mmap_sem)
> do_mlock()
> mlock_fixup
> munlock_vma_pages_range
> __munlock_pagevec
> spin_lock_irq(zone->lru_lock)
> TestClearPageMlocked(page) - here it finds PG_mlocked already cleared
> so it stops, but meanwhile new page already has PG_mlocked set and will
> stay inevictable

As Mel pointed out to me, page lock in munlock alone would not help here
anyway, because munlock would just wait for migration to unlock the old
page and then still fail to clear a flag that has been transferred by
migration to the new page. So if there was a race, it would be older
than __munlock_pagevec() removing TestClearPageMlocked(page) from the
page_lock protected section.

But then I noticed that migration bails out for pages that have elevated
pins, and this is done after making the page unreachable for
mlock/munlock via pte migration entries. So this alone should prevent a
race with m(un)lock, with three possible scenarios:
- m(un)lock sees the migration entry and waits for migration to
complete (via follow_page_mask), operates on the new page
- m(un)lock gets the page reference before migration entry is set,
finishes before migration checks the page pin count, migration
transfers the new PG_mlocked state to the new page
- m(un)lock doesn't finish that quickly, migration bails out, m(un)lock
changes the old page that is mapped back


>> Page fault path take lock page and doesn't use page isolation. This is correct.
>> try_to_unmap_cluster doesn't take page lock, but it ensure the page is isolated. This is correct too.
>
> I don't see where try_to_unmap_cluster() has guaranteed that pages other than check_page are isolated?
> (I might just be missing that). So in the race example above, CPU2 could be doing try_to_unmap_cluster()
> and set PG_mlocked on old page, with no effect on the new page. Not fatal for the design of lazy mlocking,
> but a distorted accounting anyway.
>
>> Plus, do_wp_page() has the following comment. But it is wrong. This lock is necessary to protect against
>> page migration, but not lru manipulation.
>>
>> if ((ret & VM_FAULT_WRITE) && (vma->vm_flags & VM_LOCKED)) {
>> lock_page(old_page); /* LRU manipulation */
>> munlock_vma_page(old_page);
>> unlock_page(old_page);
>> }
>>

So the protection is actually for rmap operations in try_to_munlock.
The comment could be removed from the caller here and appear just at the
PageLocked assertion in munlock_vma_page().

>> But, you know, this is crazy ugly lock design. I'm ok to change the rule to that PG_mlocked must be protected
>> page lock. If so, I propose to add PageMlocked() like this
>>
>> } else if (!PageMlocked() && trylock_page(page)) {
>> /*
>> * If we can lock the page, perform mlock.
>> * Otherwise leave the page alone, it will be
>> * eventually encountered again later.
>> */
>> mlock_vma_page(page);
>> unlock_page(page);
>>
>> This is safe. Even if race is happen and vmscan failed to mark PG_mlocked, that's no problem. Next vmscan may
>> do the right job and will fix the mistake.
>>
>> Anyway, I'm also sure this is not recent issue. It lived 5 years. It is no reason to rush.

If we agree that page migration is safe as explained above, then
removing the PageLocked assertion from mlock_vma_page is again an
option, instead of making sure that all callers lock?

> Yeah but there's the new issue with __munlock_pagevec() :( Perhaps a more serious one as it could leave pages inevictable
> when they shouldn't be. I'm not sure if the performance benefits of that could be preserved with full page locking.
> Another option would be that page migration would somehow deal with the race, or just leave the target pages without
> PG_mlocked and let them be dealt with later.
> But if we go with the rule that page lock must protect PG_mlocked, then there's also clear_page_mlock() to consider.
> And finally, we could then at least replace the atomic test and set with faster non-atomic variants.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>