Function isolate_migratepages_block() runs some checks out of lru_lock
when choose pages for migration. After checking PageLRU() it checks extra
page references by comparing page_count() and page_mapcount(). Between
these two checks page could be removed from lru, freed and taken by slab.
As a result this race triggers VM_BUG_ON(PageSlab()) in page_mapcount().
Race window is tiny. For certain workload this happens around once a year.
page:ffffea0105ca9380 count:1 mapcount:0 mapping:ffff88ff7712c180 index:0x0 compound_mapcount: 0
flags: 0x500000000008100(slab|head)
raw: 0500000000008100 dead000000000100 dead000000000200 ffff88ff7712c180
raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(PageSlab(page))
------------[ cut here ]------------
kernel BUG at ./include/linux/mm.h:628!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 77 PID: 504 Comm: kcompactd1 Tainted: G W 4.19.109-27 #1
Hardware name: Yandex T175-N41-Y3N/MY81-EX0-Y3N, BIOS R05 06/20/2019
RIP: 0010:isolate_migratepages_block+0x986/0x9b0
To fix just opencode page_mapcount() in racy check for 0-order case and
recheck carefully under lru_lock when page cannot escape from lru.
Also add checking extra references for file pages and swap cache.
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Fixes: 119d6d59dcc0 ("mm, compaction: avoid isolating pinned pages")
Fixes: 1d148e218a0d ("mm: add VM_BUG_ON_PAGE() to page_mapcount()")
---
mm/compaction.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 46f0fcc93081..91bb87fd9420 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -935,12 +935,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
}
/*
- * Migration will fail if an anonymous page is pinned in memory,
+ * Migration will fail if an page is pinned in memory,
* so avoid taking lru_lock and isolating it unnecessarily in an
- * admittedly racy check.
+ * admittedly racy check simplest case for 0-order pages.
+ *
+ * Open code page_mapcount() to avoid VM_BUG_ON(PageSlab(page)).
+ * Page could have extra reference from mapping or swap cache.
*/
- if (!page_mapping(page) &&
- page_count(page) > page_mapcount(page))
+ if (!PageCompound(page) &&
+ page_count(page) > atomic_read(&page->_mapcount) + 1 +
+ (!PageAnon(page) || PageSwapCache(page)))
goto isolate_fail;
/*
@@ -975,6 +979,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
low_pfn += compound_nr(page) - 1;
goto isolate_fail;
}
+
+ /* Recheck page extra references under lock */
+ if (page_count(page) > page_mapcount(page) +
+ (!PageAnon(page) || PageSwapCache(page)))
+ goto isolate_fail;
}
lruvec = mem_cgroup_page_lruvec(page, pgdat);
On Wed, 13 May 2020 17:05:25 +0300 Konstantin Khlebnikov <[email protected]> wrote:
> Function isolate_migratepages_block() runs some checks out of lru_lock
> when choose pages for migration. After checking PageLRU() it checks extra
> page references by comparing page_count() and page_mapcount(). Between
> these two checks page could be removed from lru, freed and taken by slab.
>
> As a result this race triggers VM_BUG_ON(PageSlab()) in page_mapcount().
> Race window is tiny. For certain workload this happens around once a year.
>
>
> page:ffffea0105ca9380 count:1 mapcount:0 mapping:ffff88ff7712c180 index:0x0 compound_mapcount: 0
> flags: 0x500000000008100(slab|head)
> raw: 0500000000008100 dead000000000100 dead000000000200 ffff88ff7712c180
> raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
> page dumped because: VM_BUG_ON_PAGE(PageSlab(page))
> ------------[ cut here ]------------
> kernel BUG at ./include/linux/mm.h:628!
> invalid opcode: 0000 [#1] SMP NOPTI
> CPU: 77 PID: 504 Comm: kcompactd1 Tainted: G W 4.19.109-27 #1
> Hardware name: Yandex T175-N41-Y3N/MY81-EX0-Y3N, BIOS R05 06/20/2019
> RIP: 0010:isolate_migratepages_block+0x986/0x9b0
>
>
> To fix just opencode page_mapcount() in racy check for 0-order case and
> recheck carefully under lru_lock when page cannot escape from lru.
>
> Also add checking extra references for file pages and swap cache.
It sounds like a cc:stable is appropriate?
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -935,12 +935,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> }
>
> /*
> - * Migration will fail if an anonymous page is pinned in memory,
> + * Migration will fail if an page is pinned in memory,
> * so avoid taking lru_lock and isolating it unnecessarily in an
> - * admittedly racy check.
> + * admittedly racy check simplest case for 0-order pages.
> + *
> + * Open code page_mapcount() to avoid VM_BUG_ON(PageSlab(page)).
> + * Page could have extra reference from mapping or swap cache.
> */
> - if (!page_mapping(page) &&
> - page_count(page) > page_mapcount(page))
> + if (!PageCompound(page) &&
> + page_count(page) > atomic_read(&page->_mapcount) + 1 +
> + (!PageAnon(page) || PageSwapCache(page)))
> goto isolate_fail;
>
> /*
> @@ -975,6 +979,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> low_pfn += compound_nr(page) - 1;
> goto isolate_fail;
> }
> +
> + /* Recheck page extra references under lock */
> + if (page_count(page) > page_mapcount(page) +
> + (!PageAnon(page) || PageSwapCache(page)))
> + goto isolate_fail;
> }
>
> lruvec = mem_cgroup_page_lruvec(page, pgdat);
On 13/05/2020 21.32, Andrew Morton wrote:
> On Wed, 13 May 2020 17:05:25 +0300 Konstantin Khlebnikov <[email protected]> wrote:
>
>> Function isolate_migratepages_block() runs some checks out of lru_lock
>> when choose pages for migration. After checking PageLRU() it checks extra
>> page references by comparing page_count() and page_mapcount(). Between
>> these two checks page could be removed from lru, freed and taken by slab.
>>
>> As a result this race triggers VM_BUG_ON(PageSlab()) in page_mapcount().
>> Race window is tiny. For certain workload this happens around once a year.
>>
>>
>> page:ffffea0105ca9380 count:1 mapcount:0 mapping:ffff88ff7712c180 index:0x0 compound_mapcount: 0
>> flags: 0x500000000008100(slab|head)
>> raw: 0500000000008100 dead000000000100 dead000000000200 ffff88ff7712c180
>> raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
>> page dumped because: VM_BUG_ON_PAGE(PageSlab(page))
>> ------------[ cut here ]------------
>> kernel BUG at ./include/linux/mm.h:628!
>> invalid opcode: 0000 [#1] SMP NOPTI
>> CPU: 77 PID: 504 Comm: kcompactd1 Tainted: G W 4.19.109-27 #1
>> Hardware name: Yandex T175-N41-Y3N/MY81-EX0-Y3N, BIOS R05 06/20/2019
>> RIP: 0010:isolate_migratepages_block+0x986/0x9b0
>>
>>
>> To fix just opencode page_mapcount() in racy check for 0-order case and
>> recheck carefully under lru_lock when page cannot escape from lru.
>>
>> Also add checking extra references for file pages and swap cache.
>
> It sounds like a cc:stable is appropriate?
Yep, but probably I'm missing something.
It seems bug is there for a long time and nobody seen it.
Am I the only one using COONFIG_DEBUG_VM=y everywhere? =)
>
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -935,12 +935,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>> }
>>
>> /*
>> - * Migration will fail if an anonymous page is pinned in memory,
>> + * Migration will fail if an page is pinned in memory,
>> * so avoid taking lru_lock and isolating it unnecessarily in an
>> - * admittedly racy check.
>> + * admittedly racy check simplest case for 0-order pages.
>> + *
>> + * Open code page_mapcount() to avoid VM_BUG_ON(PageSlab(page)).
>> + * Page could have extra reference from mapping or swap cache.
>> */
>> - if (!page_mapping(page) &&
>> - page_count(page) > page_mapcount(page))
>> + if (!PageCompound(page) &&
>> + page_count(page) > atomic_read(&page->_mapcount) + 1 +
>> + (!PageAnon(page) || PageSwapCache(page)))
>> goto isolate_fail;
>>
>> /*
>> @@ -975,6 +979,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>> low_pfn += compound_nr(page) - 1;
>> goto isolate_fail;
>> }
>> +
>> + /* Recheck page extra references under lock */
>> + if (page_count(page) > page_mapcount(page) +
>> + (!PageAnon(page) || PageSwapCache(page)))
>> + goto isolate_fail;
>> }
>>
>> lruvec = mem_cgroup_page_lruvec(page, pgdat);
On 5/13/20 9:28 PM, Konstantin Khlebnikov wrote:
> On 13/05/2020 21.32, Andrew Morton wrote:
>> On Wed, 13 May 2020 17:05:25 +0300 Konstantin Khlebnikov <[email protected]> wrote:
>>
>>> Function isolate_migratepages_block() runs some checks out of lru_lock
>>> when choose pages for migration. After checking PageLRU() it checks extra
>>> page references by comparing page_count() and page_mapcount(). Between
>>> these two checks page could be removed from lru, freed and taken by slab.
>>>
>>> As a result this race triggers VM_BUG_ON(PageSlab()) in page_mapcount().
>>> Race window is tiny. For certain workload this happens around once a year.
>>>
>>>
>>> page:ffffea0105ca9380 count:1 mapcount:0 mapping:ffff88ff7712c180 index:0x0 compound_mapcount: 0
>>> flags: 0x500000000008100(slab|head)
>>> raw: 0500000000008100 dead000000000100 dead000000000200 ffff88ff7712c180
>>> raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
>>> page dumped because: VM_BUG_ON_PAGE(PageSlab(page))
>>> ------------[ cut here ]------------
>>> kernel BUG at ./include/linux/mm.h:628!
>>> invalid opcode: 0000 [#1] SMP NOPTI
>>> CPU: 77 PID: 504 Comm: kcompactd1 Tainted: G W 4.19.109-27 #1
>>> Hardware name: Yandex T175-N41-Y3N/MY81-EX0-Y3N, BIOS R05 06/20/2019
>>> RIP: 0010:isolate_migratepages_block+0x986/0x9b0
>>>
>>>
>>> To fix just opencode page_mapcount() in racy check for 0-order case and
>>> recheck carefully under lru_lock when page cannot escape from lru.
>>>
>>> Also add checking extra references for file pages and swap cache.
>>
>> It sounds like a cc:stable is appropriate?
>
> Yep, but probably I'm missing something.
>
> It seems bug is there for a long time and nobody seen it.
> Am I the only one using COONFIG_DEBUG_VM=y everywhere? =)
Unless things changed, Fedora kernels had it enabled, AFAIK
On Wed, 13 May 2020 17:05:25 +0300 Konstantin Khlebnikov <[email protected]> wrote:
> Function isolate_migratepages_block() runs some checks out of lru_lock
> when choose pages for migration. After checking PageLRU() it checks extra
> page references by comparing page_count() and page_mapcount(). Between
> these two checks page could be removed from lru, freed and taken by slab.
>
> As a result this race triggers VM_BUG_ON(PageSlab()) in page_mapcount().
> Race window is tiny. For certain workload this happens around once a year.
>
>
> page:ffffea0105ca9380 count:1 mapcount:0 mapping:ffff88ff7712c180 index:0x0 compound_mapcount: 0
> flags: 0x500000000008100(slab|head)
> raw: 0500000000008100 dead000000000100 dead000000000200 ffff88ff7712c180
> raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
> page dumped because: VM_BUG_ON_PAGE(PageSlab(page))
> ------------[ cut here ]------------
> kernel BUG at ./include/linux/mm.h:628!
> invalid opcode: 0000 [#1] SMP NOPTI
> CPU: 77 PID: 504 Comm: kcompactd1 Tainted: G W 4.19.109-27 #1
> Hardware name: Yandex T175-N41-Y3N/MY81-EX0-Y3N, BIOS R05 06/20/2019
> RIP: 0010:isolate_migratepages_block+0x986/0x9b0
>
>
> To fix just opencode page_mapcount() in racy check for 0-order case and
> recheck carefully under lru_lock when page cannot escape from lru.
>
> Also add checking extra references for file pages and swap cache.
I dunno, this code looks quite nasty. I'm more thinking we should
revert and rethink David's 119d6d59dcc0980dcd58 ("mm, compaction: avoid
isolating pinned pages").
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -935,12 +935,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> }
>
> /*
> - * Migration will fail if an anonymous page is pinned in memory,
> + * Migration will fail if an page is pinned in memory,
> * so avoid taking lru_lock and isolating it unnecessarily in an
> - * admittedly racy check.
> + * admittedly racy check simplest case for 0-order pages.
> + *
> + * Open code page_mapcount() to avoid VM_BUG_ON(PageSlab(page)).
> + * Page could have extra reference from mapping or swap cache.
> */
> - if (!page_mapping(page) &&
> - page_count(page) > page_mapcount(page))
> + if (!PageCompound(page) &&
> + page_count(page) > atomic_read(&page->_mapcount) + 1 +
> + (!PageAnon(page) || PageSwapCache(page)))
> goto isolate_fail;
OK, we happened to notice this because we happened to trigger a
!PageSlab assertion. But if this page has been freed and reused for
slab, it could have been reused for *anything*? Perhaps it was reused
as a migratable page which we'll go ahead and migrate even though we no
longer should. There are various whacky parts of the kernel which
(ab)use surprising struct page fields in surprising ways - how do we
know it isn't one of those which now happens to look like a migratable
page?
I also worry about the next test:
/*
* Only allow to migrate anonymous pages in GFP_NOFS context
* because those do not depend on fs locks.
*/
if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
goto isolate_fail;
This page isn't PageLocked(), is it? It could be a recycled page which
is will be getting its ->mapping set one nanosecond hence.
> /*
> @@ -975,6 +979,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> low_pfn += compound_nr(page) - 1;
> goto isolate_fail;
> }
> +
> + /* Recheck page extra references under lock */
> + if (page_count(page) > page_mapcount(page) +
> + (!PageAnon(page) || PageSwapCache(page)))
> + goto isolate_fail;
> }
>
> lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
On Sat, May 23, 2020 at 4:34 AM Andrew Morton <[email protected]> wrote:
>
> On Wed, 13 May 2020 17:05:25 +0300 Konstantin Khlebnikov <[email protected]> wrote:
>
> > Function isolate_migratepages_block() runs some checks out of lru_lock
> > when choose pages for migration. After checking PageLRU() it checks extra
> > page references by comparing page_count() and page_mapcount(). Between
> > these two checks page could be removed from lru, freed and taken by slab.
> >
> > As a result this race triggers VM_BUG_ON(PageSlab()) in page_mapcount().
> > Race window is tiny. For certain workload this happens around once a year.
> >
> >
> > page:ffffea0105ca9380 count:1 mapcount:0 mapping:ffff88ff7712c180 index:0x0 compound_mapcount: 0
> > flags: 0x500000000008100(slab|head)
> > raw: 0500000000008100 dead000000000100 dead000000000200 ffff88ff7712c180
> > raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
> > page dumped because: VM_BUG_ON_PAGE(PageSlab(page))
> > ------------[ cut here ]------------
> > kernel BUG at ./include/linux/mm.h:628!
> > invalid opcode: 0000 [#1] SMP NOPTI
> > CPU: 77 PID: 504 Comm: kcompactd1 Tainted: G W 4.19.109-27 #1
> > Hardware name: Yandex T175-N41-Y3N/MY81-EX0-Y3N, BIOS R05 06/20/2019
> > RIP: 0010:isolate_migratepages_block+0x986/0x9b0
> >
> >
> > To fix just opencode page_mapcount() in racy check for 0-order case and
> > recheck carefully under lru_lock when page cannot escape from lru.
> >
> > Also add checking extra references for file pages and swap cache.
>
> I dunno, this code looks quite nasty. I'm more thinking we should
> revert and rethink David's 119d6d59dcc0980dcd58 ("mm, compaction: avoid
> isolating pinned pages").
>
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -935,12 +935,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > }
> >
> > /*
> > - * Migration will fail if an anonymous page is pinned in memory,
> > + * Migration will fail if an page is pinned in memory,
> > * so avoid taking lru_lock and isolating it unnecessarily in an
> > - * admittedly racy check.
> > + * admittedly racy check simplest case for 0-order pages.
> > + *
> > + * Open code page_mapcount() to avoid VM_BUG_ON(PageSlab(page)).
> > + * Page could have extra reference from mapping or swap cache.
> > */
> > - if (!page_mapping(page) &&
> > - page_count(page) > page_mapcount(page))
> > + if (!PageCompound(page) &&
> > + page_count(page) > atomic_read(&page->_mapcount) + 1 +
> > + (!PageAnon(page) || PageSwapCache(page)))
> > goto isolate_fail;
>
> OK, we happened to notice this because we happened to trigger a
> !PageSlab assertion. But if this page has been freed and reused for
> slab, it could have been reused for *anything*? Perhaps it was reused
> as a migratable page which we'll go ahead and migrate even though we no
> longer should. There are various whacky parts of the kernel which
> (ab)use surprising struct page fields in surprising ways - how do we
> know it isn't one of those which now happens to look like a migratable
> page?
Here we just optimistically skip as much unwanted pages as possible.
This code rechecks PageLRU and other tests later, under lru_lock.
lru_lock blocks freeing path which should acquire it to remove from lru.
>
> I also worry about the next test:
>
> /*
> * Only allow to migrate anonymous pages in GFP_NOFS context
> * because those do not depend on fs locks.
> */
> if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
> goto isolate_fail;
>
> This page isn't PageLocked(), is it? It could be a recycled page which
> is will be getting its ->mapping set one nanosecond hence.
Yes, it's racy. I don't see how compaction rechecks this later.
So it could try to unmap and migrate file page if race here with recycle.
Probably nobody starts direct-compaction without GFP_FS.
>
>
> > /*
> > @@ -975,6 +979,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > low_pfn += compound_nr(page) - 1;
> > goto isolate_fail;
> > }
> > +
> > + /* Recheck page extra references under lock */
> > + if (page_count(page) > page_mapcount(page) +
> > + (!PageAnon(page) || PageSwapCache(page)))
> > + goto isolate_fail;
> > }
> >
> > lruvec = mem_cgroup_page_lruvec(page, pgdat);
> >
>
On Wed, 13 May 2020, Konstantin Khlebnikov wrote:
> Function isolate_migratepages_block() runs some checks out of lru_lock
> when choose pages for migration. After checking PageLRU() it checks extra
> page references by comparing page_count() and page_mapcount(). Between
> these two checks page could be removed from lru, freed and taken by slab.
>
> As a result this race triggers VM_BUG_ON(PageSlab()) in page_mapcount().
> Race window is tiny. For certain workload this happens around once a year.
Around once a year, that was my guess too. I have no record of us ever
hitting this, but yes it could happen when you have CONFIG_DEBUG_VM=y
(which I too like to run with, but would not recommend for users).
>
>
> page:ffffea0105ca9380 count:1 mapcount:0 mapping:ffff88ff7712c180 index:0x0 compound_mapcount: 0
> flags: 0x500000000008100(slab|head)
> raw: 0500000000008100 dead000000000100 dead000000000200 ffff88ff7712c180
> raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
> page dumped because: VM_BUG_ON_PAGE(PageSlab(page))
> ------------[ cut here ]------------
> kernel BUG at ./include/linux/mm.h:628!
> invalid opcode: 0000 [#1] SMP NOPTI
> CPU: 77 PID: 504 Comm: kcompactd1 Tainted: G W 4.19.109-27 #1
> Hardware name: Yandex T175-N41-Y3N/MY81-EX0-Y3N, BIOS R05 06/20/2019
> RIP: 0010:isolate_migratepages_block+0x986/0x9b0
>
>
> To fix just opencode page_mapcount() in racy check for 0-order case and
> recheck carefully under lru_lock when page cannot escape from lru.
>
> Also add checking extra references for file pages and swap cache.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> Fixes: 119d6d59dcc0 ("mm, compaction: avoid isolating pinned pages")
Not really, that commit was correct at the time it went in.
> Fixes: 1d148e218a0d ("mm: add VM_BUG_ON_PAGE() to page_mapcount()")
Exactly, that commit was well-intentioned, but did not allow for this
(admittedly very exceptional) usage. How many developers actually
make the mistake of applying page_mapcount() to their slab pages?
None, I expect. That VM_BUG_ON_PAGE() is there for documentation,
and could just be replaced by a comment - and Linus would be happy
with that.
> ---
> mm/compaction.c | 17 +++++++++++++----
> 1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 46f0fcc93081..91bb87fd9420 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -935,12 +935,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> }
>
> /*
> - * Migration will fail if an anonymous page is pinned in memory,
> + * Migration will fail if an page is pinned in memory,
> * so avoid taking lru_lock and isolating it unnecessarily in an
> - * admittedly racy check.
> + * admittedly racy check simplest case for 0-order pages.
> + *
> + * Open code page_mapcount() to avoid VM_BUG_ON(PageSlab(page)).
But open coding page_mapcount() is not all that you did. You have
(understandably) chosen to avoid calling page_mapping(page), but...
> + * Page could have extra reference from mapping or swap cache.
> */
> - if (!page_mapping(page) &&
> - page_count(page) > page_mapcount(page))
> + if (!PageCompound(page) &&
> + page_count(page) > atomic_read(&page->_mapcount) + 1 +
> + (!PageAnon(page) || PageSwapCache(page)))
> goto isolate_fail;
Isn't that test going to send all the file cache pages with buffer heads
in page->private, off to isolate_fail when they're actually great
candidates for migration?
Given that the actual bug spotted was with the VM_BUG_ON_PAGE(PageSlab),
and nobody has reported any crash from the use of page_mapping() there
(and we only need the test to be right most of the time: all of this
knowingly racy, as you explain in other mail): I'd go for just replacing
the VM_BUG_ON_PAGE in page_mapcount() by a comment about this case.
But if you think developers are really in danger of coding page_mapcount()
on their slab pages, then you could add a _page_mapcount() to linux/mm.h,
which omits the VM_BUG_ON_PAGE, for use here only.
Then we wouldn't have to think so hard about the counting above!
>
> /*
> @@ -975,6 +979,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> low_pfn += compound_nr(page) - 1;
> goto isolate_fail;
> }
> +
> + /* Recheck page extra references under lock */
> + if (page_count(page) > page_mapcount(page) +
> + (!PageAnon(page) || PageSwapCache(page)))
> + goto isolate_fail;
Well, that lru_lock (and the intervening PageLRU check after getting it)
may restrict PageAnon and PageSwapCache transitions to some extent, but
it certainly has no effect on page_count and page_mapcount: so I think
such an additional check here is rather superfluous, and we should just
rely on the final checks in migrate_page_move_mapping(), as before.
> }
>
> lruvec = mem_cgroup_page_lruvec(page, pgdat);
On 24/05/2020 04.01, Hugh Dickins wrote:
> On Wed, 13 May 2020, Konstantin Khlebnikov wrote:
>
>> Function isolate_migratepages_block() runs some checks out of lru_lock
>> when choose pages for migration. After checking PageLRU() it checks extra
>> page references by comparing page_count() and page_mapcount(). Between
>> these two checks page could be removed from lru, freed and taken by slab.
>>
>> As a result this race triggers VM_BUG_ON(PageSlab()) in page_mapcount().
>> Race window is tiny. For certain workload this happens around once a year.
>
> Around once a year, that was my guess too. I have no record of us ever
> hitting this, but yes it could happen when you have CONFIG_DEBUG_VM=y
> (which I too like to run with, but would not recommend for users).
Yep, but for large cluster and pinpointed workload this happens surprisingly
frequently =) I've believed into this race only after seeing statistics for
count of compactions and how it correlates with incidents.
Probably the key component is a slab allocation from network irq/bh context
which interrupts compaction exactly at this spot.
>
>>
>>
>> page:ffffea0105ca9380 count:1 mapcount:0 mapping:ffff88ff7712c180 index:0x0 compound_mapcount: 0
>> flags: 0x500000000008100(slab|head)
>> raw: 0500000000008100 dead000000000100 dead000000000200 ffff88ff7712c180
>> raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
>> page dumped because: VM_BUG_ON_PAGE(PageSlab(page))
>> ------------[ cut here ]------------
>> kernel BUG at ./include/linux/mm.h:628!
>> invalid opcode: 0000 [#1] SMP NOPTI
>> CPU: 77 PID: 504 Comm: kcompactd1 Tainted: G W 4.19.109-27 #1
>> Hardware name: Yandex T175-N41-Y3N/MY81-EX0-Y3N, BIOS R05 06/20/2019
>> RIP: 0010:isolate_migratepages_block+0x986/0x9b0
>>
>>
>> To fix just opencode page_mapcount() in racy check for 0-order case and
>> recheck carefully under lru_lock when page cannot escape from lru.
>>
>> Also add checking extra references for file pages and swap cache.
>>
>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>> Fixes: 119d6d59dcc0 ("mm, compaction: avoid isolating pinned pages")
>
> Not really, that commit was correct at the time it went in.
>
>> Fixes: 1d148e218a0d ("mm: add VM_BUG_ON_PAGE() to page_mapcount()")
>
> Exactly, that commit was well-intentioned, but did not allow for this
> (admittedly very exceptional) usage. How many developers actually
> make the mistake of applying page_mapcount() to their slab pages?
> None, I expect. That VM_BUG_ON_PAGE() is there for documentation,
> and could just be replaced by a comment - and Linus would be happy
> with that.
Ok, I'll redo the fix in this way.
>
>> ---
>> mm/compaction.c | 17 +++++++++++++----
>> 1 file changed, 13 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 46f0fcc93081..91bb87fd9420 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -935,12 +935,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>> }
>>
>> /*
>> - * Migration will fail if an anonymous page is pinned in memory,
>> + * Migration will fail if an page is pinned in memory,
>> * so avoid taking lru_lock and isolating it unnecessarily in an
>> - * admittedly racy check.
>> + * admittedly racy check simplest case for 0-order pages.
>> + *
>> + * Open code page_mapcount() to avoid VM_BUG_ON(PageSlab(page)).
>
> But open coding page_mapcount() is not all that you did. You have
> (understandably) chosen to avoid calling page_mapping(page), but...
>
>> + * Page could have extra reference from mapping or swap cache.
>> */
>> - if (!page_mapping(page) &&
>> - page_count(page) > page_mapcount(page))
>> + if (!PageCompound(page) &&
>> + page_count(page) > atomic_read(&page->_mapcount) + 1 +
>> + (!PageAnon(page) || PageSwapCache(page)))
>> goto isolate_fail;
>
> Isn't that test going to send all the file cache pages with buffer heads
> in page->private, off to isolate_fail when they're actually great
> candidates for migration?
Yes. What a shame. Adding page_has_private() could fix that?
Kind of
page_count(page) > page_mapcount(page) +
(PageAnon(page) ? PageSwapCache(page) : (1 + page_has_private(page)))
or probably something like this:
page_count(page) > page_mapcount(page) +
(PageAnon(page) ? PageSwapCache(page) : GUP_PIN_COUNTING_BIAS)
I.e. skip only file pages pinned by dma or something slower.
I see some movements in this direction in recent changes.
of course that's independent matter.
>
> Given that the actual bug spotted was with the VM_BUG_ON_PAGE(PageSlab),
> and nobody has reported any crash from the use of page_mapping() there
> (and we only need the test to be right most of the time: all of this
> knowingly racy, as you explain in other mail): I'd go for just replacing
> the VM_BUG_ON_PAGE in page_mapcount() by a comment about this case.
>
> But if you think developers are really in danger of coding page_mapcount()
> on their slab pages, then you could add a _page_mapcount() to linux/mm.h,
> which omits the VM_BUG_ON_PAGE, for use here only.
>
> Then we wouldn't have to think so hard about the counting above!
>
>>
>> /*
>> @@ -975,6 +979,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>> low_pfn += compound_nr(page) - 1;
>> goto isolate_fail;
>> }
>> +
>> + /* Recheck page extra references under lock */
>> + if (page_count(page) > page_mapcount(page) +
>> + (!PageAnon(page) || PageSwapCache(page)))
>> + goto isolate_fail;
>
> Well, that lru_lock (and the intervening PageLRU check after getting it)
> may restrict PageAnon and PageSwapCache transitions to some extent, but
> it certainly has no effect on page_count and page_mapcount: so I think
> such an additional check here is rather superfluous, and we should just
> rely on the final checks in migrate_page_move_mapping(), as before.
>
>> }
>>
>> lruvec = mem_cgroup_page_lruvec(page, pgdat);
On Sun, 24 May 2020, Konstantin Khlebnikov wrote:
> On 24/05/2020 04.01, Hugh Dickins wrote:
> > On Wed, 13 May 2020, Konstantin Khlebnikov wrote:
> >
> > > Function isolate_migratepages_block() runs some checks out of lru_lock
> > > when choose pages for migration. After checking PageLRU() it checks extra
> > > page references by comparing page_count() and page_mapcount(). Between
> > > these two checks page could be removed from lru, freed and taken by slab.
> > >
> > > As a result this race triggers VM_BUG_ON(PageSlab()) in page_mapcount().
> > > Race window is tiny. For certain workload this happens around once a
> > > year.
> >
> > Around once a year, that was my guess too. I have no record of us ever
> > hitting this, but yes it could happen when you have CONFIG_DEBUG_VM=y
> > (which I too like to run with, but would not recommend for users).
>
> Yep, but for large cluster and pinpointed workload this happens surprisingly
> frequently =) I've believed into this race only after seeing statistics for
> count of compactions and how it correlates with incidents.
>
> Probably the key component is a slab allocation from network irq/bh context
> which interrupts compaction exactly at this spot.
Yes, I bet you're right.
>
> >
> > >
> > >
> > > page:ffffea0105ca9380 count:1 mapcount:0 mapping:ffff88ff7712c180
> > > index:0x0 compound_mapcount: 0
> > > flags: 0x500000000008100(slab|head)
> > > raw: 0500000000008100 dead000000000100 dead000000000200
> > > ffff88ff7712c180
> > > raw: 0000000000000000 0000000080200020 00000001ffffffff
> > > 0000000000000000
> > > page dumped because: VM_BUG_ON_PAGE(PageSlab(page))
> > > ------------[ cut here ]------------
> > > kernel BUG at ./include/linux/mm.h:628!
> > > invalid opcode: 0000 [#1] SMP NOPTI
> > > CPU: 77 PID: 504 Comm: kcompactd1 Tainted: G W
> > > 4.19.109-27 #1
> > > Hardware name: Yandex T175-N41-Y3N/MY81-EX0-Y3N, BIOS R05 06/20/2019
> > > RIP: 0010:isolate_migratepages_block+0x986/0x9b0
> > >
> > >
> > > To fix just opencode page_mapcount() in racy check for 0-order case and
> > > recheck carefully under lru_lock when page cannot escape from lru.
> > >
> > > Also add checking extra references for file pages and swap cache.
> > >
> > > Signed-off-by: Konstantin Khlebnikov <[email protected]>
> > > Fixes: 119d6d59dcc0 ("mm, compaction: avoid isolating pinned pages")
> >
> > Not really, that commit was correct at the time it went in.
> >
> > > Fixes: 1d148e218a0d ("mm: add VM_BUG_ON_PAGE() to page_mapcount()")
> >
> > Exactly, that commit was well-intentioned, but did not allow for this
> > (admittedly very exceptional) usage. How many developers actually
> > make the mistake of applying page_mapcount() to their slab pages?
> > None, I expect. That VM_BUG_ON_PAGE() is there for documentation,
> > and could just be replaced by a comment - and Linus would be happy
> > with that.
>
> Ok, I'll redo the fix in this way.
Thanks.
>
> >
> > > ---
> > > mm/compaction.c | 17 +++++++++++++----
> > > 1 file changed, 13 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index 46f0fcc93081..91bb87fd9420 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -935,12 +935,16 @@ isolate_migratepages_block(struct compact_control
> > > *cc, unsigned long low_pfn,
> > > }
> > > /*
> > > - * Migration will fail if an anonymous page is pinned in
> > > memory,
> > > + * Migration will fail if an page is pinned in memory,
> > > * so avoid taking lru_lock and isolating it unnecessarily in
> > > an
> > > - * admittedly racy check.
> > > + * admittedly racy check simplest case for 0-order pages.
> > > + *
> > > + * Open code page_mapcount() to avoid
> > > VM_BUG_ON(PageSlab(page)).
> >
> > But open coding page_mapcount() is not all that you did. You have
> > (understandably) chosen to avoid calling page_mapping(page), but...
> >
> > > + * Page could have extra reference from mapping or swap
> > > cache.
> > > */
> > > - if (!page_mapping(page) &&
> > > - page_count(page) > page_mapcount(page))
> > > + if (!PageCompound(page) &&
> > > + page_count(page) > atomic_read(&page->_mapcount) + 1 +
> > > + (!PageAnon(page) || PageSwapCache(page)))
> > > goto isolate_fail;
> >
> > Isn't that test going to send all the file cache pages with buffer heads
> > in page->private, off to isolate_fail when they're actually great
> > candidates for migration?
>
> Yes. What a shame. Adding page_has_private() could fix that?
>
> Kind of
>
> page_count(page) > page_mapcount(page) +
> (PageAnon(page) ? PageSwapCache(page) : (1 + page_has_private(page)))
Certainly it was fixable, but I'm too lazy to want to think through
the correct answer; and though I'm often out of sympathy with helper
functions (why do people want an inline bool function for every simple
flag test?!?!), here is a place that cries out for a helper, if you
complicate it beyond page_count > page_mapcount (especially when
driven into that detail of adding 1 to _mapcount).
>
> or probably something like this:
>
> page_count(page) > page_mapcount(page) +
> (PageAnon(page) ? PageSwapCache(page) : GUP_PIN_COUNTING_BIAS)
>
> I.e. skip only file pages pinned by dma or something slower.
> I see some movements in this direction in recent changes.
>
> of course that's independent matter.
Yes, once the gup/pin conversion is widespread, I expect that it will
allow a better implementation of this compaction test, one not limited
to the anonymous pages. (We do internally use a patch extending the
current test to file pages, which in practice has saved a lot of time
wasted on failing compactions: but, last I looked anyway, it gets some
cases wrong - cases we happen not to care about ourselves, but would
be unacceptable upstream. So I hope the distinction of pinned pages
will work out well here later.)
Hugh
On Sat, 23 May 2020, Hugh Dickins wrote:
> On Wed, 13 May 2020, Konstantin Khlebnikov wrote:
>
> > Function isolate_migratepages_block() runs some checks out of lru_lock
> > when choose pages for migration. After checking PageLRU() it checks extra
> > page references by comparing page_count() and page_mapcount(). Between
> > these two checks page could be removed from lru, freed and taken by slab.
> >
> > As a result this race triggers VM_BUG_ON(PageSlab()) in page_mapcount().
> > Race window is tiny. For certain workload this happens around once a year.
>
> Around once a year, that was my guess too. I have no record of us ever
> hitting this, but yes it could happen when you have CONFIG_DEBUG_VM=y
> (which I too like to run with, but would not recommend for users).
>
> >
> >
> > page:ffffea0105ca9380 count:1 mapcount:0 mapping:ffff88ff7712c180 index:0x0 compound_mapcount: 0
> > flags: 0x500000000008100(slab|head)
> > raw: 0500000000008100 dead000000000100 dead000000000200 ffff88ff7712c180
> > raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
> > page dumped because: VM_BUG_ON_PAGE(PageSlab(page))
> > ------------[ cut here ]------------
> > kernel BUG at ./include/linux/mm.h:628!
> > invalid opcode: 0000 [#1] SMP NOPTI
> > CPU: 77 PID: 504 Comm: kcompactd1 Tainted: G W 4.19.109-27 #1
> > Hardware name: Yandex T175-N41-Y3N/MY81-EX0-Y3N, BIOS R05 06/20/2019
> > RIP: 0010:isolate_migratepages_block+0x986/0x9b0
> >
> >
> > To fix just opencode page_mapcount() in racy check for 0-order case and
> > recheck carefully under lru_lock when page cannot escape from lru.
> >
> > Also add checking extra references for file pages and swap cache.
> >
> > Signed-off-by: Konstantin Khlebnikov <[email protected]>
> > Fixes: 119d6d59dcc0 ("mm, compaction: avoid isolating pinned pages")
>
> Not really, that commit was correct at the time it went in.
>
> > Fixes: 1d148e218a0d ("mm: add VM_BUG_ON_PAGE() to page_mapcount()")
>
> Exactly, that commit was well-intentioned, but did not allow for this
> (admittedly very exceptional) usage. How many developers actually
> make the mistake of applying page_mapcount() to their slab pages?
> None, I expect. That VM_BUG_ON_PAGE() is there for documentation,
> and could just be replaced by a comment - and Linus would be happy
> with that.
>
> > ---
> > mm/compaction.c | 17 +++++++++++++----
> > 1 file changed, 13 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 46f0fcc93081..91bb87fd9420 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -935,12 +935,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > }
> >
> > /*
> > - * Migration will fail if an anonymous page is pinned in memory,
> > + * Migration will fail if an page is pinned in memory,
> > * so avoid taking lru_lock and isolating it unnecessarily in an
> > - * admittedly racy check.
> > + * admittedly racy check simplest case for 0-order pages.
> > + *
> > + * Open code page_mapcount() to avoid VM_BUG_ON(PageSlab(page)).
>
> But open coding page_mapcount() is not all that you did. You have
> (understandably) chosen to avoid calling page_mapping(page), but...
>
> > + * Page could have extra reference from mapping or swap cache.
> > */
> > - if (!page_mapping(page) &&
> > - page_count(page) > page_mapcount(page))
> > + if (!PageCompound(page) &&
> > + page_count(page) > atomic_read(&page->_mapcount) + 1 +
> > + (!PageAnon(page) || PageSwapCache(page)))
> > goto isolate_fail;
>
> Isn't that test going to send all the file cache pages with buffer heads
> in page->private, off to isolate_fail when they're actually great
> candidates for migration?
>
> Given that the actual bug spotted was with the VM_BUG_ON_PAGE(PageSlab),
> and nobody has reported any crash from the use of page_mapping() there
> (and we only need the test to be right most of the time: all of this
> knowingly racy, as you explain in other mail): I'd go for just replacing
> the VM_BUG_ON_PAGE in page_mapcount() by a comment about this case.
>
> But if you think developers are really in danger of coding page_mapcount()
> on their slab pages, then you could add a _page_mapcount() to linux/mm.h,
> which omits the VM_BUG_ON_PAGE, for use here only.
>
> Then we wouldn't have to think so hard about the counting above!
>
> >
> > /*
> > @@ -975,6 +979,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > low_pfn += compound_nr(page) - 1;
> > goto isolate_fail;
> > }
> > +
> > + /* Recheck page extra references under lock */
> > + if (page_count(page) > page_mapcount(page) +
> > + (!PageAnon(page) || PageSwapCache(page)))
> > + goto isolate_fail;
>
> Well, that lru_lock (and the intervening PageLRU check after getting it)
> may restrict PageAnon and PageSwapCache transitions to some extent, but
> it certainly has no effect on page_count and page_mapcount: so I think
> such an additional check here is rather superfluous, and we should just
> rely on the final checks in migrate_page_move_mapping(), as before.
>
> > }
> >
> > lruvec = mem_cgroup_page_lruvec(page, pgdat);
Andrew, I've noticed that this buggy
mm-compaction-avoid-vm_bug_onpageslab-in-page_mapcount.patch
was still in Friday's mmotm 2020-05-29-16-09, despite its replacement
6988f31d558a ("mm: remove VM_BUG_ON(PageSlab()) from page_mapcount()")
getting into 5.7, thanks to your "incoming" to Linus on that day.
Please be sure to remove this patch to mm/compaction.c from your tree
and queue to Linus for 5.8: it imposes an unintended and significant
limitation on the current behavior of compaction. (And in some loads,
some of that additional limitation may actually be beneficial: but if
so, must be argued separately, not as page_mapcount BUG avoidance).
Cc'ing Alex Shi, because I noticed this when trying his v11 per-memcg
lru_lock series (which appears to be a big improvement over earlier
versions, thanks in particular to Johannes's memcg swap simplifications);
and Alex's 12/16 makes a change on top of Konstantin's latter check,
which will now just be reverted. I'm not yet confident in Alex's
isolate_migratepages_block(), in part because this muddle.
Thanks,
Hugh
On Mon, 1 Jun 2020 21:05:25 -0700 (PDT) Hugh Dickins <[email protected]> wrote:
> Andrew, I've noticed that this buggy
> mm-compaction-avoid-vm_bug_onpageslab-in-page_mapcount.patch
> was still in Friday's mmotm 2020-05-29-16-09, despite its replacement
> 6988f31d558a ("mm: remove VM_BUG_ON(PageSlab()) from page_mapcount()")
> getting into 5.7, thanks to your "incoming" to Linus on that day.
Thanks, gone.
?? 2020/6/2 ????12:05, Hugh Dickins ะด??:
> Cc'ing Alex Shi, because I noticed this when trying his v11 per-memcg
> lru_lock series (which appears to be a big improvement over earlier
> versions, thanks in particular to Johannes's memcg swap simplifications);
> and Alex's 12/16 makes a change on top of Konstantin's latter check,
> which will now just be reverted. I'm not yet confident in Alex's
> isolate_migratepages_block(), in part because this muddle.
Hi Hugh,
Yes, this could make a very tricky change on compaction behavior. I will update
the patchset after next update.
Thanks a lot for notice this! And looking forward to more comments on per memcg
lru_lock patchset! :)
Thanks
Alex