Hi Kirill,
I had a nasty thought this morning.
Andrew had prodded me gently to re-examine my concerns with your
page-flags rework in mmotm. I still dislike the bloat (my mm/built-in.o
text goes up from 478513 to 490183 bytes on a non-DEBUG_VM build); but I
was hoping to set that aside, to let us move forward.
But looking into the bloat led me to what seems a more serious issue
with it. I'd tacked a little function on to the end of mm/filemap.c:
bool page_is_locked(struct page *page)
{
return !!PageLocked(page);
}
which came out as:
0000000000003a60 <page_is_locked>:
3a60: 48 8b 07 mov (%rdi),%rax
3a63: 55 push %rbp
3a64: 48 89 e5 mov %rsp,%rbp
[instructions above same as without your patches; those below added by them]
3a67: f6 c4 80 test $0x80,%ah
3a6a: 74 10 je 3a7c <page_is_locked+0x1c>
3a6c: 48 8b 47 30 mov 0x30(%rdi),%rax
3a70: 48 8b 17 mov (%rdi),%rdx
3a73: 80 e6 80 and $0x80,%dh
3a76: 48 0f 44 c7 cmove %rdi,%rax
3a7a: eb 03 jmp 3a7f <page_is_locked+0x1f>
3a7c: 48 89 f8 mov %rdi,%rax
3a7f: 48 8b 00 mov (%rax),%rax
[instructions above added by your patches; those below same as before]
3a82: 5d pop %rbp
3a83: 83 e0 01 and $0x1,%eax
3a86: c3 retq
The "and $0x80,%dh" looked superfluous at first, but of course it isn't:
it's from the smp_rmb() in David's 668f9abbd433 "mm: close PageTail race"
(a later commit refactors compound_head() but doesn't change the story).
And it's that race, or a worse race of that kind, that now worries me.
Relying on smp_wmb() and smp_rmb() may be all that was needed in the
case that David was fixing; and (I dare not look at them to audit!)
all uses of compound_head() in our current v4.2-rc tree may well be
safe, for this or that contingent reason in each place that it's used.
But there is no locking within compound_head(page) to make it safe
everywhere, yet your page-flags rework is changing a large number
of PageWhatever()s and SetPageWhatever()s and ClearPageWhatever()s
now to do a hidden compound_head(page) beneath the covers.
To be more specific: if preemption, or an interrupt, or entry to SMM
mode, or whatever, delays this thread somewhere in that compound_head()
sequence of instructions, how can we be sure that the "head" returned
by compound_head() is good? We know the page was PageTail just before
looking up page->first_page, and we know it was PageTail just after,
but we don't know that it was PageTail throughout, and we don't know
whether page->first_page is even a good page pointer, or something
else from the private/ptl/slab_cache union.
Of course it would be very rare for it to go wrong; and most callsites
will obviously be safe for this or that reason; though, sadly, none of
them safe from holding a reference to the tail page in question, since
its count is frozen at 0 and cannot be grabbed by get_page_unless_zero.
But I don't see how it can be safe to rely on compound_head() inside
a general purpose page-flag function, that we're all accustomed to
think of as a simple bitop, that can be applied without great care.
Hugh
On Wed, Aug 05, 2015 at 09:15:57PM -0700, Hugh Dickins wrote:
> Hi Kirill,
>
> I had a nasty thought this morning.
Tough day.
I'm trying to wrap my head around this mail and not sure if I succeed
much. :-|
> Andrew had prodded me gently to re-examine my concerns with your
> page-flags rework in mmotm. I still dislike the bloat (my mm/built-in.o
> text goes up from 478513 to 490183 bytes on a non-DEBUG_VM build); but I
> was hoping to set that aside, to let us move forward.
>
> But looking into the bloat led me to what seems a more serious issue
> with it. I'd tacked a little function on to the end of mm/filemap.c:
>
> bool page_is_locked(struct page *page)
> {
> return !!PageLocked(page);
> }
>
> which came out as:
>
> 0000000000003a60 <page_is_locked>:
> 3a60: 48 8b 07 mov (%rdi),%rax
> 3a63: 55 push %rbp
> 3a64: 48 89 e5 mov %rsp,%rbp
>
> [instructions above same as without your patches; those below added by them]
>
> 3a67: f6 c4 80 test $0x80,%ah
> 3a6a: 74 10 je 3a7c <page_is_locked+0x1c>
> 3a6c: 48 8b 47 30 mov 0x30(%rdi),%rax
> 3a70: 48 8b 17 mov (%rdi),%rdx
> 3a73: 80 e6 80 and $0x80,%dh
> 3a76: 48 0f 44 c7 cmove %rdi,%rax
> 3a7a: eb 03 jmp 3a7f <page_is_locked+0x1f>
> 3a7c: 48 89 f8 mov %rdi,%rax
> 3a7f: 48 8b 00 mov (%rax),%rax
>
> [instructions above added by your patches; those below same as before]
>
> 3a82: 5d pop %rbp
> 3a83: 83 e0 01 and $0x1,%eax
> 3a86: c3 retq
>
> The "and $0x80,%dh" looked superfluous at first, but of course it isn't:
> it's from the smp_rmb() in David's 668f9abbd433 "mm: close PageTail race"
> (a later commit refactors compound_head() but doesn't change the story).
>
> And it's that race, or a worse race of that kind, that now worries me.
> Relying on smp_wmb() and smp_rmb() may be all that was needed in the
> case that David was fixing; and (I dare not look at them to audit!)
> all uses of compound_head() in our current v4.2-rc tree may well be
> safe, for this or that contingent reason in each place that it's used.
>
> But there is no locking within compound_head(page) to make it safe
> everywhere, yet your page-flags rework is changing a large number
> of PageWhatever()s and SetPageWhatever()s and ClearPageWhatever()s
> now to do a hidden compound_head(page) beneath the covers.
>
> To be more specific: if preemption, or an interrupt, or entry to SMM
> mode, or whatever, delays this thread somewhere in that compound_head()
> sequence of instructions, how can we be sure that the "head" returned
> by compound_head() is good? We know the page was PageTail just before
> looking up page->first_page, and we know it was PageTail just after,
> but we don't know that it was PageTail throughout, and we don't know
> whether page->first_page is even a good page pointer, or something
> else from the private/ptl/slab_cache union.
That looks like a very valid worry to me. For current -mm tree.
But let's take my refcounting rework into picture.
One thing it simplifies is protection against splitting. Once you've got a
reference to a page, it cannot be split under you. It makes PageTail() and
->first_page stable for most callsites.
We can access the page's flags under ptl, without having reference the
page. And that's fine: ptl protects against splitting too.
Fast GUP also have a way to protect against split.
IIUC, the only potentially problematic callsites left are physical memory
scanners. This code requires audit. I'll do that.
Do I miss something else?
> Of course it would be very rare for it to go wrong; and most callsites
> will obviously be safe for this or that reason; though, sadly, none of
> them safe from holding a reference to the tail page in question, since
> its count is frozen at 0 and cannot be grabbed by get_page_unless_zero.
Do you mean that grabbing head page's ->_count is not enough to protect
against splitting and freeing tail page under you?
I know a patchset which solves this! ;)
> But I don't see how it can be safe to rely on compound_head() inside
> a general purpose page-flag function, that we're all accustomed to
> think of as a simple bitop, that can be applied without great care.
>
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kirill A. Shutemov
On Thu, 6 Aug 2015, Kirill A. Shutemov wrote:
> On Wed, Aug 05, 2015 at 09:15:57PM -0700, Hugh Dickins wrote:
> > Hi Kirill,
> >
> > I had a nasty thought this morning.
>
> Tough day.
>
> I'm trying to wrap my head around this mail and not sure if I succeed
> much. :-|
Sorry for not being clearer.
> > To be more specific: if preemption, or an interrupt, or entry to SMM
> > mode, or whatever, delays this thread somewhere in that compound_head()
> > sequence of instructions, how can we be sure that the "head" returned
> > by compound_head() is good? We know the page was PageTail just before
> > looking up page->first_page, and we know it was PageTail just after,
> > but we don't know that it was PageTail throughout, and we don't know
> > whether page->first_page is even a good page pointer, or something
> > else from the private/ptl/slab_cache union.
>
> That looks like a very valid worry to me. For current -mm tree.
>
> But let's take my refcounting rework into picture.
Okay, let's do so. I get very confused trying to think based on two
alternative schemes at the same time, so I'm happy to assume your
THP refcounting rework (which certainly has plenty to like in it
from the point of view of cleanup - though at present I think the
mlock splitting leaves it with a net regression in functionality).
That does say that this page-flags rework should not go to Linus
separately from your refcounting rework: I don't think the issues
here are ever likely to break someone's bisection, so it's fine for
the one series to precede the other, but any release should contain
both or neither.
>
> One thing it simplifies is protection against splitting. Once you've got a
> reference to a page, it cannot be split under you. It makes PageTail() and
> ->first_page stable for most callsites.
Yes, but since you cannot acquire a reference to a tail page itself
(since it has count 0 throughout), don't you mean there that you
already hold a reference to the head?
In which case, why bother to make all the PageFlags operations on
tails redirect to the head: if the caller must hold a reference to
the head, then the caller should apply PageFlags to that head, with
no need for compound_head() redirection inside the operation, just a
VM_BUG_ON(PageTail).
Or so it seems from the outside: perhaps that becomes unworkable
somehow once you try to implement it.
>
> We can access the page's flags under ptl, without having reference the
> page. And that's fine: ptl protects against splitting too.
>
> Fast GUP also have a way to protect against split.
Yes and yes. Perhaps it's those accesses under ptl which took you
in this compound_head-inside-PageFlags direction. Fast GUP is easy
to do something special in, but there's probably a lot of scattered
PageFlags operations under ptl, which were tiresome to fiddle with
when you came to allow pte mappings of THP subpages.
>
> IIUC, the only potentially problematic callsites left are physical memory
> scanners. This code requires audit. I'll do that.
Please.
>
> Do I miss something else?
Probably not; but please check - and I'm afraid you've set things up
so that every use of a PageFlags operation needs to be thought about,
if only briefly.
It's certainly the physical approaches to a page (isolation, compaction,
formerly lumpy reclaim, are there others? /proc things?) which have
always been very tricky to get right.
I think it was for those that David added the barriered double PageTail
checking. I wonder if something extra special should be done just there,
in the physical scans; and the barriered double PageTail checking avoided
elsewhere, in the normal places that you reckon are safe already.
Mind you, shifting the unlikely PageTail handling out of line to a
called function would reduce the bloat considerably, then maybe it
wouldn't matter how complicated it gets for the general case.
>
> > Of course it would be very rare for it to go wrong; and most callsites
> > will obviously be safe for this or that reason; though, sadly, none of
> > them safe from holding a reference to the tail page in question, since
> > its count is frozen at 0 and cannot be grabbed by get_page_unless_zero.
>
> Do you mean that grabbing head page's ->_count is not enough to protect
> against splitting and freeing tail page under you?
No, I mean that if you know head already then why are you bothering with
tail; and if you only have tail, then locating head in all the cases where
the PageFlags operation might be called may be unsafe in a few of them.
And that it's not possible to acquire a reference to the tail page to
make this safe. But I accept your point above, that the existence of
a pte in a locked page table amounts to a stable reference, even though
it does not contribute to that tail page's reference count.
>
> I know a patchset which solves this! ;)
Oh, and I know a patchset which avoids these problems completely,
by not using compound pages at all ;)
Hugh
On Thu, 6 Aug 2015, Hugh Dickins wrote:
> > I know a patchset which solves this! ;)
>
> Oh, and I know a patchset which avoids these problems completely,
> by not using compound pages at all ;)
Another dumb idea: Stop the insanity of splitting pages on the fly?
Splitting pages should work like page migration: Lock everything down and
ensure no one is using the page and then do it. That way the compound pages
and its metadata are as stable as a regular page.
On Thu, Aug 06, 2015 at 12:24:22PM -0700, Hugh Dickins wrote:
> > I'm trying to wrap my head around this mail and not sure if I succeed
> > much. :-|
>
> Sorry for not being clearer.
Not your fault.
The problem you've pointed to is on edge of my understanding of concurrency.
> On Thu, 6 Aug 2015, Kirill A. Shutemov wrote:
> > > To be more specific: if preemption, or an interrupt, or entry to SMM
> > > mode, or whatever, delays this thread somewhere in that compound_head()
> > > sequence of instructions, how can we be sure that the "head" returned
> > > by compound_head() is good? We know the page was PageTail just before
> > > looking up page->first_page, and we know it was PageTail just after,
> > > but we don't know that it was PageTail throughout, and we don't know
> > > whether page->first_page is even a good page pointer, or something
> > > else from the private/ptl/slab_cache union.
> >
> > That looks like a very valid worry to me. For current -mm tree.
> >
> > But let's take my refcounting rework into picture.
>
> Okay, let's do so. I get very confused trying to think based on two
> alternative schemes at the same time, so I'm happy to assume your
> THP refcounting rework (which certainly has plenty to like in it
> from the point of view of cleanup - though at present I think the
> mlock splitting leaves it with a net regression in functionality).
The plan is to bring it a bit later. The refcounting patchset is huge
enough as it is.
> That does say that this page-flags rework should not go to Linus
> separately from your refcounting rework: I don't think the issues
> here are ever likely to break someone's bisection, so it's fine for
> the one series to precede the other, but any release should contain
> both or neither.
Agreed.
> > One thing it simplifies is protection against splitting. Once you've got a
> > reference to a page, it cannot be split under you. It makes PageTail() and
> > ->first_page stable for most callsites.
>
> Yes, but since you cannot acquire a reference to a tail page itself
> (since it has count 0 throughout), don't you mean there that you
> already hold a reference to the head?
>
> In which case, why bother to make all the PageFlags operations on
> tails redirect to the head: if the caller must hold a reference to
> the head, then the caller should apply PageFlags to that head, with
> no need for compound_head() redirection inside the operation, just a
> VM_BUG_ON(PageTail).
get_page() and put_page() hide the fact that refcounting is applied to
head page. And that's handy. Otherwise we would need drag pointers to two
pages on caller side, instead of one we have now.
The only special case is again get_page_unless_zero() users. They have to
deal with head vs. tail pages on their own. We have only few such places.
And it's manageable I believe.
> Or so it seems from the outside: perhaps that becomes unworkable
> somehow once you try to implement it.
>
> >
> > We can access the page's flags under ptl, without having reference the
> > page. And that's fine: ptl protects against splitting too.
> >
> > Fast GUP also have a way to protect against split.
>
> Yes and yes. Perhaps it's those accesses under ptl which took you
> in this compound_head-inside-PageFlags direction. Fast GUP is easy
> to do something special in, but there's probably a lot of scattered
> PageFlags operations under ptl, which were tiresome to fiddle with
> when you came to allow pte mappings of THP subpages.
Right.
> > IIUC, the only potentially problematic callsites left are physical memory
> > scanners. This code requires audit. I'll do that.
>
> Please.
I'll bring some report on this next week.
> > Do I miss something else?
>
> Probably not; but please check - and I'm afraid you've set things up
> so that every use of a PageFlags operation needs to be thought about,
> if only briefly.
>
> It's certainly the physical approaches to a page (isolation, compaction,
> formerly lumpy reclaim, are there others? /proc things?) which have
> always been very tricky to get right.
I didn't think about /proc as potential issue. Thanks.
> I think it was for those that David added the barriered double PageTail
> checking. I wonder if something extra special should be done just there,
> in the physical scans; and the barriered double PageTail checking avoided
> elsewhere, in the normal places that you reckon are safe already.
>
> Mind you, shifting the unlikely PageTail handling out of line to a
> called function would reduce the bloat considerably, then maybe it
> wouldn't matter how complicated it gets for the general case.
I'll try that.
> > > Of course it would be very rare for it to go wrong; and most callsites
> > > will obviously be safe for this or that reason; though, sadly, none of
> > > them safe from holding a reference to the tail page in question, since
> > > its count is frozen at 0 and cannot be grabbed by get_page_unless_zero.
> >
> > Do you mean that grabbing head page's ->_count is not enough to protect
> > against splitting and freeing tail page under you?
>
> No, I mean that if you know head already then why are you bothering with
> tail; and if you only have tail, then locating head in all the cases where
> the PageFlags operation might be called may be unsafe in a few of them.
See above.
> And that it's not possible to acquire a reference to the tail page to
> make this safe. But I accept your point above, that the existence of
> a pte in a locked page table amounts to a stable reference, even though
> it does not contribute to that tail page's reference count.
>
> >
> > I know a patchset which solves this! ;)
>
> Oh, and I know a patchset which avoids these problems completely,
> by not using compound pages at all ;)
BTW, I haven't heard anything about the patchset for a while.
What's the status?
Optimizing rmap operations in my patchset (see PG_double_map), I found
that it would be very tricky to expand team pages to anon-THP without
performance regression on rmap side due to amount of atomic ops it
requires.
Is there any clever approach to the issue?
Team pages are probably fine for file mappings due different performance
baseline. I'm less optimistic about anon-THP.
--
Kirill A. Shutemov
On Thu, Aug 06, 2015 at 03:45:31PM -0500, Christoph Lameter wrote:
> On Thu, 6 Aug 2015, Hugh Dickins wrote:
>
> > > I know a patchset which solves this! ;)
> >
> > Oh, and I know a patchset which avoids these problems completely,
> > by not using compound pages at all ;)
>
> Another dumb idea: Stop the insanity of splitting pages on the fly?
> Splitting pages should work like page migration: Lock everything down and
> ensure no one is using the page and then do it. That way the compound pages
> and its metadata are as stable as a regular page.
That's what I do in refcounting patchset.
--
Kirill A. Shutemov
On Fri, 7 Aug 2015, Kirill A. Shutemov wrote:
> On Thu, Aug 06, 2015 at 03:45:31PM -0500, Christoph Lameter wrote:
> > On Thu, 6 Aug 2015, Hugh Dickins wrote:
> >
> > > > I know a patchset which solves this! ;)
> > >
> > > Oh, and I know a patchset which avoids these problems completely,
> > > by not using compound pages at all ;)
> >
> > Another dumb idea: Stop the insanity of splitting pages on the fly?
> > Splitting pages should work like page migration: Lock everything down and
> > ensure no one is using the page and then do it. That way the compound pages
> > and its metadata are as stable as a regular page.
>
> That's what I do in refcounting patchset.
Looks like you make refcounting easier and avoid splitting in some cases
maybe only splitting the pmd. But the fundamental issue still remains.
Complexity is high since individual pages of a compound can be mapped and
unmapped in multiple processes.
The compound would need to be always treated as a single order N entity
in order to really get things simplified and make code cleaner.
Either all pages are mapped or none. Otherwise you have to manage the
a schizoprenic view of pages. Sometimes an order N size entity is
managed and sometimes a base page size page which is a fraction of the
whole. Such a view of a memory object is pretty difficult to manage.
On Fri, Aug 07, 2015 at 10:28:49AM -0500, Christoph Lameter wrote:
> On Fri, 7 Aug 2015, Kirill A. Shutemov wrote:
>
> > On Thu, Aug 06, 2015 at 03:45:31PM -0500, Christoph Lameter wrote:
> > > On Thu, 6 Aug 2015, Hugh Dickins wrote:
> > >
> > > > > I know a patchset which solves this! ;)
> > > >
> > > > Oh, and I know a patchset which avoids these problems completely,
> > > > by not using compound pages at all ;)
> > >
> > > Another dumb idea: Stop the insanity of splitting pages on the fly?
> > > Splitting pages should work like page migration: Lock everything down and
> > > ensure no one is using the page and then do it. That way the compound pages
> > > and its metadata are as stable as a regular page.
> >
> > That's what I do in refcounting patchset.
>
> Looks like you make refcounting easier and avoid splitting in some cases
> maybe only splitting the pmd. But the fundamental issue still remains.
> Complexity is high since individual pages of a compound can be mapped and
> unmapped in multiple processes.
>
> The compound would need to be always treated as a single order N entity
> in order to really get things simplified and make code cleaner.
>
> Either all pages are mapped or none. Otherwise you have to manage the
> a schizoprenic view of pages. Sometimes an order N size entity is
> managed and sometimes a base page size page which is a fraction of the
> whole. Such a view of a memory object is pretty difficult to manage.
I don't see anything actionable here. Your wish list doesn't cope with
reality. Compound pages are mapped with PTEs for almost ten years and I
don't see why we should stop the practice.
--
Kirill A. Shutemov
On Mon, 10 Aug 2015, Kirill A. Shutemov wrote:
> I don't see anything actionable here. Your wish list doesn't cope with
> reality. Compound pages are mapped with PTEs for almost ten years and I
> don't see why we should stop the practice.
Well they have to if they are smaller than huge pages. Treating each PTE
as each having their own state instead of having the whole compound mapped
completely causes the problem. Refcounting in tail pages is not necessary
if the whole compound is either mapped or not mapped at all by a process.
Refcounting in tail pages is only necessary if you allow 4k slices to be
mapped.
On Thu, Aug 06, 2015 at 12:24:22PM -0700, Hugh Dickins wrote:
> > IIUC, the only potentially problematic callsites left are physical memory
> > scanners. This code requires audit. I'll do that.
>
> Please.
I haven't finished the exercise yet. But here's an issue I believe present
in current *Linus* tree:
>From e78eec7d7a8c4cba8b5952a997973f7741e704f4 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Wed, 12 Aug 2015 17:09:16 +0300
Subject: [PATCH] mm: fix potential race in isolate_migratepages_block()
Hugh has pointed that compound_head() call can be unsafe in some context.
There's one example:
CPU0 CPU1
isolate_migratepages_block()
page_count()
compound_head()
!!PageTail() == true
put_page()
tail->first_page = NULL
head = tail->first_page
alloc_pages(__GFP_COMP)
prep_compound_page()
tail->first_page = head
__SetPageTail(p);
!!PageTail() == true
<head == NULL dereferencing>
The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.
This can be fixed by avoiding compound_head() in unsafe context.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Vlastimil Babka <[email protected]>
---
mm/compaction.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 86f04e556f96..bec727b700d3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -787,7 +787,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
* admittedly racy check.
*/
if (!page_mapping(page) &&
- page_count(page) > page_mapcount(page))
+ atomic_read(&page->_count) > page_mapcount(page))
continue;
/* If we already hold the lock, we can skip some rechecking */
--
Kirill A. Shutemov
On 08/12/2015 04:35 PM, Kirill A. Shutemov wrote:
> On Thu, Aug 06, 2015 at 12:24:22PM -0700, Hugh Dickins wrote:
>>> IIUC, the only potentially problematic callsites left are physical memory
>>> scanners. This code requires audit. I'll do that.
>>
>> Please.
>
> I haven't finished the exercise yet. But here's an issue I believe present
> in current *Linus* tree:
>
> From e78eec7d7a8c4cba8b5952a997973f7741e704f4 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <[email protected]>
> Date: Wed, 12 Aug 2015 17:09:16 +0300
> Subject: [PATCH] mm: fix potential race in isolate_migratepages_block()
>
> Hugh has pointed that compound_head() call can be unsafe in some context.
> There's one example:
>
> CPU0 CPU1
>
> isolate_migratepages_block()
> page_count()
> compound_head()
> !!PageTail() == true
> put_page()
> tail->first_page = NULL
> head = tail->first_page
> alloc_pages(__GFP_COMP)
> prep_compound_page()
> tail->first_page = head
> __SetPageTail(p);
> !!PageTail() == true
> <head == NULL dereferencing>
>
> The race is pure theoretical. I don't it's possible to trigger it in
> practice. But who knows.
It's even less probable thanks to the fact that before this check we
determined it's a PageLRU (and thus !PageTail).
>
> This can be fixed by avoiding compound_head() in unsafe context.
This is OK because if page becomes tail and we read zero page count,
it's not fatal.
> Signed-off-by: Kirill A. Shutemov <[email protected]>
Fixes: 119d6d59dc ("mm, compaction: avoid isolating pinned pages")
Potentially stable 3.15+ if theoretical races qualify. They don't per
stable rules, but we seem to be bending that a lot anyway.
> Cc: Hugh Dickins <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
> ---
> mm/compaction.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 86f04e556f96..bec727b700d3 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -787,7 +787,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> * admittedly racy check.
> */
> if (!page_mapping(page) &&
> - page_count(page) > page_mapcount(page))
> + atomic_read(&page->_count) > page_mapcount(page))
> continue;
>
> /* If we already hold the lock, we can skip some rechecking */
>
On Wed, 12 Aug 2015 17:35:09 +0300 "Kirill A. Shutemov" <[email protected]> wrote:
> On Thu, Aug 06, 2015 at 12:24:22PM -0700, Hugh Dickins wrote:
> > > IIUC, the only potentially problematic callsites left are physical memory
> > > scanners. This code requires audit. I'll do that.
> >
> > Please.
>
> I haven't finished the exercise yet. But here's an issue I believe present
> in current *Linus* tree:
>
> >From e78eec7d7a8c4cba8b5952a997973f7741e704f4 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <[email protected]>
> Date: Wed, 12 Aug 2015 17:09:16 +0300
> Subject: [PATCH] mm: fix potential race in isolate_migratepages_block()
>
> Hugh has pointed that compound_head() call can be unsafe in some context.
> There's one example:
>
> CPU0 CPU1
>
> isolate_migratepages_block()
> page_count()
> compound_head()
> !!PageTail() == true
> put_page()
> tail->first_page = NULL
> head = tail->first_page
> alloc_pages(__GFP_COMP)
> prep_compound_page()
> tail->first_page = head
> __SetPageTail(p);
> !!PageTail() == true
> <head == NULL dereferencing>
>
> The race is pure theoretical. I don't it's possible to trigger it in
> practice. But who knows.
>
> This can be fixed by avoiding compound_head() in unsafe context.
This is nuts :( page_count() should Just Work without us having to
worry about bizarre races against splitting. Sigh.
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -787,7 +787,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> * admittedly racy check.
> */
> if (!page_mapping(page) &&
> - page_count(page) > page_mapcount(page))
> + atomic_read(&page->_count) > page_mapcount(page))
> continue;
If we're going to do this sort of thing, can we please do it in a more
transparent manner? Let's not sprinkle unexplained and
incomprehensible direct accesses to ->_count all over the place.
Create a formal function to do this, with an appropriate name and with
documentation which fully explains what's going on. Then use that
here, and in has_unmovable_pages() (at least).
On Wed, Aug 12, 2015 at 02:16:44PM -0700, Andrew Morton wrote:
> On Wed, 12 Aug 2015 17:35:09 +0300 "Kirill A. Shutemov" <[email protected]> wrote:
>
> > On Thu, Aug 06, 2015 at 12:24:22PM -0700, Hugh Dickins wrote:
> > > > IIUC, the only potentially problematic callsites left are physical memory
> > > > scanners. This code requires audit. I'll do that.
> > >
> > > Please.
> >
> > I haven't finished the exercise yet. But here's an issue I believe present
> > in current *Linus* tree:
> >
> > >From e78eec7d7a8c4cba8b5952a997973f7741e704f4 Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" <[email protected]>
> > Date: Wed, 12 Aug 2015 17:09:16 +0300
> > Subject: [PATCH] mm: fix potential race in isolate_migratepages_block()
> >
> > Hugh has pointed that compound_head() call can be unsafe in some context.
> > There's one example:
> >
> > CPU0 CPU1
> >
> > isolate_migratepages_block()
> > page_count()
> > compound_head()
> > !!PageTail() == true
> > put_page()
> > tail->first_page = NULL
> > head = tail->first_page
> > alloc_pages(__GFP_COMP)
> > prep_compound_page()
> > tail->first_page = head
> > __SetPageTail(p);
> > !!PageTail() == true
> > <head == NULL dereferencing>
> >
> > The race is pure theoretical. I don't it's possible to trigger it in
> > practice. But who knows.
> >
> > This can be fixed by avoiding compound_head() in unsafe context.
>
> This is nuts :( page_count() should Just Work without us having to
> worry about bizarre races against splitting. Sigh.
Split is not involved. And this race is present even for THP=n. :(
>
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -787,7 +787,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > * admittedly racy check.
> > */
> > if (!page_mapping(page) &&
> > - page_count(page) > page_mapcount(page))
> > + atomic_read(&page->_count) > page_mapcount(page))
> > continue;
>
> If we're going to do this sort of thing, can we please do it in a more
> transparent manner? Let's not sprinkle unexplained and
> incomprehensible direct accesses to ->_count all over the place.
>
> Create a formal function to do this, with an appropriate name and with
> documentation which fully explains what's going on. Then use that
> here, and in has_unmovable_pages() (at least).
All this situation is ugly. I'm thinking on more general solution for
PageTail() vs. ->first_page race.
We would be able to avoid the race in first place if we encode PageTail()
and position of head page within the same word in struct page. This way we
update both thing in one shot without possibility of race.
Details get tricky.
I'm going to try tomorrow something like this: encode the position of head
as offset from the tail page and store it as negative number in the union
with ->mapping and ->s_mem. PageTail() can be implemented as check value
of the field to be in range -1..-MAX_ORDER_NR_PAGES.
I'm not sure at all if it's going to work, especially looking on
ridiculously high CONFIG_FORCE_MAX_ZONEORDER some architectures allow.
We could also try to encode page order instead (again as negative number)
and calculate head page position based on alignment...
Any other ideas are welcome.
--
Kirill A. Shutemov
On Thu, 13 Aug 2015, Kirill A. Shutemov wrote:
>
> All this situation is ugly. I'm thinking on more general solution for
> PageTail() vs. ->first_page race.
>
> We would be able to avoid the race in first place if we encode PageTail()
> and position of head page within the same word in struct page. This way we
> update both thing in one shot without possibility of race.
>
> Details get tricky.
>
> I'm going to try tomorrow something like this: encode the position of head
> as offset from the tail page and store it as negative number in the union
> with ->mapping and ->s_mem. PageTail() can be implemented as check value
> of the field to be in range -1..-MAX_ORDER_NR_PAGES.
>
> I'm not sure at all if it's going to work, especially looking on
> ridiculously high CONFIG_FORCE_MAX_ZONEORDER some architectures allow.
>
> We could also try to encode page order instead (again as negative number)
> and calculate head page position based on alignment...
>
> Any other ideas are welcome.
Good luck, I've not given it any thought, but hope it works out:
my reasoning was the same when I put the PageAnon bit into
page->mapping instead of page->flags.
Something to beware of though: although exceedingly unlikely to be a
problem, page->mapping always contained a pointer to or into a relevant
structure, or else something that could not possibly be a kernel pointer,
when I was working on KSM swapping: see comment above get_ksm_page() in
mm/ksm.c. It is best to keep page->mapping for pointers if possible
(and probably avoid having the PageAnon bit set unless really Anon).
I've only just read your mail, and I'm too slow a thinker to have
worked through your isolate_migratepages_block() race yet. But, given
the timing, cannot resist sending you a code fragment I wrote earlier
today for our v3.11-based kernel: which still has compound_trans_order(),
which we had been using in a similar racy physical scan.
I'm not for a moment suggesting that this fragment is relevant to your
race; but it is something amusing to consider when you're thinking of
such races. Credit to Greg Thelen for thinking of the prep_compound_page()
end of it, when I'd been focussed on the __split_huge_page_refcount() end.
/*
* It is not safe to use compound_lock (inside compound_trans_order)
* until we have a reference on the page (okay, done above) and have
* then seen PageLRU on it (just below): because mm/huge_memory.c uses
* the non-atomic __SetPageUptodate on a freshly allocated THPage in
* several places, believing it to be invisible to the outside world,
* but liable to race and leave PG_compound_lock set when cleared here.
*/
nr_pages = 1;
if (PageHead(page)) {
/*
* smp_rmb() against the smp_wmb() in the first iteration of
* prep_compound_page(), so that the PageTail test ensures
* that compound_order(page) is now correctly readable.
*/
smp_rmb();
if (PageTail(page + 1)) {
nr_pages = 1 << compound_order(page);
/*
* Then smp_rmb() against smp_wmb() in last iteration of
* __split_huge_page_refcount(), to ensure that has not
* yet written something else into page[1].lru.prev.
*/
smp_rmb();
if (!PageTail(page + 1))
nr_pages = 1;
}
}
Hugh
On Fri, 7 Aug 2015, Kirill A. Shutemov wrote:
> On Thu, Aug 06, 2015 at 12:24:22PM -0700, Hugh Dickins wrote:
> >
> > Oh, and I know a patchset which avoids these problems completely,
> > by not using compound pages at all ;)
>
> BTW, I haven't heard anything about the patchset for a while.
> What's the status?
It's gone well, and being put into wider use here. But I'm not
one for monthly updates of large patchsets myself, always too much
to do; and nobody else seemed anxious to have it yet, back in March.
As I said at the time of posting huge tmpfs against v3.19, it was
fully working (and little changed since), but once memory pressure
had disbanded a team to swap it out, there was nothing to put it
together again later, to restore the original hugepage performance.
I couldn't imagine people putting it into real use while that remained
the case, so spent the next months adding "huge tmpfs recovery" -
considered hooking into khugepaged, but settled on work item queued
from fault.
Which has worked out well, except that I had to rush it in before
I went on vacation in June, then spent last month fixing all the
concurrent hole-punching bugs Andres found with his fuzzing while
I was away. Busy time, stable now; but I do want to reconsider a
few rushed decisions before offering the rebased and extended set.
And there's three pieces of the work not begun:
The page-table allocation delay in mm/memory.c had been great for
the first posting, but not good enough for recovery (replacing ptes
by pmd): for the moment I skate around that by guarding with mmap_sem,
but mmap_sem usually ends up regrettable, and shouldn't be necessary -
there's just a lot of scattered walks to work through, adjusting them
to racy replacement of ptes by pmd. Maybe I can get away without
doing this for now, we seem to be working well enough without it.
And I suspect that my queueing a recovery work item from fault
is over eager, needs some stats and knobs to tune it down. Though
not surfaced as a problem yet; and I don't think we could live with
the opposite extreme, of khugepaged lumbering its way around the vmas.
But the one I think I shall have to do something better about before
posting, is NUMA. For a confluence of reasons, that rule out swapin
readahead for now, it's not a serious issue for us as yet. But swapin
readahead and NUMA have always been a joke in tmpfs, and I'll be
amplifying that joke with my current NUMA placement in recovery.
Unfortunately, there's a lot of opportunity to make silly mistakes
when trying to get NUMA right: I doubt I can get it right, but do
need to get it a little less wrong before letting others take over.
>
> Optimizing rmap operations in my patchset (see PG_double_map), I found
> that it would be very tricky to expand team pages to anon-THP without
> performance regression on rmap side due to amount of atomic ops it
> requires.
Thanks for thinking of it: I've been too busy with the recovery
to put more thought into extending teams to anon THP, though I'd
certainly like to try that once the huge tmpfs end is "complete".
Yes, there's not a doubt that starting from compound pages is more
rigid but should involve much less repetition; whereas starting from
the other end with a team of ordinary 4k pages, more flexible but a
lot of wasted effort. I can't predict where we shall meet.
>
> Is there any clever approach to the issue?
I'd been hoping that I could implement first, and then optimize away
the unnecessary; but you're right that it's easier to live with that
in the pagecache case, whereas with anon THP it would be a regression.
Hugh
>
> Team pages are probably fine for file mappings due different performance
> baseline. I'm less optimistic about anon-THP.
>
> --
> Kirill A. Shutemov