Occasionally, running this LTP test will trigger an error below on
s390:
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w.c
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w_child.c
this .config:
https://gitlab.com/cailca/linux-mm/-/blob/master/s390.config
[ 6970.253173] LTP: starting dirtyc0w
[ 6971.599102] BUG: Bad page state in process dirtyc0w_child pfn:8865d
[ 6971.599867] page:000000001a8328d7 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8865d
[ 6971.599876] flags: 0x400000000008000e(referenced|uptodate|dirty|swapbacked)
[ 6971.599886] raw: 400000000008000e 0000000000000100 0000000000000122 0000000000000000
[ 6971.599893] raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000
[ 6971.599900] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[ 6971.599906] Modules linked in: loop kvm ip_tables x_tables dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod]
[ 6971.599952] CPU: 1 PID: 65238 Comm: dirtyc0w_child Tainted: G O 5.9.0-rc4-next-20200909 #1
[ 6971.599959] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
[ 6971.599964] Call Trace:
[ 6971.599979] [<0000000073aec038>] show_stack+0x158/0x1f0
[ 6971.599986] [<0000000073af724a>] dump_stack+0x1f2/0x238
[ 6971.599994] [<0000000072ed086a>] bad_page+0x1ba/0x1c0
[ 6971.600000] [<0000000072ed20c4>] free_pcp_prepare+0x4fc/0x658
[ 6971.600006] [<0000000072ed96a6>] free_unref_page+0xae/0x158
[ 6971.600013] [<0000000072e8286a>] unmap_page_range+0xb62/0x1df8
[ 6971.600019] [<0000000072e83bbc>] unmap_single_vma+0xbc/0x1c8
[ 6971.600025] [<0000000072e8418e>] zap_page_range+0x176/0x230
[ 6971.600033] [<0000000072eece8e>] do_madvise+0xfde/0x1270
[ 6971.600039] [<0000000072eed50a>] __s390x_sys_madvise+0x72/0x98
[ 6971.600047] [<0000000073b1cce4>] system_call+0xdc/0x278
[ 6971.600053] 2 locks held by dirtyc0w_child/65238:
[ 6971.600058] #0: 000000013442fa18 (&mm->mmap_lock){++++}-{3:3}, at: do_madvise+0x17a/0x1270
[ 6971.600432] #1: 00000001343f9060 (ptlock_ptr(page)#2){+.+.}-{2:2}, at: unmap_page_range+0x640/0x1df8
[ 6971.600487] Disabling lock debugging due to kernel taint
Once it happens, running it again will trigger in on another PFN.
[39717.085115] BUG: Bad page state in process dirtyc0w_child pfn:af065
Any thoughts?
On Sat, Sep 12, 2020 at 09:54:12PM -0400, Qian Cai wrote:
> Occasionally, running this LTP test will trigger an error below on
> s390:
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w.c
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w_child.c
>
> this .config:
> https://gitlab.com/cailca/linux-mm/-/blob/master/s390.config
>
> [ 6970.253173] LTP: starting dirtyc0w
> [ 6971.599102] BUG: Bad page state in process dirtyc0w_child pfn:8865d
> [ 6971.599867] page:000000001a8328d7 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8865d
> [ 6971.599876] flags: 0x400000000008000e(referenced|uptodate|dirty|swapbacked)
> [ 6971.599886] raw: 400000000008000e 0000000000000100 0000000000000122 0000000000000000
> [ 6971.599893] raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000
> [ 6971.599900] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> [ 6971.599906] Modules linked in: loop kvm ip_tables x_tables dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod]
> [ 6971.599952] CPU: 1 PID: 65238 Comm: dirtyc0w_child Tainted: G O 5.9.0-rc4-next-20200909 #1
> [ 6971.599959] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
> [ 6971.599964] Call Trace:
> [ 6971.599979] [<0000000073aec038>] show_stack+0x158/0x1f0
> [ 6971.599986] [<0000000073af724a>] dump_stack+0x1f2/0x238
> [ 6971.599994] [<0000000072ed086a>] bad_page+0x1ba/0x1c0
> [ 6971.600000] [<0000000072ed20c4>] free_pcp_prepare+0x4fc/0x658
> [ 6971.600006] [<0000000072ed96a6>] free_unref_page+0xae/0x158
> [ 6971.600013] [<0000000072e8286a>] unmap_page_range+0xb62/0x1df8
> [ 6971.600019] [<0000000072e83bbc>] unmap_single_vma+0xbc/0x1c8
> [ 6971.600025] [<0000000072e8418e>] zap_page_range+0x176/0x230
> [ 6971.600033] [<0000000072eece8e>] do_madvise+0xfde/0x1270
> [ 6971.600039] [<0000000072eed50a>] __s390x_sys_madvise+0x72/0x98
> [ 6971.600047] [<0000000073b1cce4>] system_call+0xdc/0x278
> [ 6971.600053] 2 locks held by dirtyc0w_child/65238:
> [ 6971.600058] #0: 000000013442fa18 (&mm->mmap_lock){++++}-{3:3}, at: do_madvise+0x17a/0x1270
> [ 6971.600432] #1: 00000001343f9060 (ptlock_ptr(page)#2){+.+.}-{2:2}, at: unmap_page_range+0x640/0x1df8
> [ 6971.600487] Disabling lock debugging due to kernel taint
>
> Once it happens, running it again will trigger in on another PFN.
>
> [39717.085115] BUG: Bad page state in process dirtyc0w_child pfn:af065
>
> Any thoughts?
Alexander, Gerald, could you take a look?
On Wed, 16 Sep 2020 16:28:06 +0200
Heiko Carstens <[email protected]> wrote:
> On Sat, Sep 12, 2020 at 09:54:12PM -0400, Qian Cai wrote:
> > Occasionally, running this LTP test will trigger an error below on
> > s390:
> > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w.c
> > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w_child.c
> >
> > this .config:
> > https://gitlab.com/cailca/linux-mm/-/blob/master/s390.config
> >
> > [ 6970.253173] LTP: starting dirtyc0w
> > [ 6971.599102] BUG: Bad page state in process dirtyc0w_child pfn:8865d
> > [ 6971.599867] page:000000001a8328d7 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8865d
> > [ 6971.599876] flags: 0x400000000008000e(referenced|uptodate|dirty|swapbacked)
> > [ 6971.599886] raw: 400000000008000e 0000000000000100 0000000000000122 0000000000000000
> > [ 6971.599893] raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000
> > [ 6971.599900] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> > [ 6971.599906] Modules linked in: loop kvm ip_tables x_tables dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod]
> > [ 6971.599952] CPU: 1 PID: 65238 Comm: dirtyc0w_child Tainted: G O 5.9.0-rc4-next-20200909 #1
> > [ 6971.599959] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
> > [ 6971.599964] Call Trace:
> > [ 6971.599979] [<0000000073aec038>] show_stack+0x158/0x1f0
> > [ 6971.599986] [<0000000073af724a>] dump_stack+0x1f2/0x238
> > [ 6971.599994] [<0000000072ed086a>] bad_page+0x1ba/0x1c0
> > [ 6971.600000] [<0000000072ed20c4>] free_pcp_prepare+0x4fc/0x658
> > [ 6971.600006] [<0000000072ed96a6>] free_unref_page+0xae/0x158
> > [ 6971.600013] [<0000000072e8286a>] unmap_page_range+0xb62/0x1df8
> > [ 6971.600019] [<0000000072e83bbc>] unmap_single_vma+0xbc/0x1c8
> > [ 6971.600025] [<0000000072e8418e>] zap_page_range+0x176/0x230
> > [ 6971.600033] [<0000000072eece8e>] do_madvise+0xfde/0x1270
> > [ 6971.600039] [<0000000072eed50a>] __s390x_sys_madvise+0x72/0x98
> > [ 6971.600047] [<0000000073b1cce4>] system_call+0xdc/0x278
> > [ 6971.600053] 2 locks held by dirtyc0w_child/65238:
> > [ 6971.600058] #0: 000000013442fa18 (&mm->mmap_lock){++++}-{3:3}, at: do_madvise+0x17a/0x1270
> > [ 6971.600432] #1: 00000001343f9060 (ptlock_ptr(page)#2){+.+.}-{2:2}, at: unmap_page_range+0x640/0x1df8
> > [ 6971.600487] Disabling lock debugging due to kernel taint
> >
> > Once it happens, running it again will trigger in on another PFN.
> >
> > [39717.085115] BUG: Bad page state in process dirtyc0w_child pfn:af065
> >
> > Any thoughts?
>
> Alexander, Gerald, could you take a look?
Thanks for reporting. From the header of dirtyc0w.c it seems that this
is testing some gup behavior. Given that we have an issue with gup_fast
on s390, this could be related. I'll try to reproduce and do more
analysis.
A fix for our gup_fast issue is also in linux-next now, as of 2020-09-20,
but it was not yet included in your kernel version 5.9.0-rc4-next-20200909.
So if this is related to the gup_fast issue, it should not occur again
with linux-next kernels after 2020-09-20.
On Tue, 22 Sep 2020 19:03:50 +0200
Gerald Schaefer <[email protected]> wrote:
> On Wed, 16 Sep 2020 16:28:06 +0200
> Heiko Carstens <[email protected]> wrote:
>
> > On Sat, Sep 12, 2020 at 09:54:12PM -0400, Qian Cai wrote:
> > > Occasionally, running this LTP test will trigger an error below on
> > > s390:
> > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w.c
> > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/security/dirtyc0w/dirtyc0w_child.c
> > >
> > > this .config:
> > > https://gitlab.com/cailca/linux-mm/-/blob/master/s390.config
> > >
> > > [ 6970.253173] LTP: starting dirtyc0w
> > > [ 6971.599102] BUG: Bad page state in process dirtyc0w_child pfn:8865d
> > > [ 6971.599867] page:000000001a8328d7 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8865d
> > > [ 6971.599876] flags: 0x400000000008000e(referenced|uptodate|dirty|swapbacked)
> > > [ 6971.599886] raw: 400000000008000e 0000000000000100 0000000000000122 0000000000000000
> > > [ 6971.599893] raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000
> > > [ 6971.599900] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> > > [ 6971.599906] Modules linked in: loop kvm ip_tables x_tables dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod]
> > > [ 6971.599952] CPU: 1 PID: 65238 Comm: dirtyc0w_child Tainted: G O 5.9.0-rc4-next-20200909 #1
> > > [ 6971.599959] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
> > > [ 6971.599964] Call Trace:
> > > [ 6971.599979] [<0000000073aec038>] show_stack+0x158/0x1f0
> > > [ 6971.599986] [<0000000073af724a>] dump_stack+0x1f2/0x238
> > > [ 6971.599994] [<0000000072ed086a>] bad_page+0x1ba/0x1c0
> > > [ 6971.600000] [<0000000072ed20c4>] free_pcp_prepare+0x4fc/0x658
> > > [ 6971.600006] [<0000000072ed96a6>] free_unref_page+0xae/0x158
> > > [ 6971.600013] [<0000000072e8286a>] unmap_page_range+0xb62/0x1df8
> > > [ 6971.600019] [<0000000072e83bbc>] unmap_single_vma+0xbc/0x1c8
> > > [ 6971.600025] [<0000000072e8418e>] zap_page_range+0x176/0x230
> > > [ 6971.600033] [<0000000072eece8e>] do_madvise+0xfde/0x1270
> > > [ 6971.600039] [<0000000072eed50a>] __s390x_sys_madvise+0x72/0x98
> > > [ 6971.600047] [<0000000073b1cce4>] system_call+0xdc/0x278
> > > [ 6971.600053] 2 locks held by dirtyc0w_child/65238:
> > > [ 6971.600058] #0: 000000013442fa18 (&mm->mmap_lock){++++}-{3:3}, at: do_madvise+0x17a/0x1270
> > > [ 6971.600432] #1: 00000001343f9060 (ptlock_ptr(page)#2){+.+.}-{2:2}, at: unmap_page_range+0x640/0x1df8
> > > [ 6971.600487] Disabling lock debugging due to kernel taint
> > >
> > > Once it happens, running it again will trigger in on another PFN.
> > >
> > > [39717.085115] BUG: Bad page state in process dirtyc0w_child pfn:af065
> > >
> > > Any thoughts?
> >
> > Alexander, Gerald, could you take a look?
>
> Thanks for reporting. From the header of dirtyc0w.c it seems that this
> is testing some gup behavior. Given that we have an issue with gup_fast
> on s390, this could be related. I'll try to reproduce and do more
> analysis.
>
> A fix for our gup_fast issue is also in linux-next now, as of 2020-09-20,
> but it was not yet included in your kernel version 5.9.0-rc4-next-20200909.
> So if this is related to the gup_fast issue, it should not occur again
> with linux-next kernels after 2020-09-20.
OK, I can now reproduce this, and unfortunately also with the gup_fast
fix, so it is something different. Bisecting is a bit hard, as it will
not always show immediately, sometimes takes up to an hour.
Still, I think I found the culprit, merge commit b25d1dc9474e "Merge
branch 'simplify-do_wp_page'". Without those 4 patches, it works fine,
running over night.
Not sure why this only shows on s390, should not be architecture-specific,
but we do often see subtle races earlier than others due to hypervisor
impact.
The first commit 09854ba94c6a ("mm: do_wp_page() simplification") already
introduces this error. The dirtyc0w_child test seems to play with cow
and racing madvise(MADV_DONTNEED), but I have not yet fully understood
it and also not the changes from commit 09854ba94c6a. As Linus already
mentioned in the merge commit message, this is some bad timing for such
a change, so I don't want to delay this further with trying to understand
it better before reporting. Maybe Peter or Linus can spot some obvious
issue.
One thing that seems strange to me is that the page flags from the
bad page state output are (uptodate|swapbacked), see below, or
(referenced|uptodate|dirty|swapbacked) in the original report. But IIUC,
that should not qualify for the "PAGE_FLAGS_CHECK_AT_FREE flag(s) set"
reason. So it seems that the flags have changed between check_free_page()
and __dump_page(), which would be very odd. Or maybe some issue with
compound pages, because __dump_page() looks at head->flags.
[ 1863.237707] BUG: Bad page state in process dirtyc0w_child pfn:58527d
[ 1863.237721] page:000000008866956b refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x58527d
[ 1863.237727] flags: 0x3ffff00000080004(uptodate|swapbacked)
[ 1863.237734] raw: 3ffff00000080004 0000000000000100 0000000000000122 0000000000000000
[ 1863.237738] raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000
[ 1863.237742] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[ 1863.237745] Modules linked in:
[ 1863.237752] CPU: 16 PID: 9074 Comm: dirtyc0w_child Tainted: G B 5.9.0-rc6-00020-geff48ddeab78-dirty #104
[ 1863.237756] Hardware name: IBM 3906 M03 703 (LPAR)
[ 1863.237759] Call Trace:
[ 1863.237768] [<0000000000115f28>] show_stack+0x100/0x158
[ 1863.237775] [<000000000096b41a>] dump_stack+0xa2/0xd8
[ 1863.237781] [<00000000003d497c>] bad_page+0xdc/0x140
[ 1863.237785] [<00000000003d5b62>] free_pcp_prepare+0x31a/0x360
[ 1863.237789] [<00000000003d906a>] free_unref_page+0x32/0xb8
[ 1863.237794] [<00000000003b05f4>] zap_p4d_range+0x64c/0xcf8
[ 1863.237797] [<00000000003b0e7a>] unmap_page_range+0x9a/0x110
[ 1863.237801] [<00000000003b0f84>] unmap_single_vma+0x94/0x100
[ 1863.237805] [<00000000003b14c2>] zap_page_range+0x14a/0x1f0
[ 1863.237809] [<00000000003e3a24>] do_madvise+0x75c/0x918
[ 1863.237812] [<00000000003e3c06>] __s390x_sys_madvise+0x26/0x38
[ 1863.237817] [<0000000000d280d4>] system_call+0xe0/0x2c0
[ 1863.237820] INFO: lockdep is turned off.
On Wed, Sep 23, 2020 at 6:39 AM Gerald Schaefer
<[email protected]> wrote:
>
> OK, I can now reproduce this, and unfortunately also with the gup_fast
> fix, so it is something different. Bisecting is a bit hard, as it will
> not always show immediately, sometimes takes up to an hour.
>
> Still, I think I found the culprit, merge commit b25d1dc9474e "Merge
> branch 'simplify-do_wp_page'". Without those 4 patches, it works fine,
> running over night.
Odd, but I have a strong suspicion that the "do_wp_page()
simplification" only ends up removing serialization that then hides
some existing thing.
> Not sure why this only shows on s390, should not be architecture-specific,
> but we do often see subtle races earlier than others due to hypervisor
> impact.
Yeah, so if it needs very particular timing, maybe the s390 page table
handling together with the hypervisor interfaces ends up being more
likely to trigger this, and thus the different timings at do_wp_page()
then ends up showing it.
> One thing that seems strange to me is that the page flags from the
> bad page state output are (uptodate|swapbacked), see below, or
> (referenced|uptodate|dirty|swapbacked) in the original report. But IIUC,
> that should not qualify for the "PAGE_FLAGS_CHECK_AT_FREE flag(s) set"
> reason. So it seems that the flags have changed between check_free_page()
> and __dump_page(), which would be very odd. Or maybe some issue with
> compound pages, because __dump_page() looks at head->flags.
The odd thing is that all of this _should_ be serialized by the page
table lock, as far as I can tell.
From your trace, it looks very much like it's do_madvise() ->
zap_pte_range() (your stack trace only has zap_p4d_range mentioned,
but all the lower levels are inlined) that races with presumably
fast-gup.
But zap_pte_range() has the pte lock, and fast-gup is - by design -
not allowed to change the page state other than taking a reference to
it, and should do that with a "try_get" operation, so even taking the
reference should never ever race with somebody doing the final free.
IOW, the fast-GUP code does that
page = pte_page(pte);
head = try_grab_compound_head(page, 1, flags);
if (!head)
goto pte_unmap;
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
put_compound_head(head, 1, flags);
goto pte_unmap;
}
where the important part is that "try_grab_compound_head()" which does
the whole careful atomic "increase page count only if it wasn't zero".
See page_cache_add_speculative().
So the rule is
- if you hold the page table lock, you can just do
"get_page(pte_page())" directly, because you know the pte cannot go
away from under you
- if you are fast-gup, the pte *can* go away from under you, so you
need to do that very careful "get page unless it's gone" dance
but I don't see us violating that.
There's maybe some interesting memory ordering in the above case, but
it does atomic_add_unless() which is ordered, and s390 is strongly
ordered anyway, isn't it?
(Yes, and it doesn't do the atomic stuff at all if TINY_RCU is set,
but that's only set for non-preemptible UP kernels, so that doesn't
matter).
So if zap_page_range() races with fast-gup, then either
zap_page_range() wins the race and puts the page - but then fast-gup
won't touch it, or fast-gup wins and gets a reference to the page, and
then zap_page_range() will clear it and drop the ref to it, but it
won't be the final ref.
Your dump seems to show that zap_page_range() *did* drop the final
ref, but something is racing with it to the point of actually
modifying the page flags.
Odd.
And the do_wp_page() change itself shouldn't be directly involved,
because that's all done under the page table lock. But it obviously
does change the page locking a lot, and changes timing a lot.
And in fact, the zap_pte_range() code itself doesn't take the page
lock (and cannot, because it's all under the page table spinlock).
So it does smell like timing to me. But possibly with some
s390-specific twist to it.
Ooh. One thing that is *very* different about s390 is that it frees
the page directly, and doesn't batch things up to happen after the TLB
flush.
Maybe THAT is the difference? Not that I can tell why it should
matter, for all the reasons outlines above. But on x86-64, the
__tlb_remove_page() function just adds the page to the "free this
later" TLB flush structure, and if it fills up it does the TLB flush
and then does the actual batched page freeing outside the page table
lock.
And that *has* been one of the things that the fast-gup code depended
on. We even have a big comment about it:
/*
* Disable interrupts. The nested form is used, in order to allow
* full, general purpose use of this routine.
*
* With interrupts disabled, we block page table pages from being
* freed from under us. See struct mmu_table_batch comments in
* include/asm-generic/tlb.h for more details.
*
* We do not adopt an rcu_read_lock(.) here as we also want to
* block IPIs that come from THPs splitting.
*/
and maybe that whole thing doesn't hold true for s390 at all.
Linus
On Wed, 23 Sep 2020 13:00:45 -0700
Linus Torvalds <[email protected]> wrote:
[...]
>
> Ooh. One thing that is *very* different about s390 is that it frees
> the page directly, and doesn't batch things up to happen after the TLB
> flush.
>
> Maybe THAT is the difference? Not that I can tell why it should
> matter, for all the reasons outlines above. But on x86-64, the
> __tlb_remove_page() function just adds the page to the "free this
> later" TLB flush structure, and if it fills up it does the TLB flush
> and then does the actual batched page freeing outside the page table
> lock.
>
> And that *has* been one of the things that the fast-gup code depended
> on. We even have a big comment about it:
>
> /*
> * Disable interrupts. The nested form is used, in order to allow
> * full, general purpose use of this routine.
> *
> * With interrupts disabled, we block page table pages from being
> * freed from under us. See struct mmu_table_batch comments in
> * include/asm-generic/tlb.h for more details.
> *
> * We do not adopt an rcu_read_lock(.) here as we also want to
> * block IPIs that come from THPs splitting.
> */
>
> and maybe that whole thing doesn't hold true for s390 at all.
Thanks, very nice walk-through, need some time to digest this. The TLB
aspect is interesting, and we do have our own __tlb_remove_page_size(),
which directly calls free_page_and_swap_cache() instead of the generic
batched approach.
I faintly remember that we also did have some batched and rcu_sched based
approach. It seems there was some rework with commit 9de7d833e370
("s390/tlb: Convert to generic mmu_gather") and discussion in
https://lore.kernel.org/linux-arch/[email protected]/
Given the comment you mentioned and also this one from mm/gup.c:
/*
* Fast GUP
*
* get_user_pages_fast attempts to pin user pages by walking the page
* tables directly and avoids taking locks. Thus the walker needs to be
* protected from page table pages being freed from under it, and should
* block any THP splits.
*
* One way to achieve this is to have the walker disable interrupts, and
* rely on IPIs from the TLB flushing code blocking before the page table
* pages are freed. This is unsuitable for architectures that do not need
* to broadcast an IPI when invalidating TLBs.
*
* Another way to achieve this is to batch up page table containing pages
* belonging to more than one mm_user, then rcu_sched a callback to free those
* pages. Disabling interrupts will allow the fast_gup walker to both block
* the rcu_sched callback, and an IPI that we broadcast for splitting THPs
* (which is a relatively rare event). The code below adopts this strategy.
It really sounds like we might have yet another issue with (common code)
gup_fast, and non-batched freeing of pagetable pages, though I wonder how
that would have worked with the s390-specific version of gup_fast before.
Certainly worth investigating, thanks a lot for the hint!
Meanwhile, out of curiosity, while I still fail to comprehend commit
09854ba94c6a ("mm: do_wp_page() simplification") in its entirety, there
is one detail that I find most confusing: the unlock_page() has moved
behind the wp_page_reuse(), while it was the other way round before.
Does it simply not really matter, or was it done on purpose, possibly
related to the "get rid of the nasty serialization on the page lock"
description?
On Wed, Sep 23, 2020 at 2:33 PM Gerald Schaefer
<[email protected]> wrote:
>
> Thanks, very nice walk-through, need some time to digest this. The TLB
> aspect is interesting, and we do have our own __tlb_remove_page_size(),
> which directly calls free_page_and_swap_cache() instead of the generic
> batched approach.
So I don't think it's the free_page_and_swap_cache() itself that is the problem.
As mentioned, the actual pages themselves should be handled by the
reference counting being atomic.
The interrupt disable is really about just the page *tables* being
free'd - not the final page level.
So the issue is that at least on x86-64, we have the serialization
that we will only free the page tables after a cross-CPU IPI has
flushed the TLB.
I think s390 just RCU-free's the page tables instead, which should fix it.
So I think this is special, and s390 is very different from x86, but I
don't think it's the problem.
In fact, I think you pinpointed the real issue:
> Meanwhile, out of curiosity, while I still fail to comprehend commit
> 09854ba94c6a ("mm: do_wp_page() simplification") in its entirety, there
> is one detail that I find most confusing: the unlock_page() has moved
> behind the wp_page_reuse(), while it was the other way round before.
You know what? That was just a mistake, and I think you may actually
have hit the real cause of the problem.
It means that we keep the page locked until after we do the
pte_unmap_unlock(), so now we have no guarantees that we hold the page
referecne.
And then we unlock it - while somebody else might be freeing it.
So somebody is freeing a locked page just as we're unlocking it, and
that matches the problem you see exactly: the debug thing will hit
because the last free happened while locked, and then by the time the
printout happens it has become unlocked so it doesn't show any more.
Duh.
Would you mind testing just moving the unlock_page() back to before
the wp_page_reuse()?
Does that make your debug check go away?
Linus
On Wed, 23 Sep 2020 14:50:36 -0700
Linus Torvalds <[email protected]> wrote:
> On Wed, Sep 23, 2020 at 2:33 PM Gerald Schaefer
> <[email protected]> wrote:
> >
> > Thanks, very nice walk-through, need some time to digest this. The TLB
> > aspect is interesting, and we do have our own __tlb_remove_page_size(),
> > which directly calls free_page_and_swap_cache() instead of the generic
> > batched approach.
>
> So I don't think it's the free_page_and_swap_cache() itself that is the problem.
>
> As mentioned, the actual pages themselves should be handled by the
> reference counting being atomic.
>
> The interrupt disable is really about just the page *tables* being
> free'd - not the final page level.
>
> So the issue is that at least on x86-64, we have the serialization
> that we will only free the page tables after a cross-CPU IPI has
> flushed the TLB.
>
> I think s390 just RCU-free's the page tables instead, which should fix it.
>
> So I think this is special, and s390 is very different from x86, but I
> don't think it's the problem.
>
> In fact, I think you pinpointed the real issue:
>
> > Meanwhile, out of curiosity, while I still fail to comprehend commit
> > 09854ba94c6a ("mm: do_wp_page() simplification") in its entirety, there
> > is one detail that I find most confusing: the unlock_page() has moved
> > behind the wp_page_reuse(), while it was the other way round before.
>
> You know what? That was just a mistake, and I think you may actually
> have hit the real cause of the problem.
>
> It means that we keep the page locked until after we do the
> pte_unmap_unlock(), so now we have no guarantees that we hold the page
> referecne.
>
> And then we unlock it - while somebody else might be freeing it.
>
> So somebody is freeing a locked page just as we're unlocking it, and
> that matches the problem you see exactly: the debug thing will hit
> because the last free happened while locked, and then by the time the
> printout happens it has become unlocked so it doesn't show any more.
>
> Duh.
>
> Would you mind testing just moving the unlock_page() back to before
> the wp_page_reuse()?
Sure, I'll give it a try running over the night again.
On Wed, 2020-09-23 at 15:39 +0200, Gerald Schaefer wrote:
> OK, I can now reproduce this, and unfortunately also with the gup_fast
> fix, so it is something different. Bisecting is a bit hard, as it will
> not always show immediately, sometimes takes up to an hour.
>
> Still, I think I found the culprit, merge commit b25d1dc9474e "Merge
> branch 'simplify-do_wp_page'". Without those 4 patches, it works fine,
> running over night.
>
> Not sure why this only shows on s390, should not be architecture-specific,
> but we do often see subtle races earlier than others due to hypervisor
> impact.
Apparently, someone can reproduce something similar on x86 as well:
https://lore.kernel.org/linux-mm/[email protected]/
Probably, Alex could revert the bad commits and confirm it there.
>
> The first commit 09854ba94c6a ("mm: do_wp_page() simplification") already
> introduces this error. The dirtyc0w_child test seems to play with cow
> and racing madvise(MADV_DONTNEED), but I have not yet fully understood
> it and also not the changes from commit 09854ba94c6a. As Linus already
> mentioned in the merge commit message, this is some bad timing for such
> a change, so I don't want to delay this further with trying to understand
> it better before reporting. Maybe Peter or Linus can spot some obvious
> issue.
>
> One thing that seems strange to me is that the page flags from the
> bad page state output are (uptodate|swapbacked), see below, or
> (referenced|uptodate|dirty|swapbacked) in the original report. But IIUC,
> that should not qualify for the "PAGE_FLAGS_CHECK_AT_FREE flag(s) set"
> reason. So it seems that the flags have changed between check_free_page()
> and __dump_page(), which would be very odd. Or maybe some issue with
> compound pages, because __dump_page() looks at head->flags.
>
> [ 1863.237707] BUG: Bad page state in process dirtyc0w_child pfn:58527d
> [ 1863.237721] page:000000008866956b refcount:0 mapcount:0
> mapping:0000000000000000 index:0x0 pfn:0x58527d
> [ 1863.237727] flags: 0x3ffff00000080004(uptodate|swapbacked)
> [ 1863.237734] raw: 3ffff00000080004 0000000000000100 0000000000000122
> 0000000000000000
> [ 1863.237738] raw: 0000000000000000 0000000000000000 ffffffff00000000
> 0000000000000000
> [ 1863.237742] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> [ 1863.237745] Modules linked in:
> [ 1863.237752] CPU: 16 PID: 9074 Comm: dirtyc0w_child Tainted:
> G B 5.9.0-rc6-00020-geff48ddeab78-dirty #104
> [ 1863.237756] Hardware name: IBM 3906 M03 703 (LPAR)
> [ 1863.237759] Call Trace:
> [ 1863.237768] [<0000000000115f28>] show_stack+0x100/0x158
> [ 1863.237775] [<000000000096b41a>] dump_stack+0xa2/0xd8
> [ 1863.237781] [<00000000003d497c>] bad_page+0xdc/0x140
> [ 1863.237785] [<00000000003d5b62>] free_pcp_prepare+0x31a/0x360
> [ 1863.237789] [<00000000003d906a>] free_unref_page+0x32/0xb8
> [ 1863.237794] [<00000000003b05f4>] zap_p4d_range+0x64c/0xcf8
> [ 1863.237797] [<00000000003b0e7a>] unmap_page_range+0x9a/0x110
> [ 1863.237801] [<00000000003b0f84>] unmap_single_vma+0x94/0x100
> [ 1863.237805] [<00000000003b14c2>] zap_page_range+0x14a/0x1f0
> [ 1863.237809] [<00000000003e3a24>] do_madvise+0x75c/0x918
> [ 1863.237812] [<00000000003e3c06>] __s390x_sys_madvise+0x26/0x38
> [ 1863.237817] [<0000000000d280d4>] system_call+0xe0/0x2c0
> [ 1863.237820] INFO: lockdep is turned off.
>
On Thu, 24 Sep 2020 00:02:26 +0200
Gerald Schaefer <[email protected]> wrote:
> On Wed, 23 Sep 2020 14:50:36 -0700
> Linus Torvalds <[email protected]> wrote:
>
> > On Wed, Sep 23, 2020 at 2:33 PM Gerald Schaefer
> > <[email protected]> wrote:
> > >
> > > Thanks, very nice walk-through, need some time to digest this. The TLB
> > > aspect is interesting, and we do have our own __tlb_remove_page_size(),
> > > which directly calls free_page_and_swap_cache() instead of the generic
> > > batched approach.
> >
> > So I don't think it's the free_page_and_swap_cache() itself that is the problem.
> >
> > As mentioned, the actual pages themselves should be handled by the
> > reference counting being atomic.
> >
> > The interrupt disable is really about just the page *tables* being
> > free'd - not the final page level.
> >
> > So the issue is that at least on x86-64, we have the serialization
> > that we will only free the page tables after a cross-CPU IPI has
> > flushed the TLB.
> >
> > I think s390 just RCU-free's the page tables instead, which should fix it.
> >
> > So I think this is special, and s390 is very different from x86, but I
> > don't think it's the problem.
Ah of course, I got confused by freeing pagetable pages vs. the pages
themselves. For the pagetable pages we actually use the generic
tlb_remove_table_(sync_)one, including the IPI-synchronizing
smp_call_function (CONFIG_MMU_GATHER_RCU_TABLE_FREE=y).
The "s390 magic" then only starts in our own __tlb_remove_table,
where we take care of the special 2K vs. 4K pagetable stuff.
Thanks a lot for this very valuable abstract of "who is who and why"
in pagetable memory management :-)
> >
> > In fact, I think you pinpointed the real issue:
> >
> > > Meanwhile, out of curiosity, while I still fail to comprehend commit
> > > 09854ba94c6a ("mm: do_wp_page() simplification") in its entirety, there
> > > is one detail that I find most confusing: the unlock_page() has moved
> > > behind the wp_page_reuse(), while it was the other way round before.
> >
> > You know what? That was just a mistake, and I think you may actually
> > have hit the real cause of the problem.
> >
> > It means that we keep the page locked until after we do the
> > pte_unmap_unlock(), so now we have no guarantees that we hold the page
> > referecne.
> >
> > And then we unlock it - while somebody else might be freeing it.
> >
> > So somebody is freeing a locked page just as we're unlocking it, and
> > that matches the problem you see exactly: the debug thing will hit
> > because the last free happened while locked, and then by the time the
> > printout happens it has become unlocked so it doesn't show any more.
> >
> > Duh.
> >
> > Would you mind testing just moving the unlock_page() back to before
> > the wp_page_reuse()?
>
> Sure, I'll give it a try running over the night again.
It's all good now, no more occurrences with unlock_page() before
wp_page_reuse().
在 2020/9/24 上午10:06, Qian Cai 写道:
> On Wed, 2020-09-23 at 15:39 +0200, Gerald Schaefer wrote:
>> OK, I can now reproduce this, and unfortunately also with the gup_fast
>> fix, so it is something different. Bisecting is a bit hard, as it will
>> not always show immediately, sometimes takes up to an hour.
>>
>> Still, I think I found the culprit, merge commit b25d1dc9474e "Merge
>> branch 'simplify-do_wp_page'". Without those 4 patches, it works fine,
>> running over night.
>>
>> Not sure why this only shows on s390, should not be architecture-specific,
>> but we do often see subtle races earlier than others due to hypervisor
>> impact.
>
> Apparently, someone can reproduce something similar on x86 as well:
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> Probably, Alex could revert the bad commits and confirm it there.
>
>>
>> The first commit 09854ba94c6a ("mm: do_wp_page() simplification") already
yes, after revert this commit, the BUG disappears.
Thanks
Alex
>> introduces this error. The dirtyc0w_child test seems to play with cow
>> and racing madvise(MADV_DONTNEED), but I have not yet fully understood
>> it and also not the changes from commit 09854ba94c6a. As Linus already
>> mentioned in the merge commit message, this is some bad timing for such
>> a change, so I don't want to delay this further with trying to understand
>> it better before reporting. Maybe Peter or Linus can spot some obvious
>> issue.
>>
>> One thing that seems strange to me is that the page flags from the
>> bad page state output are (uptodate|swapbacked), see below, or
>> (referenced|uptodate|dirty|swapbacked) in the original report. But IIUC,
>> that should not qualify for the "PAGE_FLAGS_CHECK_AT_FREE flag(s) set"
>> reason. So it seems that the flags have changed between check_free_page()
>> and __dump_page(), which would be very odd. Or maybe some issue with
>> compound pages, because __dump_page() looks at head->flags.
>>
>> [ 1863.237707] BUG: Bad page state in process dirtyc0w_child pfn:58527d
>> [ 1863.237721] page:000000008866956b refcount:0 mapcount:0
>> mapping:0000000000000000 index:0x0 pfn:0x58527d
>> [ 1863.237727] flags: 0x3ffff00000080004(uptodate|swapbacked)
>> [ 1863.237734] raw: 3ffff00000080004 0000000000000100 0000000000000122
>> 0000000000000000
>> [ 1863.237738] raw: 0000000000000000 0000000000000000 ffffffff00000000
>> 0000000000000000
>> [ 1863.237742] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
>> [ 1863.237745] Modules linked in:
>> [ 1863.237752] CPU: 16 PID: 9074 Comm: dirtyc0w_child Tainted:
>> G B 5.9.0-rc6-00020-geff48ddeab78-dirty #104
>> [ 1863.237756] Hardware name: IBM 3906 M03 703 (LPAR)
>> [ 1863.237759] Call Trace:
>> [ 1863.237768] [<0000000000115f28>] show_stack+0x100/0x158
>> [ 1863.237775] [<000000000096b41a>] dump_stack+0xa2/0xd8
>> [ 1863.237781] [<00000000003d497c>] bad_page+0xdc/0x140
>> [ 1863.237785] [<00000000003d5b62>] free_pcp_prepare+0x31a/0x360
>> [ 1863.237789] [<00000000003d906a>] free_unref_page+0x32/0xb8
>> [ 1863.237794] [<00000000003b05f4>] zap_p4d_range+0x64c/0xcf8
>> [ 1863.237797] [<00000000003b0e7a>] unmap_page_range+0x9a/0x110
>> [ 1863.237801] [<00000000003b0f84>] unmap_single_vma+0x94/0x100
>> [ 1863.237805] [<00000000003b14c2>] zap_page_range+0x14a/0x1f0
>> [ 1863.237809] [<00000000003e3a24>] do_madvise+0x75c/0x918
>> [ 1863.237812] [<00000000003e3c06>] __s390x_sys_madvise+0x26/0x38
>> [ 1863.237817] [<0000000000d280d4>] system_call+0xe0/0x2c0
>> [ 1863.237820] INFO: lockdep is turned off.
>>
On Thu, Sep 24, 2020 at 5:06 AM Gerald Schaefer
<[email protected]> wrote:
>
> It's all good now, no more occurrences with unlock_page() before
> wp_page_reuse().
Thanks for the confirmation. When you pointed at that unlock_page()
change, I was pretty sure that was it ("D'oh!"), but it's always good
to have that verification.
Committed,
Linus