LinuxLists.cc - [BISECTED] Linux 3.12.7 introduces page map handling regression

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
> <[email protected]> wrote:
> >
> > Odds are this also shows up in 3.13, right?

Reproduced using 3.13 on the PV guest:

[ 368.756763] BUG: Bad page map in process mp pte:80000004a67c6165 pmd:e9b706067
[ 368.756777] page:ffffea001299f180 count:0 mapcount:-1 mapping: (null) index:0x0
[ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
[ 368.756786] addr:00007fd1388b7000 vm_flags:00100071 anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
[ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2 #1
[ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0 ffffffff814d8748 00007fd1388b7000
[ 368.756803] ffff880e9eaf3d08 ffffffff8116d289 0000000000000000 0000000000000000
[ 368.756809] ffff880e9b7065b8 ffffea001299f180 00007fd1388b8000 ffff880e9eaf3e30
[ 368.756815] Call Trace:
[ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
[ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
[ 368.756837] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
[ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
[ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
[ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
[ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
[ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
[ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
[ 368.756869] [<ffffffff814e70ed>] system_call_fastpath+0x1a/0x1f
[ 368.756872] Disabling lock debugging due to kernel taint
[ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:0 val:-1
[ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:1 val:1

>
> Probably. I don't have a Xen PV setup to test with (and very little
> interest in setting one up).. And I have a suspicion that it might not
> be so much about Xen PV, as perhaps about the kind of hardware.
>
> I suspect the issue has something to do with the magic _PAGE_NUMA
> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
> removing the _PAGE_PRESENT bit, and now the crazy numa code is
> confused.
>
> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
> _PAGE_PRESENT.
>
> Adding Andrea to the Cc, because he's the author of that horridness.
> Putting Steven's test-case here as an attachement for Andrea, maybe
> that makes him go "Ahh, yes, silly case".
>
> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
>
> Andrea, you can find the thread on lkml, but it boils down to commit
> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
> attached test-case (but apparently only under Xen PV). There it
> apparently causes a "BUG: Bad page map .." error.
>
> And I suspect this is another of those "this bug is only visible on
> real numa machines, because _PAGE_NUMA isn't actually ever set
> otherwise". That has pretty much guaranteed that it gets basically
> zero testing, which is not a great idea when coupled with that subtle
> sharing of the _PAGE_PROTNONE bit..
>
> It may be that the whole "Xen PV" thing is a red herring, and that
> Steven only sees it on that one machine because the one he runs as a
> PV guest under is a real NUMA machine, and all the other machines he
> has tried it on haven't been numa. So it *may* be that that "only
> under Xen PV" is a red herring. But that's just a possible guess.

The PV and HVM guests are both on NUMA hosts, but we don't expose NUMA to the
PV guest, so it fakes a NUMA node at startup.

I've also tried running a PV guest on a dual socket host with interleaved
memory:

# dmesg | grep -i -e numa -e node
[ 0.000000] NUMA turned off
[ 0.000000] Faking a node at [mem 0x0000000000000000-0x00000005607fffff]
[ 0.000000] Initmem setup node 0 [mem 0x00000000-0x5607fffff]
[ 0.000000] NODE_DATA [mem 0x55d4f2000-0x55d518fff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x00001000-0x0009ffff]
[ 0.000000] node 0: [mem 0x00100000-0x5607fffff]
[ 0.000000] On node 0 totalpages: 5638047
[ 0.000000] setup_percpu: NR_CPUS:4096 nr_cpumask_bits:16 nr_cpu_ids:16 nr_node_ids:1
[ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=1
[ 0.010697] Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes)
# dmesg | tail -n 21
[ 348.467265] BUG: Bad page map in process t pte:800000008a6ef165 pmd:53aa39067
[ 348.467280] page:ffffea000229bbc0 count:0 mapcount:-1 mapping: (null) index:0x0
[ 348.467286] page flags: 0x1ffc0000000014(referenced|dirty)
[ 348.467293] addr:00007f8c9fca0000 vm_flags:00100071 anon_vma:ffff88053aff19c0 mapping: (null) index:7f8c9fca0
[ 348.467301] CPU: 0 PID: 359 Comm: t Tainted: G B 3.12.8-1-ec2 #1
[ 348.467306] ffff8805396f71f8 ffff880539c49cc0 ffffffff814c77bb 00007f8c9fca0000
[ 348.467316] ffff880539c49d08 ffffffff8116788e 0000000000000000 0000000000000000
[ 348.467325] ffff88053aa39500 ffffea000229bbc0 00007f8c9fca1000 ffff880539c49e30
[ 348.467334] Call Trace:
[ 348.467346] [<ffffffff814c77bb>] dump_stack+0x45/0x56
[ 348.467355] [<ffffffff8116788e>] print_bad_pte+0x22e/0x250
[ 348.467362] [<ffffffff811690b3>] unmap_single_vma+0x583/0x890
[ 348.467369] [<ffffffff8116a445>] unmap_vmas+0x65/0x90
[ 348.467375] [<ffffffff8117051c>] unmap_region+0xac/0x120
[ 348.467382] [<ffffffff81170af9>] ? vma_rb_erase+0x1c9/0x210
[ 348.467389] [<ffffffff811726d0>] do_munmap+0x280/0x370
[ 348.467395] [<ffffffff81172801>] vm_munmap+0x41/0x60
[ 348.467404] [<ffffffff81173702>] SyS_munmap+0x22/0x30
[ 348.467413] [<ffffffff814d61ad>] system_call_fastpath+0x1a/0x1f
[ 348.470081] BUG: Bad rss-counter state mm:ffff88053a992100 idx:0 val:-1
[ 348.470091] BUG: Bad rss-counter state mm:ffff88053a992100 idx:1 val:1

As for bare metal Linux repro, I have a 2-socket Westmere box with NUMA enabled
running Linux 3.12.8. It doesn't repro:

$ sudo journalctl -b | grep -i -e node -e numa | cut -c 30-
SRAT: PXM 0 -> APIC 0x00 -> Node 0
SRAT: PXM 0 -> APIC 0x02 -> Node 0
SRAT: PXM 0 -> APIC 0x04 -> Node 0
SRAT: PXM 0 -> APIC 0x10 -> Node 0
SRAT: PXM 0 -> APIC 0x12 -> Node 0
SRAT: PXM 0 -> APIC 0x14 -> Node 0
SRAT: PXM 0 -> APIC 0x01 -> Node 0
SRAT: PXM 0 -> APIC 0x03 -> Node 0
SRAT: PXM 0 -> APIC 0x05 -> Node 0
SRAT: PXM 0 -> APIC 0x11 -> Node 0
SRAT: PXM 0 -> APIC 0x13 -> Node 0
SRAT: PXM 0 -> APIC 0x15 -> Node 0
SRAT: PXM 1 -> APIC 0x20 -> Node 1
SRAT: PXM 1 -> APIC 0x22 -> Node 1
SRAT: PXM 1 -> APIC 0x24 -> Node 1
SRAT: PXM 1 -> APIC 0x30 -> Node 1
SRAT: PXM 1 -> APIC 0x32 -> Node 1
SRAT: PXM 1 -> APIC 0x34 -> Node 1
SRAT: PXM 1 -> APIC 0x21 -> Node 1
SRAT: PXM 1 -> APIC 0x23 -> Node 1
SRAT: PXM 1 -> APIC 0x25 -> Node 1
SRAT: PXM 1 -> APIC 0x31 -> Node 1
SRAT: PXM 1 -> APIC 0x33 -> Node 1
SRAT: PXM 1 -> APIC 0x35 -> Node 1
SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
SRAT: Node 0 PXM 0 [mem 0x100000000-0x63fffffff]
SRAT: Node 1 PXM 1 [mem 0x640000000-0xc3fffffff]
NUMA: Initialized distance table, cnt=2
NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff]
NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x63fffffff] -> [mem 0x00000000-0x63fffffff]
Initmem setup node 0 [mem 0x00000000-0x63fffffff]
NODE_DATA [mem 0x63ffd9000-0x63fffffff]
Initmem setup node 1 [mem 0x640000000-0xc3fffffff]
NODE_DATA [mem 0xc3ffd6000-0xc3fffcfff]
[ffffea0000000000-ffffea0018ffffff] PMD -> [ffff880627e00000-ffff88063fdfffff] on node 0
[ffffea0019000000-ffffea0030ffffff] PMD -> [ffff880c27600000-ffff880c3f5fffff] on node 1
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x00001000-0x0009bfff]
node 0: [mem 0x00100000-0xbf78ffff]
node 0: [mem 0x100000000-0x63fffffff]
node 1: [mem 0x640000000-0xc3fffffff]
On node 0 totalpages: 6289195
On node 1 totalpages: 6291456
setup_percpu: NR_CPUS:4096 nr_cpumask_bits:24 nr_cpu_ids:24 nr_node_ids:2
SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=24, Nodes=2
Enabling automatic NUMA balancing. Configure with numa_balancing= or sysctl
Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes)
smpboot: Booting Node 0, Processors # 1 # 2 # 3 # 4 # 5 OK
smpboot: Booting Node 1, Processors # 6 # 7 # 8 # 9 # 10 # 11 OK
smpboot: Booting Node 0, Processors # 12 # 13 # 14 # 15 # 16 # 17 OK
smpboot: Booting Node 1, Processors # 18 # 19 # 20 # 21 # 22 # 23 OK
pci_bus 0000:00: on NUMA node 0 (pxm 0)
[...]
$ uname -r
3.12.8-1
$ sudo dmesg -c
$ gcc -O2 -o t t.c
$ ./t
$ dmesg
$

> Christ, how I hate that _PAGE_NUMA bit. Andrea: the fact that it gets
> no testing on any normal machines is a major problem. If it was simple
> and straightforward and the code was "obviously correct", it wouldn't
> be such a problem, but the _PAGE_NUMA code definitely does not fall
> under that "simple and obviously correct" heading.
>
> Guys, any ideas?
>
> Linus

2014-01-22 05:04:07

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
> > On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
> > <[email protected]> wrote:

Adding extra folks to the party.
> > >
> > > Odds are this also shows up in 3.13, right?
>
> Reproduced using 3.13 on the PV guest:
>
> [ 368.756763] BUG: Bad page map in process mp pte:80000004a67c6165 pmd:e9b706067
> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1 mapping: (null) index:0x0
> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071 anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2 #1
> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0 ffffffff814d8748 00007fd1388b7000
> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289 0000000000000000 0000000000000000
> [ 368.756809] ffff880e9b7065b8 ffffea001299f180 00007fd1388b8000 ffff880e9eaf3e30
> [ 368.756815] Call Trace:
> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
> [ 368.756837] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
> [ 368.756869] [<ffffffff814e70ed>] system_call_fastpath+0x1a/0x1f
> [ 368.756872] Disabling lock debugging due to kernel taint
> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:0 val:-1
> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:1 val:1
>
> >
> > Probably. I don't have a Xen PV setup to test with (and very little
> > interest in setting one up).. And I have a suspicion that it might not
> > be so much about Xen PV, as perhaps about the kind of hardware.
> >
> > I suspect the issue has something to do with the magic _PAGE_NUMA
> > tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
> > removing the _PAGE_PRESENT bit, and now the crazy numa code is
> > confused.
> >
> > The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
> > bit with _PAGE_PROTNONE, which is why it then has that tie-in to
> > _PAGE_PRESENT.
> >
> > Adding Andrea to the Cc, because he's the author of that horridness.
> > Putting Steven's test-case here as an attachement for Andrea, maybe
> > that makes him go "Ahh, yes, silly case".
> >
> > Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
> >
> > Andrea, you can find the thread on lkml, but it boils down to commit
> > 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
> > attached test-case (but apparently only under Xen PV). There it
> > apparently causes a "BUG: Bad page map .." error.

I *think* it is due to the fact that pmd_numa and pte_numa is getting the _raw_
value of PMDs and PTEs. That is - it does not use the pvops interface
and instead reads the values directly from the page-table. Since the
page-table is also manipulated by the hypervisor - there are certain
flags it also sets to do its business. It might be that it uses
_PAGE_GLOBAL as well - and Linux picks up on that. If it was using
pte_flags that would invoke the pvops interface.

Elena, Dariof and George, you guys had been looking at this a bit deeper
than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?

This not-compiled-totally-bad-patch might shed some light on what I was
thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
It does not fix it for PMDs naturally (as there are no PMD paravirt ops
for that).

The other question is - how is AutoNUMA running when it is not enabled?
Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
turned on?

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index ce563be..9fa7088 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -370,12 +370,15 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
unsigned long pfn = mfn_to_pfn(mfn);

pteval_t flags = val & PTE_FLAGS_MASK;
+ /* No AutoNUMA for PV. TODO If Linux sees the PTE having
+ * said bit, just igore it. */
+ if (flags & _PAGE_NUMA)
+ flags = flags & ~_PAGE_NUMA;
if (unlikely(pfn == ~0))
val = flags & ~_PAGE_PRESENT;
else
val = ((pteval_t)pfn << PAGE_SHIFT) | flags;
}
-
return val;
}

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index db09234..a8bc07d 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -644,7 +644,7 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#ifndef pte_numa
static inline int pte_numa(pte_t pte)
{
- return (pte_flags(pte) &
+ return (pte_val(pte) &
(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
}
#endif

2014-01-22 07:29:26

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
> > On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
> > > On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
> > > <[email protected]> wrote:
>
> Adding extra folks to the party.
> > > >
> > > > Odds are this also shows up in 3.13, right?
> >
> > Reproduced using 3.13 on the PV guest:
> >
> > [ 368.756763] BUG: Bad page map in process mp pte:80000004a67c6165 pmd:e9b706067
> > [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1 mapping: (null) index:0x0
> > [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
> > [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071 anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
> > [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2 #1
> > [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0 ffffffff814d8748 00007fd1388b7000
> > [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289 0000000000000000 0000000000000000
> > [ 368.756809] ffff880e9b7065b8 ffffea001299f180 00007fd1388b8000 ffff880e9eaf3e30
> > [ 368.756815] Call Trace:
> > [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
> > [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
> > [ 368.756837] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
> > [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
> > [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
> > [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
> > [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
> > [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
> > [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
> > [ 368.756869] [<ffffffff814e70ed>] system_call_fastpath+0x1a/0x1f
> > [ 368.756872] Disabling lock debugging due to kernel taint
> > [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:0 val:-1
> > [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:1 val:1
> >
> > >
> > > Probably. I don't have a Xen PV setup to test with (and very little
> > > interest in setting one up).. And I have a suspicion that it might not
> > > be so much about Xen PV, as perhaps about the kind of hardware.
> > >
> > > I suspect the issue has something to do with the magic _PAGE_NUMA
> > > tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
> > > removing the _PAGE_PRESENT bit, and now the crazy numa code is
> > > confused.
> > >
> > > The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
> > > bit with _PAGE_PROTNONE, which is why it then has that tie-in to
> > > _PAGE_PRESENT.
> > >
> > > Adding Andrea to the Cc, because he's the author of that horridness.
> > > Putting Steven's test-case here as an attachement for Andrea, maybe
> > > that makes him go "Ahh, yes, silly case".
> > >
> > > Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
> > >
> > > Andrea, you can find the thread on lkml, but it boils down to commit
> > > 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
> > > attached test-case (but apparently only under Xen PV). There it
> > > apparently causes a "BUG: Bad page map .." error.
>
> I *think* it is due to the fact that pmd_numa and pte_numa is getting the _raw_
> value of PMDs and PTEs. That is - it does not use the pvops interface
> and instead reads the values directly from the page-table. Since the
> page-table is also manipulated by the hypervisor - there are certain
> flags it also sets to do its business. It might be that it uses
> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
> pte_flags that would invoke the pvops interface.
>
> Elena, Dariof and George, you guys had been looking at this a bit deeper
> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?
>
> This not-compiled-totally-bad-patch might shed some light on what I was
> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
> for that).

Unfortunately the Totally Bad Patch seems to make no difference. I am
still able to repro the issue:

[ 346.374929] BUG: Bad page map in process mp pte:80000004ae928065 pmd:e993f9067
[ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping: (null) index:0x0
[ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
[ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071 anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
[ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+ #1
[ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768 00007f06a9bbb000
[ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000 0000000000000000
[ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000 ffff880e991a3e30
[ 346.374979] Call Trace:
[ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
[ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
[ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
[ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
[ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
[ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
[ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
[ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
[ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
[ 346.375034] [<ffffffff814e712d>] system_call_fastpath+0x1a/0x1f
[ 346.375037] Disabling lock debugging due to kernel taint
[ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00 idx:0 val:-1
[ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00 idx:1 val:1

This dump doesn't look dramatically different, either.

>
> The other question is - how is AutoNUMA running when it is not enabled?
> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
> turned on?

Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
mean not enabled at runtime?

[1] http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64

2014-01-22 14:30:18

by Daniel Borkmann

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On 01/22/2014 08:29 AM, Steven Noonan wrote:
> On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
>> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
>>> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
>>>> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
>>>> <[email protected]> wrote:
>>
>> Adding extra folks to the party.
>>>>>
>>>>> Odds are this also shows up in 3.13, right?
>>>
>>> Reproduced using 3.13 on the PV guest:
>>>
>>> [ 368.756763] BUG: Bad page map in process mp pte:80000004a67c6165 pmd:e9b706067
>>> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1 mapping: (null) index:0x0
>>> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
>>> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071 anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
>>> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2 #1
>>> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0 ffffffff814d8748 00007fd1388b7000
>>> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289 0000000000000000 0000000000000000
>>> [ 368.756809] ffff880e9b7065b8 ffffea001299f180 00007fd1388b8000 ffff880e9eaf3e30
>>> [ 368.756815] Call Trace:
>>> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
>>> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>>> [ 368.756837] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
>>> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>>> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
>>> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
>>> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
>>> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
>>> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
>>> [ 368.756869] [<ffffffff814e70ed>] system_call_fastpath+0x1a/0x1f
>>> [ 368.756872] Disabling lock debugging due to kernel taint
>>> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:0 val:-1
>>> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:1 val:1
>>>
>>>>
>>>> Probably. I don't have a Xen PV setup to test with (and very little
>>>> interest in setting one up).. And I have a suspicion that it might not
>>>> be so much about Xen PV, as perhaps about the kind of hardware.
>>>>
>>>> I suspect the issue has something to do with the magic _PAGE_NUMA
>>>> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
>>>> removing the _PAGE_PRESENT bit, and now the crazy numa code is
>>>> confused.
>>>>
>>>> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
>>>> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
>>>> _PAGE_PRESENT.
>>>>
>>>> Adding Andrea to the Cc, because he's the author of that horridness.
>>>> Putting Steven's test-case here as an attachement for Andrea, maybe
>>>> that makes him go "Ahh, yes, silly case".
>>>>
>>>> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
>>>>
>>>> Andrea, you can find the thread on lkml, but it boils down to commit
>>>> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
>>>> attached test-case (but apparently only under Xen PV). There it
>>>> apparently causes a "BUG: Bad page map .." error.
>>
>> I *think* it is due to the fact that pmd_numa and pte_numa is getting the _raw_
>> value of PMDs and PTEs. That is - it does not use the pvops interface
>> and instead reads the values directly from the page-table. Since the
>> page-table is also manipulated by the hypervisor - there are certain
>> flags it also sets to do its business. It might be that it uses
>> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
>> pte_flags that would invoke the pvops interface.
>>
>> Elena, Dariof and George, you guys had been looking at this a bit deeper
>> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?
>>
>> This not-compiled-totally-bad-patch might shed some light on what I was
>> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
>> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
>> for that).
>
> Unfortunately the Totally Bad Patch seems to make no difference. I am
> still able to repro the issue:

Maybe this one is also related to this BUG here (cc'ed people investigating
this one) ...

https://lkml.org/lkml/2014/1/10/427

... not sure, though.

> [ 346.374929] BUG: Bad page map in process mp pte:80000004ae928065 pmd:e993f9067
> [ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping: (null) index:0x0
> [ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
> [ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071 anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
> [ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+ #1
> [ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768 00007f06a9bbb000
> [ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000 0000000000000000
> [ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000 ffff880e991a3e30
> [ 346.374979] Call Trace:
> [ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
> [ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
> [ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
> [ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
> [ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
> [ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
> [ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
> [ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
> [ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
> [ 346.375034] [<ffffffff814e712d>] system_call_fastpath+0x1a/0x1f
> [ 346.375037] Disabling lock debugging due to kernel taint
> [ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00 idx:0 val:-1
> [ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00 idx:1 val:1
>
> This dump doesn't look dramatically different, either.
>
>>
>> The other question is - how is AutoNUMA running when it is not enabled?
>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
>> turned on?
>
> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
> mean not enabled at runtime?
>
> [1] http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64

2014-01-22 18:08:00

by Rik van Riel

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On 01/21/2014 09:47 PM, Linus Torvalds wrote:
> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
> <[email protected]> wrote:
>>
>> Odds are this also shows up in 3.13, right?
>
> Probably. I don't have a Xen PV setup to test with (and very little
> interest in setting one up).. And I have a suspicion that it might not
> be so much about Xen PV, as perhaps about the kind of hardware.
>
> I suspect the issue has something to do with the magic _PAGE_NUMA
> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
> removing the _PAGE_PRESENT bit, and now the crazy numa code is
> confused.
>
> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
> _PAGE_PRESENT.

The numa balancing code should clear _PAGE_PRESENT and
set _PAGE_NUMA / _PAGE_PROTNONE.

The difference between a numa pte and a protnone pte is
the VMA permissions.

When the VMA is protnone, do_page_fault will kill the
app with a segfault. When the VMA has proper permissions,
handle_pte_fault will call do_numa_page, and numa-y things
are done.

>
> Adding Andrea to the Cc, because he's the author of that horridness.
> Putting Steven's test-case here as an attachement for Andrea, maybe
> that makes him go "Ahh, yes, silly case".
>
> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
>
> Andrea, you can find the thread on lkml, but it boils down to commit
> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
> attached test-case (but apparently only under Xen PV). There it
> apparently causes a "BUG: Bad page map .." error.
>
> And I suspect this is another of those "this bug is only visible on
> real numa machines, because _PAGE_NUMA isn't actually ever set
> otherwise". That has pretty much guaranteed that it gets basically
> zero testing, which is not a great idea when coupled with that subtle
> sharing of the _PAGE_PROTNONE bit..
>
> It may be that the whole "Xen PV" thing is a red herring, and that
> Steven only sees it on that one machine because the one he runs as a
> PV guest under is a real NUMA machine, and all the other machines he
> has tried it on haven't been numa. So it *may* be that that "only
> under Xen PV" is a red herring. But that's just a possible guess.
>
> Christ, how I hate that _PAGE_NUMA bit. Andrea: the fact that it gets
> no testing on any normal machines is a major problem. If it was simple
> and straightforward and the code was "obviously correct", it wouldn't
> be such a problem, but the _PAGE_NUMA code definitely does not fall
> under that "simple and obviously correct" heading.
>
> Guys, any ideas?
>
> Linus
>

2014-01-22 18:24:11

by Linus Torvalds

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Wed, Jan 22, 2014 at 10:07 AM, Rik van Riel <[email protected]> wrote:
>
> The difference between a numa pte and a protnone pte is
> the VMA permissions.

If that is indeed the only difference, then we should damn well get
rid of that f*cking stupid _PAGE_NUMA name entirely.

It's misleading crap. Really. Just do a quick grep for that bit, and
you see just *how* confused people are about it:

#define _PAGE_NUMA _PAGE_PROTNONE
...
if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) &

think about it. Just *THINK* about how broken that code is. The whole
thing is a disaster. _PAGE_NUMA must die. It's shit.

Linus

2014-01-22 18:39:42

by Rik van Riel

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On 01/22/2014 01:24 PM, Linus Torvalds wrote:
> On Wed, Jan 22, 2014 at 10:07 AM, Rik van Riel <[email protected]> wrote:
>>
>> The difference between a numa pte and a protnone pte is
>> the VMA permissions.
>
> If that is indeed the only difference, then we should damn well get
> rid of that f*cking stupid _PAGE_NUMA name entirely.
>
> It's misleading crap. Really. Just do a quick grep for that bit, and
> you see just *how* confused people are about it:
>
> #define _PAGE_NUMA _PAGE_PROTNONE
> ...
> if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) &
>
> think about it. Just *THINK* about how broken that code is. The whole
> thing is a disaster. _PAGE_NUMA must die. It's shit.

The reason things are this way is that we were
not sure whether we can indeed use _PAGE_PROTNONE
for NUMA balancing on all architectures.

If we are sure that _PAGE_PROTNONE can be used
everywhere, I agree we should get rid of the whole
_PAGE_NUMA naming, and replace that ambiguous
code with some comments and documentation instead.

2014-01-22 20:18:54

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Wed, Jan 22, 2014 at 9:29 AM, Daniel Borkmann <[email protected]> wrote:
> On 01/22/2014 08:29 AM, Steven Noonan wrote:
>>
>> On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
>>>
>>> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
>>>>
>>>> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
>>>>>
>>>>> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
>>>>> <[email protected]> wrote:
>>>
>>>
>>> Adding extra folks to the party.
>>>>>>
>>>>>>
>>>>>> Odds are this also shows up in 3.13, right?
>>>>
>>>>
>>>> Reproduced using 3.13 on the PV guest:
>>>>
>>>> [ 368.756763] BUG: Bad page map in process mp
>>>> pte:80000004a67c6165 pmd:e9b706067
>>>> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1
>>>> mapping: (null) index:0x0
>>>> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
>>>> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071
>>>> anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
>>>> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2
>>>> #1
>>>> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0
>>>> ffffffff814d8748 00007fd1388b7000
>>>> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289
>>>> 0000000000000000 0000000000000000
>>>> [ 368.756809] ffff880e9b7065b8 ffffea001299f180
>>>> 00007fd1388b8000 ffff880e9eaf3e30
>>>> [ 368.756815] Call Trace:
>>>> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
>>>> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>>>> [ 368.756837] [<ffffffff8116eae3>]
>>>> unmap_single_vma+0x583/0x890
>>>> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>>>> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
>>>> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
>>>> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
>>>> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
>>>> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
>>>> [ 368.756869] [<ffffffff814e70ed>]
>>>> system_call_fastpath+0x1a/0x1f
>>>> [ 368.756872] Disabling lock debugging due to kernel taint
>>>> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680
>>>> idx:0 val:-1
>>>> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680
>>>> idx:1 val:1
>>>>
>>>>>
>>>>> Probably. I don't have a Xen PV setup to test with (and very little
>>>>> interest in setting one up).. And I have a suspicion that it might not
>>>>> be so much about Xen PV, as perhaps about the kind of hardware.
>>>>>
>>>>> I suspect the issue has something to do with the magic _PAGE_NUMA
>>>>> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
>>>>> removing the _PAGE_PRESENT bit, and now the crazy numa code is
>>>>> confused.
>>>>>
>>>>> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
>>>>> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
>>>>> _PAGE_PRESENT.
>>>>>
>>>>> Adding Andrea to the Cc, because he's the author of that horridness.
>>>>> Putting Steven's test-case here as an attachement for Andrea, maybe
>>>>> that makes him go "Ahh, yes, silly case".
>>>>>
>>>>> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
>>>>>
>>>>> Andrea, you can find the thread on lkml, but it boils down to commit
>>>>> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
>>>>> attached test-case (but apparently only under Xen PV). There it
>>>>> apparently causes a "BUG: Bad page map .." error.
>>>
>>>
>>> I *think* it is due to the fact that pmd_numa and pte_numa is getting the
>>> _raw_
>>> value of PMDs and PTEs. That is - it does not use the pvops interface
>>> and instead reads the values directly from the page-table. Since the
>>> page-table is also manipulated by the hypervisor - there are certain
>>> flags it also sets to do its business. It might be that it uses
>>> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
>>> pte_flags that would invoke the pvops interface.
>>>
>>> Elena, Dariof and George, you guys had been looking at this a bit deeper
>>> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?
>>>
>>> This not-compiled-totally-bad-patch might shed some light on what I was
>>> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
>>> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
>>> for that).
>>
>>
>> Unfortunately the Totally Bad Patch seems to make no difference. I am
>> still able to repro the issue:

Steven, do you use numa=fake on boot cmd line for pv guest?

I had similar issue on pv guest. Let me check if the fix that resolved
this for me will help with 3.13.

>
>
> Maybe this one is also related to this BUG here (cc'ed people investigating
> this one) ...
>
> https://lkml.org/lkml/2014/1/10/427
>
> ... not sure, though.
>
>
>> [ 346.374929] BUG: Bad page map in process mp
>> pte:80000004ae928065 pmd:e993f9067
>> [ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping:
>> (null) index:0x0
>> [ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
>> [ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071
>> anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
>> [ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+
>> #1
>> [ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768
>> 00007f06a9bbb000
>> [ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000
>> 0000000000000000
>> [ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000
>> ffff880e991a3e30
>> [ 346.374979] Call Trace:
>> [ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
>> [ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>> [ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
>> [ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>> [ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
>> [ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
>> [ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
>> [ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
>> [ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
>> [ 346.375034] [<ffffffff814e712d>]
>> system_call_fastpath+0x1a/0x1f
>> [ 346.375037] Disabling lock debugging due to kernel taint
>> [ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>> idx:0 val:-1
>> [ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>> idx:1 val:1
>>
>> This dump doesn't look dramatically different, either.
>>
>>>
>>> The other question is - how is AutoNUMA running when it is not enabled?
>>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
>>> turned on?
>>
>>
>> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
>> mean not enabled at runtime?
>>
>> [1]
>> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64

--
Elena

2014-01-22 20:33:47

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Wed, Jan 22, 2014 at 03:18:50PM -0500, Elena Ufimtseva wrote:
> On Wed, Jan 22, 2014 at 9:29 AM, Daniel Borkmann <[email protected]> wrote:
> > On 01/22/2014 08:29 AM, Steven Noonan wrote:
> >>
> >> On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
> >>>
> >>> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
> >>>>
> >>>> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
> >>>>>
> >>>>> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
> >>>>> <[email protected]> wrote:
> >>>
> >>>
> >>> Adding extra folks to the party.
> >>>>>>
> >>>>>>
> >>>>>> Odds are this also shows up in 3.13, right?
> >>>>
> >>>>
> >>>> Reproduced using 3.13 on the PV guest:
> >>>>
> >>>> [ 368.756763] BUG: Bad page map in process mp
> >>>> pte:80000004a67c6165 pmd:e9b706067
> >>>> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1
> >>>> mapping: (null) index:0x0
> >>>> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
> >>>> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071
> >>>> anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
> >>>> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2
> >>>> #1
> >>>> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0
> >>>> ffffffff814d8748 00007fd1388b7000
> >>>> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289
> >>>> 0000000000000000 0000000000000000
> >>>> [ 368.756809] ffff880e9b7065b8 ffffea001299f180
> >>>> 00007fd1388b8000 ffff880e9eaf3e30
> >>>> [ 368.756815] Call Trace:
> >>>> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
> >>>> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
> >>>> [ 368.756837] [<ffffffff8116eae3>]
> >>>> unmap_single_vma+0x583/0x890
> >>>> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
> >>>> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
> >>>> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
> >>>> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
> >>>> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
> >>>> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
> >>>> [ 368.756869] [<ffffffff814e70ed>]
> >>>> system_call_fastpath+0x1a/0x1f
> >>>> [ 368.756872] Disabling lock debugging due to kernel taint
> >>>> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680
> >>>> idx:0 val:-1
> >>>> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680
> >>>> idx:1 val:1
> >>>>
> >>>>>
> >>>>> Probably. I don't have a Xen PV setup to test with (and very little
> >>>>> interest in setting one up).. And I have a suspicion that it might not
> >>>>> be so much about Xen PV, as perhaps about the kind of hardware.
> >>>>>
> >>>>> I suspect the issue has something to do with the magic _PAGE_NUMA
> >>>>> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
> >>>>> removing the _PAGE_PRESENT bit, and now the crazy numa code is
> >>>>> confused.
> >>>>>
> >>>>> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
> >>>>> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
> >>>>> _PAGE_PRESENT.
> >>>>>
> >>>>> Adding Andrea to the Cc, because he's the author of that horridness.
> >>>>> Putting Steven's test-case here as an attachement for Andrea, maybe
> >>>>> that makes him go "Ahh, yes, silly case".
> >>>>>
> >>>>> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
> >>>>>
> >>>>> Andrea, you can find the thread on lkml, but it boils down to commit
> >>>>> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
> >>>>> attached test-case (but apparently only under Xen PV). There it
> >>>>> apparently causes a "BUG: Bad page map .." error.
> >>>
> >>>
> >>> I *think* it is due to the fact that pmd_numa and pte_numa is getting the
> >>> _raw_
> >>> value of PMDs and PTEs. That is - it does not use the pvops interface
> >>> and instead reads the values directly from the page-table. Since the
> >>> page-table is also manipulated by the hypervisor - there are certain
> >>> flags it also sets to do its business. It might be that it uses
> >>> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
> >>> pte_flags that would invoke the pvops interface.
> >>>
> >>> Elena, Dariof and George, you guys had been looking at this a bit deeper
> >>> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?
> >>>
> >>> This not-compiled-totally-bad-patch might shed some light on what I was
> >>> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
> >>> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
> >>> for that).
> >>
> >>
> >> Unfortunately the Totally Bad Patch seems to make no difference. I am
> >> still able to repro the issue:
>
> Steven, do you use numa=fake on boot cmd line for pv guest?
>
> I had similar issue on pv guest. Let me check if the fix that resolved
> this for me will help with 3.13.

Nope:

# cat /proc/cmdline
root=/dev/xvda1 ro rootwait rootfstype=ext4 nomodeset console=hvc0 earlyprintk=xen,verbose loglevel=7

>
> >
> >
> > Maybe this one is also related to this BUG here (cc'ed people investigating
> > this one) ...
> >
> > https://lkml.org/lkml/2014/1/10/427
> >
> > ... not sure, though.
> >
> >
> >> [ 346.374929] BUG: Bad page map in process mp
> >> pte:80000004ae928065 pmd:e993f9067
> >> [ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping:
> >> (null) index:0x0
> >> [ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
> >> [ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071
> >> anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
> >> [ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+
> >> #1
> >> [ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768
> >> 00007f06a9bbb000
> >> [ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000
> >> 0000000000000000
> >> [ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000
> >> ffff880e991a3e30
> >> [ 346.374979] Call Trace:
> >> [ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
> >> [ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
> >> [ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
> >> [ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
> >> [ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
> >> [ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
> >> [ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
> >> [ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
> >> [ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
> >> [ 346.375034] [<ffffffff814e712d>]
> >> system_call_fastpath+0x1a/0x1f
> >> [ 346.375037] Disabling lock debugging due to kernel taint
> >> [ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00
> >> idx:0 val:-1
> >> [ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00
> >> idx:1 val:1
> >>
> >> This dump doesn't look dramatically different, either.
> >>
> >>>
> >>> The other question is - how is AutoNUMA running when it is not enabled?
> >>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
> >>> turned on?
> >>
> >>
> >> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
> >> mean not enabled at runtime?
> >>
> >> [1]
> >> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
>
>
>
> --
> Elena

2014-01-23 16:23:42

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Wed, Jan 22, 2014 at 3:33 PM, Steven Noonan <[email protected]> wrote:
> On Wed, Jan 22, 2014 at 03:18:50PM -0500, Elena Ufimtseva wrote:
>> On Wed, Jan 22, 2014 at 9:29 AM, Daniel Borkmann <[email protected]> wrote:
>> > On 01/22/2014 08:29 AM, Steven Noonan wrote:
>> >>
>> >> On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
>> >>>
>> >>> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
>> >>>>
>> >>>> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
>> >>>>>
>> >>>>> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
>> >>>>> <[email protected]> wrote:
>> >>>
>> >>>
>> >>> Adding extra folks to the party.
>> >>>>>>
>> >>>>>>
>> >>>>>> Odds are this also shows up in 3.13, right?
>> >>>>
>> >>>>
>> >>>> Reproduced using 3.13 on the PV guest:
>> >>>>
>> >>>> [ 368.756763] BUG: Bad page map in process mp
>> >>>> pte:80000004a67c6165 pmd:e9b706067
>> >>>> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1
>> >>>> mapping: (null) index:0x0
>> >>>> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
>> >>>> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071
>> >>>> anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
>> >>>> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2
>> >>>> #1
>> >>>> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0
>> >>>> ffffffff814d8748 00007fd1388b7000
>> >>>> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289
>> >>>> 0000000000000000 0000000000000000
>> >>>> [ 368.756809] ffff880e9b7065b8 ffffea001299f180
>> >>>> 00007fd1388b8000 ffff880e9eaf3e30
>> >>>> [ 368.756815] Call Trace:
>> >>>> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
>> >>>> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>> >>>> [ 368.756837] [<ffffffff8116eae3>]
>> >>>> unmap_single_vma+0x583/0x890
>> >>>> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>> >>>> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
>> >>>> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
>> >>>> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
>> >>>> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
>> >>>> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
>> >>>> [ 368.756869] [<ffffffff814e70ed>]
>> >>>> system_call_fastpath+0x1a/0x1f
>> >>>> [ 368.756872] Disabling lock debugging due to kernel taint
>> >>>> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680
>> >>>> idx:0 val:-1
>> >>>> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680
>> >>>> idx:1 val:1
>> >>>>
>> >>>>>
>> >>>>> Probably. I don't have a Xen PV setup to test with (and very little
>> >>>>> interest in setting one up).. And I have a suspicion that it might not
>> >>>>> be so much about Xen PV, as perhaps about the kind of hardware.
>> >>>>>
>> >>>>> I suspect the issue has something to do with the magic _PAGE_NUMA
>> >>>>> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
>> >>>>> removing the _PAGE_PRESENT bit, and now the crazy numa code is
>> >>>>> confused.
>> >>>>>
>> >>>>> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
>> >>>>> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
>> >>>>> _PAGE_PRESENT.
>> >>>>>
>> >>>>> Adding Andrea to the Cc, because he's the author of that horridness.
>> >>>>> Putting Steven's test-case here as an attachement for Andrea, maybe
>> >>>>> that makes him go "Ahh, yes, silly case".
>> >>>>>
>> >>>>> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
>> >>>>>
>> >>>>> Andrea, you can find the thread on lkml, but it boils down to commit
>> >>>>> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
>> >>>>> attached test-case (but apparently only under Xen PV). There it
>> >>>>> apparently causes a "BUG: Bad page map .." error.
>> >>>
>> >>>
>> >>> I *think* it is due to the fact that pmd_numa and pte_numa is getting the
>> >>> _raw_
>> >>> value of PMDs and PTEs. That is - it does not use the pvops interface
>> >>> and instead reads the values directly from the page-table. Since the
>> >>> page-table is also manipulated by the hypervisor - there are certain
>> >>> flags it also sets to do its business. It might be that it uses
>> >>> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
>> >>> pte_flags that would invoke the pvops interface.
>> >>>
>> >>> Elena, Dariof and George, you guys had been looking at this a bit deeper
>> >>> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?

It does use _PAGE_GLOBAL for guest user pages

>> >>>
>> >>> This not-compiled-totally-bad-patch might shed some light on what I was
>> >>> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
>> >>> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
>> >>> for that).
>> >>
>> >>
>> >> Unfortunately the Totally Bad Patch seems to make no difference. I am
>> >> still able to repro the issue:
>>
>> Steven, do you use numa=fake on boot cmd line for pv guest?
>>
>> I had similar issue on pv guest. Let me check if the fix that resolved
>> this for me will help with 3.13.
>
> Nope:
>
> # cat /proc/cmdline
> root=/dev/xvda1 ro rootwait rootfstype=ext4 nomodeset console=hvc0 earlyprintk=xen,verbose loglevel=7

>
>>
>> >
>> >
>> > Maybe this one is also related to this BUG here (cc'ed people investigating
>> > this one) ...
>> >
>> > https://lkml.org/lkml/2014/1/10/427
>> >
>> > ... not sure, though.
>> >
>> >
>> >> [ 346.374929] BUG: Bad page map in process mp
>> >> pte:80000004ae928065 pmd:e993f9067
>> >> [ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping:
>> >> (null) index:0x0
>> >> [ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
>> >> [ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071
>> >> anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
>> >> [ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+
>> >> #1
>> >> [ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768
>> >> 00007f06a9bbb000
>> >> [ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000
>> >> 0000000000000000
>> >> [ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000
>> >> ffff880e991a3e30
>> >> [ 346.374979] Call Trace:
>> >> [ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
>> >> [ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>> >> [ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
>> >> [ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>> >> [ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
>> >> [ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
>> >> [ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
>> >> [ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
>> >> [ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
>> >> [ 346.375034] [<ffffffff814e712d>]
>> >> system_call_fastpath+0x1a/0x1f
>> >> [ 346.375037] Disabling lock debugging due to kernel taint
>> >> [ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>> >> idx:0 val:-1
>> >> [ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>> >> idx:1 val:1
>> >>
>> >> This dump doesn't look dramatically different, either.
>> >>
>> >>>
>> >>> The other question is - how is AutoNUMA running when it is not enabled?
>> >>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
>> >>> turned on?
>> >>
>> >>
>> >> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
>> >> mean not enabled at runtime?
>> >>
>> >> [1]
>> >> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
>>
>>
>>
>> --
>> Elena

I was able to reproduce this consistently, also with the latest mm
patches from yesterday.
Can you please try this:

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index ce563be..76dcf96 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
*mm, unsigned long addr,
/* Assume pteval_t is equivalent to all the other *val_t types. */
static pteval_t pte_mfn_to_pfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if ((val & _PAGE_PRESENT) || ((val &
(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
unsigned long pfn = mfn_to_pfn(mfn);

@@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)

static pteval_t pte_pfn_to_mfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if ((val & _PAGE_PRESENT) || ((val &
(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
pteval_t flags = val & PTE_FLAGS_MASK;
unsigned long mfn;

2014-01-23 17:03:44

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Tue, Jan 21, 2014 at 03:27:08PM -0800, Steven Noonan wrote:
> A user reported a problem starting vsftpd on a Xen paravirtualized
> guest, with this in dmesg:
>
> [ 60.654862] BUG: Bad page map in process vsftpd pte:8000000493b88165 pmd:e9cc01067
> [ 60.654876] page:ffffea00124ee200 count:0 mapcount:-1 mapping: (null) index:0x0
> [ 60.654879] page flags: 0x2ffc0000000014(referenced|dirty)
> [ 60.654885] addr:00007f97eea74000 vm_flags:00100071 anon_vma:ffff880e98f80380 mapping: (null) index:7f97eea74
> [ 60.654890] CPU: 4 PID: 587 Comm: vsftpd Not tainted 3.12.7-1-ec2 #1
> [ 60.654893] ffff880e9cc6ec38 ffff880e9cc61ca0 ffffffff814c763b 00007f97eea74000
> [ 60.654900] ffff880e9cc61ce8 ffffffff8116784e 0000000000000000 0000000000000000
> [ 60.654906] ffff880e9cc013a0 ffffea00124ee200 00007f97eea75000 ffff880e9cc61e10
> [ 60.654912] Call Trace:
> [ 60.654921] [<ffffffff814c763b>] dump_stack+0x45/0x56
> [ 60.654928] [<ffffffff8116784e>] print_bad_pte+0x22e/0x250
> [ 60.654933] [<ffffffff81169073>] unmap_single_vma+0x583/0x890
> [ 60.654938] [<ffffffff8116a405>] unmap_vmas+0x65/0x90
> [ 60.654942] [<ffffffff81173795>] exit_mmap+0xc5/0x170
> [ 60.654948] [<ffffffff8105d295>] mmput+0x65/0x100
> [ 60.654952] [<ffffffff81062983>] do_exit+0x393/0x9e0
> [ 60.654955] [<ffffffff810630dc>] do_group_exit+0xcc/0x140
> [ 60.654959] [<ffffffff81063164>] SyS_exit_group+0x14/0x20
> [ 60.654965] [<ffffffff814d602d>] system_call_fastpath+0x1a/0x1f
> [ 60.654968] Disabling lock debugging due to kernel taint
> [ 60.655191] BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:0 val:-1
> [ 60.655196] BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:1 val:1
>
>
> The issue could not be reproduced under an HVM instance with the same
> kernel, so it appears to be exclusive to paravirtual Xen guests.
>
> I noted that it wasn't present in 3.10.27, but was present in 3.12.7 and
> 3.12.8. I ran through a bisection to find the root cause:
>
> # start: 'v3.12.7' 'v3.10.27'
> # bad: [4301b7a8] Linux 3.12.7
> # good: [1071ea6e] Linux 3.10.27
> # good: [8bb495e3] Linux 3.10
> # good: [8fe73691] staging: comedi: comedi_bond: change return value
> # good: [22e04f6b] Merge branch 'for-linus' of git://git.kernel.org/p
> # good: [b7c09ad4] Merge branch 'for-linus' of git://git.kernel.org/p
> # good: [13caa8ed] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
> # good: [13caa8ed] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
> # good: [f5fa9283] ipv6: reset dst.expires value when clearing expire
> # good: [4af9d888] bridge: flush br's address entry in fdb when remov
> # good: [8c13daf6] dm delay: fix a possible deadlock due to shared wo
> # good: [93c02d70] firewire: sbp2: bring back WRITE SAME support
> # good: [18065245] ACPI / PCI / hotplug: Avoid warning when _ADR not
> # bad: [8807a436] mm/memory-failure.c: transfer page count from head
> # bad: [fd5df800] mm: numa: avoid unnecessary disruption of NUMA hin
> # good: [c18e3316] mm: numa: do not clear PMD during PTE update scan
> # good: [f3b578d9] mm: numa: avoid unnecessary work on the failure pa
> # bad: [3d792d61] mm: numa: clear numa hinting information on mprote
> # good: [cefeb279] sched: numa: skip inaccessible VMAs
> # first bad: [3d792d61] mm: numa: clear numa hinting information on mprote
>
> If only I'd tested v3.12.0, that bisection would have been a lot shorter!
>
>
> It looks like this is the change implicated (introduced in v3.12.7):
>
> commit 3d792d616ba408ab55a54c1bb75a9367d997acfa
> Author: Mel Gorman <[email protected]>
> Date: Tue Jan 7 14:00:44 2014 +0000
>
> mm: numa: clear numa hinting information on mprotect
>
> commit 1667918b6483b12a6496bf54151b827b8235d7b1 upstream.
>
> On a protection change it is no longer clear if the page should be still
> accessible. This patch clears the NUMA hinting fault bits on a
> protection change.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Cc: Alex Thorlton <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>

Sorry for the long delay in responding. I've only managed to get as far as
this mail now and it's already my evening and I won't be able to work on
it now either. Based on the patch a possible explanation is that I called
the page table mknonnuma helpers on a non-numa PTE and there is now a
mixup between the _PAGE_PRESENT and _PAGE_PROTNONE bits. It feels like a
bad fit though and would not explain why it only affects Xen. A slightly
better fit would be if the page table helpers for numa hinting flags are
colliding badly with the paravirt interface somehow. A git grep for
_PAGE_PROTNONE on xen showed up nothing but _PAGE_PRESENT has special
meaning and that might be what I trashed.

--
Mel Gorman
SUSE Labs

2014-01-23 23:20:22

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Thu, Jan 23, 2014 at 11:23:37AM -0500, Elena Ufimtseva wrote:
> On Wed, Jan 22, 2014 at 3:33 PM, Steven Noonan <[email protected]> wrote:
> > On Wed, Jan 22, 2014 at 03:18:50PM -0500, Elena Ufimtseva wrote:
> >> On Wed, Jan 22, 2014 at 9:29 AM, Daniel Borkmann <[email protected]> wrote:
> >> > On 01/22/2014 08:29 AM, Steven Noonan wrote:
> >> >>
> >> >> On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
> >> >>>
> >> >>> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
> >> >>>>
> >> >>>> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
> >> >>>>>
> >> >>>>> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
> >> >>>>> <[email protected]> wrote:
> >> >>>
> >> >>>
> >> >>> Adding extra folks to the party.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Odds are this also shows up in 3.13, right?
> >> >>>>
> >> >>>>
> >> >>>> Reproduced using 3.13 on the PV guest:
> >> >>>>
> >> >>>> [ 368.756763] BUG: Bad page map in process mp
> >> >>>> pte:80000004a67c6165 pmd:e9b706067
> >> >>>> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1
> >> >>>> mapping: (null) index:0x0
> >> >>>> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
> >> >>>> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071
> >> >>>> anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
> >> >>>> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2
> >> >>>> #1
> >> >>>> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0
> >> >>>> ffffffff814d8748 00007fd1388b7000
> >> >>>> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289
> >> >>>> 0000000000000000 0000000000000000
> >> >>>> [ 368.756809] ffff880e9b7065b8 ffffea001299f180
> >> >>>> 00007fd1388b8000 ffff880e9eaf3e30
> >> >>>> [ 368.756815] Call Trace:
> >> >>>> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
> >> >>>> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
> >> >>>> [ 368.756837] [<ffffffff8116eae3>]
> >> >>>> unmap_single_vma+0x583/0x890
> >> >>>> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
> >> >>>> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
> >> >>>> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
> >> >>>> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
> >> >>>> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
> >> >>>> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
> >> >>>> [ 368.756869] [<ffffffff814e70ed>]
> >> >>>> system_call_fastpath+0x1a/0x1f
> >> >>>> [ 368.756872] Disabling lock debugging due to kernel taint
> >> >>>> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680
> >> >>>> idx:0 val:-1
> >> >>>> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680
> >> >>>> idx:1 val:1
> >> >>>>
> >> >>>>>
> >> >>>>> Probably. I don't have a Xen PV setup to test with (and very little
> >> >>>>> interest in setting one up).. And I have a suspicion that it might not
> >> >>>>> be so much about Xen PV, as perhaps about the kind of hardware.
> >> >>>>>
> >> >>>>> I suspect the issue has something to do with the magic _PAGE_NUMA
> >> >>>>> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
> >> >>>>> removing the _PAGE_PRESENT bit, and now the crazy numa code is
> >> >>>>> confused.
> >> >>>>>
> >> >>>>> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
> >> >>>>> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
> >> >>>>> _PAGE_PRESENT.
> >> >>>>>
> >> >>>>> Adding Andrea to the Cc, because he's the author of that horridness.
> >> >>>>> Putting Steven's test-case here as an attachement for Andrea, maybe
> >> >>>>> that makes him go "Ahh, yes, silly case".
> >> >>>>>
> >> >>>>> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
> >> >>>>>
> >> >>>>> Andrea, you can find the thread on lkml, but it boils down to commit
> >> >>>>> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
> >> >>>>> attached test-case (but apparently only under Xen PV). There it
> >> >>>>> apparently causes a "BUG: Bad page map .." error.
> >> >>>
> >> >>>
> >> >>> I *think* it is due to the fact that pmd_numa and pte_numa is getting the
> >> >>> _raw_
> >> >>> value of PMDs and PTEs. That is - it does not use the pvops interface
> >> >>> and instead reads the values directly from the page-table. Since the
> >> >>> page-table is also manipulated by the hypervisor - there are certain
> >> >>> flags it also sets to do its business. It might be that it uses
> >> >>> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
> >> >>> pte_flags that would invoke the pvops interface.
> >> >>>
> >> >>> Elena, Dariof and George, you guys had been looking at this a bit deeper
> >> >>> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?
>
> It does use _PAGE_GLOBAL for guest user pages
>
> >> >>>
> >> >>> This not-compiled-totally-bad-patch might shed some light on what I was
> >> >>> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
> >> >>> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
> >> >>> for that).
> >> >>
> >> >>
> >> >> Unfortunately the Totally Bad Patch seems to make no difference. I am
> >> >> still able to repro the issue:
> >>
> >> Steven, do you use numa=fake on boot cmd line for pv guest?
> >>
> >> I had similar issue on pv guest. Let me check if the fix that resolved
> >> this for me will help with 3.13.
> >
> > Nope:
> >
> > # cat /proc/cmdline
> > root=/dev/xvda1 ro rootwait rootfstype=ext4 nomodeset console=hvc0 earlyprintk=xen,verbose loglevel=7
>
> >
> >>
> >> >
> >> >
> >> > Maybe this one is also related to this BUG here (cc'ed people investigating
> >> > this one) ...
> >> >
> >> > https://lkml.org/lkml/2014/1/10/427
> >> >
> >> > ... not sure, though.
> >> >
> >> >
> >> >> [ 346.374929] BUG: Bad page map in process mp
> >> >> pte:80000004ae928065 pmd:e993f9067
> >> >> [ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping:
> >> >> (null) index:0x0
> >> >> [ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
> >> >> [ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071
> >> >> anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
> >> >> [ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+
> >> >> #1
> >> >> [ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768
> >> >> 00007f06a9bbb000
> >> >> [ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000
> >> >> 0000000000000000
> >> >> [ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000
> >> >> ffff880e991a3e30
> >> >> [ 346.374979] Call Trace:
> >> >> [ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
> >> >> [ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
> >> >> [ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
> >> >> [ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
> >> >> [ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
> >> >> [ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
> >> >> [ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
> >> >> [ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
> >> >> [ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
> >> >> [ 346.375034] [<ffffffff814e712d>]
> >> >> system_call_fastpath+0x1a/0x1f
> >> >> [ 346.375037] Disabling lock debugging due to kernel taint
> >> >> [ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00
> >> >> idx:0 val:-1
> >> >> [ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00
> >> >> idx:1 val:1
> >> >>
> >> >> This dump doesn't look dramatically different, either.
> >> >>
> >> >>>
> >> >>> The other question is - how is AutoNUMA running when it is not enabled?
> >> >>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
> >> >>> turned on?
> >> >>
> >> >>
> >> >> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
> >> >> mean not enabled at runtime?
> >> >>
> >> >> [1]
> >> >> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
> >>
> >>
> >>
> >> --
> >> Elena
>
> I was able to reproduce this consistently, also with the latest mm
> patches from yesterday.
> Can you please try this:
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index ce563be..76dcf96 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
> *mm, unsigned long addr,
> /* Assume pteval_t is equivalent to all the other *val_t types. */
> static pteval_t pte_mfn_to_pfn(pteval_t val)
> {
> - if (val & _PAGE_PRESENT) {
> + if ((val & _PAGE_PRESENT) || ((val &
> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
> unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
> unsigned long pfn = mfn_to_pfn(mfn);
>
> @@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
>
> static pteval_t pte_pfn_to_mfn(pteval_t val)
> {
> - if (val & _PAGE_PRESENT) {
> + if ((val & _PAGE_PRESENT) || ((val &
> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
> unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
> pteval_t flags = val & PTE_FLAGS_MASK;
> unsigned long mfn;

Thanks Elena, I just tested that and it does un-break things.

- Steven

2014-01-24 04:28:03

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Thu, Jan 23, 2014 at 6:20 PM, Steven Noonan <[email protected]> wrote:
> On Thu, Jan 23, 2014 at 11:23:37AM -0500, Elena Ufimtseva wrote:
>> On Wed, Jan 22, 2014 at 3:33 PM, Steven Noonan <[email protected]> wrote:
>> > On Wed, Jan 22, 2014 at 03:18:50PM -0500, Elena Ufimtseva wrote:
>> >> On Wed, Jan 22, 2014 at 9:29 AM, Daniel Borkmann <[email protected]> wrote:
>> >> > On 01/22/2014 08:29 AM, Steven Noonan wrote:
>> >> >>
>> >> >> On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
>> >> >>>
>> >> >>> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
>> >> >>>>
>> >> >>>> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
>> >> >>>>>
>> >> >>>>> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
>> >> >>>>> <[email protected]> wrote:
>> >> >>>
>> >> >>>
>> >> >>> Adding extra folks to the party.
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Odds are this also shows up in 3.13, right?
>> >> >>>>
>> >> >>>>
>> >> >>>> Reproduced using 3.13 on the PV guest:
>> >> >>>>
>> >> >>>> [ 368.756763] BUG: Bad page map in process mp
>> >> >>>> pte:80000004a67c6165 pmd:e9b706067
>> >> >>>> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1
>> >> >>>> mapping: (null) index:0x0
>> >> >>>> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
>> >> >>>> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071
>> >> >>>> anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
>> >> >>>> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2
>> >> >>>> #1
>> >> >>>> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0
>> >> >>>> ffffffff814d8748 00007fd1388b7000
>> >> >>>> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289
>> >> >>>> 0000000000000000 0000000000000000
>> >> >>>> [ 368.756809] ffff880e9b7065b8 ffffea001299f180
>> >> >>>> 00007fd1388b8000 ffff880e9eaf3e30
>> >> >>>> [ 368.756815] Call Trace:
>> >> >>>> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
>> >> >>>> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>> >> >>>> [ 368.756837] [<ffffffff8116eae3>]
>> >> >>>> unmap_single_vma+0x583/0x890
>> >> >>>> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>> >> >>>> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
>> >> >>>> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
>> >> >>>> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
>> >> >>>> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
>> >> >>>> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
>> >> >>>> [ 368.756869] [<ffffffff814e70ed>]
>> >> >>>> system_call_fastpath+0x1a/0x1f
>> >> >>>> [ 368.756872] Disabling lock debugging due to kernel taint
>> >> >>>> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680
>> >> >>>> idx:0 val:-1
>> >> >>>> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680
>> >> >>>> idx:1 val:1
>> >> >>>>
>> >> >>>>>
>> >> >>>>> Probably. I don't have a Xen PV setup to test with (and very little
>> >> >>>>> interest in setting one up).. And I have a suspicion that it might not
>> >> >>>>> be so much about Xen PV, as perhaps about the kind of hardware.
>> >> >>>>>
>> >> >>>>> I suspect the issue has something to do with the magic _PAGE_NUMA
>> >> >>>>> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
>> >> >>>>> removing the _PAGE_PRESENT bit, and now the crazy numa code is
>> >> >>>>> confused.
>> >> >>>>>
>> >> >>>>> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
>> >> >>>>> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
>> >> >>>>> _PAGE_PRESENT.
>> >> >>>>>
>> >> >>>>> Adding Andrea to the Cc, because he's the author of that horridness.
>> >> >>>>> Putting Steven's test-case here as an attachement for Andrea, maybe
>> >> >>>>> that makes him go "Ahh, yes, silly case".
>> >> >>>>>
>> >> >>>>> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
>> >> >>>>>
>> >> >>>>> Andrea, you can find the thread on lkml, but it boils down to commit
>> >> >>>>> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
>> >> >>>>> attached test-case (but apparently only under Xen PV). There it
>> >> >>>>> apparently causes a "BUG: Bad page map .." error.
>> >> >>>
>> >> >>>
>> >> >>> I *think* it is due to the fact that pmd_numa and pte_numa is getting the
>> >> >>> _raw_
>> >> >>> value of PMDs and PTEs. That is - it does not use the pvops interface
>> >> >>> and instead reads the values directly from the page-table. Since the
>> >> >>> page-table is also manipulated by the hypervisor - there are certain
>> >> >>> flags it also sets to do its business. It might be that it uses
>> >> >>> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
>> >> >>> pte_flags that would invoke the pvops interface.
>> >> >>>
>> >> >>> Elena, Dariof and George, you guys had been looking at this a bit deeper
>> >> >>> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?
>>
>> It does use _PAGE_GLOBAL for guest user pages
>>
>> >> >>>
>> >> >>> This not-compiled-totally-bad-patch might shed some light on what I was
>> >> >>> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
>> >> >>> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
>> >> >>> for that).
>> >> >>
>> >> >>
>> >> >> Unfortunately the Totally Bad Patch seems to make no difference. I am
>> >> >> still able to repro the issue:
>> >>
>> >> Steven, do you use numa=fake on boot cmd line for pv guest?
>> >>
>> >> I had similar issue on pv guest. Let me check if the fix that resolved
>> >> this for me will help with 3.13.
>> >
>> > Nope:
>> >
>> > # cat /proc/cmdline
>> > root=/dev/xvda1 ro rootwait rootfstype=ext4 nomodeset console=hvc0 earlyprintk=xen,verbose loglevel=7
>>
>> >
>> >>
>> >> >
>> >> >
>> >> > Maybe this one is also related to this BUG here (cc'ed people investigating
>> >> > this one) ...
>> >> >
>> >> > https://lkml.org/lkml/2014/1/10/427
>> >> >
>> >> > ... not sure, though.
>> >> >
>> >> >
>> >> >> [ 346.374929] BUG: Bad page map in process mp
>> >> >> pte:80000004ae928065 pmd:e993f9067
>> >> >> [ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping:
>> >> >> (null) index:0x0
>> >> >> [ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
>> >> >> [ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071
>> >> >> anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
>> >> >> [ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+
>> >> >> #1
>> >> >> [ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768
>> >> >> 00007f06a9bbb000
>> >> >> [ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000
>> >> >> 0000000000000000
>> >> >> [ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000
>> >> >> ffff880e991a3e30
>> >> >> [ 346.374979] Call Trace:
>> >> >> [ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
>> >> >> [ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>> >> >> [ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
>> >> >> [ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>> >> >> [ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
>> >> >> [ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
>> >> >> [ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
>> >> >> [ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
>> >> >> [ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
>> >> >> [ 346.375034] [<ffffffff814e712d>]
>> >> >> system_call_fastpath+0x1a/0x1f
>> >> >> [ 346.375037] Disabling lock debugging due to kernel taint
>> >> >> [ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>> >> >> idx:0 val:-1
>> >> >> [ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>> >> >> idx:1 val:1
>> >> >>
>> >> >> This dump doesn't look dramatically different, either.
>> >> >>
>> >> >>>
>> >> >>> The other question is - how is AutoNUMA running when it is not enabled?
>> >> >>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
>> >> >>> turned on?
>> >> >>
>> >> >>
>> >> >> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
>> >> >> mean not enabled at runtime?
>> >> >>
>> >> >> [1]
>> >> >> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
>> >>
>> >>
>> >>
>> >> --
>> >> Elena
>>
>> I was able to reproduce this consistently, also with the latest mm
>> patches from yesterday.
>> Can you please try this:
>>
>> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>> index ce563be..76dcf96 100644
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
>> *mm, unsigned long addr,
>> /* Assume pteval_t is equivalent to all the other *val_t types. */
>> static pteval_t pte_mfn_to_pfn(pteval_t val)
>> {
>> - if (val & _PAGE_PRESENT) {
>> + if ((val & _PAGE_PRESENT) || ((val &
>> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
>> unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>> unsigned long pfn = mfn_to_pfn(mfn);
>>
>> @@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
>>
>> static pteval_t pte_pfn_to_mfn(pteval_t val)
>> {
>> - if (val & _PAGE_PRESENT) {
>> + if ((val & _PAGE_PRESENT) || ((val &
>> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
>> unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>> pteval_t flags = val & PTE_FLAGS_MASK;
>> unsigned long mfn;
>
> Thanks Elena, I just tested that and it does un-break things.
>
> - Steven

Good sign ) Ok, Ill format this patch for submission after it will be
clear what path is taken here (as its not numa auto balancing related
thing).
--
Elena

2014-01-24 11:05:27

by David Vrabel

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On 23/01/14 16:23, Elena Ufimtseva wrote:
> On Wed, Jan 22, 2014 at 3:33 PM, Steven Noonan <[email protected]> wrote:
>> On Wed, Jan 22, 2014 at 03:18:50PM -0500, Elena Ufimtseva wrote:
>>> On Wed, Jan 22, 2014 at 9:29 AM, Daniel Borkmann <[email protected]> wrote:
>>>> On 01/22/2014 08:29 AM, Steven Noonan wrote:
>>>>>
>>>>> On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
>>>>>>
>>>>>> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
>>>>>>>
>>>>>>> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
>>>>>>>>
>>>>>>>> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
>>>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> Adding extra folks to the party.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Odds are this also shows up in 3.13, right?
>>>>>>>
>>>>>>>
>>>>>>> Reproduced using 3.13 on the PV guest:
>>>>>>>
>>>>>>> [ 368.756763] BUG: Bad page map in process mp
>>>>>>> pte:80000004a67c6165 pmd:e9b706067
>>>>>>> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1
>>>>>>> mapping: (null) index:0x0
>>>>>>> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
>>>>>>> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071
>>>>>>> anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
>>>>>>> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2
>>>>>>> #1
>>>>>>> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0
>>>>>>> ffffffff814d8748 00007fd1388b7000
>>>>>>> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289
>>>>>>> 0000000000000000 0000000000000000
>>>>>>> [ 368.756809] ffff880e9b7065b8 ffffea001299f180
>>>>>>> 00007fd1388b8000 ffff880e9eaf3e30
>>>>>>> [ 368.756815] Call Trace:
>>>>>>> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
>>>>>>> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>>>>>>> [ 368.756837] [<ffffffff8116eae3>]
>>>>>>> unmap_single_vma+0x583/0x890
>>>>>>> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>>>>>>> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
>>>>>>> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
>>>>>>> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
>>>>>>> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
>>>>>>> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
>>>>>>> [ 368.756869] [<ffffffff814e70ed>]
>>>>>>> system_call_fastpath+0x1a/0x1f
>>>>>>> [ 368.756872] Disabling lock debugging due to kernel taint
>>>>>>> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680
>>>>>>> idx:0 val:-1
>>>>>>> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680
>>>>>>> idx:1 val:1
>>>>>>>
>>>>>>>>
>>>>>>>> Probably. I don't have a Xen PV setup to test with (and very little
>>>>>>>> interest in setting one up).. And I have a suspicion that it might not
>>>>>>>> be so much about Xen PV, as perhaps about the kind of hardware.
>>>>>>>>
>>>>>>>> I suspect the issue has something to do with the magic _PAGE_NUMA
>>>>>>>> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
>>>>>>>> removing the _PAGE_PRESENT bit, and now the crazy numa code is
>>>>>>>> confused.
>>>>>>>>
>>>>>>>> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
>>>>>>>> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
>>>>>>>> _PAGE_PRESENT.
>>>>>>>>
>>>>>>>> Adding Andrea to the Cc, because he's the author of that horridness.
>>>>>>>> Putting Steven's test-case here as an attachement for Andrea, maybe
>>>>>>>> that makes him go "Ahh, yes, silly case".
>>>>>>>>
>>>>>>>> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
>>>>>>>>
>>>>>>>> Andrea, you can find the thread on lkml, but it boils down to commit
>>>>>>>> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
>>>>>>>> attached test-case (but apparently only under Xen PV). There it
>>>>>>>> apparently causes a "BUG: Bad page map .." error.
>>>>>>
>>>>>>
>>>>>> I *think* it is due to the fact that pmd_numa and pte_numa is getting the
>>>>>> _raw_
>>>>>> value of PMDs and PTEs. That is - it does not use the pvops interface
>>>>>> and instead reads the values directly from the page-table. Since the
>>>>>> page-table is also manipulated by the hypervisor - there are certain
>>>>>> flags it also sets to do its business. It might be that it uses
>>>>>> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
>>>>>> pte_flags that would invoke the pvops interface.
>>>>>>
>>>>>> Elena, Dariof and George, you guys had been looking at this a bit deeper
>>>>>> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?
>
> It does use _PAGE_GLOBAL for guest user pages
>
>>>>>>
>>>>>> This not-compiled-totally-bad-patch might shed some light on what I was
>>>>>> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
>>>>>> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
>>>>>> for that).
>>>>>
>>>>>
>>>>> Unfortunately the Totally Bad Patch seems to make no difference. I am
>>>>> still able to repro the issue:
>>>
>>> Steven, do you use numa=fake on boot cmd line for pv guest?
>>>
>>> I had similar issue on pv guest. Let me check if the fix that resolved
>>> this for me will help with 3.13.
>>
>> Nope:
>>
>> # cat /proc/cmdline
>> root=/dev/xvda1 ro rootwait rootfstype=ext4 nomodeset console=hvc0 earlyprintk=xen,verbose loglevel=7
>
>>
>>>
>>>>
>>>>
>>>> Maybe this one is also related to this BUG here (cc'ed people investigating
>>>> this one) ...
>>>>
>>>> https://lkml.org/lkml/2014/1/10/427
>>>>
>>>> ... not sure, though.
>>>>
>>>>
>>>>> [ 346.374929] BUG: Bad page map in process mp
>>>>> pte:80000004ae928065 pmd:e993f9067
>>>>> [ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping:
>>>>> (null) index:0x0
>>>>> [ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
>>>>> [ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071
>>>>> anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
>>>>> [ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+
>>>>> #1
>>>>> [ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768
>>>>> 00007f06a9bbb000
>>>>> [ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000
>>>>> 0000000000000000
>>>>> [ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000
>>>>> ffff880e991a3e30
>>>>> [ 346.374979] Call Trace:
>>>>> [ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
>>>>> [ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>>>>> [ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
>>>>> [ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>>>>> [ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
>>>>> [ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
>>>>> [ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
>>>>> [ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
>>>>> [ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
>>>>> [ 346.375034] [<ffffffff814e712d>]
>>>>> system_call_fastpath+0x1a/0x1f
>>>>> [ 346.375037] Disabling lock debugging due to kernel taint
>>>>> [ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>>>>> idx:0 val:-1
>>>>> [ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>>>>> idx:1 val:1
>>>>>
>>>>> This dump doesn't look dramatically different, either.
>>>>>
>>>>>>
>>>>>> The other question is - how is AutoNUMA running when it is not enabled?
>>>>>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
>>>>>> turned on?
>>>>>
>>>>>
>>>>> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
>>>>> mean not enabled at runtime?
>>>>>
>>>>> [1]
>>>>> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
>>>
>>>
>>>
>>> --
>>> Elena
>
> I was able to reproduce this consistently, also with the latest mm
> patches from yesterday.
> Can you please try this:
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index ce563be..76dcf96 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
> *mm, unsigned long addr,
> /* Assume pteval_t is equivalent to all the other *val_t types. */
> static pteval_t pte_mfn_to_pfn(pteval_t val)
> {
> - if (val & _PAGE_PRESENT) {
> + if ((val & _PAGE_PRESENT) || ((val &
> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {

if (val & (_PAGE_PRESENT | _PAGE_NUMA))

is equivalent.

David

2014-01-24 11:43:53

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Wed, Jan 22, 2014 at 01:39:32PM -0500, Rik van Riel wrote:
> On 01/22/2014 01:24 PM, Linus Torvalds wrote:
> >On Wed, Jan 22, 2014 at 10:07 AM, Rik van Riel <[email protected]> wrote:
> >>
> >>The difference between a numa pte and a protnone pte is
> >>the VMA permissions.
> >
> >If that is indeed the only difference, then we should damn well get
> >rid of that f*cking stupid _PAGE_NUMA name entirely.
> >
> >It's misleading crap. Really. Just do a quick grep for that bit, and
> >you see just *how* confused people are about it:
> >
> > #define _PAGE_NUMA _PAGE_PROTNONE
> > ...
> > if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) &
> >
> >think about it. Just *THINK* about how broken that code is. The whole
> >thing is a disaster. _PAGE_NUMA must die. It's shit.
>
> The reason things are this way is that we were
> not sure whether we can indeed use _PAGE_PROTNONE
> for NUMA balancing on all architectures.
>

Power is not using _PAGE_PROTNONE to trap NUMA hinting faults because they
do not have that bit. Instead they reuse _PAGE_COHERENT with various tricks,
patches are in -next.

92c08a0d522c7e62c01a63e42597f0c2b02c4245 powerpc/mm: Use HPTE constants when updating hpte bits
c8c06f5a0dde0fed260c54d550962187f266ed0d powerpc/mm: Free up _PAGE_COHERENCE for numa fault use later
8937ba48dcf62b5cdf7abb93652914af16756f50 powerpc/mm: Only check for _PAGE_PRESENT in set_pte/pmd functions
c34a51ce49b40b9667cd7f5cc2e40475af8b4c3d powerpc/mm: Enable _PAGE_NUMA for book3s

As confusing as _PAGE_NUMA is, the intent was to express support in an
architecture-independent manner. If we had started with _PAGE_PROTNONE
then it would still be ambiguous -- are we interested in NUMA hinting
information or is this really PROTNONE protection? The power people would
then have had to add something like _PAGE_NUMA anyway when they had no
_PAGE_PROTNONE bit or define _PAGE_PROTNONE as _PAGE_COHERENT which is
just a different type of misleading.

> If we are sure that _PAGE_PROTNONE can be used
> everywhere, I agree we should get rid of the whole
> _PAGE_NUMA naming, and replace that ambiguous
> code with some comments and documentation instead.

We are sure that _PAGE_PROTNONE cannot be used everywhere.

--
Mel Gorman
SUSE Labs

2014-01-24 13:38:38

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Thu, Jan 23, 2014 at 11:23:37AM -0500, Elena Ufimtseva wrote:
> >> >> <SNIP>
> >> >>
> >> >> This dump doesn't look dramatically different, either.
> >> >>
> >> >>>
> >> >>> The other question is - how is AutoNUMA running when it is not enabled?
> >> >>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
> >> >>> turned on?
> >> >>
> >> >>
> >> >> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
> >> >> mean not enabled at runtime?
> >> >>
> >> >> [1]
> >> >> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
> >>
> >>
> >>
> >> --
> >> Elena
>
> I was able to reproduce this consistently, also with the latest mm
> patches from yesterday.
> Can you please try this:
>

Thanks Elena,

> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index ce563be..76dcf96 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
> *mm, unsigned long addr,
> /* Assume pteval_t is equivalent to all the other *val_t types. */
> static pteval_t pte_mfn_to_pfn(pteval_t val)
> {
> - if (val & _PAGE_PRESENT) {
> + if ((val & _PAGE_PRESENT) || ((val &
> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
> unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
> unsigned long pfn = mfn_to_pfn(mfn);
>
> @@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
>
> static pteval_t pte_pfn_to_mfn(pteval_t val)
> {
> - if (val & _PAGE_PRESENT) {
> + if ((val & _PAGE_PRESENT) || ((val &
> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
> unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
> pteval_t flags = val & PTE_FLAGS_MASK;
> unsigned long mfn;

Would reusing pte_present be an option? Ordinarily I expect that
PAGE_NUMA/PAGE_PROTNONE is only set if PAGE_PRESENT is not set and pte_present
is defined as

static inline int pte_present(pte_t a)
{
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
_PAGE_NUMA);
}

So it looks like it work work. Of course it would need to be split to
reuse it within xen if pte_present was split to have a pteval_present
helper like so

static inline int pteval_present(pteval_t val)
{
/*
* Yes Linus, _PAGE_PROTNONE == _PAGE_NUMA. Expressing it this
* way clearly states that the intent is that a protnone and numa
* hinting ptes are considered present for the purposes of
* pagetable operations like zapping, protection changes, gup etc.
*/
return val & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA);
}

static inline int pte_present(pte_t pte)
{
return pteval_present(pte_flags(pte))
}

If Xen is doing some other tricks with _PAGE_PRESENT then it might be
ruled out as an option. If so, then maybe it could still be made a
little clearer for future reference?

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index c1d406f..ff621de 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr,
/* Assume pteval_t is equivalent to all the other *val_t types. */
static pteval_t pte_mfn_to_pfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if ((val & _PAGE_PRESENT) || pteval_numa(val)) {
unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
unsigned long pfn = mfn_to_pfn(mfn);

@@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)

static pteval_t pte_pfn_to_mfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if ((val & _PAGE_PRESENT) || pteval_numa(val)) {
unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
pteval_t flags = val & PTE_FLAGS_MASK;
unsigned long mfn;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 8e4f41d..693fe00 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -654,10 +654,14 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
* (because _PAGE_PRESENT is not set).
*/
#ifndef pte_numa
+static inline int pteval_numa(pteval_t pteval)
+{
+ return (pteval & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+
static inline int pte_numa(pte_t pte)
{
- return (pte_flags(pte) &
- (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+ return pteval_numa(pte_flags(pte));
}
#endif

2014-01-26 18:02:38

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Fri, Jan 24, 2014 at 8:38 AM, Mel Gorman <[email protected]> wrote:
> On Thu, Jan 23, 2014 at 11:23:37AM -0500, Elena Ufimtseva wrote:
>> >> >> <SNIP>
>> >> >>
>> >> >> This dump doesn't look dramatically different, either.
>> >> >>
>> >> >>>
>> >> >>> The other question is - how is AutoNUMA running when it is not enabled?
>> >> >>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
>> >> >>> turned on?
>> >> >>
>> >> >>
>> >> >> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
>> >> >> mean not enabled at runtime?
>> >> >>
>> >> >> [1]
>> >> >> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
>> >>
>> >>
>> >>
>> >> --
>> >> Elena
>>
>> I was able to reproduce this consistently, also with the latest mm
>> patches from yesterday.
>> Can you please try this:
>>
>
> Thanks Elena,
>
>> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>> index ce563be..76dcf96 100644
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
>> *mm, unsigned long addr,
>> /* Assume pteval_t is equivalent to all the other *val_t types. */
>> static pteval_t pte_mfn_to_pfn(pteval_t val)
>> {
>> - if (val & _PAGE_PRESENT) {
>> + if ((val & _PAGE_PRESENT) || ((val &
>> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
>> unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>> unsigned long pfn = mfn_to_pfn(mfn);
>>
>> @@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
>>
>> static pteval_t pte_pfn_to_mfn(pteval_t val)
>> {
>> - if (val & _PAGE_PRESENT) {
>> + if ((val & _PAGE_PRESENT) || ((val &
>> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
>> unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>> pteval_t flags = val & PTE_FLAGS_MASK;
>> unsigned long mfn;
>
> Would reusing pte_present be an option? Ordinarily I expect that
> PAGE_NUMA/PAGE_PROTNONE is only set if PAGE_PRESENT is not set and pte_present
> is defined as
>
> static inline int pte_present(pte_t a)
> {
> return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
> _PAGE_NUMA);
> }
>
> So it looks like it work work. Of course it would need to be split to
> reuse it within xen if pte_present was split to have a pteval_present
> helper like so
>
> static inline int pteval_present(pteval_t val)
> {
> /*
> * Yes Linus, _PAGE_PROTNONE == _PAGE_NUMA. Expressing it this
> * way clearly states that the intent is that a protnone and numa
> * hinting ptes are considered present for the purposes of
> * pagetable operations like zapping, protection changes, gup etc.
> */
> return val & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA);
> }
>
> static inline int pte_present(pte_t pte)
> {
> return pteval_present(pte_flags(pte))
> }
>
> If Xen is doing some other tricks with _PAGE_PRESENT then it might be
> ruled out as an option. If so, then maybe it could still be made a
> little clearer for future reference?

Yes, sure, it should work, I tried it.
Thank you Mel.

>
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index c1d406f..ff621de 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr,
> /* Assume pteval_t is equivalent to all the other *val_t types. */
> static pteval_t pte_mfn_to_pfn(pteval_t val)
> {
> - if (val & _PAGE_PRESENT) {
> + if ((val & _PAGE_PRESENT) || pteval_numa(val)) {
> unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
> unsigned long pfn = mfn_to_pfn(mfn);
>
> @@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
>
> static pteval_t pte_pfn_to_mfn(pteval_t val)
> {
> - if (val & _PAGE_PRESENT) {
> + if ((val & _PAGE_PRESENT) || pteval_numa(val)) {
> unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
> pteval_t flags = val & PTE_FLAGS_MASK;
> unsigned long mfn;
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 8e4f41d..693fe00 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -654,10 +654,14 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
> * (because _PAGE_PRESENT is not set).
> */
> #ifndef pte_numa
> +static inline int pteval_numa(pteval_t pteval)
> +{
> + return (pteval & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
> +}
> +
> static inline int pte_numa(pte_t pte)
> {
> - return (pte_flags(pte) &
> - (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
> + return pteval_numa(pte_flags(pte));
> }
> #endif
>

--
Elena

2014-02-04 06:58:39

[permalink] [raw]

Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Sun, Jan 26, 2014 at 1:02 PM, Elena Ufimtseva <[email protected]> wrote:
> On Fri, Jan 24, 2014 at 8:38 AM, Mel Gorman <[email protected]> wrote:
>> On Thu, Jan 23, 2014 at 11:23:37AM -0500, Elena Ufimtseva wrote:
>>> >> >> <SNIP>
>>> >> >>
>>> >> >> This dump doesn't look dramatically different, either.
>>> >> >>
>>> >> >>>
>>> >> >>> The other question is - how is AutoNUMA running when it is not enabled?
>>> >> >>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
>>> >> >>> turned on?
>>> >> >>
>>> >> >>
>>> >> >> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
>>> >> >> mean not enabled at runtime?
>>> >> >>
>>> >> >> [1]
>>> >> >> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Elena
>>>
>>> I was able to reproduce this consistently, also with the latest mm
>>> patches from yesterday.
>>> Can you please try this:
>>>
>>
>> Thanks Elena,
>>
>>> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>>> index ce563be..76dcf96 100644
>>> --- a/arch/x86/xen/mmu.c
>>> +++ b/arch/x86/xen/mmu.c
>>> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
>>> *mm, unsigned long addr,
>>> /* Assume pteval_t is equivalent to all the other *val_t types. */
>>> static pteval_t pte_mfn_to_pfn(pteval_t val)
>>> {
>>> - if (val & _PAGE_PRESENT) {
>>> + if ((val & _PAGE_PRESENT) || ((val &
>>> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
>>> unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>>> unsigned long pfn = mfn_to_pfn(mfn);
>>>
>>> @@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
>>>
>>> static pteval_t pte_pfn_to_mfn(pteval_t val)
>>> {
>>> - if (val & _PAGE_PRESENT) {
>>> + if ((val & _PAGE_PRESENT) || ((val &
>>> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
>>> unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>>> pteval_t flags = val & PTE_FLAGS_MASK;
>>> unsigned long mfn;
>>
>> Would reusing pte_present be an option? Ordinarily I expect that
>> PAGE_NUMA/PAGE_PROTNONE is only set if PAGE_PRESENT is not set and pte_present
>> is defined as
>>
>> static inline int pte_present(pte_t a)
>> {
>> return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
>> _PAGE_NUMA);
>> }
>>
>> So it looks like it work work. Of course it would need to be split to
>> reuse it within xen if pte_present was split to have a pteval_present
>> helper like so
>>
>> static inline int pteval_present(pteval_t val)
>> {
>> /*
>> * Yes Linus, _PAGE_PROTNONE == _PAGE_NUMA. Expressing it this
>> * way clearly states that the intent is that a protnone and numa
>> * hinting ptes are considered present for the purposes of
>> * pagetable operations like zapping, protection changes, gup etc.
>> */
>> return val & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA);
>> }
>>
>> static inline int pte_present(pte_t pte)
>> {
>> return pteval_present(pte_flags(pte))
>> }
>>
>> If Xen is doing some other tricks with _PAGE_PRESENT then it might be
>> ruled out as an option. If so, then maybe it could still be made a
>> little clearer for future reference?
>
> Yes, sure, it should work, I tried it.
> Thank you Mel.
>
>>
>>
>> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>> index c1d406f..ff621de 100644
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr,
>> /* Assume pteval_t is equivalent to all the other *val_t types. */
>> static pteval_t pte_mfn_to_pfn(pteval_t val)
>> {
>> - if (val & _PAGE_PRESENT) {
>> + if ((val & _PAGE_PRESENT) || pteval_numa(val)) {
>> unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>> unsigned long pfn = mfn_to_pfn(mfn);
>>
>> @@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
>>
>> static pteval_t pte_pfn_to_mfn(pteval_t val)
>> {
>> - if (val & _PAGE_PRESENT) {
>> + if ((val & _PAGE_PRESENT) || pteval_numa(val)) {
>> unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>> pteval_t flags = val & PTE_FLAGS_MASK;
>> unsigned long mfn;
>> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
>> index 8e4f41d..693fe00 100644
>> --- a/include/asm-generic/pgtable.h
>> +++ b/include/asm-generic/pgtable.h
>> @@ -654,10 +654,14 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
>> * (because _PAGE_PRESENT is not set).
>> */
>> #ifndef pte_numa
>> +static inline int pteval_numa(pteval_t pteval)
>> +{
>> + return (pteval & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
>> +}
>> +
>> static inline int pte_numa(pte_t pte)
>> {
>> - return (pte_flags(pte) &
>> - (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
>> + return pteval_numa(pte_flags(pte));
>> }
>> #endif
>>
>
>
>
> --
> Elena

Here are two variants of this change . First one adds check for
_PAGE_NUMA flag in xen pte translations.
Second adds proposed by Mel pteval_present (comments are left
untouched :) and respective patch for xen pte translation that
uses pteval_present.
Mel, you can pick any of these two if they look fine and xen
maintainers are ok with the xen change (Konrad, David?)

Subject: [PATCH] xen: add _PAGE_NUMA for pte translations

xen: add _PAGE_NUMA for pte translations

Adds check in xen guest pte addresses translations for
_PAGE_NUMA flag. This resolves reported issues and will
be essential for NUMA support in xen guest with future
vNUMA patches.

Signed-off-by: Elena Ufimtseva <[email protected]>
---
arch/x86/xen/mmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 2423ef0..c804d58 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
*mm, unsigned long addr,
/* Assume pteval_t is equivalent to all the other *val_t types. */
static pteval_t pte_mfn_to_pfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if ((val & _PAGE_PRESENT) || (val & (_PAGE_NUMA|_PAGE_PRESENT))) {
unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
unsigned long pfn = mfn_to_pfn(mfn);

@@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)

static pteval_t pte_pfn_to_mfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if ((val & _PAGE_PRESENT) || (val & (_PAGE_NUMA|_PAGE_PRESENT))) {
unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
pteval_t flags = val & PTE_FLAGS_MASK;
unsigned long mfn;
--
1.7.10.4

Subject: [PATCH] mm: adds pteval_present

As suggested by Mel Gorman in https://lkml.org/lkml/2014/1/24/174.
Adds pteval_present to clarify that hinting ptes are considered present
for the purposes of pagetable operations like zapping, protection changes,
gup etc.

Signed-off-by: Elena Ufimtseva <[email protected]>
---
arch/x86/include/asm/pgtable.h | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index bbc8b12..205b00d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -445,10 +445,20 @@ static inline int pte_same(pte_t a, pte_t b)
return a.pte == b.pte;
}

+static inline int pteval_present(pteval_t pteval)
+{
+ /*
+ * Yes Linus, _PAGE_PROTNONE == _PAGE_NUMA. Expressing it this
+ * way clearly states that the intent is that a protnone and numa
+ * hinting ptes are considered present for the purposes of
+ * pagetable operations like zapping, protection changes, gup etc.
+ */
+ return pteval & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA);
+}
+
static inline int pte_present(pte_t a)
{
- return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
- _PAGE_NUMA);
+ return pteval_present(pte_flags(a));
}

#define pte_accessible pte_accessible
--
1.7.10.4

Subject: [PATCH] xen: use pteval_present for xen pte translations

Uses pteval_present for xen pte translations as sugested
by Mel Gorman in https://lkml.org/lkml/2014/1/24/174.
This also takes into account _PAGE_NUMA flag.

Signed-off-by: Elena Ufimtseva <[email protected]>
---
arch/x86/xen/mmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 2423ef0..256282e 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
*mm, unsigned long addr,
/* Assume pteval_t is equivalent to all the other *val_t types. */
static pteval_t pte_mfn_to_pfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if (pteval_present(val)) {
unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
unsigned long pfn = mfn_to_pfn(mfn);

@@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)

static pteval_t pte_pfn_to_mfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if (pteval_present(val)) {
unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
pteval_t flags = val & PTE_FLAGS_MASK;
unsigned long mfn;
--
1.7.10.4

2014-02-04 11:44:55