2014-01-21 23:27:19

by Steven Noonan

[permalink] [raw]
Subject: [BISECTED] Linux 3.12.7 introduces page map handling regression

A user reported a problem starting vsftpd on a Xen paravirtualized
guest, with this in dmesg:

[ 60.654862] BUG: Bad page map in process vsftpd pte:8000000493b88165 pmd:e9cc01067
[ 60.654876] page:ffffea00124ee200 count:0 mapcount:-1 mapping: (null) index:0x0
[ 60.654879] page flags: 0x2ffc0000000014(referenced|dirty)
[ 60.654885] addr:00007f97eea74000 vm_flags:00100071 anon_vma:ffff880e98f80380 mapping: (null) index:7f97eea74
[ 60.654890] CPU: 4 PID: 587 Comm: vsftpd Not tainted 3.12.7-1-ec2 #1
[ 60.654893] ffff880e9cc6ec38 ffff880e9cc61ca0 ffffffff814c763b 00007f97eea74000
[ 60.654900] ffff880e9cc61ce8 ffffffff8116784e 0000000000000000 0000000000000000
[ 60.654906] ffff880e9cc013a0 ffffea00124ee200 00007f97eea75000 ffff880e9cc61e10
[ 60.654912] Call Trace:
[ 60.654921] [<ffffffff814c763b>] dump_stack+0x45/0x56
[ 60.654928] [<ffffffff8116784e>] print_bad_pte+0x22e/0x250
[ 60.654933] [<ffffffff81169073>] unmap_single_vma+0x583/0x890
[ 60.654938] [<ffffffff8116a405>] unmap_vmas+0x65/0x90
[ 60.654942] [<ffffffff81173795>] exit_mmap+0xc5/0x170
[ 60.654948] [<ffffffff8105d295>] mmput+0x65/0x100
[ 60.654952] [<ffffffff81062983>] do_exit+0x393/0x9e0
[ 60.654955] [<ffffffff810630dc>] do_group_exit+0xcc/0x140
[ 60.654959] [<ffffffff81063164>] SyS_exit_group+0x14/0x20
[ 60.654965] [<ffffffff814d602d>] system_call_fastpath+0x1a/0x1f
[ 60.654968] Disabling lock debugging due to kernel taint
[ 60.655191] BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:0 val:-1
[ 60.655196] BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:1 val:1


The issue could not be reproduced under an HVM instance with the same
kernel, so it appears to be exclusive to paravirtual Xen guests.

I noted that it wasn't present in 3.10.27, but was present in 3.12.7 and
3.12.8. I ran through a bisection to find the root cause:

# start: 'v3.12.7' 'v3.10.27'
# bad: [4301b7a8] Linux 3.12.7
# good: [1071ea6e] Linux 3.10.27
# good: [8bb495e3] Linux 3.10
# good: [8fe73691] staging: comedi: comedi_bond: change return value
# good: [22e04f6b] Merge branch 'for-linus' of git://git.kernel.org/p
# good: [b7c09ad4] Merge branch 'for-linus' of git://git.kernel.org/p
# good: [13caa8ed] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
# good: [13caa8ed] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
# good: [f5fa9283] ipv6: reset dst.expires value when clearing expire
# good: [4af9d888] bridge: flush br's address entry in fdb when remov
# good: [8c13daf6] dm delay: fix a possible deadlock due to shared wo
# good: [93c02d70] firewire: sbp2: bring back WRITE SAME support
# good: [18065245] ACPI / PCI / hotplug: Avoid warning when _ADR not
# bad: [8807a436] mm/memory-failure.c: transfer page count from head
# bad: [fd5df800] mm: numa: avoid unnecessary disruption of NUMA hin
# good: [c18e3316] mm: numa: do not clear PMD during PTE update scan
# good: [f3b578d9] mm: numa: avoid unnecessary work on the failure pa
# bad: [3d792d61] mm: numa: clear numa hinting information on mprote
# good: [cefeb279] sched: numa: skip inaccessible VMAs
# first bad: [3d792d61] mm: numa: clear numa hinting information on mprote

If only I'd tested v3.12.0, that bisection would have been a lot shorter!


It looks like this is the change implicated (introduced in v3.12.7):

commit 3d792d616ba408ab55a54c1bb75a9367d997acfa
Author: Mel Gorman <[email protected]>
Date: Tue Jan 7 14:00:44 2014 +0000

mm: numa: clear numa hinting information on mprotect

commit 1667918b6483b12a6496bf54151b827b8235d7b1 upstream.

On a protection change it is no longer clear if the page should be still
accessible. This patch clears the NUMA hinting fault bits on a
protection change.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Alex Thorlton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>


This clearly points to breakage of mprotect() in particular. Checking
what vsftpd was doing via strace, I was able to come up with a simple
test case which triggers the issue:

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>

void die(const char *what)
{
perror(what);
exit(1);
}

int main(int arg, char **argv)
{
void *p = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

if (p == MAP_FAILED)
die("mmap");

/* Tickle the page. */
((char *)p)[0] = 0;

if (mprotect(p, 4096, PROT_NONE) != 0)
die("mprotect");

if (mprotect(p, 4096, PROT_READ) != 0)
die("mprotect");

if (munmap(p, 4096) != 0)
die("munmap");

return 0;
}

This could probably be reduced further. I didn't spend much time on it.

Adding people cited in the patch to CC, as well as Konrad since this is
a Xen issue (I haven't been able to repro on HVM or bare metal so far).

Any ideas what's causing the BUG, and how we can fix it?

- Steven


2014-01-22 01:48:32

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Tue, Jan 21, 2014 at 03:27:08PM -0800, Steven Noonan wrote:
> A user reported a problem starting vsftpd on a Xen paravirtualized
> guest, with this in dmesg:
>
> [ 60.654862] BUG: Bad page map in process vsftpd pte:8000000493b88165 pmd:e9cc01067
> [ 60.654876] page:ffffea00124ee200 count:0 mapcount:-1 mapping: (null) index:0x0
> [ 60.654879] page flags: 0x2ffc0000000014(referenced|dirty)
> [ 60.654885] addr:00007f97eea74000 vm_flags:00100071 anon_vma:ffff880e98f80380 mapping: (null) index:7f97eea74
> [ 60.654890] CPU: 4 PID: 587 Comm: vsftpd Not tainted 3.12.7-1-ec2 #1
> [ 60.654893] ffff880e9cc6ec38 ffff880e9cc61ca0 ffffffff814c763b 00007f97eea74000
> [ 60.654900] ffff880e9cc61ce8 ffffffff8116784e 0000000000000000 0000000000000000
> [ 60.654906] ffff880e9cc013a0 ffffea00124ee200 00007f97eea75000 ffff880e9cc61e10
> [ 60.654912] Call Trace:
> [ 60.654921] [<ffffffff814c763b>] dump_stack+0x45/0x56
> [ 60.654928] [<ffffffff8116784e>] print_bad_pte+0x22e/0x250
> [ 60.654933] [<ffffffff81169073>] unmap_single_vma+0x583/0x890
> [ 60.654938] [<ffffffff8116a405>] unmap_vmas+0x65/0x90
> [ 60.654942] [<ffffffff81173795>] exit_mmap+0xc5/0x170
> [ 60.654948] [<ffffffff8105d295>] mmput+0x65/0x100
> [ 60.654952] [<ffffffff81062983>] do_exit+0x393/0x9e0
> [ 60.654955] [<ffffffff810630dc>] do_group_exit+0xcc/0x140
> [ 60.654959] [<ffffffff81063164>] SyS_exit_group+0x14/0x20
> [ 60.654965] [<ffffffff814d602d>] system_call_fastpath+0x1a/0x1f
> [ 60.654968] Disabling lock debugging due to kernel taint
> [ 60.655191] BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:0 val:-1
> [ 60.655196] BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:1 val:1
>
>
> The issue could not be reproduced under an HVM instance with the same
> kernel, so it appears to be exclusive to paravirtual Xen guests.
>
> I noted that it wasn't present in 3.10.27, but was present in 3.12.7 and
> 3.12.8. I ran through a bisection to find the root cause:
>
> # start: 'v3.12.7' 'v3.10.27'
> # bad: [4301b7a8] Linux 3.12.7
> # good: [1071ea6e] Linux 3.10.27
> # good: [8bb495e3] Linux 3.10
> # good: [8fe73691] staging: comedi: comedi_bond: change return value
> # good: [22e04f6b] Merge branch 'for-linus' of git://git.kernel.org/p
> # good: [b7c09ad4] Merge branch 'for-linus' of git://git.kernel.org/p
> # good: [13caa8ed] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
> # good: [13caa8ed] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
> # good: [f5fa9283] ipv6: reset dst.expires value when clearing expire
> # good: [4af9d888] bridge: flush br's address entry in fdb when remov
> # good: [8c13daf6] dm delay: fix a possible deadlock due to shared wo
> # good: [93c02d70] firewire: sbp2: bring back WRITE SAME support
> # good: [18065245] ACPI / PCI / hotplug: Avoid warning when _ADR not
> # bad: [8807a436] mm/memory-failure.c: transfer page count from head
> # bad: [fd5df800] mm: numa: avoid unnecessary disruption of NUMA hin
> # good: [c18e3316] mm: numa: do not clear PMD during PTE update scan
> # good: [f3b578d9] mm: numa: avoid unnecessary work on the failure pa
> # bad: [3d792d61] mm: numa: clear numa hinting information on mprote
> # good: [cefeb279] sched: numa: skip inaccessible VMAs
> # first bad: [3d792d61] mm: numa: clear numa hinting information on mprote
>
> If only I'd tested v3.12.0, that bisection would have been a lot shorter!
>
>
> It looks like this is the change implicated (introduced in v3.12.7):
>
> commit 3d792d616ba408ab55a54c1bb75a9367d997acfa
> Author: Mel Gorman <[email protected]>
> Date: Tue Jan 7 14:00:44 2014 +0000
>
> mm: numa: clear numa hinting information on mprotect
>
> commit 1667918b6483b12a6496bf54151b827b8235d7b1 upstream.
>
> On a protection change it is no longer clear if the page should be still
> accessible. This patch clears the NUMA hinting fault bits on a
> protection change.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Cc: Alex Thorlton <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>
>
> This clearly points to breakage of mprotect() in particular. Checking
> what vsftpd was doing via strace, I was able to come up with a simple
> test case which triggers the issue:
>
> #include <errno.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/mman.h>
>
> void die(const char *what)
> {
> perror(what);
> exit(1);
> }
>
> int main(int arg, char **argv)
> {
> void *p = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>
> if (p == MAP_FAILED)
> die("mmap");
>
> /* Tickle the page. */
> ((char *)p)[0] = 0;
>
> if (mprotect(p, 4096, PROT_NONE) != 0)
> die("mprotect");
>
> if (mprotect(p, 4096, PROT_READ) != 0)
> die("mprotect");
>
> if (munmap(p, 4096) != 0)
> die("munmap");
>
> return 0;
> }
>
> This could probably be reduced further. I didn't spend much time on it.
>
> Adding people cited in the patch to CC, as well as Konrad since this is
> a Xen issue (I haven't been able to repro on HVM or bare metal so far).

Odds are this also shows up in 3.13, right?

thanks,

greg k-h

2014-01-22 02:47:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
<[email protected]> wrote:
>
> Odds are this also shows up in 3.13, right?

Probably. I don't have a Xen PV setup to test with (and very little
interest in setting one up).. And I have a suspicion that it might not
be so much about Xen PV, as perhaps about the kind of hardware.

I suspect the issue has something to do with the magic _PAGE_NUMA
tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
removing the _PAGE_PRESENT bit, and now the crazy numa code is
confused.

The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
bit with _PAGE_PROTNONE, which is why it then has that tie-in to
_PAGE_PRESENT.

Adding Andrea to the Cc, because he's the author of that horridness.
Putting Steven's test-case here as an attachement for Andrea, maybe
that makes him go "Ahh, yes, silly case".

Also added Kirill, because he was involved the last _PAGE_NUMA debacle.

Andrea, you can find the thread on lkml, but it boils down to commit
1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
attached test-case (but apparently only under Xen PV). There it
apparently causes a "BUG: Bad page map .." error.

And I suspect this is another of those "this bug is only visible on
real numa machines, because _PAGE_NUMA isn't actually ever set
otherwise". That has pretty much guaranteed that it gets basically
zero testing, which is not a great idea when coupled with that subtle
sharing of the _PAGE_PROTNONE bit..

It may be that the whole "Xen PV" thing is a red herring, and that
Steven only sees it on that one machine because the one he runs as a
PV guest under is a real NUMA machine, and all the other machines he
has tried it on haven't been numa. So it *may* be that that "only
under Xen PV" is a red herring. But that's just a possible guess.

Christ, how I hate that _PAGE_NUMA bit. Andrea: the fact that it gets
no testing on any normal machines is a major problem. If it was simple
and straightforward and the code was "obviously correct", it wouldn't
be such a problem, but the _PAGE_NUMA code definitely does not fall
under that "simple and obviously correct" heading.

Guys, any ideas?

Linus


Attachments:
t.c (532.00 B)

2014-01-22 03:21:00

by Steven Noonan

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
> <[email protected]> wrote:
> >
> > Odds are this also shows up in 3.13, right?

Reproduced using 3.13 on the PV guest:

[ 368.756763] BUG: Bad page map in process mp pte:80000004a67c6165 pmd:e9b706067
[ 368.756777] page:ffffea001299f180 count:0 mapcount:-1 mapping: (null) index:0x0
[ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
[ 368.756786] addr:00007fd1388b7000 vm_flags:00100071 anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
[ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2 #1
[ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0 ffffffff814d8748 00007fd1388b7000
[ 368.756803] ffff880e9eaf3d08 ffffffff8116d289 0000000000000000 0000000000000000
[ 368.756809] ffff880e9b7065b8 ffffea001299f180 00007fd1388b8000 ffff880e9eaf3e30
[ 368.756815] Call Trace:
[ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
[ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
[ 368.756837] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
[ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
[ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
[ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
[ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
[ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
[ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
[ 368.756869] [<ffffffff814e70ed>] system_call_fastpath+0x1a/0x1f
[ 368.756872] Disabling lock debugging due to kernel taint
[ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:0 val:-1
[ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:1 val:1

>
> Probably. I don't have a Xen PV setup to test with (and very little
> interest in setting one up).. And I have a suspicion that it might not
> be so much about Xen PV, as perhaps about the kind of hardware.
>
> I suspect the issue has something to do with the magic _PAGE_NUMA
> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
> removing the _PAGE_PRESENT bit, and now the crazy numa code is
> confused.
>
> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
> _PAGE_PRESENT.
>
> Adding Andrea to the Cc, because he's the author of that horridness.
> Putting Steven's test-case here as an attachement for Andrea, maybe
> that makes him go "Ahh, yes, silly case".
>
> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
>
> Andrea, you can find the thread on lkml, but it boils down to commit
> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
> attached test-case (but apparently only under Xen PV). There it
> apparently causes a "BUG: Bad page map .." error.
>
> And I suspect this is another of those "this bug is only visible on
> real numa machines, because _PAGE_NUMA isn't actually ever set
> otherwise". That has pretty much guaranteed that it gets basically
> zero testing, which is not a great idea when coupled with that subtle
> sharing of the _PAGE_PROTNONE bit..
>
> It may be that the whole "Xen PV" thing is a red herring, and that
> Steven only sees it on that one machine because the one he runs as a
> PV guest under is a real NUMA machine, and all the other machines he
> has tried it on haven't been numa. So it *may* be that that "only
> under Xen PV" is a red herring. But that's just a possible guess.

The PV and HVM guests are both on NUMA hosts, but we don't expose NUMA to the
PV guest, so it fakes a NUMA node at startup.

I've also tried running a PV guest on a dual socket host with interleaved
memory:

# dmesg | grep -i -e numa -e node
[ 0.000000] NUMA turned off
[ 0.000000] Faking a node at [mem 0x0000000000000000-0x00000005607fffff]
[ 0.000000] Initmem setup node 0 [mem 0x00000000-0x5607fffff]
[ 0.000000] NODE_DATA [mem 0x55d4f2000-0x55d518fff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x00001000-0x0009ffff]
[ 0.000000] node 0: [mem 0x00100000-0x5607fffff]
[ 0.000000] On node 0 totalpages: 5638047
[ 0.000000] setup_percpu: NR_CPUS:4096 nr_cpumask_bits:16 nr_cpu_ids:16 nr_node_ids:1
[ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=1
[ 0.010697] Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes)
# dmesg | tail -n 21
[ 348.467265] BUG: Bad page map in process t pte:800000008a6ef165 pmd:53aa39067
[ 348.467280] page:ffffea000229bbc0 count:0 mapcount:-1 mapping: (null) index:0x0
[ 348.467286] page flags: 0x1ffc0000000014(referenced|dirty)
[ 348.467293] addr:00007f8c9fca0000 vm_flags:00100071 anon_vma:ffff88053aff19c0 mapping: (null) index:7f8c9fca0
[ 348.467301] CPU: 0 PID: 359 Comm: t Tainted: G B 3.12.8-1-ec2 #1
[ 348.467306] ffff8805396f71f8 ffff880539c49cc0 ffffffff814c77bb 00007f8c9fca0000
[ 348.467316] ffff880539c49d08 ffffffff8116788e 0000000000000000 0000000000000000
[ 348.467325] ffff88053aa39500 ffffea000229bbc0 00007f8c9fca1000 ffff880539c49e30
[ 348.467334] Call Trace:
[ 348.467346] [<ffffffff814c77bb>] dump_stack+0x45/0x56
[ 348.467355] [<ffffffff8116788e>] print_bad_pte+0x22e/0x250
[ 348.467362] [<ffffffff811690b3>] unmap_single_vma+0x583/0x890
[ 348.467369] [<ffffffff8116a445>] unmap_vmas+0x65/0x90
[ 348.467375] [<ffffffff8117051c>] unmap_region+0xac/0x120
[ 348.467382] [<ffffffff81170af9>] ? vma_rb_erase+0x1c9/0x210
[ 348.467389] [<ffffffff811726d0>] do_munmap+0x280/0x370
[ 348.467395] [<ffffffff81172801>] vm_munmap+0x41/0x60
[ 348.467404] [<ffffffff81173702>] SyS_munmap+0x22/0x30
[ 348.467413] [<ffffffff814d61ad>] system_call_fastpath+0x1a/0x1f
[ 348.470081] BUG: Bad rss-counter state mm:ffff88053a992100 idx:0 val:-1
[ 348.470091] BUG: Bad rss-counter state mm:ffff88053a992100 idx:1 val:1

As for bare metal Linux repro, I have a 2-socket Westmere box with NUMA enabled
running Linux 3.12.8. It doesn't repro:

$ sudo journalctl -b | grep -i -e node -e numa | cut -c 30-
SRAT: PXM 0 -> APIC 0x00 -> Node 0
SRAT: PXM 0 -> APIC 0x02 -> Node 0
SRAT: PXM 0 -> APIC 0x04 -> Node 0
SRAT: PXM 0 -> APIC 0x10 -> Node 0
SRAT: PXM 0 -> APIC 0x12 -> Node 0
SRAT: PXM 0 -> APIC 0x14 -> Node 0
SRAT: PXM 0 -> APIC 0x01 -> Node 0
SRAT: PXM 0 -> APIC 0x03 -> Node 0
SRAT: PXM 0 -> APIC 0x05 -> Node 0
SRAT: PXM 0 -> APIC 0x11 -> Node 0
SRAT: PXM 0 -> APIC 0x13 -> Node 0
SRAT: PXM 0 -> APIC 0x15 -> Node 0
SRAT: PXM 1 -> APIC 0x20 -> Node 1
SRAT: PXM 1 -> APIC 0x22 -> Node 1
SRAT: PXM 1 -> APIC 0x24 -> Node 1
SRAT: PXM 1 -> APIC 0x30 -> Node 1
SRAT: PXM 1 -> APIC 0x32 -> Node 1
SRAT: PXM 1 -> APIC 0x34 -> Node 1
SRAT: PXM 1 -> APIC 0x21 -> Node 1
SRAT: PXM 1 -> APIC 0x23 -> Node 1
SRAT: PXM 1 -> APIC 0x25 -> Node 1
SRAT: PXM 1 -> APIC 0x31 -> Node 1
SRAT: PXM 1 -> APIC 0x33 -> Node 1
SRAT: PXM 1 -> APIC 0x35 -> Node 1
SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
SRAT: Node 0 PXM 0 [mem 0x100000000-0x63fffffff]
SRAT: Node 1 PXM 1 [mem 0x640000000-0xc3fffffff]
NUMA: Initialized distance table, cnt=2
NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff]
NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x63fffffff] -> [mem 0x00000000-0x63fffffff]
Initmem setup node 0 [mem 0x00000000-0x63fffffff]
NODE_DATA [mem 0x63ffd9000-0x63fffffff]
Initmem setup node 1 [mem 0x640000000-0xc3fffffff]
NODE_DATA [mem 0xc3ffd6000-0xc3fffcfff]
[ffffea0000000000-ffffea0018ffffff] PMD -> [ffff880627e00000-ffff88063fdfffff] on node 0
[ffffea0019000000-ffffea0030ffffff] PMD -> [ffff880c27600000-ffff880c3f5fffff] on node 1
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x00001000-0x0009bfff]
node 0: [mem 0x00100000-0xbf78ffff]
node 0: [mem 0x100000000-0x63fffffff]
node 1: [mem 0x640000000-0xc3fffffff]
On node 0 totalpages: 6289195
On node 1 totalpages: 6291456
setup_percpu: NR_CPUS:4096 nr_cpumask_bits:24 nr_cpu_ids:24 nr_node_ids:2
SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=24, Nodes=2
Enabling automatic NUMA balancing. Configure with numa_balancing= or sysctl
Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes)
smpboot: Booting Node 0, Processors # 1 # 2 # 3 # 4 # 5 OK
smpboot: Booting Node 1, Processors # 6 # 7 # 8 # 9 # 10 # 11 OK
smpboot: Booting Node 0, Processors # 12 # 13 # 14 # 15 # 16 # 17 OK
smpboot: Booting Node 1, Processors # 18 # 19 # 20 # 21 # 22 # 23 OK
pci_bus 0000:00: on NUMA node 0 (pxm 0)
[...]
$ uname -r
3.12.8-1
$ sudo dmesg -c
$ gcc -O2 -o t t.c
$ ./t
$ dmesg
$


> Christ, how I hate that _PAGE_NUMA bit. Andrea: the fact that it gets
> no testing on any normal machines is a major problem. If it was simple
> and straightforward and the code was "obviously correct", it wouldn't
> be such a problem, but the _PAGE_NUMA code definitely does not fall
> under that "simple and obviously correct" heading.
>
> Guys, any ideas?
>
> Linus

2014-01-22 05:04:07

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
> > On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
> > <[email protected]> wrote:

Adding extra folks to the party.
> > >
> > > Odds are this also shows up in 3.13, right?
>
> Reproduced using 3.13 on the PV guest:
>
> [ 368.756763] BUG: Bad page map in process mp pte:80000004a67c6165 pmd:e9b706067
> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1 mapping: (null) index:0x0
> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071 anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2 #1
> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0 ffffffff814d8748 00007fd1388b7000
> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289 0000000000000000 0000000000000000
> [ 368.756809] ffff880e9b7065b8 ffffea001299f180 00007fd1388b8000 ffff880e9eaf3e30
> [ 368.756815] Call Trace:
> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
> [ 368.756837] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
> [ 368.756869] [<ffffffff814e70ed>] system_call_fastpath+0x1a/0x1f
> [ 368.756872] Disabling lock debugging due to kernel taint
> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:0 val:-1
> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:1 val:1
>
> >
> > Probably. I don't have a Xen PV setup to test with (and very little
> > interest in setting one up).. And I have a suspicion that it might not
> > be so much about Xen PV, as perhaps about the kind of hardware.
> >
> > I suspect the issue has something to do with the magic _PAGE_NUMA
> > tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
> > removing the _PAGE_PRESENT bit, and now the crazy numa code is
> > confused.
> >
> > The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
> > bit with _PAGE_PROTNONE, which is why it then has that tie-in to
> > _PAGE_PRESENT.
> >
> > Adding Andrea to the Cc, because he's the author of that horridness.
> > Putting Steven's test-case here as an attachement for Andrea, maybe
> > that makes him go "Ahh, yes, silly case".
> >
> > Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
> >
> > Andrea, you can find the thread on lkml, but it boils down to commit
> > 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
> > attached test-case (but apparently only under Xen PV). There it
> > apparently causes a "BUG: Bad page map .." error.

I *think* it is due to the fact that pmd_numa and pte_numa is getting the _raw_
value of PMDs and PTEs. That is - it does not use the pvops interface
and instead reads the values directly from the page-table. Since the
page-table is also manipulated by the hypervisor - there are certain
flags it also sets to do its business. It might be that it uses
_PAGE_GLOBAL as well - and Linux picks up on that. If it was using
pte_flags that would invoke the pvops interface.

Elena, Dariof and George, you guys had been looking at this a bit deeper
than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?

This not-compiled-totally-bad-patch might shed some light on what I was
thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
It does not fix it for PMDs naturally (as there are no PMD paravirt ops
for that).

The other question is - how is AutoNUMA running when it is not enabled?
Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
turned on?


diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index ce563be..9fa7088 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -370,12 +370,15 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
unsigned long pfn = mfn_to_pfn(mfn);

pteval_t flags = val & PTE_FLAGS_MASK;
+ /* No AutoNUMA for PV. TODO If Linux sees the PTE having
+ * said bit, just igore it. */
+ if (flags & _PAGE_NUMA)
+ flags = flags & ~_PAGE_NUMA;
if (unlikely(pfn == ~0))
val = flags & ~_PAGE_PRESENT;
else
val = ((pteval_t)pfn << PAGE_SHIFT) | flags;
}
-
return val;
}

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index db09234..a8bc07d 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -644,7 +644,7 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#ifndef pte_numa
static inline int pte_numa(pte_t pte)
{
- return (pte_flags(pte) &
+ return (pte_val(pte) &
(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
}
#endif

2014-01-22 07:29:26

by Steven Noonan

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
> > On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
> > > On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
> > > <[email protected]> wrote:
>
> Adding extra folks to the party.
> > > >
> > > > Odds are this also shows up in 3.13, right?
> >
> > Reproduced using 3.13 on the PV guest:
> >
> > [ 368.756763] BUG: Bad page map in process mp pte:80000004a67c6165 pmd:e9b706067
> > [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1 mapping: (null) index:0x0
> > [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
> > [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071 anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
> > [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2 #1
> > [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0 ffffffff814d8748 00007fd1388b7000
> > [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289 0000000000000000 0000000000000000
> > [ 368.756809] ffff880e9b7065b8 ffffea001299f180 00007fd1388b8000 ffff880e9eaf3e30
> > [ 368.756815] Call Trace:
> > [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
> > [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
> > [ 368.756837] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
> > [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
> > [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
> > [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
> > [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
> > [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
> > [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
> > [ 368.756869] [<ffffffff814e70ed>] system_call_fastpath+0x1a/0x1f
> > [ 368.756872] Disabling lock debugging due to kernel taint
> > [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:0 val:-1
> > [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:1 val:1
> >
> > >
> > > Probably. I don't have a Xen PV setup to test with (and very little
> > > interest in setting one up).. And I have a suspicion that it might not
> > > be so much about Xen PV, as perhaps about the kind of hardware.
> > >
> > > I suspect the issue has something to do with the magic _PAGE_NUMA
> > > tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
> > > removing the _PAGE_PRESENT bit, and now the crazy numa code is
> > > confused.
> > >
> > > The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
> > > bit with _PAGE_PROTNONE, which is why it then has that tie-in to
> > > _PAGE_PRESENT.
> > >
> > > Adding Andrea to the Cc, because he's the author of that horridness.
> > > Putting Steven's test-case here as an attachement for Andrea, maybe
> > > that makes him go "Ahh, yes, silly case".
> > >
> > > Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
> > >
> > > Andrea, you can find the thread on lkml, but it boils down to commit
> > > 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
> > > attached test-case (but apparently only under Xen PV). There it
> > > apparently causes a "BUG: Bad page map .." error.
>
> I *think* it is due to the fact that pmd_numa and pte_numa is getting the _raw_
> value of PMDs and PTEs. That is - it does not use the pvops interface
> and instead reads the values directly from the page-table. Since the
> page-table is also manipulated by the hypervisor - there are certain
> flags it also sets to do its business. It might be that it uses
> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
> pte_flags that would invoke the pvops interface.
>
> Elena, Dariof and George, you guys had been looking at this a bit deeper
> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?
>
> This not-compiled-totally-bad-patch might shed some light on what I was
> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
> for that).

Unfortunately the Totally Bad Patch seems to make no difference. I am
still able to repro the issue:

[ 346.374929] BUG: Bad page map in process mp pte:80000004ae928065 pmd:e993f9067
[ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping: (null) index:0x0
[ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
[ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071 anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
[ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+ #1
[ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768 00007f06a9bbb000
[ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000 0000000000000000
[ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000 ffff880e991a3e30
[ 346.374979] Call Trace:
[ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
[ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
[ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
[ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
[ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
[ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
[ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
[ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
[ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
[ 346.375034] [<ffffffff814e712d>] system_call_fastpath+0x1a/0x1f
[ 346.375037] Disabling lock debugging due to kernel taint
[ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00 idx:0 val:-1
[ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00 idx:1 val:1

This dump doesn't look dramatically different, either.

>
> The other question is - how is AutoNUMA running when it is not enabled?
> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
> turned on?

Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
mean not enabled at runtime?

[1] http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64

2014-01-22 14:30:18

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On 01/22/2014 08:29 AM, Steven Noonan wrote:
> On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
>> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
>>> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
>>>> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
>>>> <[email protected]> wrote:
>>
>> Adding extra folks to the party.
>>>>>
>>>>> Odds are this also shows up in 3.13, right?
>>>
>>> Reproduced using 3.13 on the PV guest:
>>>
>>> [ 368.756763] BUG: Bad page map in process mp pte:80000004a67c6165 pmd:e9b706067
>>> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1 mapping: (null) index:0x0
>>> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
>>> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071 anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
>>> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2 #1
>>> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0 ffffffff814d8748 00007fd1388b7000
>>> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289 0000000000000000 0000000000000000
>>> [ 368.756809] ffff880e9b7065b8 ffffea001299f180 00007fd1388b8000 ffff880e9eaf3e30
>>> [ 368.756815] Call Trace:
>>> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
>>> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>>> [ 368.756837] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
>>> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>>> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
>>> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
>>> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
>>> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
>>> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
>>> [ 368.756869] [<ffffffff814e70ed>] system_call_fastpath+0x1a/0x1f
>>> [ 368.756872] Disabling lock debugging due to kernel taint
>>> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:0 val:-1
>>> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680 idx:1 val:1
>>>
>>>>
>>>> Probably. I don't have a Xen PV setup to test with (and very little
>>>> interest in setting one up).. And I have a suspicion that it might not
>>>> be so much about Xen PV, as perhaps about the kind of hardware.
>>>>
>>>> I suspect the issue has something to do with the magic _PAGE_NUMA
>>>> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
>>>> removing the _PAGE_PRESENT bit, and now the crazy numa code is
>>>> confused.
>>>>
>>>> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
>>>> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
>>>> _PAGE_PRESENT.
>>>>
>>>> Adding Andrea to the Cc, because he's the author of that horridness.
>>>> Putting Steven's test-case here as an attachement for Andrea, maybe
>>>> that makes him go "Ahh, yes, silly case".
>>>>
>>>> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
>>>>
>>>> Andrea, you can find the thread on lkml, but it boils down to commit
>>>> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
>>>> attached test-case (but apparently only under Xen PV). There it
>>>> apparently causes a "BUG: Bad page map .." error.
>>
>> I *think* it is due to the fact that pmd_numa and pte_numa is getting the _raw_
>> value of PMDs and PTEs. That is - it does not use the pvops interface
>> and instead reads the values directly from the page-table. Since the
>> page-table is also manipulated by the hypervisor - there are certain
>> flags it also sets to do its business. It might be that it uses
>> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
>> pte_flags that would invoke the pvops interface.
>>
>> Elena, Dariof and George, you guys had been looking at this a bit deeper
>> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?
>>
>> This not-compiled-totally-bad-patch might shed some light on what I was
>> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
>> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
>> for that).
>
> Unfortunately the Totally Bad Patch seems to make no difference. I am
> still able to repro the issue:

Maybe this one is also related to this BUG here (cc'ed people investigating
this one) ...

https://lkml.org/lkml/2014/1/10/427

... not sure, though.

> [ 346.374929] BUG: Bad page map in process mp pte:80000004ae928065 pmd:e993f9067
> [ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping: (null) index:0x0
> [ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
> [ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071 anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
> [ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+ #1
> [ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768 00007f06a9bbb000
> [ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000 0000000000000000
> [ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000 ffff880e991a3e30
> [ 346.374979] Call Trace:
> [ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
> [ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
> [ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
> [ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
> [ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
> [ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
> [ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
> [ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
> [ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
> [ 346.375034] [<ffffffff814e712d>] system_call_fastpath+0x1a/0x1f
> [ 346.375037] Disabling lock debugging due to kernel taint
> [ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00 idx:0 val:-1
> [ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00 idx:1 val:1
>
> This dump doesn't look dramatically different, either.
>
>>
>> The other question is - how is AutoNUMA running when it is not enabled?
>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
>> turned on?
>
> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
> mean not enabled at runtime?
>
> [1] http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64

2014-01-22 18:08:00

by Rik van Riel

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On 01/21/2014 09:47 PM, Linus Torvalds wrote:
> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
> <[email protected]> wrote:
>>
>> Odds are this also shows up in 3.13, right?
>
> Probably. I don't have a Xen PV setup to test with (and very little
> interest in setting one up).. And I have a suspicion that it might not
> be so much about Xen PV, as perhaps about the kind of hardware.
>
> I suspect the issue has something to do with the magic _PAGE_NUMA
> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
> removing the _PAGE_PRESENT bit, and now the crazy numa code is
> confused.
>
> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
> _PAGE_PRESENT.

The numa balancing code should clear _PAGE_PRESENT and
set _PAGE_NUMA / _PAGE_PROTNONE.

The difference between a numa pte and a protnone pte is
the VMA permissions.

When the VMA is protnone, do_page_fault will kill the
app with a segfault. When the VMA has proper permissions,
handle_pte_fault will call do_numa_page, and numa-y things
are done.



>
> Adding Andrea to the Cc, because he's the author of that horridness.
> Putting Steven's test-case here as an attachement for Andrea, maybe
> that makes him go "Ahh, yes, silly case".
>
> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
>
> Andrea, you can find the thread on lkml, but it boils down to commit
> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
> attached test-case (but apparently only under Xen PV). There it
> apparently causes a "BUG: Bad page map .." error.
>
> And I suspect this is another of those "this bug is only visible on
> real numa machines, because _PAGE_NUMA isn't actually ever set
> otherwise". That has pretty much guaranteed that it gets basically
> zero testing, which is not a great idea when coupled with that subtle
> sharing of the _PAGE_PROTNONE bit..
>
> It may be that the whole "Xen PV" thing is a red herring, and that
> Steven only sees it on that one machine because the one he runs as a
> PV guest under is a real NUMA machine, and all the other machines he
> has tried it on haven't been numa. So it *may* be that that "only
> under Xen PV" is a red herring. But that's just a possible guess.
>
> Christ, how I hate that _PAGE_NUMA bit. Andrea: the fact that it gets
> no testing on any normal machines is a major problem. If it was simple
> and straightforward and the code was "obviously correct", it wouldn't
> be such a problem, but the _PAGE_NUMA code definitely does not fall
> under that "simple and obviously correct" heading.
>
> Guys, any ideas?
>
> Linus
>

2014-01-22 18:24:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Wed, Jan 22, 2014 at 10:07 AM, Rik van Riel <[email protected]> wrote:
>
> The difference between a numa pte and a protnone pte is
> the VMA permissions.

If that is indeed the only difference, then we should damn well get
rid of that f*cking stupid _PAGE_NUMA name entirely.

It's misleading crap. Really. Just do a quick grep for that bit, and
you see just *how* confused people are about it:

#define _PAGE_NUMA _PAGE_PROTNONE
...
if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) &

think about it. Just *THINK* about how broken that code is. The whole
thing is a disaster. _PAGE_NUMA must die. It's shit.

Linus

2014-01-22 18:39:42

by Rik van Riel

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On 01/22/2014 01:24 PM, Linus Torvalds wrote:
> On Wed, Jan 22, 2014 at 10:07 AM, Rik van Riel <[email protected]> wrote:
>>
>> The difference between a numa pte and a protnone pte is
>> the VMA permissions.
>
> If that is indeed the only difference, then we should damn well get
> rid of that f*cking stupid _PAGE_NUMA name entirely.
>
> It's misleading crap. Really. Just do a quick grep for that bit, and
> you see just *how* confused people are about it:
>
> #define _PAGE_NUMA _PAGE_PROTNONE
> ...
> if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) &
>
> think about it. Just *THINK* about how broken that code is. The whole
> thing is a disaster. _PAGE_NUMA must die. It's shit.

The reason things are this way is that we were
not sure whether we can indeed use _PAGE_PROTNONE
for NUMA balancing on all architectures.

If we are sure that _PAGE_PROTNONE can be used
everywhere, I agree we should get rid of the whole
_PAGE_NUMA naming, and replace that ambiguous
code with some comments and documentation instead.

2014-01-22 20:18:54

by Elena Ufimtseva

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Wed, Jan 22, 2014 at 9:29 AM, Daniel Borkmann <[email protected]> wrote:
> On 01/22/2014 08:29 AM, Steven Noonan wrote:
>>
>> On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
>>>
>>> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
>>>>
>>>> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
>>>>>
>>>>> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
>>>>> <[email protected]> wrote:
>>>
>>>
>>> Adding extra folks to the party.
>>>>>>
>>>>>>
>>>>>> Odds are this also shows up in 3.13, right?
>>>>
>>>>
>>>> Reproduced using 3.13 on the PV guest:
>>>>
>>>> [ 368.756763] BUG: Bad page map in process mp
>>>> pte:80000004a67c6165 pmd:e9b706067
>>>> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1
>>>> mapping: (null) index:0x0
>>>> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
>>>> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071
>>>> anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
>>>> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2
>>>> #1
>>>> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0
>>>> ffffffff814d8748 00007fd1388b7000
>>>> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289
>>>> 0000000000000000 0000000000000000
>>>> [ 368.756809] ffff880e9b7065b8 ffffea001299f180
>>>> 00007fd1388b8000 ffff880e9eaf3e30
>>>> [ 368.756815] Call Trace:
>>>> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
>>>> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>>>> [ 368.756837] [<ffffffff8116eae3>]
>>>> unmap_single_vma+0x583/0x890
>>>> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>>>> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
>>>> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
>>>> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
>>>> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
>>>> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
>>>> [ 368.756869] [<ffffffff814e70ed>]
>>>> system_call_fastpath+0x1a/0x1f
>>>> [ 368.756872] Disabling lock debugging due to kernel taint
>>>> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680
>>>> idx:0 val:-1
>>>> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680
>>>> idx:1 val:1
>>>>
>>>>>
>>>>> Probably. I don't have a Xen PV setup to test with (and very little
>>>>> interest in setting one up).. And I have a suspicion that it might not
>>>>> be so much about Xen PV, as perhaps about the kind of hardware.
>>>>>
>>>>> I suspect the issue has something to do with the magic _PAGE_NUMA
>>>>> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
>>>>> removing the _PAGE_PRESENT bit, and now the crazy numa code is
>>>>> confused.
>>>>>
>>>>> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
>>>>> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
>>>>> _PAGE_PRESENT.
>>>>>
>>>>> Adding Andrea to the Cc, because he's the author of that horridness.
>>>>> Putting Steven's test-case here as an attachement for Andrea, maybe
>>>>> that makes him go "Ahh, yes, silly case".
>>>>>
>>>>> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
>>>>>
>>>>> Andrea, you can find the thread on lkml, but it boils down to commit
>>>>> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
>>>>> attached test-case (but apparently only under Xen PV). There it
>>>>> apparently causes a "BUG: Bad page map .." error.
>>>
>>>
>>> I *think* it is due to the fact that pmd_numa and pte_numa is getting the
>>> _raw_
>>> value of PMDs and PTEs. That is - it does not use the pvops interface
>>> and instead reads the values directly from the page-table. Since the
>>> page-table is also manipulated by the hypervisor - there are certain
>>> flags it also sets to do its business. It might be that it uses
>>> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
>>> pte_flags that would invoke the pvops interface.
>>>
>>> Elena, Dariof and George, you guys had been looking at this a bit deeper
>>> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?
>>>
>>> This not-compiled-totally-bad-patch might shed some light on what I was
>>> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
>>> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
>>> for that).
>>
>>
>> Unfortunately the Totally Bad Patch seems to make no difference. I am
>> still able to repro the issue:

Steven, do you use numa=fake on boot cmd line for pv guest?

I had similar issue on pv guest. Let me check if the fix that resolved
this for me will help with 3.13.


>
>
> Maybe this one is also related to this BUG here (cc'ed people investigating
> this one) ...
>
> https://lkml.org/lkml/2014/1/10/427
>
> ... not sure, though.
>
>
>> [ 346.374929] BUG: Bad page map in process mp
>> pte:80000004ae928065 pmd:e993f9067
>> [ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping:
>> (null) index:0x0
>> [ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
>> [ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071
>> anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
>> [ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+
>> #1
>> [ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768
>> 00007f06a9bbb000
>> [ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000
>> 0000000000000000
>> [ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000
>> ffff880e991a3e30
>> [ 346.374979] Call Trace:
>> [ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
>> [ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>> [ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
>> [ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>> [ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
>> [ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
>> [ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
>> [ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
>> [ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
>> [ 346.375034] [<ffffffff814e712d>]
>> system_call_fastpath+0x1a/0x1f
>> [ 346.375037] Disabling lock debugging due to kernel taint
>> [ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>> idx:0 val:-1
>> [ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>> idx:1 val:1
>>
>> This dump doesn't look dramatically different, either.
>>
>>>
>>> The other question is - how is AutoNUMA running when it is not enabled?
>>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
>>> turned on?
>>
>>
>> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
>> mean not enabled at runtime?
>>
>> [1]
>> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64



--
Elena

2014-01-22 20:33:47

by Steven Noonan

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Wed, Jan 22, 2014 at 03:18:50PM -0500, Elena Ufimtseva wrote:
> On Wed, Jan 22, 2014 at 9:29 AM, Daniel Borkmann <[email protected]> wrote:
> > On 01/22/2014 08:29 AM, Steven Noonan wrote:
> >>
> >> On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
> >>>
> >>> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
> >>>>
> >>>> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
> >>>>>
> >>>>> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
> >>>>> <[email protected]> wrote:
> >>>
> >>>
> >>> Adding extra folks to the party.
> >>>>>>
> >>>>>>
> >>>>>> Odds are this also shows up in 3.13, right?
> >>>>
> >>>>
> >>>> Reproduced using 3.13 on the PV guest:
> >>>>
> >>>> [ 368.756763] BUG: Bad page map in process mp
> >>>> pte:80000004a67c6165 pmd:e9b706067
> >>>> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1
> >>>> mapping: (null) index:0x0
> >>>> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
> >>>> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071
> >>>> anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
> >>>> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2
> >>>> #1
> >>>> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0
> >>>> ffffffff814d8748 00007fd1388b7000
> >>>> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289
> >>>> 0000000000000000 0000000000000000
> >>>> [ 368.756809] ffff880e9b7065b8 ffffea001299f180
> >>>> 00007fd1388b8000 ffff880e9eaf3e30
> >>>> [ 368.756815] Call Trace:
> >>>> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
> >>>> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
> >>>> [ 368.756837] [<ffffffff8116eae3>]
> >>>> unmap_single_vma+0x583/0x890
> >>>> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
> >>>> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
> >>>> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
> >>>> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
> >>>> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
> >>>> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
> >>>> [ 368.756869] [<ffffffff814e70ed>]
> >>>> system_call_fastpath+0x1a/0x1f
> >>>> [ 368.756872] Disabling lock debugging due to kernel taint
> >>>> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680
> >>>> idx:0 val:-1
> >>>> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680
> >>>> idx:1 val:1
> >>>>
> >>>>>
> >>>>> Probably. I don't have a Xen PV setup to test with (and very little
> >>>>> interest in setting one up).. And I have a suspicion that it might not
> >>>>> be so much about Xen PV, as perhaps about the kind of hardware.
> >>>>>
> >>>>> I suspect the issue has something to do with the magic _PAGE_NUMA
> >>>>> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
> >>>>> removing the _PAGE_PRESENT bit, and now the crazy numa code is
> >>>>> confused.
> >>>>>
> >>>>> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
> >>>>> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
> >>>>> _PAGE_PRESENT.
> >>>>>
> >>>>> Adding Andrea to the Cc, because he's the author of that horridness.
> >>>>> Putting Steven's test-case here as an attachement for Andrea, maybe
> >>>>> that makes him go "Ahh, yes, silly case".
> >>>>>
> >>>>> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
> >>>>>
> >>>>> Andrea, you can find the thread on lkml, but it boils down to commit
> >>>>> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
> >>>>> attached test-case (but apparently only under Xen PV). There it
> >>>>> apparently causes a "BUG: Bad page map .." error.
> >>>
> >>>
> >>> I *think* it is due to the fact that pmd_numa and pte_numa is getting the
> >>> _raw_
> >>> value of PMDs and PTEs. That is - it does not use the pvops interface
> >>> and instead reads the values directly from the page-table. Since the
> >>> page-table is also manipulated by the hypervisor - there are certain
> >>> flags it also sets to do its business. It might be that it uses
> >>> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
> >>> pte_flags that would invoke the pvops interface.
> >>>
> >>> Elena, Dariof and George, you guys had been looking at this a bit deeper
> >>> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?
> >>>
> >>> This not-compiled-totally-bad-patch might shed some light on what I was
> >>> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
> >>> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
> >>> for that).
> >>
> >>
> >> Unfortunately the Totally Bad Patch seems to make no difference. I am
> >> still able to repro the issue:
>
> Steven, do you use numa=fake on boot cmd line for pv guest?
>
> I had similar issue on pv guest. Let me check if the fix that resolved
> this for me will help with 3.13.

Nope:

# cat /proc/cmdline
root=/dev/xvda1 ro rootwait rootfstype=ext4 nomodeset console=hvc0 earlyprintk=xen,verbose loglevel=7

>
> >
> >
> > Maybe this one is also related to this BUG here (cc'ed people investigating
> > this one) ...
> >
> > https://lkml.org/lkml/2014/1/10/427
> >
> > ... not sure, though.
> >
> >
> >> [ 346.374929] BUG: Bad page map in process mp
> >> pte:80000004ae928065 pmd:e993f9067
> >> [ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping:
> >> (null) index:0x0
> >> [ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
> >> [ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071
> >> anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
> >> [ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+
> >> #1
> >> [ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768
> >> 00007f06a9bbb000
> >> [ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000
> >> 0000000000000000
> >> [ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000
> >> ffff880e991a3e30
> >> [ 346.374979] Call Trace:
> >> [ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
> >> [ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
> >> [ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
> >> [ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
> >> [ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
> >> [ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
> >> [ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
> >> [ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
> >> [ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
> >> [ 346.375034] [<ffffffff814e712d>]
> >> system_call_fastpath+0x1a/0x1f
> >> [ 346.375037] Disabling lock debugging due to kernel taint
> >> [ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00
> >> idx:0 val:-1
> >> [ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00
> >> idx:1 val:1
> >>
> >> This dump doesn't look dramatically different, either.
> >>
> >>>
> >>> The other question is - how is AutoNUMA running when it is not enabled?
> >>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
> >>> turned on?
> >>
> >>
> >> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
> >> mean not enabled at runtime?
> >>
> >> [1]
> >> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
>
>
>
> --
> Elena

2014-01-23 16:23:42

by Elena Ufimtseva

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Wed, Jan 22, 2014 at 3:33 PM, Steven Noonan <[email protected]> wrote:
> On Wed, Jan 22, 2014 at 03:18:50PM -0500, Elena Ufimtseva wrote:
>> On Wed, Jan 22, 2014 at 9:29 AM, Daniel Borkmann <[email protected]> wrote:
>> > On 01/22/2014 08:29 AM, Steven Noonan wrote:
>> >>
>> >> On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
>> >>>
>> >>> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
>> >>>>
>> >>>> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
>> >>>>>
>> >>>>> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
>> >>>>> <[email protected]> wrote:
>> >>>
>> >>>
>> >>> Adding extra folks to the party.
>> >>>>>>
>> >>>>>>
>> >>>>>> Odds are this also shows up in 3.13, right?
>> >>>>
>> >>>>
>> >>>> Reproduced using 3.13 on the PV guest:
>> >>>>
>> >>>> [ 368.756763] BUG: Bad page map in process mp
>> >>>> pte:80000004a67c6165 pmd:e9b706067
>> >>>> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1
>> >>>> mapping: (null) index:0x0
>> >>>> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
>> >>>> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071
>> >>>> anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
>> >>>> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2
>> >>>> #1
>> >>>> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0
>> >>>> ffffffff814d8748 00007fd1388b7000
>> >>>> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289
>> >>>> 0000000000000000 0000000000000000
>> >>>> [ 368.756809] ffff880e9b7065b8 ffffea001299f180
>> >>>> 00007fd1388b8000 ffff880e9eaf3e30
>> >>>> [ 368.756815] Call Trace:
>> >>>> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
>> >>>> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>> >>>> [ 368.756837] [<ffffffff8116eae3>]
>> >>>> unmap_single_vma+0x583/0x890
>> >>>> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>> >>>> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
>> >>>> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
>> >>>> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
>> >>>> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
>> >>>> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
>> >>>> [ 368.756869] [<ffffffff814e70ed>]
>> >>>> system_call_fastpath+0x1a/0x1f
>> >>>> [ 368.756872] Disabling lock debugging due to kernel taint
>> >>>> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680
>> >>>> idx:0 val:-1
>> >>>> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680
>> >>>> idx:1 val:1
>> >>>>
>> >>>>>
>> >>>>> Probably. I don't have a Xen PV setup to test with (and very little
>> >>>>> interest in setting one up).. And I have a suspicion that it might not
>> >>>>> be so much about Xen PV, as perhaps about the kind of hardware.
>> >>>>>
>> >>>>> I suspect the issue has something to do with the magic _PAGE_NUMA
>> >>>>> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
>> >>>>> removing the _PAGE_PRESENT bit, and now the crazy numa code is
>> >>>>> confused.
>> >>>>>
>> >>>>> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
>> >>>>> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
>> >>>>> _PAGE_PRESENT.
>> >>>>>
>> >>>>> Adding Andrea to the Cc, because he's the author of that horridness.
>> >>>>> Putting Steven's test-case here as an attachement for Andrea, maybe
>> >>>>> that makes him go "Ahh, yes, silly case".
>> >>>>>
>> >>>>> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
>> >>>>>
>> >>>>> Andrea, you can find the thread on lkml, but it boils down to commit
>> >>>>> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
>> >>>>> attached test-case (but apparently only under Xen PV). There it
>> >>>>> apparently causes a "BUG: Bad page map .." error.
>> >>>
>> >>>
>> >>> I *think* it is due to the fact that pmd_numa and pte_numa is getting the
>> >>> _raw_
>> >>> value of PMDs and PTEs. That is - it does not use the pvops interface
>> >>> and instead reads the values directly from the page-table. Since the
>> >>> page-table is also manipulated by the hypervisor - there are certain
>> >>> flags it also sets to do its business. It might be that it uses
>> >>> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
>> >>> pte_flags that would invoke the pvops interface.
>> >>>
>> >>> Elena, Dariof and George, you guys had been looking at this a bit deeper
>> >>> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?

It does use _PAGE_GLOBAL for guest user pages

>> >>>
>> >>> This not-compiled-totally-bad-patch might shed some light on what I was
>> >>> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
>> >>> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
>> >>> for that).
>> >>
>> >>
>> >> Unfortunately the Totally Bad Patch seems to make no difference. I am
>> >> still able to repro the issue:
>>
>> Steven, do you use numa=fake on boot cmd line for pv guest?
>>
>> I had similar issue on pv guest. Let me check if the fix that resolved
>> this for me will help with 3.13.
>
> Nope:
>
> # cat /proc/cmdline
> root=/dev/xvda1 ro rootwait rootfstype=ext4 nomodeset console=hvc0 earlyprintk=xen,verbose loglevel=7

>
>>
>> >
>> >
>> > Maybe this one is also related to this BUG here (cc'ed people investigating
>> > this one) ...
>> >
>> > https://lkml.org/lkml/2014/1/10/427
>> >
>> > ... not sure, though.
>> >
>> >
>> >> [ 346.374929] BUG: Bad page map in process mp
>> >> pte:80000004ae928065 pmd:e993f9067
>> >> [ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping:
>> >> (null) index:0x0
>> >> [ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
>> >> [ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071
>> >> anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
>> >> [ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+
>> >> #1
>> >> [ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768
>> >> 00007f06a9bbb000
>> >> [ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000
>> >> 0000000000000000
>> >> [ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000
>> >> ffff880e991a3e30
>> >> [ 346.374979] Call Trace:
>> >> [ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
>> >> [ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>> >> [ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
>> >> [ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>> >> [ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
>> >> [ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
>> >> [ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
>> >> [ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
>> >> [ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
>> >> [ 346.375034] [<ffffffff814e712d>]
>> >> system_call_fastpath+0x1a/0x1f
>> >> [ 346.375037] Disabling lock debugging due to kernel taint
>> >> [ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>> >> idx:0 val:-1
>> >> [ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>> >> idx:1 val:1
>> >>
>> >> This dump doesn't look dramatically different, either.
>> >>
>> >>>
>> >>> The other question is - how is AutoNUMA running when it is not enabled?
>> >>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
>> >>> turned on?
>> >>
>> >>
>> >> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
>> >> mean not enabled at runtime?
>> >>
>> >> [1]
>> >> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
>>
>>
>>
>> --
>> Elena

I was able to reproduce this consistently, also with the latest mm
patches from yesterday.
Can you please try this:

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index ce563be..76dcf96 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
*mm, unsigned long addr,
/* Assume pteval_t is equivalent to all the other *val_t types. */
static pteval_t pte_mfn_to_pfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if ((val & _PAGE_PRESENT) || ((val &
(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
unsigned long pfn = mfn_to_pfn(mfn);

@@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)

static pteval_t pte_pfn_to_mfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if ((val & _PAGE_PRESENT) || ((val &
(_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
pteval_t flags = val & PTE_FLAGS_MASK;
unsigned long mfn;

2014-01-23 17:03:44

by Mel Gorman

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Tue, Jan 21, 2014 at 03:27:08PM -0800, Steven Noonan wrote:
> A user reported a problem starting vsftpd on a Xen paravirtualized
> guest, with this in dmesg:
>
> [ 60.654862] BUG: Bad page map in process vsftpd pte:8000000493b88165 pmd:e9cc01067
> [ 60.654876] page:ffffea00124ee200 count:0 mapcount:-1 mapping: (null) index:0x0
> [ 60.654879] page flags: 0x2ffc0000000014(referenced|dirty)
> [ 60.654885] addr:00007f97eea74000 vm_flags:00100071 anon_vma:ffff880e98f80380 mapping: (null) index:7f97eea74
> [ 60.654890] CPU: 4 PID: 587 Comm: vsftpd Not tainted 3.12.7-1-ec2 #1
> [ 60.654893] ffff880e9cc6ec38 ffff880e9cc61ca0 ffffffff814c763b 00007f97eea74000
> [ 60.654900] ffff880e9cc61ce8 ffffffff8116784e 0000000000000000 0000000000000000
> [ 60.654906] ffff880e9cc013a0 ffffea00124ee200 00007f97eea75000 ffff880e9cc61e10
> [ 60.654912] Call Trace:
> [ 60.654921] [<ffffffff814c763b>] dump_stack+0x45/0x56
> [ 60.654928] [<ffffffff8116784e>] print_bad_pte+0x22e/0x250
> [ 60.654933] [<ffffffff81169073>] unmap_single_vma+0x583/0x890
> [ 60.654938] [<ffffffff8116a405>] unmap_vmas+0x65/0x90
> [ 60.654942] [<ffffffff81173795>] exit_mmap+0xc5/0x170
> [ 60.654948] [<ffffffff8105d295>] mmput+0x65/0x100
> [ 60.654952] [<ffffffff81062983>] do_exit+0x393/0x9e0
> [ 60.654955] [<ffffffff810630dc>] do_group_exit+0xcc/0x140
> [ 60.654959] [<ffffffff81063164>] SyS_exit_group+0x14/0x20
> [ 60.654965] [<ffffffff814d602d>] system_call_fastpath+0x1a/0x1f
> [ 60.654968] Disabling lock debugging due to kernel taint
> [ 60.655191] BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:0 val:-1
> [ 60.655196] BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:1 val:1
>
>
> The issue could not be reproduced under an HVM instance with the same
> kernel, so it appears to be exclusive to paravirtual Xen guests.
>
> I noted that it wasn't present in 3.10.27, but was present in 3.12.7 and
> 3.12.8. I ran through a bisection to find the root cause:
>
> # start: 'v3.12.7' 'v3.10.27'
> # bad: [4301b7a8] Linux 3.12.7
> # good: [1071ea6e] Linux 3.10.27
> # good: [8bb495e3] Linux 3.10
> # good: [8fe73691] staging: comedi: comedi_bond: change return value
> # good: [22e04f6b] Merge branch 'for-linus' of git://git.kernel.org/p
> # good: [b7c09ad4] Merge branch 'for-linus' of git://git.kernel.org/p
> # good: [13caa8ed] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
> # good: [13caa8ed] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
> # good: [f5fa9283] ipv6: reset dst.expires value when clearing expire
> # good: [4af9d888] bridge: flush br's address entry in fdb when remov
> # good: [8c13daf6] dm delay: fix a possible deadlock due to shared wo
> # good: [93c02d70] firewire: sbp2: bring back WRITE SAME support
> # good: [18065245] ACPI / PCI / hotplug: Avoid warning when _ADR not
> # bad: [8807a436] mm/memory-failure.c: transfer page count from head
> # bad: [fd5df800] mm: numa: avoid unnecessary disruption of NUMA hin
> # good: [c18e3316] mm: numa: do not clear PMD during PTE update scan
> # good: [f3b578d9] mm: numa: avoid unnecessary work on the failure pa
> # bad: [3d792d61] mm: numa: clear numa hinting information on mprote
> # good: [cefeb279] sched: numa: skip inaccessible VMAs
> # first bad: [3d792d61] mm: numa: clear numa hinting information on mprote
>
> If only I'd tested v3.12.0, that bisection would have been a lot shorter!
>
>
> It looks like this is the change implicated (introduced in v3.12.7):
>
> commit 3d792d616ba408ab55a54c1bb75a9367d997acfa
> Author: Mel Gorman <[email protected]>
> Date: Tue Jan 7 14:00:44 2014 +0000
>
> mm: numa: clear numa hinting information on mprotect
>
> commit 1667918b6483b12a6496bf54151b827b8235d7b1 upstream.
>
> On a protection change it is no longer clear if the page should be still
> accessible. This patch clears the NUMA hinting fault bits on a
> protection change.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Cc: Alex Thorlton <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>

Sorry for the long delay in responding. I've only managed to get as far as
this mail now and it's already my evening and I won't be able to work on
it now either. Based on the patch a possible explanation is that I called
the page table mknonnuma helpers on a non-numa PTE and there is now a
mixup between the _PAGE_PRESENT and _PAGE_PROTNONE bits. It feels like a
bad fit though and would not explain why it only affects Xen. A slightly
better fit would be if the page table helpers for numa hinting flags are
colliding badly with the paravirt interface somehow. A git grep for
_PAGE_PROTNONE on xen showed up nothing but _PAGE_PRESENT has special
meaning and that might be what I trashed.

--
Mel Gorman
SUSE Labs

2014-01-23 23:20:22

by Steven Noonan

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Thu, Jan 23, 2014 at 11:23:37AM -0500, Elena Ufimtseva wrote:
> On Wed, Jan 22, 2014 at 3:33 PM, Steven Noonan <[email protected]> wrote:
> > On Wed, Jan 22, 2014 at 03:18:50PM -0500, Elena Ufimtseva wrote:
> >> On Wed, Jan 22, 2014 at 9:29 AM, Daniel Borkmann <[email protected]> wrote:
> >> > On 01/22/2014 08:29 AM, Steven Noonan wrote:
> >> >>
> >> >> On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
> >> >>>
> >> >>> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
> >> >>>>
> >> >>>> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
> >> >>>>>
> >> >>>>> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
> >> >>>>> <[email protected]> wrote:
> >> >>>
> >> >>>
> >> >>> Adding extra folks to the party.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Odds are this also shows up in 3.13, right?
> >> >>>>
> >> >>>>
> >> >>>> Reproduced using 3.13 on the PV guest:
> >> >>>>
> >> >>>> [ 368.756763] BUG: Bad page map in process mp
> >> >>>> pte:80000004a67c6165 pmd:e9b706067
> >> >>>> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1
> >> >>>> mapping: (null) index:0x0
> >> >>>> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
> >> >>>> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071
> >> >>>> anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
> >> >>>> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2
> >> >>>> #1
> >> >>>> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0
> >> >>>> ffffffff814d8748 00007fd1388b7000
> >> >>>> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289
> >> >>>> 0000000000000000 0000000000000000
> >> >>>> [ 368.756809] ffff880e9b7065b8 ffffea001299f180
> >> >>>> 00007fd1388b8000 ffff880e9eaf3e30
> >> >>>> [ 368.756815] Call Trace:
> >> >>>> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
> >> >>>> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
> >> >>>> [ 368.756837] [<ffffffff8116eae3>]
> >> >>>> unmap_single_vma+0x583/0x890
> >> >>>> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
> >> >>>> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
> >> >>>> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
> >> >>>> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
> >> >>>> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
> >> >>>> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
> >> >>>> [ 368.756869] [<ffffffff814e70ed>]
> >> >>>> system_call_fastpath+0x1a/0x1f
> >> >>>> [ 368.756872] Disabling lock debugging due to kernel taint
> >> >>>> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680
> >> >>>> idx:0 val:-1
> >> >>>> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680
> >> >>>> idx:1 val:1
> >> >>>>
> >> >>>>>
> >> >>>>> Probably. I don't have a Xen PV setup to test with (and very little
> >> >>>>> interest in setting one up).. And I have a suspicion that it might not
> >> >>>>> be so much about Xen PV, as perhaps about the kind of hardware.
> >> >>>>>
> >> >>>>> I suspect the issue has something to do with the magic _PAGE_NUMA
> >> >>>>> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
> >> >>>>> removing the _PAGE_PRESENT bit, and now the crazy numa code is
> >> >>>>> confused.
> >> >>>>>
> >> >>>>> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
> >> >>>>> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
> >> >>>>> _PAGE_PRESENT.
> >> >>>>>
> >> >>>>> Adding Andrea to the Cc, because he's the author of that horridness.
> >> >>>>> Putting Steven's test-case here as an attachement for Andrea, maybe
> >> >>>>> that makes him go "Ahh, yes, silly case".
> >> >>>>>
> >> >>>>> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
> >> >>>>>
> >> >>>>> Andrea, you can find the thread on lkml, but it boils down to commit
> >> >>>>> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
> >> >>>>> attached test-case (but apparently only under Xen PV). There it
> >> >>>>> apparently causes a "BUG: Bad page map .." error.
> >> >>>
> >> >>>
> >> >>> I *think* it is due to the fact that pmd_numa and pte_numa is getting the
> >> >>> _raw_
> >> >>> value of PMDs and PTEs. That is - it does not use the pvops interface
> >> >>> and instead reads the values directly from the page-table. Since the
> >> >>> page-table is also manipulated by the hypervisor - there are certain
> >> >>> flags it also sets to do its business. It might be that it uses
> >> >>> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
> >> >>> pte_flags that would invoke the pvops interface.
> >> >>>
> >> >>> Elena, Dariof and George, you guys had been looking at this a bit deeper
> >> >>> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?
>
> It does use _PAGE_GLOBAL for guest user pages
>
> >> >>>
> >> >>> This not-compiled-totally-bad-patch might shed some light on what I was
> >> >>> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
> >> >>> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
> >> >>> for that).
> >> >>
> >> >>
> >> >> Unfortunately the Totally Bad Patch seems to make no difference. I am
> >> >> still able to repro the issue:
> >>
> >> Steven, do you use numa=fake on boot cmd line for pv guest?
> >>
> >> I had similar issue on pv guest. Let me check if the fix that resolved
> >> this for me will help with 3.13.
> >
> > Nope:
> >
> > # cat /proc/cmdline
> > root=/dev/xvda1 ro rootwait rootfstype=ext4 nomodeset console=hvc0 earlyprintk=xen,verbose loglevel=7
>
> >
> >>
> >> >
> >> >
> >> > Maybe this one is also related to this BUG here (cc'ed people investigating
> >> > this one) ...
> >> >
> >> > https://lkml.org/lkml/2014/1/10/427
> >> >
> >> > ... not sure, though.
> >> >
> >> >
> >> >> [ 346.374929] BUG: Bad page map in process mp
> >> >> pte:80000004ae928065 pmd:e993f9067
> >> >> [ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping:
> >> >> (null) index:0x0
> >> >> [ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
> >> >> [ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071
> >> >> anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
> >> >> [ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+
> >> >> #1
> >> >> [ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768
> >> >> 00007f06a9bbb000
> >> >> [ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000
> >> >> 0000000000000000
> >> >> [ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000
> >> >> ffff880e991a3e30
> >> >> [ 346.374979] Call Trace:
> >> >> [ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
> >> >> [ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
> >> >> [ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
> >> >> [ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
> >> >> [ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
> >> >> [ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
> >> >> [ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
> >> >> [ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
> >> >> [ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
> >> >> [ 346.375034] [<ffffffff814e712d>]
> >> >> system_call_fastpath+0x1a/0x1f
> >> >> [ 346.375037] Disabling lock debugging due to kernel taint
> >> >> [ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00
> >> >> idx:0 val:-1
> >> >> [ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00
> >> >> idx:1 val:1
> >> >>
> >> >> This dump doesn't look dramatically different, either.
> >> >>
> >> >>>
> >> >>> The other question is - how is AutoNUMA running when it is not enabled?
> >> >>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
> >> >>> turned on?
> >> >>
> >> >>
> >> >> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
> >> >> mean not enabled at runtime?
> >> >>
> >> >> [1]
> >> >> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
> >>
> >>
> >>
> >> --
> >> Elena
>
> I was able to reproduce this consistently, also with the latest mm
> patches from yesterday.
> Can you please try this:
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index ce563be..76dcf96 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
> *mm, unsigned long addr,
> /* Assume pteval_t is equivalent to all the other *val_t types. */
> static pteval_t pte_mfn_to_pfn(pteval_t val)
> {
> - if (val & _PAGE_PRESENT) {
> + if ((val & _PAGE_PRESENT) || ((val &
> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
> unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
> unsigned long pfn = mfn_to_pfn(mfn);
>
> @@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
>
> static pteval_t pte_pfn_to_mfn(pteval_t val)
> {
> - if (val & _PAGE_PRESENT) {
> + if ((val & _PAGE_PRESENT) || ((val &
> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
> unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
> pteval_t flags = val & PTE_FLAGS_MASK;
> unsigned long mfn;

Thanks Elena, I just tested that and it does un-break things.

- Steven

2014-01-24 04:28:03

by Elena Ufimtseva

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Thu, Jan 23, 2014 at 6:20 PM, Steven Noonan <[email protected]> wrote:
> On Thu, Jan 23, 2014 at 11:23:37AM -0500, Elena Ufimtseva wrote:
>> On Wed, Jan 22, 2014 at 3:33 PM, Steven Noonan <[email protected]> wrote:
>> > On Wed, Jan 22, 2014 at 03:18:50PM -0500, Elena Ufimtseva wrote:
>> >> On Wed, Jan 22, 2014 at 9:29 AM, Daniel Borkmann <[email protected]> wrote:
>> >> > On 01/22/2014 08:29 AM, Steven Noonan wrote:
>> >> >>
>> >> >> On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
>> >> >>>
>> >> >>> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
>> >> >>>>
>> >> >>>> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
>> >> >>>>>
>> >> >>>>> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
>> >> >>>>> <[email protected]> wrote:
>> >> >>>
>> >> >>>
>> >> >>> Adding extra folks to the party.
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Odds are this also shows up in 3.13, right?
>> >> >>>>
>> >> >>>>
>> >> >>>> Reproduced using 3.13 on the PV guest:
>> >> >>>>
>> >> >>>> [ 368.756763] BUG: Bad page map in process mp
>> >> >>>> pte:80000004a67c6165 pmd:e9b706067
>> >> >>>> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1
>> >> >>>> mapping: (null) index:0x0
>> >> >>>> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
>> >> >>>> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071
>> >> >>>> anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
>> >> >>>> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2
>> >> >>>> #1
>> >> >>>> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0
>> >> >>>> ffffffff814d8748 00007fd1388b7000
>> >> >>>> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289
>> >> >>>> 0000000000000000 0000000000000000
>> >> >>>> [ 368.756809] ffff880e9b7065b8 ffffea001299f180
>> >> >>>> 00007fd1388b8000 ffff880e9eaf3e30
>> >> >>>> [ 368.756815] Call Trace:
>> >> >>>> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
>> >> >>>> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>> >> >>>> [ 368.756837] [<ffffffff8116eae3>]
>> >> >>>> unmap_single_vma+0x583/0x890
>> >> >>>> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>> >> >>>> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
>> >> >>>> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
>> >> >>>> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
>> >> >>>> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
>> >> >>>> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
>> >> >>>> [ 368.756869] [<ffffffff814e70ed>]
>> >> >>>> system_call_fastpath+0x1a/0x1f
>> >> >>>> [ 368.756872] Disabling lock debugging due to kernel taint
>> >> >>>> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680
>> >> >>>> idx:0 val:-1
>> >> >>>> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680
>> >> >>>> idx:1 val:1
>> >> >>>>
>> >> >>>>>
>> >> >>>>> Probably. I don't have a Xen PV setup to test with (and very little
>> >> >>>>> interest in setting one up).. And I have a suspicion that it might not
>> >> >>>>> be so much about Xen PV, as perhaps about the kind of hardware.
>> >> >>>>>
>> >> >>>>> I suspect the issue has something to do with the magic _PAGE_NUMA
>> >> >>>>> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
>> >> >>>>> removing the _PAGE_PRESENT bit, and now the crazy numa code is
>> >> >>>>> confused.
>> >> >>>>>
>> >> >>>>> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
>> >> >>>>> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
>> >> >>>>> _PAGE_PRESENT.
>> >> >>>>>
>> >> >>>>> Adding Andrea to the Cc, because he's the author of that horridness.
>> >> >>>>> Putting Steven's test-case here as an attachement for Andrea, maybe
>> >> >>>>> that makes him go "Ahh, yes, silly case".
>> >> >>>>>
>> >> >>>>> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
>> >> >>>>>
>> >> >>>>> Andrea, you can find the thread on lkml, but it boils down to commit
>> >> >>>>> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
>> >> >>>>> attached test-case (but apparently only under Xen PV). There it
>> >> >>>>> apparently causes a "BUG: Bad page map .." error.
>> >> >>>
>> >> >>>
>> >> >>> I *think* it is due to the fact that pmd_numa and pte_numa is getting the
>> >> >>> _raw_
>> >> >>> value of PMDs and PTEs. That is - it does not use the pvops interface
>> >> >>> and instead reads the values directly from the page-table. Since the
>> >> >>> page-table is also manipulated by the hypervisor - there are certain
>> >> >>> flags it also sets to do its business. It might be that it uses
>> >> >>> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
>> >> >>> pte_flags that would invoke the pvops interface.
>> >> >>>
>> >> >>> Elena, Dariof and George, you guys had been looking at this a bit deeper
>> >> >>> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?
>>
>> It does use _PAGE_GLOBAL for guest user pages
>>
>> >> >>>
>> >> >>> This not-compiled-totally-bad-patch might shed some light on what I was
>> >> >>> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
>> >> >>> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
>> >> >>> for that).
>> >> >>
>> >> >>
>> >> >> Unfortunately the Totally Bad Patch seems to make no difference. I am
>> >> >> still able to repro the issue:
>> >>
>> >> Steven, do you use numa=fake on boot cmd line for pv guest?
>> >>
>> >> I had similar issue on pv guest. Let me check if the fix that resolved
>> >> this for me will help with 3.13.
>> >
>> > Nope:
>> >
>> > # cat /proc/cmdline
>> > root=/dev/xvda1 ro rootwait rootfstype=ext4 nomodeset console=hvc0 earlyprintk=xen,verbose loglevel=7
>>
>> >
>> >>
>> >> >
>> >> >
>> >> > Maybe this one is also related to this BUG here (cc'ed people investigating
>> >> > this one) ...
>> >> >
>> >> > https://lkml.org/lkml/2014/1/10/427
>> >> >
>> >> > ... not sure, though.
>> >> >
>> >> >
>> >> >> [ 346.374929] BUG: Bad page map in process mp
>> >> >> pte:80000004ae928065 pmd:e993f9067
>> >> >> [ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping:
>> >> >> (null) index:0x0
>> >> >> [ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
>> >> >> [ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071
>> >> >> anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
>> >> >> [ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+
>> >> >> #1
>> >> >> [ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768
>> >> >> 00007f06a9bbb000
>> >> >> [ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000
>> >> >> 0000000000000000
>> >> >> [ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000
>> >> >> ffff880e991a3e30
>> >> >> [ 346.374979] Call Trace:
>> >> >> [ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
>> >> >> [ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>> >> >> [ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
>> >> >> [ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>> >> >> [ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
>> >> >> [ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
>> >> >> [ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
>> >> >> [ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
>> >> >> [ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
>> >> >> [ 346.375034] [<ffffffff814e712d>]
>> >> >> system_call_fastpath+0x1a/0x1f
>> >> >> [ 346.375037] Disabling lock debugging due to kernel taint
>> >> >> [ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>> >> >> idx:0 val:-1
>> >> >> [ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>> >> >> idx:1 val:1
>> >> >>
>> >> >> This dump doesn't look dramatically different, either.
>> >> >>
>> >> >>>
>> >> >>> The other question is - how is AutoNUMA running when it is not enabled?
>> >> >>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
>> >> >>> turned on?
>> >> >>
>> >> >>
>> >> >> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
>> >> >> mean not enabled at runtime?
>> >> >>
>> >> >> [1]
>> >> >> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
>> >>
>> >>
>> >>
>> >> --
>> >> Elena
>>
>> I was able to reproduce this consistently, also with the latest mm
>> patches from yesterday.
>> Can you please try this:
>>
>> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>> index ce563be..76dcf96 100644
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
>> *mm, unsigned long addr,
>> /* Assume pteval_t is equivalent to all the other *val_t types. */
>> static pteval_t pte_mfn_to_pfn(pteval_t val)
>> {
>> - if (val & _PAGE_PRESENT) {
>> + if ((val & _PAGE_PRESENT) || ((val &
>> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
>> unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>> unsigned long pfn = mfn_to_pfn(mfn);
>>
>> @@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
>>
>> static pteval_t pte_pfn_to_mfn(pteval_t val)
>> {
>> - if (val & _PAGE_PRESENT) {
>> + if ((val & _PAGE_PRESENT) || ((val &
>> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
>> unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>> pteval_t flags = val & PTE_FLAGS_MASK;
>> unsigned long mfn;
>
> Thanks Elena, I just tested that and it does un-break things.
>
> - Steven


Good sign ) Ok, Ill format this patch for submission after it will be
clear what path is taken here (as its not numa auto balancing related
thing).
--
Elena

2014-01-24 11:05:27

by David Vrabel

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On 23/01/14 16:23, Elena Ufimtseva wrote:
> On Wed, Jan 22, 2014 at 3:33 PM, Steven Noonan <[email protected]> wrote:
>> On Wed, Jan 22, 2014 at 03:18:50PM -0500, Elena Ufimtseva wrote:
>>> On Wed, Jan 22, 2014 at 9:29 AM, Daniel Borkmann <[email protected]> wrote:
>>>> On 01/22/2014 08:29 AM, Steven Noonan wrote:
>>>>>
>>>>> On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
>>>>>>
>>>>>> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
>>>>>>>
>>>>>>> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
>>>>>>>>
>>>>>>>> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
>>>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> Adding extra folks to the party.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Odds are this also shows up in 3.13, right?
>>>>>>>
>>>>>>>
>>>>>>> Reproduced using 3.13 on the PV guest:
>>>>>>>
>>>>>>> [ 368.756763] BUG: Bad page map in process mp
>>>>>>> pte:80000004a67c6165 pmd:e9b706067
>>>>>>> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1
>>>>>>> mapping: (null) index:0x0
>>>>>>> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
>>>>>>> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071
>>>>>>> anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
>>>>>>> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2
>>>>>>> #1
>>>>>>> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0
>>>>>>> ffffffff814d8748 00007fd1388b7000
>>>>>>> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289
>>>>>>> 0000000000000000 0000000000000000
>>>>>>> [ 368.756809] ffff880e9b7065b8 ffffea001299f180
>>>>>>> 00007fd1388b8000 ffff880e9eaf3e30
>>>>>>> [ 368.756815] Call Trace:
>>>>>>> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
>>>>>>> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>>>>>>> [ 368.756837] [<ffffffff8116eae3>]
>>>>>>> unmap_single_vma+0x583/0x890
>>>>>>> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>>>>>>> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
>>>>>>> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
>>>>>>> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
>>>>>>> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
>>>>>>> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
>>>>>>> [ 368.756869] [<ffffffff814e70ed>]
>>>>>>> system_call_fastpath+0x1a/0x1f
>>>>>>> [ 368.756872] Disabling lock debugging due to kernel taint
>>>>>>> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680
>>>>>>> idx:0 val:-1
>>>>>>> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680
>>>>>>> idx:1 val:1
>>>>>>>
>>>>>>>>
>>>>>>>> Probably. I don't have a Xen PV setup to test with (and very little
>>>>>>>> interest in setting one up).. And I have a suspicion that it might not
>>>>>>>> be so much about Xen PV, as perhaps about the kind of hardware.
>>>>>>>>
>>>>>>>> I suspect the issue has something to do with the magic _PAGE_NUMA
>>>>>>>> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
>>>>>>>> removing the _PAGE_PRESENT bit, and now the crazy numa code is
>>>>>>>> confused.
>>>>>>>>
>>>>>>>> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
>>>>>>>> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
>>>>>>>> _PAGE_PRESENT.
>>>>>>>>
>>>>>>>> Adding Andrea to the Cc, because he's the author of that horridness.
>>>>>>>> Putting Steven's test-case here as an attachement for Andrea, maybe
>>>>>>>> that makes him go "Ahh, yes, silly case".
>>>>>>>>
>>>>>>>> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
>>>>>>>>
>>>>>>>> Andrea, you can find the thread on lkml, but it boils down to commit
>>>>>>>> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
>>>>>>>> attached test-case (but apparently only under Xen PV). There it
>>>>>>>> apparently causes a "BUG: Bad page map .." error.
>>>>>>
>>>>>>
>>>>>> I *think* it is due to the fact that pmd_numa and pte_numa is getting the
>>>>>> _raw_
>>>>>> value of PMDs and PTEs. That is - it does not use the pvops interface
>>>>>> and instead reads the values directly from the page-table. Since the
>>>>>> page-table is also manipulated by the hypervisor - there are certain
>>>>>> flags it also sets to do its business. It might be that it uses
>>>>>> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
>>>>>> pte_flags that would invoke the pvops interface.
>>>>>>
>>>>>> Elena, Dariof and George, you guys had been looking at this a bit deeper
>>>>>> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?
>
> It does use _PAGE_GLOBAL for guest user pages
>
>>>>>>
>>>>>> This not-compiled-totally-bad-patch might shed some light on what I was
>>>>>> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
>>>>>> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
>>>>>> for that).
>>>>>
>>>>>
>>>>> Unfortunately the Totally Bad Patch seems to make no difference. I am
>>>>> still able to repro the issue:
>>>
>>> Steven, do you use numa=fake on boot cmd line for pv guest?
>>>
>>> I had similar issue on pv guest. Let me check if the fix that resolved
>>> this for me will help with 3.13.
>>
>> Nope:
>>
>> # cat /proc/cmdline
>> root=/dev/xvda1 ro rootwait rootfstype=ext4 nomodeset console=hvc0 earlyprintk=xen,verbose loglevel=7
>
>>
>>>
>>>>
>>>>
>>>> Maybe this one is also related to this BUG here (cc'ed people investigating
>>>> this one) ...
>>>>
>>>> https://lkml.org/lkml/2014/1/10/427
>>>>
>>>> ... not sure, though.
>>>>
>>>>
>>>>> [ 346.374929] BUG: Bad page map in process mp
>>>>> pte:80000004ae928065 pmd:e993f9067
>>>>> [ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping:
>>>>> (null) index:0x0
>>>>> [ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
>>>>> [ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071
>>>>> anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
>>>>> [ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+
>>>>> #1
>>>>> [ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768
>>>>> 00007f06a9bbb000
>>>>> [ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000
>>>>> 0000000000000000
>>>>> [ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000
>>>>> ffff880e991a3e30
>>>>> [ 346.374979] Call Trace:
>>>>> [ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
>>>>> [ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>>>>> [ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
>>>>> [ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>>>>> [ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
>>>>> [ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
>>>>> [ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
>>>>> [ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
>>>>> [ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
>>>>> [ 346.375034] [<ffffffff814e712d>]
>>>>> system_call_fastpath+0x1a/0x1f
>>>>> [ 346.375037] Disabling lock debugging due to kernel taint
>>>>> [ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>>>>> idx:0 val:-1
>>>>> [ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>>>>> idx:1 val:1
>>>>>
>>>>> This dump doesn't look dramatically different, either.
>>>>>
>>>>>>
>>>>>> The other question is - how is AutoNUMA running when it is not enabled?
>>>>>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
>>>>>> turned on?
>>>>>
>>>>>
>>>>> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
>>>>> mean not enabled at runtime?
>>>>>
>>>>> [1]
>>>>> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
>>>
>>>
>>>
>>> --
>>> Elena
>
> I was able to reproduce this consistently, also with the latest mm
> patches from yesterday.
> Can you please try this:
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index ce563be..76dcf96 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
> *mm, unsigned long addr,
> /* Assume pteval_t is equivalent to all the other *val_t types. */
> static pteval_t pte_mfn_to_pfn(pteval_t val)
> {
> - if (val & _PAGE_PRESENT) {
> + if ((val & _PAGE_PRESENT) || ((val &
> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {

if (val & (_PAGE_PRESENT | _PAGE_NUMA))

is equivalent.

David

2014-01-24 11:43:53

by Mel Gorman

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Wed, Jan 22, 2014 at 01:39:32PM -0500, Rik van Riel wrote:
> On 01/22/2014 01:24 PM, Linus Torvalds wrote:
> >On Wed, Jan 22, 2014 at 10:07 AM, Rik van Riel <[email protected]> wrote:
> >>
> >>The difference between a numa pte and a protnone pte is
> >>the VMA permissions.
> >
> >If that is indeed the only difference, then we should damn well get
> >rid of that f*cking stupid _PAGE_NUMA name entirely.
> >
> >It's misleading crap. Really. Just do a quick grep for that bit, and
> >you see just *how* confused people are about it:
> >
> > #define _PAGE_NUMA _PAGE_PROTNONE
> > ...
> > if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) &
> >
> >think about it. Just *THINK* about how broken that code is. The whole
> >thing is a disaster. _PAGE_NUMA must die. It's shit.
>
> The reason things are this way is that we were
> not sure whether we can indeed use _PAGE_PROTNONE
> for NUMA balancing on all architectures.
>

Power is not using _PAGE_PROTNONE to trap NUMA hinting faults because they
do not have that bit. Instead they reuse _PAGE_COHERENT with various tricks,
patches are in -next.

92c08a0d522c7e62c01a63e42597f0c2b02c4245 powerpc/mm: Use HPTE constants when updating hpte bits
c8c06f5a0dde0fed260c54d550962187f266ed0d powerpc/mm: Free up _PAGE_COHERENCE for numa fault use later
8937ba48dcf62b5cdf7abb93652914af16756f50 powerpc/mm: Only check for _PAGE_PRESENT in set_pte/pmd functions
c34a51ce49b40b9667cd7f5cc2e40475af8b4c3d powerpc/mm: Enable _PAGE_NUMA for book3s

As confusing as _PAGE_NUMA is, the intent was to express support in an
architecture-independent manner. If we had started with _PAGE_PROTNONE
then it would still be ambiguous -- are we interested in NUMA hinting
information or is this really PROTNONE protection? The power people would
then have had to add something like _PAGE_NUMA anyway when they had no
_PAGE_PROTNONE bit or define _PAGE_PROTNONE as _PAGE_COHERENT which is
just a different type of misleading.

> If we are sure that _PAGE_PROTNONE can be used
> everywhere, I agree we should get rid of the whole
> _PAGE_NUMA naming, and replace that ambiguous
> code with some comments and documentation instead.

We are sure that _PAGE_PROTNONE cannot be used everywhere.

--
Mel Gorman
SUSE Labs

2014-01-24 13:38:38

by Mel Gorman

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Thu, Jan 23, 2014 at 11:23:37AM -0500, Elena Ufimtseva wrote:
> >> >> <SNIP>
> >> >>
> >> >> This dump doesn't look dramatically different, either.
> >> >>
> >> >>>
> >> >>> The other question is - how is AutoNUMA running when it is not enabled?
> >> >>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
> >> >>> turned on?
> >> >>
> >> >>
> >> >> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
> >> >> mean not enabled at runtime?
> >> >>
> >> >> [1]
> >> >> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
> >>
> >>
> >>
> >> --
> >> Elena
>
> I was able to reproduce this consistently, also with the latest mm
> patches from yesterday.
> Can you please try this:
>

Thanks Elena,

> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index ce563be..76dcf96 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
> *mm, unsigned long addr,
> /* Assume pteval_t is equivalent to all the other *val_t types. */
> static pteval_t pte_mfn_to_pfn(pteval_t val)
> {
> - if (val & _PAGE_PRESENT) {
> + if ((val & _PAGE_PRESENT) || ((val &
> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
> unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
> unsigned long pfn = mfn_to_pfn(mfn);
>
> @@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
>
> static pteval_t pte_pfn_to_mfn(pteval_t val)
> {
> - if (val & _PAGE_PRESENT) {
> + if ((val & _PAGE_PRESENT) || ((val &
> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
> unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
> pteval_t flags = val & PTE_FLAGS_MASK;
> unsigned long mfn;

Would reusing pte_present be an option? Ordinarily I expect that
PAGE_NUMA/PAGE_PROTNONE is only set if PAGE_PRESENT is not set and pte_present
is defined as

static inline int pte_present(pte_t a)
{
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
_PAGE_NUMA);
}

So it looks like it work work. Of course it would need to be split to
reuse it within xen if pte_present was split to have a pteval_present
helper like so

static inline int pteval_present(pteval_t val)
{
/*
* Yes Linus, _PAGE_PROTNONE == _PAGE_NUMA. Expressing it this
* way clearly states that the intent is that a protnone and numa
* hinting ptes are considered present for the purposes of
* pagetable operations like zapping, protection changes, gup etc.
*/
return val & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA);
}

static inline int pte_present(pte_t pte)
{
return pteval_present(pte_flags(pte))
}

If Xen is doing some other tricks with _PAGE_PRESENT then it might be
ruled out as an option. If so, then maybe it could still be made a
little clearer for future reference?


diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index c1d406f..ff621de 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr,
/* Assume pteval_t is equivalent to all the other *val_t types. */
static pteval_t pte_mfn_to_pfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if ((val & _PAGE_PRESENT) || pteval_numa(val)) {
unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
unsigned long pfn = mfn_to_pfn(mfn);

@@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)

static pteval_t pte_pfn_to_mfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if ((val & _PAGE_PRESENT) || pteval_numa(val)) {
unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
pteval_t flags = val & PTE_FLAGS_MASK;
unsigned long mfn;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 8e4f41d..693fe00 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -654,10 +654,14 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
* (because _PAGE_PRESENT is not set).
*/
#ifndef pte_numa
+static inline int pteval_numa(pteval_t pteval)
+{
+ return (pteval & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+
static inline int pte_numa(pte_t pte)
{
- return (pte_flags(pte) &
- (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+ return pteval_numa(pte_flags(pte));
}
#endif

2014-01-26 18:02:38

by Elena Ufimtseva

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Fri, Jan 24, 2014 at 8:38 AM, Mel Gorman <[email protected]> wrote:
> On Thu, Jan 23, 2014 at 11:23:37AM -0500, Elena Ufimtseva wrote:
>> >> >> <SNIP>
>> >> >>
>> >> >> This dump doesn't look dramatically different, either.
>> >> >>
>> >> >>>
>> >> >>> The other question is - how is AutoNUMA running when it is not enabled?
>> >> >>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
>> >> >>> turned on?
>> >> >>
>> >> >>
>> >> >> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
>> >> >> mean not enabled at runtime?
>> >> >>
>> >> >> [1]
>> >> >> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
>> >>
>> >>
>> >>
>> >> --
>> >> Elena
>>
>> I was able to reproduce this consistently, also with the latest mm
>> patches from yesterday.
>> Can you please try this:
>>
>
> Thanks Elena,
>
>> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>> index ce563be..76dcf96 100644
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
>> *mm, unsigned long addr,
>> /* Assume pteval_t is equivalent to all the other *val_t types. */
>> static pteval_t pte_mfn_to_pfn(pteval_t val)
>> {
>> - if (val & _PAGE_PRESENT) {
>> + if ((val & _PAGE_PRESENT) || ((val &
>> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
>> unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>> unsigned long pfn = mfn_to_pfn(mfn);
>>
>> @@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
>>
>> static pteval_t pte_pfn_to_mfn(pteval_t val)
>> {
>> - if (val & _PAGE_PRESENT) {
>> + if ((val & _PAGE_PRESENT) || ((val &
>> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
>> unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>> pteval_t flags = val & PTE_FLAGS_MASK;
>> unsigned long mfn;
>
> Would reusing pte_present be an option? Ordinarily I expect that
> PAGE_NUMA/PAGE_PROTNONE is only set if PAGE_PRESENT is not set and pte_present
> is defined as
>
> static inline int pte_present(pte_t a)
> {
> return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
> _PAGE_NUMA);
> }
>
> So it looks like it work work. Of course it would need to be split to
> reuse it within xen if pte_present was split to have a pteval_present
> helper like so
>
> static inline int pteval_present(pteval_t val)
> {
> /*
> * Yes Linus, _PAGE_PROTNONE == _PAGE_NUMA. Expressing it this
> * way clearly states that the intent is that a protnone and numa
> * hinting ptes are considered present for the purposes of
> * pagetable operations like zapping, protection changes, gup etc.
> */
> return val & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA);
> }
>
> static inline int pte_present(pte_t pte)
> {
> return pteval_present(pte_flags(pte))
> }
>
> If Xen is doing some other tricks with _PAGE_PRESENT then it might be
> ruled out as an option. If so, then maybe it could still be made a
> little clearer for future reference?

Yes, sure, it should work, I tried it.
Thank you Mel.

>
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index c1d406f..ff621de 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr,
> /* Assume pteval_t is equivalent to all the other *val_t types. */
> static pteval_t pte_mfn_to_pfn(pteval_t val)
> {
> - if (val & _PAGE_PRESENT) {
> + if ((val & _PAGE_PRESENT) || pteval_numa(val)) {
> unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
> unsigned long pfn = mfn_to_pfn(mfn);
>
> @@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
>
> static pteval_t pte_pfn_to_mfn(pteval_t val)
> {
> - if (val & _PAGE_PRESENT) {
> + if ((val & _PAGE_PRESENT) || pteval_numa(val)) {
> unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
> pteval_t flags = val & PTE_FLAGS_MASK;
> unsigned long mfn;
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 8e4f41d..693fe00 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -654,10 +654,14 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
> * (because _PAGE_PRESENT is not set).
> */
> #ifndef pte_numa
> +static inline int pteval_numa(pteval_t pteval)
> +{
> + return (pteval & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
> +}
> +
> static inline int pte_numa(pte_t pte)
> {
> - return (pte_flags(pte) &
> - (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
> + return pteval_numa(pte_flags(pte));
> }
> #endif
>



--
Elena

2014-02-04 06:58:39

by Elena Ufimtseva

[permalink] [raw]
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression

On Sun, Jan 26, 2014 at 1:02 PM, Elena Ufimtseva <[email protected]> wrote:
> On Fri, Jan 24, 2014 at 8:38 AM, Mel Gorman <[email protected]> wrote:
>> On Thu, Jan 23, 2014 at 11:23:37AM -0500, Elena Ufimtseva wrote:
>>> >> >> <SNIP>
>>> >> >>
>>> >> >> This dump doesn't look dramatically different, either.
>>> >> >>
>>> >> >>>
>>> >> >>> The other question is - how is AutoNUMA running when it is not enabled?
>>> >> >>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
>>> >> >>> turned on?
>>> >> >>
>>> >> >>
>>> >> >> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
>>> >> >> mean not enabled at runtime?
>>> >> >>
>>> >> >> [1]
>>> >> >> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Elena
>>>
>>> I was able to reproduce this consistently, also with the latest mm
>>> patches from yesterday.
>>> Can you please try this:
>>>
>>
>> Thanks Elena,
>>
>>> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>>> index ce563be..76dcf96 100644
>>> --- a/arch/x86/xen/mmu.c
>>> +++ b/arch/x86/xen/mmu.c
>>> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
>>> *mm, unsigned long addr,
>>> /* Assume pteval_t is equivalent to all the other *val_t types. */
>>> static pteval_t pte_mfn_to_pfn(pteval_t val)
>>> {
>>> - if (val & _PAGE_PRESENT) {
>>> + if ((val & _PAGE_PRESENT) || ((val &
>>> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
>>> unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>>> unsigned long pfn = mfn_to_pfn(mfn);
>>>
>>> @@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
>>>
>>> static pteval_t pte_pfn_to_mfn(pteval_t val)
>>> {
>>> - if (val & _PAGE_PRESENT) {
>>> + if ((val & _PAGE_PRESENT) || ((val &
>>> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
>>> unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>>> pteval_t flags = val & PTE_FLAGS_MASK;
>>> unsigned long mfn;
>>
>> Would reusing pte_present be an option? Ordinarily I expect that
>> PAGE_NUMA/PAGE_PROTNONE is only set if PAGE_PRESENT is not set and pte_present
>> is defined as
>>
>> static inline int pte_present(pte_t a)
>> {
>> return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
>> _PAGE_NUMA);
>> }
>>
>> So it looks like it work work. Of course it would need to be split to
>> reuse it within xen if pte_present was split to have a pteval_present
>> helper like so
>>
>> static inline int pteval_present(pteval_t val)
>> {
>> /*
>> * Yes Linus, _PAGE_PROTNONE == _PAGE_NUMA. Expressing it this
>> * way clearly states that the intent is that a protnone and numa
>> * hinting ptes are considered present for the purposes of
>> * pagetable operations like zapping, protection changes, gup etc.
>> */
>> return val & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA);
>> }
>>
>> static inline int pte_present(pte_t pte)
>> {
>> return pteval_present(pte_flags(pte))
>> }
>>
>> If Xen is doing some other tricks with _PAGE_PRESENT then it might be
>> ruled out as an option. If so, then maybe it could still be made a
>> little clearer for future reference?
>
> Yes, sure, it should work, I tried it.
> Thank you Mel.
>
>>
>>
>> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>> index c1d406f..ff621de 100644
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr,
>> /* Assume pteval_t is equivalent to all the other *val_t types. */
>> static pteval_t pte_mfn_to_pfn(pteval_t val)
>> {
>> - if (val & _PAGE_PRESENT) {
>> + if ((val & _PAGE_PRESENT) || pteval_numa(val)) {
>> unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>> unsigned long pfn = mfn_to_pfn(mfn);
>>
>> @@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)
>>
>> static pteval_t pte_pfn_to_mfn(pteval_t val)
>> {
>> - if (val & _PAGE_PRESENT) {
>> + if ((val & _PAGE_PRESENT) || pteval_numa(val)) {
>> unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
>> pteval_t flags = val & PTE_FLAGS_MASK;
>> unsigned long mfn;
>> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
>> index 8e4f41d..693fe00 100644
>> --- a/include/asm-generic/pgtable.h
>> +++ b/include/asm-generic/pgtable.h
>> @@ -654,10 +654,14 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
>> * (because _PAGE_PRESENT is not set).
>> */
>> #ifndef pte_numa
>> +static inline int pteval_numa(pteval_t pteval)
>> +{
>> + return (pteval & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
>> +}
>> +
>> static inline int pte_numa(pte_t pte)
>> {
>> - return (pte_flags(pte) &
>> - (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
>> + return pteval_numa(pte_flags(pte));
>> }
>> #endif
>>
>
>
>
> --
> Elena

Here are two variants of this change . First one adds check for
_PAGE_NUMA flag in xen pte translations.
Second adds proposed by Mel pteval_present (comments are left
untouched :) and respective patch for xen pte translation that
uses pteval_present.
Mel, you can pick any of these two if they look fine and xen
maintainers are ok with the xen change (Konrad, David?)

Subject: [PATCH] xen: add _PAGE_NUMA for pte translations

xen: add _PAGE_NUMA for pte translations

Adds check in xen guest pte addresses translations for
_PAGE_NUMA flag. This resolves reported issues and will
be essential for NUMA support in xen guest with future
vNUMA patches.

Signed-off-by: Elena Ufimtseva <[email protected]>
---
arch/x86/xen/mmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 2423ef0..c804d58 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
*mm, unsigned long addr,
/* Assume pteval_t is equivalent to all the other *val_t types. */
static pteval_t pte_mfn_to_pfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if ((val & _PAGE_PRESENT) || (val & (_PAGE_NUMA|_PAGE_PRESENT))) {
unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
unsigned long pfn = mfn_to_pfn(mfn);

@@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)

static pteval_t pte_pfn_to_mfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if ((val & _PAGE_PRESENT) || (val & (_PAGE_NUMA|_PAGE_PRESENT))) {
unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
pteval_t flags = val & PTE_FLAGS_MASK;
unsigned long mfn;
--
1.7.10.4




Subject: [PATCH] mm: adds pteval_present

As suggested by Mel Gorman in https://lkml.org/lkml/2014/1/24/174.
Adds pteval_present to clarify that hinting ptes are considered present
for the purposes of pagetable operations like zapping, protection changes,
gup etc.

Signed-off-by: Elena Ufimtseva <[email protected]>
---
arch/x86/include/asm/pgtable.h | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index bbc8b12..205b00d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -445,10 +445,20 @@ static inline int pte_same(pte_t a, pte_t b)
return a.pte == b.pte;
}

+static inline int pteval_present(pteval_t pteval)
+{
+ /*
+ * Yes Linus, _PAGE_PROTNONE == _PAGE_NUMA. Expressing it this
+ * way clearly states that the intent is that a protnone and numa
+ * hinting ptes are considered present for the purposes of
+ * pagetable operations like zapping, protection changes, gup etc.
+ */
+ return pteval & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA);
+}
+
static inline int pte_present(pte_t a)
{
- return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
- _PAGE_NUMA);
+ return pteval_present(pte_flags(a));
}

#define pte_accessible pte_accessible
--
1.7.10.4


Subject: [PATCH] xen: use pteval_present for xen pte translations

Uses pteval_present for xen pte translations as sugested
by Mel Gorman in https://lkml.org/lkml/2014/1/24/174.
This also takes into account _PAGE_NUMA flag.

Signed-off-by: Elena Ufimtseva <[email protected]>
---
arch/x86/xen/mmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 2423ef0..256282e 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
*mm, unsigned long addr,
/* Assume pteval_t is equivalent to all the other *val_t types. */
static pteval_t pte_mfn_to_pfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if (pteval_present(val)) {
unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
unsigned long pfn = mfn_to_pfn(mfn);

@@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)

static pteval_t pte_pfn_to_mfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if (pteval_present(val)) {
unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
pteval_t flags = val & PTE_FLAGS_MASK;
unsigned long mfn;
--
1.7.10.4

2014-02-04 11:44:55

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] Subject: [PATCH] xen: Properly account for _PAGE_NUMA during xen pte translations

Steven Noonan forwarded a users report where they had a problem starting
vsftpd on a Xen paravirtualized guest, with this in dmesg:

[ 60.654862] BUG: Bad page map in process vsftpd pte:8000000493b88165 pmd:e9cc01067
[ 60.654876] page:ffffea00124ee200 count:0 mapcount:-1 mapping: (null) index:0x0
[ 60.654879] page flags: 0x2ffc0000000014(referenced|dirty)
[ 60.654885] addr:00007f97eea74000 vm_flags:00100071 anon_vma:ffff880e98f80380 mapping: (null) index:7f97eea74
[ 60.654890] CPU: 4 PID: 587 Comm: vsftpd Not tainted 3.12.7-1-ec2 #1
[ 60.654893] ffff880e9cc6ec38 ffff880e9cc61ca0 ffffffff814c763b 00007f97eea74000
[ 60.654900] ffff880e9cc61ce8 ffffffff8116784e 0000000000000000 0000000000000000
[ 60.654906] ffff880e9cc013a0 ffffea00124ee200 00007f97eea75000 ffff880e9cc61e10
[ 60.654912] Call Trace:
[ 60.654921] [<ffffffff814c763b>] dump_stack+0x45/0x56
[ 60.654928] [<ffffffff8116784e>] print_bad_pte+0x22e/0x250
[ 60.654933] [<ffffffff81169073>] unmap_single_vma+0x583/0x890
[ 60.654938] [<ffffffff8116a405>] unmap_vmas+0x65/0x90
[ 60.654942] [<ffffffff81173795>] exit_mmap+0xc5/0x170
[ 60.654948] [<ffffffff8105d295>] mmput+0x65/0x100
[ 60.654952] [<ffffffff81062983>] do_exit+0x393/0x9e0
[ 60.654955] [<ffffffff810630dc>] do_group_exit+0xcc/0x140
[ 60.654959] [<ffffffff81063164>] SyS_exit_group+0x14/0x20
[ 60.654965] [<ffffffff814d602d>] system_call_fastpath+0x1a/0x1f
[ 60.654968] Disabling lock debugging due to kernel taint
[ 60.655191] BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:0 val:-1
[ 60.655196] BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:1 val:1

The issue could not be reproduced under an HVM instance with the same kernel,
so it appears to be exclusive to paravirtual Xen guests. He bisected the
problem to commit 1667918b (mm: numa: clear numa hinting information on
mprotect) that was also included in 3.12-stable.

The problem was related to how xen translates ptes because it was not
accounting for the _PAGE_NUMA bit. This patch splits pte_present to add
a pteval_present helper for use by xen so both bare metal and xen use
the same code when checking if a PTE is present.

[[email protected]: Wrote changelog, proposed minor modifications]
Reported-and-tested-by: Steven Noonan <[email protected]>
Signed-off-by: Elena Ufimtseva <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Cc: [email protected] # 3.12+
---
arch/x86/include/asm/pgtable.h | 14 ++++++++++++--
arch/x86/xen/mmu.c | 4 ++--
2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index bbc8b12..19e3706 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -445,10 +445,20 @@ static inline int pte_same(pte_t a, pte_t b)
return a.pte == b.pte;
}

+static inline int pteval_present(pteval_t pteval)
+{
+ /*
+ * Yes Linus, _PAGE_PROTNONE == _PAGE_NUMA. Expressing it this
+ * way clearly states that the intent is that a protnone and numa
+ * hinting ptes are considered present for the purposes of
+ * pagetable operations like zapping, protection changes, gup etc.
+ */
+ return pteval & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA);
+}
+
static inline int pte_present(pte_t a)
{
- return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
- _PAGE_NUMA);
+ return pteval_present(pte_flags(a));
}

#define pte_accessible pte_accessible
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 2423ef0..256282e 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr,
/* Assume pteval_t is equivalent to all the other *val_t types. */
static pteval_t pte_mfn_to_pfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if (pteval_present(val)) {
unsigned long mfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
unsigned long pfn = mfn_to_pfn(mfn);

@@ -381,7 +381,7 @@ static pteval_t pte_mfn_to_pfn(pteval_t val)

static pteval_t pte_pfn_to_mfn(pteval_t val)
{
- if (val & _PAGE_PRESENT) {
+ if (pteval_present(val)) {
unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
pteval_t flags = val & PTE_FLAGS_MASK;
unsigned long mfn;

2014-02-04 11:48:52

by David Vrabel

[permalink] [raw]
Subject: Re: [PATCH] Subject: [PATCH] xen: Properly account for _PAGE_NUMA during xen pte translations

On 04/02/14 11:44, Mel Gorman wrote:
> Steven Noonan forwarded a users report where they had a problem starting
> vsftpd on a Xen paravirtualized guest, with this in dmesg:
>
[...]
>
> The issue could not be reproduced under an HVM instance with the same kernel,
> so it appears to be exclusive to paravirtual Xen guests. He bisected the
> problem to commit 1667918b (mm: numa: clear numa hinting information on
> mprotect) that was also included in 3.12-stable.
>
> The problem was related to how xen translates ptes because it was not
> accounting for the _PAGE_NUMA bit. This patch splits pte_present to add
> a pteval_present helper for use by xen so both bare metal and xen use
> the same code when checking if a PTE is present.

Reviewed-by: David Vrabel <[email protected]>

Thanks.

David

2014-02-04 14:39:42

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH] Subject: [PATCH] xen: Properly account for _PAGE_NUMA during xen pte translations

On Tue, Feb 04, 2014 at 11:48:45AM +0000, David Vrabel wrote:
> On 04/02/14 11:44, Mel Gorman wrote:
> > Steven Noonan forwarded a users report where they had a problem starting
> > vsftpd on a Xen paravirtualized guest, with this in dmesg:
> >
> [...]
> >
> > The issue could not be reproduced under an HVM instance with the same kernel,
> > so it appears to be exclusive to paravirtual Xen guests. He bisected the
> > problem to commit 1667918b (mm: numa: clear numa hinting information on
> > mprotect) that was also included in 3.12-stable.
> >
> > The problem was related to how xen translates ptes because it was not
> > accounting for the _PAGE_NUMA bit. This patch splits pte_present to add
> > a pteval_present helper for use by xen so both bare metal and xen use
> > the same code when checking if a PTE is present.
>
> Reviewed-by: David Vrabel <[email protected]>

Thank you for fixing it!

Acked-by: Konrad Rzeszutek Wilk <[email protected]>

I can ingest it through the Xen tree for rc2. Or let Linus handle it
if he prefers it.

>
> Thanks.
>
> David