On Tue, 30 Sep 2014, Linus Torvalds wrote:
> On Tue, Sep 30, 2014 at 11:20 AM, Dave Jones <[email protected]> wrote:
> >
> > page_fault_kernel: address=__per_cpu_end ip=copy_page_to_iter error_code=0x2
>
> Interesting. "error_code" in particular. The value "2" means that the
> CPU thinks that the page is not present (bit zero is clear).
>
> (That "address" is useless - it's tried to turn a user address into a
> kernel symbol, and the percpu symbols are zero-based, so it picks the
> last of them. The "ip" is useless too, since it doesn't give the
> offset)
>
> So the CPU thinks it's a write to a not-present page, which means that
> _PAGE_PRESENT bit is clear.
>
> Now the *kernel* thinks a page is present not just if _PAGE_PRESENT is
> set, but also if _PAGE_PROTNONE or _PAGE_NUMA are set. Sadly, your
> trace is not very useful, because inlining has caused pretty much all
> the cases to be in "handle_mm_fault()", so the trace doesn't really
> tell which path this all takes.
>
> But we can still do *some* analysis on the trace: do_wp_page()
> shouldn't have been inlined, so it would have shown up in the trace if
> it had been called. So I think we can be pretty confident that the
> ptep_set_access_flags() we see is the one from handle_pte_fault().
>
> And if that is the case, then we know that "pte_present()" is indeed
> true as far a the kernel is concerned. So with _PAGE_PRESENT not being
> set (based on the error code), we know that _PAGE_PROTNONE must be
> set, otherwise we'd have triggered the pte_numa() check and exited
> through do_numa_page().
>
> So it smells like we have a PROT_NONE VM area (at least the paeg table
> entries imply that). But "access_error()" should have flagged that (it
> checks "vma->vm_flags & VM_WRITE"). How do we have a page table entry
> marked _PAGE_PROTNONE, but VM_WRITE set in the vma?
>
> Or, possibly, we have some confusion about the page tables themselves
> (corruption, wrong %cr3 value, whatever), explaining why the CPU
> thinks one thing, but our software page table walker thinks another.
>
> I'm not seeing how this all happens. But I'm adding Kirill to the cc,
> since he might see something I missed, and he touched some of this
> code last ("tag, you're it").
>
> Kirill: the thread is on lkml, but basically it boils down to the
> second byte write in fault_in_pages_writeable() faulting forever,
> despite handle_mm_fault() apparently thinking that everything is fine.
>
> Also adding Hugh Dickins, just because the more people who know this
> code that are involved, the better.
I've tried, but failed to explain it.
I think it's likely related to the VM_BUG_ON(!(val & _PAGE_PRESENT))
which linux-next has in pte_mknuma(), which Sasha Levin first reported
hitting in https://lkml.org/lkml/2014/8/26/869 (a resumption of the
"mm: BUG in unmap_page_range" thread, though its subject bug is fixed).
Mel and I gave it a lot of thought, but that too remains unexplained.
Sasha could reproduce it fairly easily on linux-next, but could not
reproduce it on 3.17-rc4 (plus the VM_BUG_ON); maybe Dave is doing
something different enough to get it on 3.17-rc7.
I say they're likely related because both could be explained if
there's some way in which a PROTNONE pte can get left behind after
the vma has been mprotected back from PROT_NONE to read-writable.
But we cannot see how (even when racing with page migration).
Irrelevance follows...
There *appears* to be a risk of hitting the VM_BUG_ON, or with no
VM_BUG_ON (as in 3.17-rc) pte_mknuma proceeding to add _PAGE_NUMA
to _PAGE_PROTNONE - making the pte then fail the pte_numa test,
but pass the pte_special test, hence fail the vm_normal_page test:
when coming from change_prot_numa serving MPOL_MF_LAZY for mbind.
However, that would still not explain Dave's endless refaulting;
though I was reminded to send you a patch to fix it, except that
when I came to test the fix, I could not produce the problem, and
eventually discovered a720094ded8c ("mm: mempolicy: Hide MPOL_NOOP
and MPOL_MF_LAZY from userspace for now") - that call to
change_prot_numa is still just dead code, so we're still safe from
its use on PROT_NONE areas (which task_numa_work carefully avoids).
Some time wasted on that, but I learnt a valuable debugging technique:
#undef EINVAL
#define EINVAL __LINE__
Hugh
On Wed, Oct 1, 2014 at 1:19 AM, Hugh Dickins <[email protected]> wrote:
>
> Irrelevance follows...
Maybe not irrelevant.
> There *appears* to be a risk of hitting the VM_BUG_ON, or with no
> VM_BUG_ON (as in 3.17-rc) pte_mknuma proceeding to add _PAGE_NUMA
> to _PAGE_PROTNONE - making the pte then fail the pte_numa test,
> but pass the pte_special test, hence fail the vm_normal_page test:
> when coming from change_prot_numa serving MPOL_MF_LAZY for mbind.
Ugh, yes. The whole _PAGE_NUMA is still a f*cking mess. I hate it.
Hate it, hate it, hate it.
We need to get rid of it, and just make it the same as pte_protnone().
And then the real protnone is in the vma flags, and if you actually
ever get to a pte that is marked protnone, you know it's a numa page.
Seriously.
I never understood what the objection to that was, but every time I
tell people to do it, they go crazy and think _PAGE_NUMA makes sense.
It doesn't. There's no excuse. Rik, what was your broken excuse again?
Something to do with Powerpc, but it is obviously not true, since
powerpc supports protnone just fine.
Even our own comments are confused, with include/asm-generic/pgtable.h saying:
* _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
* same bit too).
but no, it's not the same bit.
Can we please just get rid of _PAGE_NUMA. There is no excuse for it.
> However, that would still not explain Dave's endless refaulting;
Why not? You start out with a PROTNONE, trigger shrink_page_list() on
a hugepage,.which calls add_to_swap(), which does
split_huge_page_to_list(), which in turn calls __split_huge_page(),
and that turns (_PAGE_PROTNONE) into (_PAGE_PROTNONE|_PAGE_NUMA),
which you will then fault on forever, because the kernel thinks the
page is present, but not a NUMA page.
IOW, it's *exactly* the same f*cking confusion between _PAGE_NUMA and
_PAGE_PROTNONE I've complained about before.
> Some time wasted on that, but I learnt a valuable debugging technique:
> #undef EINVAL
> #define EINVAL __LINE__
Wow. There's a certain beauty in the pure crazyness.
However, be careful: our IS_ERR() handling knows that error numbers
are < 4096. So on a big file, and error pointers, that doesn't work
reliably.
Linus
On Wed, Oct 1, 2014 at 9:01 AM, Linus Torvalds
<[email protected]> wrote:
>
> We need to get rid of it, and just make it the same as pte_protnone().
> And then the real protnone is in the vma flags, and if you actually
> ever get to a pte that is marked protnone, you know it's a numa page.
So I'd really suggest we do exactly that. Get rid of "pte_numa()"
entirely, get rid of "_PAGE_[BIT_]NUMA" entirely, and instead add a
"pte_protnone()" helper to check for the "protnone" case (which on x86
is testing the _PAGE_PROTNONE bit, and on most other architectures is
just testing that the page has no access rights).
Then we throw away "pte_mknuma()" and "pte_mknonnuma()" entirely,
because they are brainless sh*t, and we just use
ptent = ptep_modify_prot_start(mm, addr, pte);
ptent = pte_modify(ptent, newprot);
ptep_modify_prot_commit(mm, addr, pte, ptent);
reliably instead (where for the mknuma case "newprot" is PROT_NONE,
and for mknonnuma() it is vma->vm_page_prot. Yes, that means that you
have to pass in the vma to those functions, but that just makes sense
anyway.
And if that means that we lose the numa flag on mprotect etc, nobody sane cares.
Seriously, why can't we just do this, and throw away all the crap that
is "numa special case". This would make all the random games in
change_pte_range() just go away entirely, because the whole NUMA thing
really wouldn't be a special case for the pte AT ALL any more. All it
would be is that a pte could be marked PROT_NONE even if the
vma->vm_flags aren't.
Please, please, please? The current _PAGE_NUMA really is a horrible
horrible thing, and may well be the source of this bug.
The fact that it took DaveJ a long time to trigger his lockup would be
entirely consistent with "you have to split a PROTNONE large page due
to memory pressure", so the problem with our current pte_mknuma() that
Hugh points out looks entirely possible to me.
Now, there may be some reason why it can't happen, but even in the
absense of this bug, I really think that _PAGE_NUMA has been a huge
mistake from day one.
Linus
On 10/01/2014 12:18 PM, Linus Torvalds wrote:
> Seriously, why can't we just do this, and throw away all the crap that
> is "numa special case". This would make all the random games in
> change_pte_range() just go away entirely, because the whole NUMA thing
> really wouldn't be a special case for the pte AT ALL any more. All it
> would be is that a pte could be marked PROT_NONE even if the
> vma->vm_flags aren't.
That's what I suggested quite a while back, but IIRC either
Peter or Mel brought up a reason why this was not possible.
Unfortunately I do not remember what it was, just that I
was convinced by it at the time...
On Wed, Oct 1, 2014 at 9:18 AM, Linus Torvalds
<[email protected]> wrote:
>
> So I'd really suggest we do exactly that. Get rid of "pte_numa()"
> entirely, get rid of "_PAGE_[BIT_]NUMA" entirely, and instead add a
> "pte_protnone()" helper to check for the "protnone" case (which on x86
> is testing the _PAGE_PROTNONE bit, and on most other architectures is
> just testing that the page has no access rights).
>
> Then we throw away "pte_mknuma()" and "pte_mknonnuma()" entirely,
> because they are brainless sh*t, and we just use
>
> ptent = ptep_modify_prot_start(mm, addr, pte);
> ptent = pte_modify(ptent, newprot);
> ptep_modify_prot_commit(mm, addr, pte, ptent);
>
> reliably instead (where for the mknuma case "newprot" is PROT_NONE,
> and for mknonnuma() it is vma->vm_page_prot. Yes, that means that you
> have to pass in the vma to those functions, but that just makes sense
> anyway.
>
> And if that means that we lose the numa flag on mprotect etc, nobody sane cares.
So here is a *COMPLETELY UNTESTED* and probably seriously buggy first
version of such a patch. It doesn't do the powerpc conversion, so
somebody would need to check that eventually, but aside from that
obvious issue, can people fix this up? Or comment on why it doesn't
work.
Now, I like this because it gets rid of the horrible PAGE_NUMA special
cases, but it really seems to simplify things in general. Lookie here:
13 files changed, 74 insertions(+), 268 deletions(-)
that's really mainly just the removal of odd and broken numa pte/pmd
helper functions from <asm-generic/pgtable.h> that aren't needed any
more because the normal "change protections" functions just DTRT
automatically. Although there are actually a few other cases that got
simpler too, so it's not *just* removal of those _PAGE_NUMA-specific
helpers.
One thing this does *not* remove is the special pte locking rule in
the "change_*_range()" functions: they still take that broken
"prot_numa" argument. HOWEVER, it isn't actually used for any page
table modifications, the only reason for it existing is the hacky
locking issue (see lock_pte_protection(), and the comment about races
with the transhuge accesses).
Now, I'll be honest: this patch *migth* just work, but I expect it to
have some stupid problem. It compiles. I haven't even dared boot it,
much less try any numa benchmarks that woudln't show anything sane on
my machine anyway.
So I'm really sending this patch out in the hope that it will get
comments, fixup and possibly even testing by people who actually know
the NUMA balancing code. Rik? Anybody?
Linus
On 10/01/2014 04:20 PM, Linus Torvalds wrote:
> Now, I'll be honest: this patch *migth* just work, but I expect it to
> have some stupid problem. It compiles. I haven't even dared boot it,
> much less try any numa benchmarks that woudln't show anything sane on
> my machine anyway.
>
> So I'm really sending this patch out in the hope that it will get
> comments, fixup and possibly even testing by people who actually know
> the NUMA balancing code. Rik? Anybody?
I'll add it to the list. No guarantees I'll be able to get
to it before Linux Con / KVM Forum though. I have a bit of
a backlog to get through this coming week...
On 10/01/2014 04:20 PM, Linus Torvalds wrote:
> So I'm really sending this patch out in the hope that it will get
> comments, fixup and possibly even testing by people who actually know
> the NUMA balancing code. Rik? Anybody?
Hi Linus,
I've tried this patch on the same configuration that was triggering
the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
ran fine for ~20 minutes before exploding with:
[ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
[ 2781.566953] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 2781.568054] Dumping ftrace buffer:
[ 2781.568826] (ftrace buffer empty)
[ 2781.569392] Modules linked in:
[ 2781.569909] CPU: 61 PID: 13111 Comm: trinity-c61 Not tainted 3.17.0-rc7-sasha-00040-g65e1cb2 #1259
[ 2781.571077] task: ffff88050ba80000 ti: ffff880418ecc000 task.ti: ffff880418ecc000
[ 2781.571077] RIP: do_huge_pmd_numa_page (mm/huge_memory.c:1293 (discriminator 1))
[ 2781.571077] RSP: 0000:ffff880418ecfc60 EFLAGS: 00010246
[ 2781.571077] RAX: ffffea0074c60000 RBX: ffffea0074c60000 RCX: 0000001d318009e0
[ 2781.571077] RDX: ffffea0000000000 RSI: ffffffffb5706ef3 RDI: 0000001d318009e0
[ 2781.571077] RBP: ffff880418ecfcc8 R08: 0000000000000038 R09: 0000000000000001
[ 2781.571077] R10: 0000000000000038 R11: 0000000000000001 R12: ffff8804f9b52000
[ 2781.571077] R13: 0000001d318009e0 R14: ffff880508a1f840 R15: 0000000000000028
[ 2781.571077] FS: 00007f5502fc9700(0000) GS:ffff881d77e00000(0000) knlGS:0000000000000000
[ 2781.571077] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2781.571077] CR2: 0000000000000000 CR3: 00000004bfac4000 CR4: 00000000000006a0
[ 2781.571077] Stack:
[ 2781.571077] ffff880418ecfc98 0000000000000282 ffff88050ba80000 000000000000000b
[ 2781.571077] ffff88060d2ab000 ffff88060000001d 0000000000000000 ffff881d30b3ec00
[ 2781.571077] 0000000000000000 ffff881d30b3ec00 ffff88060d2ab000 0000000000000100
[ 2781.571077] Call Trace:
[ 2781.571077] handle_mm_fault (mm/memory.c:3304 mm/memory.c:3368)
[ 2781.571077] __do_page_fault (arch/x86/mm/fault.c:1231)
[ 2781.571077] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
[ 2781.571077] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304)
[ 2781.571077] ? sched_clock_local (kernel/sched/clock.c:214)
[ 2781.571077] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[ 2781.571077] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 2781.571077] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2641 (discriminator 8))
[ 2781.571077] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
[ 2781.571077] do_async_page_fault (arch/x86/kernel/kvm.c:279)
[ 2781.571077] async_page_fault (arch/x86/kernel/entry_64.S:1314)
[ 2781.571077] ? copy_user_generic_unrolled (arch/x86/lib/copy_user_64.S:166)
[ 2781.571077] ? sys32_mmap (arch/x86/ia32/sys_ia32.c:159)
[ 2781.571077] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
[ 2781.571077] Code: b4 eb e0 0f 1f 84 00 00 00 00 00 4c 89 f7 e8 88 2f 0c 03 48 8b 45 d0 4c 89 e6 48 8b b8 88 00 00 00 e8 85 c7 ff ff e9 90 fe ff ff <0f> 0b 66 0f 1f 44 00 00 48 89 df e8 90 e9 f9 ff 84 c0 0f 85 be
All code
========
0: b4 eb mov $0xeb,%ah
2: e0 0f loopne 0x13
4: 1f (bad)
5: 84 00 test %al,(%rax)
7: 00 00 add %al,(%rax)
9: 00 00 add %al,(%rax)
b: 4c 89 f7 mov %r14,%rdi
e: e8 88 2f 0c 03 callq 0x30c2f9b
13: 48 8b 45 d0 mov -0x30(%rbp),%rax
17: 4c 89 e6 mov %r12,%rsi
1a: 48 8b b8 88 00 00 00 mov 0x88(%rax),%rdi
21: e8 85 c7 ff ff callq 0xffffffffffffc7ab
26: e9 90 fe ff ff jmpq 0xfffffffffffffebb
2b:* 0f 0b ud2 <-- trapping instruction
2d: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
33: 48 89 df mov %rbx,%rdi
36: e8 90 e9 f9 ff callq 0xfffffffffff9e9cb
3b: 84 c0 test %al,%al
3d: 0f .byte 0xf
3e: 85 .byte 0x85
3f: be .byte 0xbe
...
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
8: 48 89 df mov %rbx,%rdi
b: e8 90 e9 f9 ff callq 0xfffffffffff9e9a0
10: 84 c0 test %al,%al
12: 0f .byte 0xf
13: 85 .byte 0x85
14: be .byte 0xbe
...
[ 2781.571077] RIP do_huge_pmd_numa_page (mm/huge_memory.c:1293 (discriminator 1))
[ 2781.571077] RSP <ffff880418ecfc60>
Thanks,
Sasha
On Wed, 01 Oct 2014 18:08:30 -0400
Sasha Levin <[email protected]> wrote:
> On 10/01/2014 04:20 PM, Linus Torvalds wrote:
> > So I'm really sending this patch out in the hope that it will get
> > comments, fixup and possibly even testing by people who actually know
> > the NUMA balancing code. Rik? Anybody?
>
> Hi Linus,
>
> I've tried this patch on the same configuration that was triggering
> the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
> ran fine for ~20 minutes before exploding with:
>
> [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
That's:
BUG_ON(is_huge_zero_page(page));
Can you change your scripts to show the source code line when
the error is a BUG_ON()? The machine code disassembly after the
oops message doesn't really help.
> [ 2781.566953] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 2781.568054] Dumping ftrace buffer:
> [ 2781.568826] (ftrace buffer empty)
> [ 2781.569392] Modules linked in:
> [ 2781.569909] CPU: 61 PID: 13111 Comm: trinity-c61 Not tainted 3.17.0-rc7-sasha-00040-g65e1cb2 #1259
> [ 2781.571077] task: ffff88050ba80000 ti: ffff880418ecc000 task.ti: ffff880418ecc000
> [ 2781.571077] RIP: do_huge_pmd_numa_page (mm/huge_memory.c:1293 (discriminator 1))
> [ 2781.571077] RSP: 0000:ffff880418ecfc60 EFLAGS: 00010246
> [ 2781.571077] RAX: ffffea0074c60000 RBX: ffffea0074c60000 RCX: 0000001d318009e0
> [ 2781.571077] RDX: ffffea0000000000 RSI: ffffffffb5706ef3 RDI: 0000001d318009e0
> [ 2781.571077] RBP: ffff880418ecfcc8 R08: 0000000000000038 R09: 0000000000000001
> [ 2781.571077] R10: 0000000000000038 R11: 0000000000000001 R12: ffff8804f9b52000
> [ 2781.571077] R13: 0000001d318009e0 R14: ffff880508a1f840 R15: 0000000000000028
> [ 2781.571077] FS: 00007f5502fc9700(0000) GS:ffff881d77e00000(0000) knlGS:0000000000000000
> [ 2781.571077] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2781.571077] CR2: 0000000000000000 CR3: 00000004bfac4000 CR4: 00000000000006a0
> [ 2781.571077] Stack:
> [ 2781.571077] ffff880418ecfc98 0000000000000282 ffff88050ba80000 000000000000000b
> [ 2781.571077] ffff88060d2ab000 ffff88060000001d 0000000000000000 ffff881d30b3ec00
> [ 2781.571077] 0000000000000000 ffff881d30b3ec00 ffff88060d2ab000 0000000000000100
> [ 2781.571077] Call Trace:
> [ 2781.571077] handle_mm_fault (mm/memory.c:3304 mm/memory.c:3368)
> [ 2781.571077] __do_page_fault (arch/x86/mm/fault.c:1231)
> [ 2781.571077] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
> [ 2781.571077] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304)
> [ 2781.571077] ? sched_clock_local (kernel/sched/clock.c:214)
> [ 2781.571077] ? context_tracking_user_exit (kernel/context_tracking.c:184)
> [ 2781.571077] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 2781.571077] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2641 (discriminator 8))
> [ 2781.571077] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
> [ 2781.571077] do_async_page_fault (arch/x86/kernel/kvm.c:279)
> [ 2781.571077] async_page_fault (arch/x86/kernel/entry_64.S:1314)
> [ 2781.571077] ? copy_user_generic_unrolled (arch/x86/lib/copy_user_64.S:166)
> [ 2781.571077] ? sys32_mmap (arch/x86/ia32/sys_ia32.c:159)
> [ 2781.571077] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
> [ 2781.571077] Code: b4 eb e0 0f 1f 84 00 00 00 00 00 4c 89 f7 e8 88 2f 0c 03 48 8b 45 d0 4c 89 e6 48 8b b8 88 00 00 00 e8 85 c7 ff ff e9 90 fe ff ff <0f> 0b 66 0f 1f 44 00 00 48 89 df e8 90 e9 f9 ff 84 c0 0f 85 be
> All code
> ========
> 0: b4 eb mov $0xeb,%ah
> 2: e0 0f loopne 0x13
> 4: 1f (bad)
> 5: 84 00 test %al,(%rax)
> 7: 00 00 add %al,(%rax)
> 9: 00 00 add %al,(%rax)
> b: 4c 89 f7 mov %r14,%rdi
> e: e8 88 2f 0c 03 callq 0x30c2f9b
> 13: 48 8b 45 d0 mov -0x30(%rbp),%rax
> 17: 4c 89 e6 mov %r12,%rsi
> 1a: 48 8b b8 88 00 00 00 mov 0x88(%rax),%rdi
> 21: e8 85 c7 ff ff callq 0xffffffffffffc7ab
> 26: e9 90 fe ff ff jmpq 0xfffffffffffffebb
> 2b:* 0f 0b ud2 <-- trapping instruction
> 2d: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
> 33: 48 89 df mov %rbx,%rdi
> 36: e8 90 e9 f9 ff callq 0xfffffffffff9e9cb
> 3b: 84 c0 test %al,%al
> 3d: 0f .byte 0xf
> 3e: 85 .byte 0x85
> 3f: be .byte 0xbe
> ...
>
> Code starting with the faulting instruction
> ===========================================
> 0: 0f 0b ud2
> 2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
> 8: 48 89 df mov %rbx,%rdi
> b: e8 90 e9 f9 ff callq 0xfffffffffff9e9a0
> 10: 84 c0 test %al,%al
> 12: 0f .byte 0xf
> 13: 85 .byte 0x85
> 14: be .byte 0xbe
> ...
> [ 2781.571077] RIP do_huge_pmd_numa_page (mm/huge_memory.c:1293 (discriminator 1))
> [ 2781.571077] RSP <ffff880418ecfc60>
>
>
> Thanks,
> Sasha
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Wed, Oct 1, 2014 at 3:08 PM, Sasha Levin <[email protected]> wrote:
>
> I've tried this patch on the same configuration that was triggering
> the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
> ran fine for ~20 minutes before exploding with:
Well, that's somewhat encouraging. I didn't expect it to be perfect.
That said, "ran fine" isn't necessarily the same thing as "worked".
Who knows how buggy it was without showing overt symptoms until the
BUG_ON() triggered. But hey, I'll be optimistic.
> [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
So that's
BUG_ON(is_huge_zero_page(page));
and the reason is trivial: the old code used to have a magical special
case for the zero-page hugepage (see change_huge_pmd()) and I got rid
of that (because now it's just about setting protections, and the
zero-page hugepage is in no way special.
So I think the solution is equally trivial: just accept that the
zero-page can happen, and ignore it (just un-numa it).
Appended is a incremental diff on top of the previous one. Even less
tested than the last case, but I think you get the idea if it doesn't
work as-is.
Linus
On 10/01/2014 06:28 PM, Chuck Ebbert wrote:
> On Wed, 01 Oct 2014 18:08:30 -0400
> Sasha Levin <[email protected]> wrote:
>
>> > On 10/01/2014 04:20 PM, Linus Torvalds wrote:
>>> > > So I'm really sending this patch out in the hope that it will get
>>> > > comments, fixup and possibly even testing by people who actually know
>>> > > the NUMA balancing code. Rik? Anybody?
>> >
>> > Hi Linus,
>> >
>> > I've tried this patch on the same configuration that was triggering
>> > the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
>> > ran fine for ~20 minutes before exploding with:
>> >
>> > [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
> That's:
> BUG_ON(is_huge_zero_page(page));
>
> Can you change your scripts to show the source code line when
> the error is a BUG_ON()? The machine code disassembly after the
> oops message doesn't really help.
>
Hum? The source code line is the first line in the trace:
[ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
Thanks,
Sasha
On Wed, 01 Oct 2014 23:32:15 -0400
Sasha Levin <[email protected]> wrote:
> On 10/01/2014 06:28 PM, Chuck Ebbert wrote:
> > On Wed, 01 Oct 2014 18:08:30 -0400
> > Sasha Levin <[email protected]> wrote:
> >
> >> > On 10/01/2014 04:20 PM, Linus Torvalds wrote:
> >>> > > So I'm really sending this patch out in the hope that it will get
> >>> > > comments, fixup and possibly even testing by people who actually know
> >>> > > the NUMA balancing code. Rik? Anybody?
> >> >
> >> > Hi Linus,
> >> >
> >> > I've tried this patch on the same configuration that was triggering
> >> > the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
> >> > ran fine for ~20 minutes before exploding with:
> >> >
> >> > [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
> > That's:
> > BUG_ON(is_huge_zero_page(page));
> >
> > Can you change your scripts to show the source code line when
> > the error is a BUG_ON()? The machine code disassembly after the
> > oops message doesn't really help.
> >
>
> Hum? The source code line is the first line in the trace:
>
> [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
>
I meant, display the contents of that line so we can see what the
BUG_ON() was triggered by. In some cases you might have a custom patch
applied or be running a version that some people don't have handy.
On Wed, Oct 01, 2014 at 01:29:04PM -0400, Rik van Riel wrote:
> On 10/01/2014 12:18 PM, Linus Torvalds wrote:
>
> > Seriously, why can't we just do this, and throw away all the crap that
> > is "numa special case". This would make all the random games in
> > change_pte_range() just go away entirely, because the whole NUMA thing
> > really wouldn't be a special case for the pte AT ALL any more. All it
> > would be is that a pte could be marked PROT_NONE even if the
> > vma->vm_flags aren't.
>
> That's what I suggested quite a while back, but IIRC either
> Peter or Mel brought up a reason why this was not possible.
Hey, I'm the one that did the numa faulting with PROTNONE to begin with.
Andrea and Mel (and possibly you) all didn't like that and did the per
arch pagetable nonensen :-) I've never understood why.
https://lkml.org/lkml/2012/9/27/76
On Wed, 1 Oct 2014, Linus Torvalds wrote:
> On Wed, Oct 1, 2014 at 1:19 AM, Hugh Dickins <[email protected]> wrote:
>
> Can we please just get rid of _PAGE_NUMA. There is no excuse for it.
I'm no lover of _PAGE_NUMA, and hope that it can be simplified away
as you outline. What we have in 3.16+3.17 is already an attempt to
improve on what you hated before, but not obviously an improvement.
Mel is the one who knows these issues inside out: maybe he's been
blinkered, but I wouldn't dare to pull it apart without his input.
Myself, I'm not looking beyond fixing whatever is the bug for 3.17.
> > However, that would still not explain Dave's endless refaulting;
>
> Why not? You start out with a PROTNONE, trigger shrink_page_list() on
> a hugepage,.which calls add_to_swap(), which does
> split_huge_page_to_list(), which in turn calls __split_huge_page(),
> and that turns (_PAGE_PROTNONE) into (_PAGE_PROTNONE|_PAGE_NUMA),
> which you will then fault on forever, because the kernel thinks the
> page is present, but not a NUMA page.
I hesitate to admit, I still don't see it: please illuminate further.
We're talking about the loop in __split_huge_page_map(), where it does
entry = mk_pte(page + i, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (!pmd_write(*pmd))
entry = pte_wrprotect(entry);
if (!pmd_young(*pmd))
entry = pte_mkold(entry);
if (pmd_numa(*pmd))
entry = pte_mknuma(entry);
, right? I only see that adding _PAGE_NUMA to _PAGE_PROTNONE if
pmd_numa(*pmd): but that would mean we had already gone wrong, setting
pmd_numa in a PROT_NONE vma, which task_numa_work takes care not to do;
or have mprotected an area to PROT_NONE without doing the pmd_mknonnuma.
Ah, we won't have mmap_sem in the add_to_swap case; so we could be
racing with an mprotect which already updated vm_flags and vm_page_prot,
but has not yet reached this pmd: is that a part of how you see it?
Or are you noticing a deficiency in the pmd locking? I have not
worked my way through that, so cannot guarantee it, but please
point me to the weakness where you see it.
But when you convince me on that, then I still don't see how we get to
doing the repeated write fault, instead of hitting access_error(), as
you pointed out originally. That still seems to require a PROTNONE
pte to be left behind in a VM_WRITE vma, which I do not see happening
here. pte_modify leaves _PAGE_NUMA alone, but updates _PAGE_PROTNONE.
The pte_numa<->pte_special confusion is messy, but I don't yet get
how it would manifest in the manner observed.
But certainly, a bug in the THP splitting feels just right to
match the frequency of the sightings: I hope you've got it.
Hugh
On Wed, Oct 01, 2014 at 09:18:25AM -0700, Linus Torvalds wrote:
> On Wed, Oct 1, 2014 at 9:01 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > We need to get rid of it, and just make it the same as pte_protnone().
> > And then the real protnone is in the vma flags, and if you actually
> > ever get to a pte that is marked protnone, you know it's a numa page.
>
> So I'd really suggest we do exactly that. Get rid of "pte_numa()"
> entirely, get rid of "_PAGE_[BIT_]NUMA" entirely, and instead add a
> "pte_protnone()" helper to check for the "protnone" case (which on x86
> is testing the _PAGE_PROTNONE bit, and on most other architectures is
> just testing that the page has no access rights).
>
Do not interpret the following as being against the idea of taking the
pte_protnone approach. This is intended to give background.
At the time the changes were made to the _PAGE_NUMA bits it was acknowledged
that a full move to prot_none was an option but it was not the preferred
solution at the time. It replaced one set of corner cases with another and
the last time like this time, there was considerable time pressure. The
VMA would be required to distinguish between a NUMA hinting fault and a
real prot_none bit. In most cases, we have the VMA now with the exception
of GUP. GUP would have to unconditionally go into the slow path to do the
VMA lookup. That is not likely to be a big of a problem but it was a concern.
In early implementations based on prot_none there were some VMA-based
protection checks that had higher overhead. At the time, there were severe
problems with overhead due to NUMA balancing and adding more was not
desirable. This has been addressed since with changes in multiple other
areas so it's much less of a concern now than it was. In the current shape,
these probably is not as much a problem as long as any check on pte_numa
was first guarded by a VMA check. One way of handling the corner cases
where would be to pass in the VMA where available and have a VM_BUG_ON that
fires if its a PROT_NONE VMA. That would catch problems during debugging
without adding overhead in the !debug case.
Going back to the start, the PTE bit was used as the approach due to
concerns that a pte_protnone helper would not work on all architectures,
ppc64 in particular. There was no PROT_NONE bit there and instead prot_none
protections rely on PAGE_USER not being set so it's inaccessible from
userspace. There was discussion at the time that this could conceivably be
broken from some sub-architectures but I don't recall the details. Looking
at the current shape and your patch, it's conceivable that the pte_protnone
could be implemented as a _PAGE_PRESENT && !_PAGE_USER check as long
as it was guarded by a VMA check which x86 requires anyway. Not sure
if that would work for PMDs as I'm not familiar with with ppc64 to tell
offhand. Alternatively, ppc64 would potentially use the bit currently used
for _PAGE_NUMA as a _PROT_NONE bit.
I finished a small series that cleans up a number of issues discovered
recently due to Sasha's testing. It passed light testing and NUMA balancing
works but I cannot comment if they help Dave or Sasha's bugs as I never
managed to reproduce them. I'll post it shortly after I sent this mail.
Again, I'm not opposed to the pte_protnone issue as many of the concerns
I had before have been addressed since or rendered redundant. However,
I won't be able to digest and/or finish your patch in a reasonable time
frame. I'm only partially in work at the moment due to sick (going back
to bed after this mail) and out for a good chunk of next week and the week
after. Minor comments from the patch though
1. ppc64 needs work. Added Aneesh to the cc so he's at least aware
2. That check in pte_offset_kernel can probably go away
3. Ideally, VM_BUG_ON checks should be added to the pte_protnone check to
ensure the VMA checks have already completed to avoid confusion between
NUMA hints and real prot_none protections
4. GUP on prot_none now always hits the slow path but can't think of a
case where that really matters.
5. SWP_OFFSET_SHIFT should be readjusted back if _PAGE_BIT_NUMA is
removed.
6. It probably should at least be rebased on top of "mm: remove
misleading ARCH_USES_NUMA_PROT_NONE" simply on the grounds that
it cleans up some cruft
7. At the risk of pissing you off, pte_protnone_numa might be cleared
at indicating why protnone is being checked at all
> Then we throw away "pte_mknuma()" and "pte_mknonnuma()" entirely,
> because they are brainless sh*t, and we just use
>
> ptent = ptep_modify_prot_start(mm, addr, pte);
> ptent = pte_modify(ptent, newprot);
> ptep_modify_prot_commit(mm, addr, pte, ptent);
>
> reliably instead (where for the mknuma case "newprot" is PROT_NONE,
> and for mknonnuma() it is vma->vm_page_prot. Yes, that means that you
> have to pass in the vma to those functions, but that just makes sense
> anyway.
>
> And if that means that we lose the numa flag on mprotect etc, nobody sane cares.
>
Losing a NUMA hinting fault due to operations like mprotect is not a concern.
--
Mel Gorman
SUSE Labs
On Wed, Oct 01, 2014 at 03:42:53PM -0700, Linus Torvalds wrote:
> On Wed, Oct 1, 2014 at 3:08 PM, Sasha Levin <[email protected]> wrote:
> >
> > I've tried this patch on the same configuration that was triggering
> > the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
> > ran fine for ~20 minutes before exploding with:
>
> Well, that's somewhat encouraging. I didn't expect it to be perfect.
>
> That said, "ran fine" isn't necessarily the same thing as "worked".
> Who knows how buggy it was without showing overt symptoms until the
> BUG_ON() triggered. But hey, I'll be optimistic.
>
> > [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
>
> So that's
>
> BUG_ON(is_huge_zero_page(page));
>
> and the reason is trivial: the old code used to have a magical special
> case for the zero-page hugepage (see change_huge_pmd()) and I got rid
> of that (because now it's just about setting protections, and the
> zero-page hugepage is in no way special.
>
> So I think the solution is equally trivial: just accept that the
> zero-page can happen, and ignore it (just un-numa it).
>
> Appended is a incremental diff on top of the previous one. Even less
> tested than the last case, but I think you get the idea if it doesn't
> work as-is.
>
> Linus
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 14de54af6c38..fc33952d59c4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1290,7 +1290,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> }
>
> page = pmd_page(pmd);
> - BUG_ON(is_huge_zero_page(page));
> + if (is_huge_zero_page(page))
> + goto huge_zero_page;
> +
> page_nid = page_to_nid(page);
> last_cpupid = page_cpupid_last(page);
> count_vm_numa_event(NUMA_HINT_FAULTS);
> @@ -1381,6 +1383,11 @@ out:
> task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);
>
> return 0;
> +huge_zero_page:
> + pmd = pmd_modify(pmd, vma->vm_page_prot);
> + set_pmd_at(mm, haddr, pmdp, pmd);
> + update_mmu_cache_pmd(vma, addr, pmdp);
> + goto out_unlock;
I don't see what prevents the code to make zero page writable here.
We need at least pmd = pmd_wrprotect(pmd) before set_pmd_at();
--
Kirill A. Shutemov
On 10/02/2014 04:03 AM, Chuck Ebbert wrote:
> On Wed, 01 Oct 2014 23:32:15 -0400
> Sasha Levin <[email protected]> wrote:
>
>> On 10/01/2014 06:28 PM, Chuck Ebbert wrote:
>>> On Wed, 01 Oct 2014 18:08:30 -0400
>>> Sasha Levin <[email protected]> wrote:
>>>
>>>>> On 10/01/2014 04:20 PM, Linus Torvalds wrote:
>>>>>>> So I'm really sending this patch out in the hope that it will get
>>>>>>> comments, fixup and possibly even testing by people who actually know
>>>>>>> the NUMA balancing code. Rik? Anybody?
>>>>>
>>>>> Hi Linus,
>>>>>
>>>>> I've tried this patch on the same configuration that was triggering
>>>>> the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
>>>>> ran fine for ~20 minutes before exploding with:
>>>>>
>>>>> [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
>>> That's:
>>> BUG_ON(is_huge_zero_page(page));
>>>
>>> Can you change your scripts to show the source code line when
>>> the error is a BUG_ON()? The machine code disassembly after the
>>> oops message doesn't really help.
>>>
>>
>> Hum? The source code line is the first line in the trace:
>>
>> [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
>>
>
> I meant, display the contents of that line so we can see what the
> BUG_ON() was triggered by. In some cases you might have a custom patch
> applied or be running a version that some people don't have handy.
Ah, the actual content? That's a good point - let me look into that.
Thanks,
Sasha
On 10/01/2014 06:42 PM, Linus Torvalds wrote:
> On Wed, Oct 1, 2014 at 3:08 PM, Sasha Levin <[email protected]> wrote:
>> >
>> > I've tried this patch on the same configuration that was triggering
>> > the VM_BUG_ON that Hugh mentioned previously. Surprisingly enough it
>> > ran fine for ~20 minutes before exploding with:
> Well, that's somewhat encouraging. I didn't expect it to be perfect.
>
> That said, "ran fine" isn't necessarily the same thing as "worked".
> Who knows how buggy it was without showing overt symptoms until the
> BUG_ON() triggered. But hey, I'll be optimistic.
>
>> > [ 2781.566206] kernel BUG at mm/huge_memory.c:1293!
> So that's
>
> BUG_ON(is_huge_zero_page(page));
>
> and the reason is trivial: the old code used to have a magical special
> case for the zero-page hugepage (see change_huge_pmd()) and I got rid
> of that (because now it's just about setting protections, and the
> zero-page hugepage is in no way special.
>
> So I think the solution is equally trivial: just accept that the
> zero-page can happen, and ignore it (just un-numa it).
>
> Appended is a incremental diff on top of the previous one. Even less
> tested than the last case, but I think you get the idea if it doesn't
> work as-is.
I have a new one for you. I know it doesn't say "numa" anywhere, but I
haven't ever seen that trace before so I'll just go ahead and blame it
on your patch...
[ 2838.403382] BUG: unable to handle kernel paging request at 000000055d996e80
[ 2838.405740] IP: task_curr (kernel/sched/core.c:1010)
[ 2838.407076] PGD dba2c6067 PUD 0
[ 2838.407926] Thread overran stack, or stack corrupted
[ 2838.409093] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 2838.411454] Dumping ftrace buffer:
[ 2838.412602] (ftrace buffer empty)
[ 2838.413187] Modules linked in:
[ 2838.413187] CPU: 38 PID: 9342 Comm: trinity-c38 Not tainted 3.17.0-rc7-sasha-00041-g6c9c81b #1260
[ 2838.413187] task: ffff880dba2f0000 ti: ffff880dba2ec000 task.ti: ffff880dba2ec000
[ 2838.413187] RIP: task_curr (kernel/sched/core.c:1010)
[ 2838.413187] RSP: 0018:ffff880dba2ebf48 EFLAGS: 00010046
[ 2838.413187] RAX: 000000000000f080 RBX: ffff880dba2f0000 RCX: 000000000000000a
[ 2838.413187] RDX: 00000000ba1a9560 RSI: ffff880dba2f0000 RDI: ffff880dba2f0000
[ 2838.413187] RBP: ffff880dba2ebf98 R08: 000000000004862a R09: 0000000000000000
[ 2838.413187] R10: 0000000000000038 R11: 000000000000001f R12: ffff880dba2f0000
[ 2838.413187] R13: ffff880dd5420740 R14: 000000000000000b R15: ffffffff8cc92000
[ 2838.413187] FS: 00007f05f3dbc700(0000) GS:ffff880701e00000(0000) knlGS:0000000000000000
[ 2838.413187] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2838.413187] CR2: 000000055d996e80 CR3: 0000000dba2c5000 CR4: 00000000000006a0
[ 2838.413187] DR0: 00000000006ee000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2838.413187] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000090602
[ 2838.413187] Stack:
[ 2838.413187] ffffffff8816218b 0000000000000000 ffff880d0000000a 000000000000000b
[ 2838.413187] 0000000000000082 ffff880dba2f0000 000000000000000b ffff880dba2ec070
[ 2838.413187] 0000000000000000 ffffffff8cc92000 ffff880dba2ebff8 ffffffff88162a84
[ 2838.413187] Call Trace:
[ 2838.413187] <UNK>
[ 2838.413187] Code: 87 60 09 00 00 01 e8 8d ee ff ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 48 8b 57 08 55 48 c7 c0 80 f0 00 00 48 89 e5 5d 8b 52 18 <48> 8b 14 d5 80 c3 c4 8c 48 39 bc 10 68 09 00 00 0f 94 c0 0f b6
All code
========
0: 87 60 09 xchg %esp,0x9(%rax)
3: 00 00 add %al,(%rax)
5: 01 e8 add %ebp,%eax
7: 8d (bad)
8: ee out %al,(%dx)
9: ff (bad)
a: ff 5d c3 lcallq *-0x3d(%rbp)
d: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1)
14: 00 00 00 00
18: 48 8b 57 08 mov 0x8(%rdi),%rdx
1c: 55 push %rbp
1d: 48 c7 c0 80 f0 00 00 mov $0xf080,%rax
24: 48 89 e5 mov %rsp,%rbp
27: 5d pop %rbp
28: 8b 52 18 mov 0x18(%rdx),%edx
2b:* 48 8b 14 d5 80 c3 c4 mov -0x733b3c80(,%rdx,8),%rdx <-- trapping instruction
32: 8c
33: 48 39 bc 10 68 09 00 cmp %rdi,0x968(%rax,%rdx,1)
3a: 00
3b: 0f 94 c0 sete %al
3e: 0f b6 00 movzbl (%rax),%eax
Code starting with the faulting instruction
===========================================
0: 48 8b 14 d5 80 c3 c4 mov -0x733b3c80(,%rdx,8),%rdx
7: 8c
8: 48 39 bc 10 68 09 00 cmp %rdi,0x968(%rax,%rdx,1)
f: 00
10: 0f 94 c0 sete %al
13: 0f b6 00 movzbl (%rax),%eax
[ 2838.413187] RIP task_curr (kernel/sched/core.c:1010)
[ 2838.413187] RSP <ffff880dba2ebf48>
[ 2838.413187] CR2: 000000055d996e80
Thanks,
Sasha
On Thu, Oct 2, 2014 at 1:47 AM, Hugh Dickins <[email protected]> wrote:
>
> I hesitate to admit, I still don't see it: please illuminate further.
No, your'e looking at what I was looking.
> We're talking about the loop in __split_huge_page_map(), where it does
Yes.
> entry = mk_pte(page + i, vma->vm_page_prot);
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> if (!pmd_write(*pmd))
> entry = pte_wrprotect(entry);
> if (!pmd_young(*pmd))
> entry = pte_mkold(entry);
> if (pmd_numa(*pmd))
> entry = pte_mknuma(entry);
>
> , right? I only see that adding _PAGE_NUMA to _PAGE_PROTNONE if
> pmd_numa(*pmd): but that would mean we had already gone wrong, setting
> pmd_numa in a PROT_NONE vma, which task_numa_work takes care not to do;
> or have mprotected an area to PROT_NONE without doing the pmd_mknonnuma.
Fair enough. Except this code has no locking that I see, so if we
*ever* see that numa entry in the pmd while walking the page tables in
vmscan, we're basically screwed.
> Or are you noticing a deficiency in the pmd locking? I have not
> worked my way through that, so cannot guarantee it, but please
> point me to the weakness where you see it.
So I don't see any locking at all wrt mprotect (or new mmap). That's
kind of the whole point for page-out - it bypasses all the normal VM
locks, and only uses the last pte locking.
So the whole use of vma->vm_page_prot here is a bit scary. That gets
modified outside of the page table locks. So how do you know it's not
already PROT_NONE, but mprotect just hasn't gotten to actually take
the page table locks yet?
I dunno. It all makes me just very nervous. The whole "numa bit is
separate from the protections, has different locking, and is just
oddly and subtly different" is really what I fundamentally object to.
And it seems so _unnecessary_. All this odd complexity for no actual
gain - just extra code, and extra room for subtle bugs. Which is
exactly why I hate that magic NUMA bit so much.
Linus
On Thu, Oct 2, 2014 at 7:25 AM, Kirill A. Shutemov <[email protected]> wrote:
>
> I don't see what prevents the code to make zero page writable here.
> We need at least pmd = pmd_wrprotect(pmd) before set_pmd_at();
Do we? If it's the zero page, it had better be an anonymous mapping,
and vm_page_prot had better not be writable.
Anonymous pages don't _start_ out writable, we explicitly make them so
with code like
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));
so it should be fine to just use "pmd_modify(pmd, vma->vm_page_prot);" directly.
But hey, this is the kind of thing that maybe I'm missing something on..
Linus
On Thu, Oct 2, 2014 at 8:04 AM, Sasha Levin <[email protected]> wrote:
>
> I have a new one for you. I know it doesn't say "numa" anywhere, but I
> haven't ever seen that trace before so I'll just go ahead and blame it
> on your patch...
Fair enough, but the oops doesn't really give even a hint of what
could be wrong.
The stack is clearly too deep:
Thread overran stack, or stack corrupted
task.ti: ffff880dba2ec000
RSP: ffff880dba2ebf48
but my patch shouldn't have added any deeper call-chains anywhere.
Anyway, unless you can get some other interesting oops with more hints
about where we go wrong, you might want to try Mel's four cleanup
patches instead. I love how you're testing my quick-and-dirty hack,
and I think we'll need to do this eventually, but in the short term
Mel's patches are probably worth applying.
In particular, his 4/4 patch removes the code case that I was
particularly nervous about, so it looks worth a try, because that may
just remove the bug with much smaller effort.
Linus
On Thu, Oct 02, 2014 at 09:01:38AM -0700, Linus Torvalds wrote:
> On Thu, Oct 2, 2014 at 7:25 AM, Kirill A. Shutemov <[email protected]> wrote:
> >
> > I don't see what prevents the code to make zero page writable here.
> > We need at least pmd = pmd_wrprotect(pmd) before set_pmd_at();
>
> Do we? If it's the zero page, it had better be an anonymous mapping,
> and vm_page_prot had better not be writable.
>
> Anonymous pages don't _start_ out writable, we explicitly make them so
> with code like
>
> if (vma->vm_flags & VM_WRITE)
> entry = pte_mkwrite(pte_mkdirty(entry));
>
> so it should be fine to just use "pmd_modify(pmd, vma->vm_page_prot);" directly.
>
> But hey, this is the kind of thing that maybe I'm missing something on..
You're right.
It means we have redundant pmd_wrprotect() in set_huge_zero_page().
==========================================================================
Subject: [PATCH] thp: do not mark zero-page pmd write-protected explicitly
Zero pages can be used only in anonymous mappings, which never have
writable vma->vm_page_prot: see protection_map in mm/mmap.c and __PX1X
definitions.
Let's drop redundant pmd_wrprotect() in set_huge_zero_page().
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06b862..2c17d184b56d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -784,7 +784,6 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
if (!pmd_none(*pmd))
return false;
entry = mk_pmd(zero_page, vma->vm_page_prot);
- entry = pmd_wrprotect(entry);
entry = pmd_mkhuge(entry);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, haddr, pmd, entry);
--
Kirill A. Shutemov
On 10/02/2014 12:10 PM, Linus Torvalds wrote:
> On Thu, Oct 2, 2014 at 8:04 AM, Sasha Levin <[email protected]> wrote:
>> >
>> > I have a new one for you. I know it doesn't say "numa" anywhere, but I
>> > haven't ever seen that trace before so I'll just go ahead and blame it
>> > on your patch...
> Fair enough, but the oops doesn't really give even a hint of what
> could be wrong.
>
> The stack is clearly too deep:
>
> Thread overran stack, or stack corrupted
> task.ti: ffff880dba2ec000
> RSP: ffff880dba2ebf48
>
> but my patch shouldn't have added any deeper call-chains anywhere.
For the record, I tweaked the environment to put some more pressure on the
scheduler and found out what broke (which is not related to this thread at
all).
For the curious ones: https://lkml.org/lkml/2014/10/3/5
Thanks,
Sasha
On Thu, Oct 2, 2014 at 10:00 PM, Sasha Levin <[email protected]> wrote:
>
> For the record, I tweaked the environment to put some more pressure on the
> scheduler and found out what broke (which is not related to this thread at
> all).
Ok. It's probably still worth testing Mel's patches, since that's what
goes into 3.17. But it would be interesting to eventually go back to
stability-testing the protnone set, since it *looked* like it was
working well aside from this issue. I might decide to just do it
during the 3.18 merge window..
Linus
On Fri, Oct 03, 2014 at 08:43:57AM -0700, Linus Torvalds wrote:
> On Thu, Oct 2, 2014 at 10:00 PM, Sasha Levin <[email protected]> wrote:
> >
> > For the record, I tweaked the environment to put some more pressure on the
> > scheduler and found out what broke (which is not related to this thread at
> > all).
>
> Ok. It's probably still worth testing Mel's patches, since that's what
> goes into 3.17. But it would be interesting to eventually go back to
> stability-testing the protnone set, since it *looked* like it was
> working well aside from this issue. I might decide to just do it
> during the 3.18 merge window..
FYI, I've been trying to reproduce that initial bug all week, and
haven't seen a single reoccurance of it, just other bugs.
Kind of a pain in the ass for confirming this numa stuff was indeed
the cause, but that's been true of so many of the more obscure bugs
trinity has shaken out.
Dave
On 10/03/2014 11:58 AM, Dave Jones wrote:
> On Fri, Oct 03, 2014 at 08:43:57AM -0700, Linus Torvalds wrote:
> > On Thu, Oct 2, 2014 at 10:00 PM, Sasha Levin <[email protected]> wrote:
> > >
> > > For the record, I tweaked the environment to put some more pressure on the
> > > scheduler and found out what broke (which is not related to this thread at
> > > all).
> >
> > Ok. It's probably still worth testing Mel's patches, since that's what
> > goes into 3.17. But it would be interesting to eventually go back to
> > stability-testing the protnone set, since it *looked* like it was
> > working well aside from this issue. I might decide to just do it
> > during the 3.18 merge window..
Linus, I'm running with Mel's patches since yesterday, nothing interesting
to report.
> FYI, I've been trying to reproduce that initial bug all week, and
> haven't seen a single reoccurance of it, just other bugs.
> Kind of a pain in the ass for confirming this numa stuff was indeed
> the cause, but that's been true of so many of the more obscure bugs
> trinity has shaken out.
+1, it seems to have "disappeared", quite annoying :/
Thanks,
Sasha
Mel Gorman <[email protected]> writes:
> On Wed, Oct 01, 2014 at 09:18:25AM -0700, Linus Torvalds wrote:
>> On Wed, Oct 1, 2014 at 9:01 AM, Linus Torvalds
>> <[email protected]> wrote:
>> >
>> > We need to get rid of it, and just make it the same as pte_protnone().
>> > And then the real protnone is in the vma flags, and if you actually
>> > ever get to a pte that is marked protnone, you know it's a numa page.
>>
>> So I'd really suggest we do exactly that. Get rid of "pte_numa()"
>> entirely, get rid of "_PAGE_[BIT_]NUMA" entirely, and instead add a
>> "pte_protnone()" helper to check for the "protnone" case (which on x86
>> is testing the _PAGE_PROTNONE bit, and on most other architectures is
>> just testing that the page has no access rights).
>>
>
> Do not interpret the following as being against the idea of taking the
> pte_protnone approach. This is intended to give background.
>
> At the time the changes were made to the _PAGE_NUMA bits it was acknowledged
> that a full move to prot_none was an option but it was not the preferred
> solution at the time. It replaced one set of corner cases with another and
> the last time like this time, there was considerable time pressure. The
> VMA would be required to distinguish between a NUMA hinting fault and a
> real prot_none bit. In most cases, we have the VMA now with the exception
> of GUP. GUP would have to unconditionally go into the slow path to do the
> VMA lookup. That is not likely to be a big of a problem but it was a concern.
>
> In early implementations based on prot_none there were some VMA-based
> protection checks that had higher overhead. At the time, there were severe
> problems with overhead due to NUMA balancing and adding more was not
> desirable. This has been addressed since with changes in multiple other
> areas so it's much less of a concern now than it was. In the current shape,
> these probably is not as much a problem as long as any check on pte_numa
> was first guarded by a VMA check. One way of handling the corner cases
> where would be to pass in the VMA where available and have a VM_BUG_ON that
> fires if its a PROT_NONE VMA. That would catch problems during debugging
> without adding overhead in the !debug case.
>
> Going back to the start, the PTE bit was used as the approach due to
> concerns that a pte_protnone helper would not work on all architectures,
> ppc64 in particular. There was no PROT_NONE bit there and instead prot_none
> protections rely on PAGE_USER not being set so it's inaccessible from
> userspace. There was discussion at the time that this could conceivably be
> broken from some sub-architectures but I don't recall the details. Looking
> at the current shape and your patch, it's conceivable that the pte_protnone
> could be implemented as a _PAGE_PRESENT && !_PAGE_USER check as long
> as it was guarded by a VMA check which x86 requires anyway. Not sure
> if that would work for PMDs as I'm not familiar with with ppc64 to tell
> offhand. Alternatively, ppc64 would potentially use the bit currently used
> for _PAGE_NUMA as a _PROT_NONE bit.
Are we still looking at these options ? I could look at implementing the
first option which will also enable us to free up one pte bit.
Note: Freeing up one bit will enable us to implement soft dirty tracking
needed for CRIU.
-aneesh
On Mon, Oct 6, 2014 at 3:18 PM, Aneesh Kumar K.V
<[email protected]> wrote:
>
> Are we still looking at these options ? I could look at implementing the
> first option which will also enable us to free up one pte bit.
We definitely are. If you can test my patch (with the small follow-up
fix), and do the necessary changes for ppc64, that would be good.
I looked quickly at the ppc64 side, and it didn't look too painful.
Using pte_protnone() instead of pte_numa() should remove move lines
than it adds there too..
Linus
Linus Torvalds <[email protected]> writes:
> On Mon, Oct 6, 2014 at 3:18 PM, Aneesh Kumar K.V
> <[email protected]> wrote:
>>
>> Are we still looking at these options ? I could look at implementing the
>> first option which will also enable us to free up one pte bit.
>
> We definitely are. If you can test my patch (with the small follow-up
> fix), and do the necessary changes for ppc64, that would be good.
>
> I looked quickly at the ppc64 side, and it didn't look too painful.
> Using pte_protnone() instead of pte_numa() should remove move lines
> than it adds there too..
>
This is a quick hack and gets it running. With perf bench numa mem
-bash-4.2# grep numa /proc/vmstat
numa_hit 3310633
numa_miss 0
numa_foreign 0
numa_interleave 6451
numa_local 3162369
numa_other 148264
numa_pte_updates 27708982
numa_huge_pte_updates 76987
numa_hint_faults 268439275
numa_hint_faults_local 5359216
numa_pages_migrated 3349573
-bash-4.2#
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index d98c1ecc3266..2a9bbe4d2364 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -41,7 +41,7 @@ static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) & PA
static inline int pte_present(pte_t pte)
{
- return pte_val(pte) & (_PAGE_PRESENT | _PAGE_NUMA);
+ return pte_val(pte) & _PAGE_PRESENT;
}
#define pte_present_nonuma pte_present_nonuma
@@ -50,78 +50,20 @@ static inline int pte_present_nonuma(pte_t pte)
return pte_val(pte) & (_PAGE_PRESENT);
}
-#define pte_numa pte_numa
-static inline int pte_numa(pte_t pte)
+#define pte_protnone pte_protnone
+static inline int pte_protnone(pte_t pte)
{
return (pte_val(pte) &
- (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+ (_PAGE_PRESENT | _PAGE_USER)) == _PAGE_PRESENT;
}
-#define pte_mknonnuma pte_mknonnuma
-static inline pte_t pte_mknonnuma(pte_t pte)
+#define pmd_protnone pmd_protnone
+static inline int pmd_protnone(pmd_t pmd)
{
- pte_val(pte) &= ~_PAGE_NUMA;
- pte_val(pte) |= _PAGE_PRESENT | _PAGE_ACCESSED;
- return pte;
-}
-
-#define pte_mknuma pte_mknuma
-static inline pte_t pte_mknuma(pte_t pte)
-{
- /*
- * We should not set _PAGE_NUMA on non present ptes. Also clear the
- * present bit so that hash_page will return 1 and we collect this
- * as numa fault.
- */
- if (pte_present(pte)) {
- pte_val(pte) |= _PAGE_NUMA;
- pte_val(pte) &= ~_PAGE_PRESENT;
- } else
- VM_BUG_ON(1);
- return pte;
+ return pte_protnone(pmd_pte(pmd));
}
-#define ptep_set_numa ptep_set_numa
-static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep)
-{
- if ((pte_val(*ptep) & _PAGE_PRESENT) == 0)
- VM_BUG_ON(1);
-
- pte_update(mm, addr, ptep, _PAGE_PRESENT, _PAGE_NUMA, 0);
- return;
-}
-
-#define pmd_numa pmd_numa
-static inline int pmd_numa(pmd_t pmd)
-{
- return pte_numa(pmd_pte(pmd));
-}
-
-#define pmdp_set_numa pmdp_set_numa
-static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
- pmd_t *pmdp)
-{
- if ((pmd_val(*pmdp) & _PAGE_PRESENT) == 0)
- VM_BUG_ON(1);
-
- pmd_hugepage_update(mm, addr, pmdp, _PAGE_PRESENT, _PAGE_NUMA);
- return;
-}
-
-#define pmd_mknonnuma pmd_mknonnuma
-static inline pmd_t pmd_mknonnuma(pmd_t pmd)
-{
- return pte_pmd(pte_mknonnuma(pmd_pte(pmd)));
-}
-
-#define pmd_mknuma pmd_mknuma
-static inline pmd_t pmd_mknuma(pmd_t pmd)
-{
- return pte_pmd(pte_mknuma(pmd_pte(pmd)));
-}
-
-# else
+#else
static inline int pte_present(pte_t pte)
{
diff --git a/arch/powerpc/include/asm/pte-hash64.h b/arch/powerpc/include/asm/pte-hash64.h
index 2505d8eab15c..68d0a5d01dc3 100644
--- a/arch/powerpc/include/asm/pte-hash64.h
+++ b/arch/powerpc/include/asm/pte-hash64.h
@@ -30,7 +30,7 @@
/*
* Used for tracking numa faults
*/
-#define _PAGE_NUMA 0x00000010 /* Gather numa placement stats */
+//#define _PAGE_NUMA 0x00000010 /* Gather numa placement stats */
/* No separate kernel read-only */
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 084ad54c73cd..27004431b576 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -235,7 +235,11 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
pte_size = psize;
pte = lookup_linux_pte_and_update(pgdir, hva, writing,
&pte_size);
- if (pte_present(pte) && !pte_numa(pte)) {
+ /*
+ * Skip the ptes marked for numa fault tracking in
+ * host page table.
+ */
+ if (pte_present(pte) && !pte_protnone(pte)) {
if (writing && !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 51ab9e7e6c39..b77ecac7e61f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -393,8 +393,6 @@ good_area:
* processors use the same I/D cache coherency mechanism
* as embedded.
*/
- if (error_code & DSISR_PROTFAULT)
- goto bad_area;
#endif /* CONFIG_PPC_STD_MMU */
/*
@@ -418,9 +416,6 @@ good_area:
flags |= FAULT_FLAG_WRITE;
/* a read */
} else {
- /* protection fault */
- if (error_code & 0x08000000)
- goto bad_area;
if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
goto bad_area;
}
diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index d8746684f606..89f8568bf0b5 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -39,7 +39,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
/*
* Similar to the PMD case, NUMA hinting must take slow path
*/
- if (pte_numa(pte))
+ if (pte_protnone(pte))
return 0;
if ((pte_val(pte) & mask) != result)
@@ -85,7 +85,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
* slowpath for accounting purposes and so that they
* can be serialised against THP migration.
*/
- if (pmd_numa(pmd))
+ if (pmd_protnone(pmd))
return 0;
if (!gup_hugepte((pte_t *)pmdp, PMD_SIZE, addr, next,