On Thu, 2012-06-28 at 01:02 +0200, Peter Zijlstra wrote:
> On Wed, 2012-06-27 at 15:26 -0700, Linus Torvalds wrote:
> > On Wed, Jun 27, 2012 at 2:15 PM, Peter Zijlstra <[email protected]> wrote:
> > > This originated from s390 which does something similar and would allow
> > > s390 to use the generic TLB flushing code.
> > >
> > > The idea is to flush the mm wide cache and tlb a priory and not bother
> > > with multiple flushes if the batching isn't large enough.
> > >
> > > This can be safely done since there cannot be any concurrency on this
> > > mm, its either after the process died (exit) or in the middle of
> > > execve where the thread switched to the new mm.
> >
> > I think we actually *used* to do the final TLB flush from within the
> > context of the process that died. That doesn't seem to ever be the
> > case any more, but it does worry me a bit. Maybe a
> >
> > VM_BUG_ON(current->active_mm == mm);
> >
> > or something for the fullmm case?
>
> OK, added it and am rebooting the test box..
That triggered.. is this a problem though, at this point userspace is
very dead so it shouldn't matter, right?
Will have to properly think about it tomorrow, its been 1am, brain is
mostly sleeping already.
------------[ cut here ]------------
kernel BUG at /home/root/src/linux-2.6/mm/memory.c:221!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU 13
Pid: 132, comm: modprobe Not tainted 3.5.0-rc4-01507-g912ca15-dirty #180 Supermicro X8DTN/X8DTN
RIP: 0010:[<ffffffff811511bf>] [<ffffffff811511bf>] tlb_gather_mmu+0x9f/0xb0
RSP: 0018:ffff880235b2bd78 EFLAGS: 00010246
RAX: ffff880235b18000 RBX: ffff880235b2bdc0 RCX: ffff880235b18000
RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000000
RBP: ffff880235b2bd98 R08: 0000000000000018 R09: 0000000000000004
R10: ffffffff81eedfc0 R11: 0000000000000084 R12: ffff8804356b8000
R13: 0000000000000001 R14: ffff880235b185f0 R15: ffff880235b18000
FS: 0000000000000000(0000) GS:ffff880237ce0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000038ce8ae150 CR3: 0000000436ad6000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 132, threadinfo ffff880235b2a000, task ffff880235b18000)
Stack:
ffff880235b2bd98 0000000000000000 ffff8804356b8000 ffff8804356b8060
ffff880235b2be38 ffffffff8115ad38 ffff880235b2be38 ffff880235b4e000
ffff880235b4e630 ffff8804356b8000 0000000100000000 ffff880235b2bdd8
Call Trace:
[<ffffffff8115ad38>] exit_mmap+0x98/0x150
[<ffffffff810bf98e>] ? exit_numa+0xae/0xe0
[<ffffffff81078b74>] mmput+0x84/0x120
[<ffffffff81080ce8>] exit_mm+0x108/0x130
[<ffffffff81081388>] do_exit+0x678/0x950
[<ffffffff811a3ad6>] ? alloc_fd+0xd6/0x120
[<ffffffff811791c0>] ? kmem_cache_free+0x20/0x130
[<ffffffff810819af>] do_group_exit+0x3f/0xa0
[<ffffffff81081a27>] sys_exit_group+0x17/0x20
[<ffffffff81980ed2>] system_call_fastpath+0x16/0x1b
Code: 10 74 1a 65 48 8b 04 25 80 ba 00 00 4c 3b a0 90 02 00 00 74 16 4c 89 e7 e8 5f 39 f2 ff 48 8b 5d e8 4c 8b 65 f0 4c 8b 6d f8 c9 c3 <0f> 0b 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
RIP [<ffffffff811511bf>] tlb_gather_mmu+0x9f/0xb0
RSP <ffff880235b2bd78>
---[ end trace f99f121b09c974f8 ]---
On Wed, Jun 27, 2012 at 4:13 PM, Peter Zijlstra <[email protected]> wrote:
>
> That triggered.. is this a problem though, at this point userspace is
> very dead so it shouldn't matter, right?
It still matters. Even if user space is dead, kernel space accesses
can result in TLB fills in user space. Exactly because of things like
speculative fills etc.
So what can happen - for example - is that the kernel does a indirect
jump, and the CPU predicts the destination of the jump that using the
branch prediction tables.
But the branch prediction tables are obviously just predictions, and
they easily contain user addresses etc in them. So the kernel may well
end up speculatively doing a TLB fill on a user access.
And your whole optimization depends on this not happening, unless I
read the logic wrong. The whole "invalidate the TLB just once
up-front" approach is *only* valid if you know that nothing is going
to ever fill that TLB again. But see above - if we're still running
within that TLB context, we have no idea what speculative execution
may or may not end up filling.
That said, maybe I misread your patch?
Linus
On Wed, Jun 27, 2012 at 4:23 PM, Linus Torvalds
<[email protected]> wrote:
>
> But the branch prediction tables are obviously just predictions, and
> they easily contain user addresses etc in them. So the kernel may well
> end up speculatively doing a TLB fill on a user access.
That should be ".. on a user *address*", hopefully that was clear from
the context, if not from the text.
IOW, the point I'm trying to make is that even if there are zero
*actual* accesses of user space (because user space is dead, and the
kernel hopefully does no "get_user()/put_user()" stuff at this point
any more), the CPU may speculatively use user addresses for the
bog-standard kernel addresses that happen.
Taking a user address from the BTB is just one example. Speculative
memory accesses might happen after a mis-predicted branch, where we
test a pointer against NULL, and after the branch we access it. So
doing a speculative TLB walk of the NULL address would not necessarily
even be unusual. Obviously normally nothing is actually mapped there,
but these kinds of things can *easily* result in the page tables
themselves being cached, even if the final page doesn't exist.
Also, all of this obviously depends on how aggressive the speculation
is. It's entirely possible that effects like these are really hard to
see in practice, and you'll almost never hit it. But stale TLB
contents (or stale page directory caches) are *really* nasty when they
do happen, and almost impossible to debug. So we want to be insanely
anal in this area.
Linus
On Wed, 2012-06-27 at 16:33 -0700, Linus Torvalds wrote:
> IOW, the point I'm trying to make is that even if there are zero
> *actual* accesses of user space (because user space is dead, and the
> kernel hopefully does no "get_user()/put_user()" stuff at this point
> any more), the CPU may speculatively use user addresses for the
> bog-standard kernel addresses that happen.
Right.. and s390 having done this only says that s390 appears to be ok
with it. Martin, does s390 hardware guarantee no speculative stuff like
Linus explained, or might there even be a latent issue on s390?
But it looks like we cannot do this in general, and esp. ARM (as already
noted by Catalin) has very aggressive speculative behaviour.
The alternative is that we do a switch_mm() to init_mm instead of the
TLB flush. On x86 that should be about the same cost, but I've not
looked at other architectures yet.
The second and least favourite alternative is of course special casing
this for s390 if it turns out its a safe thing to do for them.
/me goes look through arch code.
On Thu, 28 Jun 2012 12:55:04 +0200
Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-06-27 at 16:33 -0700, Linus Torvalds wrote:
> > IOW, the point I'm trying to make is that even if there are zero
> > *actual* accesses of user space (because user space is dead, and the
> > kernel hopefully does no "get_user()/put_user()" stuff at this point
> > any more), the CPU may speculatively use user addresses for the
> > bog-standard kernel addresses that happen.
>
> Right.. and s390 having done this only says that s390 appears to be ok
> with it. Martin, does s390 hardware guarantee no speculative stuff like
> Linus explained, or might there even be a latent issue on s390?
The cpu can create speculative TLB entries, but only if it runs in the
mode that uses the respective mm. We have two mm's active at the same
time, the kernel mm (init_mm) and the user mm. While the cpu runs only
in kernel mode it is not allowed to create TLBs for the user mm.
While running in user mode it is allowed to speculatively create TLBs.
> But it looks like we cannot do this in general, and esp. ARM (as already
> noted by Catalin) has very aggressive speculative behaviour.
>
> The alternative is that we do a switch_mm() to init_mm instead of the
> TLB flush. On x86 that should be about the same cost, but I've not
> looked at other architectures yet.
>
> The second and least favourite alternative is of course special casing
> this for s390 if it turns out its a safe thing to do for them.
>
> /me goes look through arch code.
Basically we have two special requirements on s390:
1) do not modify ptes while attached to another cpu except with the
special IPTE / IDTE instructions
2) do a TLB flush before freeing any kind of page table page, s390
needs a flush for pud, pmd & pte tables.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Thu, 2012-06-28 at 13:19 +0200, Martin Schwidefsky wrote:
> The cpu can create speculative TLB entries, but only if it runs in the
> mode that uses the respective mm. We have two mm's active at the same
> time, the kernel mm (init_mm) and the user mm. While the cpu runs only
> in kernel mode it is not allowed to create TLBs for the user mm.
> While running in user mode it is allowed to speculatively create TLBs.
OK, that's neat.
> Basically we have two special requirements on s390:
> 1) do not modify ptes while attached to another cpu except with the
> special IPTE / IDTE instructions
Right, and your fullmm case works by doing a global invalidate after all
threads have ceased userspace execution, this allows you to do away with
the IPTE/IDTE instructions since there's no other active cpus on the
userspace mm anymore.
> 2) do a TLB flush before freeing any kind of page table page, s390
> needs a flush for pud, pmd & pte tables.
Right, we do that (now)..