2010-07-14 15:59:46

by Mathieu Desnoyers

[permalink] [raw]
Subject: [patch 1/2] x86_64 page fault NMI-safe

> I think you're vastly overestimating what is sane to do from an NMI
> context. It is utterly and totally insane to assume vmalloc is available
> in NMI.
>
> -hpa
>

Ok, please tell me where I am wrong then.. by looking into
arch/x86/mm/fault.c, I see that vmalloc_sync_all() touches pgd_list
entries while the pgd_lock spinlock is taken, with interrupts disabled.
So it's protected against concurrent pgd_list modification from

a - vmalloc_sync_all() on other CPUs
b - local interrupts

However, a completely normal interrupt can come on a remote CPU, run
vmalloc_fault() and issue a set_pgd concurrently. Therefore I conclude
this interrupt disable is not there to insure any kind of protection
against concurrent updates.

Also, we see that vmalloc_fault has comments such as :

(for x86_32)
* Do _not_ use "current" here. We might be inside
* an interrupt in the middle of a task switch..

So it takes the pgd_addr from cr3, not from current. Using only the
stack/registers makes this NMI-safe even if "current" is invalid when
the NMI comes. This is caused by the fact that __switch_to will update
the registers before updating current_task without disabling interrupts.

You are right in that x86_64 does not seems to play as safely as x86_32
on this matter; it uses current->mm. Probably it shouldn't assume
"current" is valid. Actually, I don't see where x86_64 disables
interrupts around __switch_to, so this would seem to be a race
condition. Or have I missed something ?

(Ingo)
> > the scheduler disables interrupts around __switch_to(). (x86 does
> > not set __ARCH_WANT_INTERRUPTS_ON_CTXSW)
>
(Mathieu)
> Ok, so I guess it's only useful to NMIs then. However, it makes me
> wonder why this comment was there in the first place on x86_32
> vmalloc_fault() and why it uses read_cr3() :
>
> * Do _not_ use "current" here. We might be inside
> * an interrupt in the middle of a task switch..
(Ingo)
hm, i guess it's still useful to keep the
__ARCH_WANT_INTERRUPTS_ON_CTXSW case working too. On -rt we used to
enable it to squeeze a tiny bit more latency out of the system.


Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: [email protected]
CC: [email protected]
CC: "H. Peter Anvin" <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Frank Ch. Eigler" <[email protected]>
---
arch/x86/mm/fault.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/arch/x86/mm/fault.c
===================================================================
--- linux-2.6-lttng.orig/arch/x86/mm/fault.c 2010-03-13 16:56:46.000000000 -0500
+++ linux-2.6-lttng/arch/x86/mm/fault.c 2010-03-13 16:57:53.000000000 -0500
@@ -360,6 +360,7 @@ void vmalloc_sync_all(void)
*/
static noinline __kprobes int vmalloc_fault(unsigned long address)
{
+ unsigned long pgd_paddr;
pgd_t *pgd, *pgd_ref;
pud_t *pud, *pud_ref;
pmd_t *pmd, *pmd_ref;
@@ -374,7 +375,8 @@ static noinline __kprobes int vmalloc_fa
* happen within a race in page table update. In the later
* case just flush:
*/
- pgd = pgd_offset(current->active_mm, address);
+ pgd_paddr = read_cr3();
+ pgd = __va(pgd_paddr) + pgd_index(address);
pgd_ref = pgd_offset_k(address);
if (pgd_none(*pgd_ref))
return -1;


2010-07-14 16:29:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 8:49 AM, Mathieu Desnoyers
<[email protected]> wrote:
>> I think you're vastly overestimating what is sane to do from an NMI
>> context. ?It is utterly and totally insane to assume vmalloc is available
>> in NMI.

I agree that NMI handlers shouldn't touch vmalloc space. But now that
percpu data is mapped through the VM, I do agree that other CPU's may
potentially need to touch that data, and an interrupt (including an
NMI) might be the first to create the mapping.

And that's why the faulting code needs to be interrupt-safe for the
vmalloc area.

However, it does look like the current scheduler should make it safe
to access "current->mm->pgd" from regular interrupts, so the problem
is apparently only an NMI issue. So exactly what are the circumstances
that create and expose percpu data on a CPU _without_ mapping it on
that CPU?

IOW, I'm missing some background here. I agree that at least some
basic percpu data should generally be available for an NMI handler,
but at the same time I wonder how come that basic percpu data wasn't
already mapped?

The traditional irq vmalloc race was something like:
- one CPU does a "fork()", which copies the basic kernel mappings
- in another thread a driver does a vmalloc(), which creates a _new_
mapping that didn't get copied.
- later on a switch_to() switches to the newly forked process that
missed the vmalloc initialization
- we take an interrupt for the driver that needed the new vmalloc
space, but now it doesn't have it, and we fill it in at run-time for
the (rare) race.

and I'm simply not seeing how fork() could ever race with percpu data setup.

So please just document the sequence that actually needs the page
table setup for the NMI/percpu case.

This patch (1/2) doesn't look horrible per se. I have no problems with
it. I just want to understand why it is needed.

Linus

2010-07-14 17:06:23

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Linus Torvalds ([email protected]) wrote:
> On Wed, Jul 14, 2010 at 8:49 AM, Mathieu Desnoyers
> <[email protected]> wrote:

(I was quoting Peter Anvin below) ;)

> >> I think you're vastly overestimating what is sane to do from an NMI
> >> context. ?It is utterly and totally insane to assume vmalloc is available
> >> in NMI.
>
> I agree that NMI handlers shouldn't touch vmalloc space. But now that
> percpu data is mapped through the VM, I do agree that other CPU's may
> potentially need to touch that data, and an interrupt (including an
> NMI) might be the first to create the mapping.
>
[...]
> So please just document the sequence that actually needs the page
> table setup for the NMI/percpu case.
>
> This patch (1/2) doesn't look horrible per se. I have no problems with
> it. I just want to understand why it is needed.

The problem originally addressed by this patch is the case where a NMI handler
try to access vmalloc'd per-cpu data, which goes as follow:

- One CPU does a fork(), which copies the basic kernel mappings.
- Perf allocates percpu memory for buffer control data structures.
This mapping does not get copied.
- Tracing is activated.
- switch_to() to the newly forked process which missed the new percpu
allocation.
- We take a NMI, which touches the vmalloc'd percpu memory in the Perf tracing
handler, therefore leading to a page fault in NMI context. Here, we might be
in the middle of switch_to(), where ->current might not be in sync with the
current cr3 register.

The three choices we have to handle this that I am aware of are:
1) supporting page faults in NMI context, which imply removing ->current
dependency and supporting iret-less return path.
2) duplicating the percpu alloc API with a variant that maps to kmalloc.
3) using vmalloc_sync_all() after creating the mapping. (only works for x86_64,
not x86_32).

Choice 3 seems like a no-go on x86_32, choice 2 seems like a last-resort
(involves API duplication and reservation of a fixed-amount of per-cpu memory at
boot). Hence the proposal of choice 1.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-14 18:10:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 10:06 AM, Mathieu Desnoyers
<[email protected]> wrote:
>>
>> This patch (1/2) doesn't look horrible per se. I have no problems with
>> it. I just want to understand why it is needed.

[ And patch 2/2 is much more intrusive, and touches a critical path
too.. If it was just the 1/2 series, I don't think I would care. For
the 2/2, I think I'd want to explore all the alternative options ]

> The problem originally addressed by this patch is the case where a NMI handler
> try to access vmalloc'd per-cpu data, which goes as follow:
>
> - One CPU does a fork(), which copies the basic kernel mappings.
> - Perf allocates percpu memory for buffer control data structures.
> ?This mapping does not get copied.
> - Tracing is activated.
> - switch_to() to the newly forked process which missed the new percpu
> ?allocation.
> - We take a NMI, which touches the vmalloc'd percpu memory in the Perf tracing
> ?handler, therefore leading to a page fault in NMI context. Here, we might be
> ?in the middle of switch_to(), where ->current might not be in sync with the
> ?current cr3 register.

Ok. I was wondering why anybody would allocate core percpu variables
so late that this would ever be an issue, but I guess perf is a
reasonable such case. And reasonable to do from NMI.

That said - grr. I really wish there was some other alternative than
adding yet more complexity to the exception return path. That "iret
re-enables NMI's unconditionally" thing annoys me.

In fact, I wonder if we couldn't just do a software NMI disable
instead? Hav ea per-cpu variable (in the _core_ percpu areas that get
allocated statically) that points to the NMI stack frame, and just
make the NMI code itself do something like

NMI entry:
- load percpu NMI stack frame pointer
- if non-zero we know we're nested, and should ignore this NMI:
- we're returning to kernel mode, so return immediately by using
"popf/ret", which also keeps NMI's disabled in the hardware until the
"real" NMI iret happens.
- before the popf/iret, use the NMI stack pointer to make the NMI
return stack be invalid and cause a fault
- set the NMI stack pointer to the current stack pointer

NMI exit (not the above "immediate exit because we nested"):
clear the percpu NMI stack pointer
Just do the iret.

Now, the thing is, now the "iret" is atomic. If we had a nested NMI,
we'll take a fault, and that re-does our "delayed" NMI - and NMI's
will stay masked.

And if we didn't have a nested NMI, that iret will now unmask NMI's,
and everything is happy.

Doesn't the above sound like a good solution? In other words, we solve
the whole problem by simply _fixing_ the crazy Intel "iret-vs-NMI"
semantics. And we don't need to change the hotpath, and we'll just
_allow_ nested faults within NMI's.

Am I missing something? Maybe I'm not as clever as I think I am... But
I _feel_ clever.

Linus

2010-07-14 18:47:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe


* Linus Torvalds <[email protected]> wrote:

> Ok. I was wondering why anybody would allocate core percpu variables so late
> that this would ever be an issue, but I guess perf is a reasonable such
> case. And reasonable to do from NMI.

Yeah.

Frederic (re-)discovered this problem via very hard to debug crashes when he
extended perf call-graph tracing to have a bit larger buffer and used
percpu_alloc() for it (which is entirely reasonable in itself).

> That said - grr. I really wish there was some other alternative than adding
> yet more complexity to the exception return path. That "iret re-enables
> NMI's unconditionally" thing annoys me.

Ok. We can solve it by allocating the space from the non-vmalloc percpu area -
8K per CPU.

> In fact, I wonder if we couldn't just do a software NMI disable
> instead? Hav ea per-cpu variable (in the _core_ percpu areas that get
> allocated statically) that points to the NMI stack frame, and just
> make the NMI code itself do something like
>
> NMI entry:

I think at this point [NMI re-entry] we've corrupted the top of the NMI kernel
stack already, due to entering via the IST stack mechanism, which is
non-nesting and which enters at the same point - right?

We could solve that by copying that small stack frame off before entering the
'generic' NMI routine - but it all feels a bit pulled in by the hair.

I feel uneasy about taking pagefaults from the NMI handler. Even if we
implemented it all correctly, who knows what CPU erratas are waiting there to
be discovered, etc ...

I think we should try to muddle through by preventing these situations from
happening (and adding a WARN_ONCE() to the vmalloc page-fault handler would
certainly help as well), and only go to more clever schemes if no other option
looks sane anymore?

Thanks,

Ingo

2010-07-14 19:15:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 11:46 AM, Ingo Molnar <[email protected]> wrote:
>> ?NMI entry:
>
> I think at this point [NMI re-entry] we've corrupted the top of the NMI kernel
> stack already, due to entering via the IST stack mechanism, which is
> non-nesting and which enters at the same point - right?

Yeah, you're right, but we could easily fix that up. We know we don't
need any stack for the nested case, so all we would need to do is to
just subtract a small bit off %rsp, and copy the three words or so to
create a "new" stack for the non-nested case.

> We could solve that by copying that small stack frame off before entering the
> 'generic' NMI routine - but it all feels a bit pulled in by the hair.

Why? It's much cleaner than making the _real_ codepaths much worse.

Linus

2010-07-14 19:37:01

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 12:14:01PM -0700, Linus Torvalds wrote:
> On Wed, Jul 14, 2010 at 11:46 AM, Ingo Molnar <[email protected]> wrote:
> >> ?NMI entry:
> >
> > I think at this point [NMI re-entry] we've corrupted the top of the NMI kernel
> > stack already, due to entering via the IST stack mechanism, which is
> > non-nesting and which enters at the same point - right?
>
> Yeah, you're right, but we could easily fix that up. We know we don't
> need any stack for the nested case, so all we would need to do is to
> just subtract a small bit off %rsp, and copy the three words or so to
> create a "new" stack for the non-nested case.
>
> > We could solve that by copying that small stack frame off before entering the
> > 'generic' NMI routine - but it all feels a bit pulled in by the hair.
>
> Why? It's much cleaner than making the _real_ codepaths much worse.



There is also the fact we need to handle the lost NMI, by defering its
treatment or so. That adds even more complexity.

2010-07-14 19:44:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 12:14 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Jul 14, 2010 at 11:46 AM, Ingo Molnar <[email protected]> wrote:
>> We could solve that by copying that small stack frame off before entering the
>> 'generic' NMI routine - but it all feels a bit pulled in by the hair.
>
> Why? It's much cleaner than making the _real_ codepaths much worse.

.. but if the option is to never take a fault at all from the NMI
handler, and that is doable, than that would be good, of course.

But that may not be fixable. At least not without much more pain than
just adding a fairly simple hack to the NMI path itself, and keeping
all the NMI pain away from all the other cases.

And doing the per-cpu NMI nesting hack would actually also work as a
way to essentially block NMI's from critical regions. With my NMI
nestign avoidance suggestion, you could literally do something like
this to block NMI's:

/* This is just a fake stack structure */
struct nmi_blocker {
unsigned long rflags;
unsigned long cs;
unsigned long rip;
};

void block_nmi_on_this_cpu(struct nmi_blocker *blocker)
{
get_cpu();
memset(blocker, 0, sizeof(*blocker));
per_cpu_nmi_stack_frame = blocker;
}

void unblock_nmi_on_this_cpu(struct nmi_blocker *blocker)
{
per_cpu_nmi_stack_frame = NULL;
barrier();
/* Did an NMI happen? If so, we're now running NMI-blocked by hardware,
* we need to emulate the NMI and do a real 'iret' here
*/
if (blocker->cs == INVALID_CS)
asm volatile(".. build stack frame, call NMI routine ..");
put_cpu();
}

or similar. Wouldn't that be nice to have as a capability?

Linus

2010-07-14 19:55:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
<[email protected]> wrote:
>
> There is also the fact we need to handle the lost NMI, by defering its
> treatment or so. That adds even more complexity.

I don't think your read my proposal very deeply. It already handles
them by taking a fault on the iret of the first one (that's why we
point to the stack frame - so that we can corrupt it and force a
fault).

Linus

2010-07-14 19:56:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

> or similar. Wouldn't that be nice to have as a capability?

It means the NMI watchdog would get useless if these areas
become common.

Again I suspect all of this is not really needed anyways if
vmalloc_sync_all() works properly. That would solve the original
problem Mathieu was trying to solve for per_cpu data. The rule
would be just to call vmalloc_sync_all() properly when changing
per CPU data too.

In fact I'm pretty sure it worked originally. Perhaps it regressed?

-Andi
--
[email protected] -- Speaking for myself only.

2010-07-14 20:05:58

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Andi Kleen ([email protected]) wrote:
> > or similar. Wouldn't that be nice to have as a capability?
>
> It means the NMI watchdog would get useless if these areas
> become common.
>
> Again I suspect all of this is not really needed anyways if
> vmalloc_sync_all() works properly. That would solve the original
> problem Mathieu was trying to solve for per_cpu data. The rule
> would be just to call vmalloc_sync_all() properly when changing
> per CPU data too.

Yep, that would solve the page fault in nmi problem altogether without adding
complexity.

>
> In fact I'm pretty sure it worked originally. Perhaps it regressed?

I'd first ask the obvious to Perf authors: does perf issue vmalloc_sync_all()
between percpu data allocation and tracing activation ? The generic ring buffer
library I posted last week does it already as a precaution for this very
specific reason (making sure NMIs never trigger page faults).

Thanks,

Mathieu

>
> -Andi
> --
> [email protected] -- Speaking for myself only.

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-14 20:07:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

> > In fact I'm pretty sure it worked originally. Perhaps it regressed?
>
> I'd first ask the obvious to Perf authors: does perf issue vmalloc_sync_all()
> between percpu data allocation and tracing activation ? The generic ring buffer
> library I posted last week does it already as a precaution for this very
> specific reason (making sure NMIs never trigger page faults).

I suspect the low level per cpu allocation functions should
just call it.

-Andi

--
[email protected] -- Speaking for myself only.

2010-07-14 20:10:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 07/14/2010 01:07 PM, Andi Kleen wrote:
>>> In fact I'm pretty sure it worked originally. Perhaps it regressed?
>>
>> I'd first ask the obvious to Perf authors: does perf issue vmalloc_sync_all()
>> between percpu data allocation and tracing activation ? The generic ring buffer
>> library I posted last week does it already as a precaution for this very
>> specific reason (making sure NMIs never trigger page faults).
>
> I suspect the low level per cpu allocation functions should
> just call it.
>

Yes, specifically the point at which we allocate new per cpu memory blocks.

-hpa

2010-07-14 20:17:57

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Linus Torvalds ([email protected]) wrote:
> On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> <[email protected]> wrote:
> >
> > There is also the fact we need to handle the lost NMI, by defering its
> > treatment or so. That adds even more complexity.
>
> I don't think your read my proposal very deeply. It already handles
> them by taking a fault on the iret of the first one (that's why we
> point to the stack frame - so that we can corrupt it and force a
> fault).

It only handles the case of a single NMI coming in. What happens in this
scenario?

- NMI (1) comes in.
- takes a fault
- iret
- NMI (2) comes in.
- nesting detected, popf/ret
- takes another fault
- NMI (3) comes in.
- nesting detected, popf/ret
- iret faults
- executes only one extra NMI handler

We miss NMI (3) here. I think this is an important change from a semantic where,
AFAIK, the hardware should be allowed to assume that the CPU will execute as
many nmi handlers are there are NMIs acknowledged.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-14 20:39:45

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Linus Torvalds ([email protected]) wrote
[...]
> In fact, I wonder if we couldn't just do a software NMI disable
> instead? Hav ea per-cpu variable (in the _core_ percpu areas that get
> allocated statically) that points to the NMI stack frame, and just
> make the NMI code itself do something like
>
> NMI entry:

Let's try to figure out how far we can go with this idea. First, to answer
Ingo's critic, let's assume we do a stack frame copy before entering the
"generic" nmi handler routine.

> - load percpu NMI stack frame pointer
> - if non-zero we know we're nested, and should ignore this NMI:
> - we're returning to kernel mode, so return immediately by using
> "popf/ret", which also keeps NMI's disabled in the hardware until the
> "real" NMI iret happens.

Maybe incrementing a per-cpu missed NMIs count could be appropriate here so we
know how many NMIs should be replayed at iret ?

> - before the popf/iret, use the NMI stack pointer to make the NMI
> return stack be invalid and cause a fault

I assume you mean "popf/ret" here. So assuming we use a frame copy, we should
change the nmi stack pointer in the nesting 0 nmi stack copy, so the nesting 0
NMI iret will trigger the fault.

> - set the NMI stack pointer to the current stack pointer

That would mean bringing back the NMI stack pointer to the (nesting - 1) nmi
stack copy.

>
> NMI exit (not the above "immediate exit because we nested"):
> clear the percpu NMI stack pointer

This would be rather:
- Copy the nesting 0 stack copy back onto the real nmi stack.
- clear the percpu nmi stack pointer

** !

> Just do the iret.

Which presumably faults if we changed the return stack for an invalid one and
executes as many NMIs as there are "missed nmis" counted (missed nmis should
probably be read with an xchg() instruction).

So, one question persists, regarding the "** !" comment: what do we do if an NMI
comes in exactly at that point ? I'm afraid it will overwrite the "real" nmi
stack, and therefore drop all the "pending" nmis by setting the nmi stack return
address to a valid one.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-14 20:56:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 1:17 PM, Mathieu Desnoyers
<[email protected]> wrote:
>
> It only handles the case of a single NMI coming in. What happens in this
> scenario?

[ two nested NMI's ]

The _right_ thing happens.

What do you think the hardware would have done itself? The NMI was
blocked. It wouldn't get replayed twice. If you have two NMI's
happening while another NMI is active, you will get a single NMI after
the first NMI has completed.

So stop these _idiotic_ complaints. And face the music:

- NMI's aren't that important. They are a _hell_ of a lot less
important than the standard page fault path, for example.

- We do _not_ want to add more NMI magic outside of the NMI
codepaths. It's much better to handle NMI special cases in the NMI
code, rather than sprinkle them in random other codepaths (like percpu
allocators) that have NOTHING WHAT-SO-EVER to do with NMI's!

Linus

>
> - NMI (1) comes in.
> - takes a fault
> ? ?- iret
> - NMI (2) comes in.
> ?- nesting detected, popf/ret
> - takes another fault
> - NMI (3) comes in.
> ?- nesting detected, popf/ret
> - iret faults
> ?- executes only one extra NMI handler
>
> We miss NMI (3) here. I think this is an important change from a semantic where,
> AFAIK, the hardware should be allowed to assume that the CPU will execute as
> many nmi handlers are there are NMIs acknowledged.
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
>

2010-07-14 21:18:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe


* Linus Torvalds <[email protected]> wrote:

> On Wed, Jul 14, 2010 at 1:17 PM, Mathieu Desnoyers
> <[email protected]> wrote:
> >
> > It only handles the case of a single NMI coming in. What happens in this
> > scenario?
>
> [ two nested NMI's ]
>
> The _right_ thing happens.
>
> What do you think the hardware would have done itself? The NMI was blocked.
> It wouldn't get replayed twice. If you have two NMI's happening while
> another NMI is active, you will get a single NMI after the first NMI has
> completed.

If it ever became an issue, we could even do what softirqs do and re-execute
the NMI handler. At least for things like PMU NMIs we have to handle them once
they have been (re-)issued, or we'd get a stuck PMU.

But in any case it should be a non-issue.

Ingo

2010-07-14 21:23:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 1:39 PM, Mathieu Desnoyers
<[email protected]> wrote:
>
>> ?- load percpu NMI stack frame pointer
>> ?- if non-zero we know we're nested, and should ignore this NMI:
>> ? ? - we're returning to kernel mode, so return immediately by using
>> "popf/ret", which also keeps NMI's disabled in the hardware until the
>> "real" NMI iret happens.
>
> Maybe incrementing a per-cpu missed NMIs count could be appropriate here so we
> know how many NMIs should be replayed at iret ?

No. As mentioned, there is no such counter in real hardware either.

Look at what happens for the not-nested case:

- NMI1 triggers. The CPU takes a fault, and runs the NMI handler with
NMI's disabled

- NMI2 triggers. Nothing happens, the NMI's are disabled.

- NMI3 triggers. Again, nothing happens, the NMI's are still disabled

- the NMI handler returns.

- What happens now?

How many NMI interrupts do you get? ONE. Exactly like my "emulate it
in software" approach. The hardware doesn't have any counters for
pending NMI's either. Why should the software emulation have them?

>> ? ? - before the popf/iret, use the NMI stack pointer to make the NMI
>> return stack be invalid and cause a fault
>
> I assume you mean "popf/ret" here.

Yes, that was as typo. The whole point of using popf was obviously to
_avoid_ the iret ;)

> So assuming we use a frame copy, we should
> change the nmi stack pointer in the nesting 0 nmi stack copy, so the nesting 0
> NMI iret will trigger the fault
>
>> ? - set the NMI stack pointer to the current stack pointer
>
> That would mean bringing back the NMI stack pointer to the (nesting - 1) nmi
> stack copy.

I think you're confused. Or I am by your question.

The NMI code would literally just do:

- check if the NMI was nested, by looking at whether the percpu
nmi-stack-pointer is non-NULL

- if it was nested, do nothing, an return with a popf/ret. The only
stack this sequence might needs is to save/restore the register that
we use for the percpu value (although maybe we can just co a "cmpl
$0,%_percpu_seg:nmi_stack_ptr" and not even need that), and it's
atomic because at this point we know that NMI's are disabled (we've
not _yet_ taken any nested faults)

- if it's a regular (non-nesting) NMI, we'd basically do

6* pushq 48(%rsp)

to copy the five words that the NMI pushed (ss/esp/eflags/cs/eip)
and the one we saved ourselves (if we needed any, maybe we can make do
with just 5 words).

- then we just save that new stack pointer to the percpu thing with a simple

movq %rsp,%__percpu_seg:nmi_stack_ptr

and we're all done. The final "iret" will do the right thing (either
fault or return), and there are no races that I can see exactly
because we use a single nmi-atomic instruction (the "iret" itself) to
either re-enable NMI's _or_ test whether we should re-do an NMI.

There is a single-instruction window that is interestign in the return
path, which is the window between the two final instructions:

movl $0,%__percpu_seg:nmi_stack_ptr
iret

where I wonder what happens if we have re-enabled NMI (due to a fault
in the NMI handler), but we haven't actually taken the NMI itself yet,
so now we _will_ re-use the stack. Hmm. I suspect we need another of
those horrible "if the NMI happens at this particular %rip" cases that
we already have for the sysenter code on x86-32 for the NMI/DEBUG trap
case of fixing up the stack pointer.

And maybe I missed something else. But it does look reasonably simple.
Subtle, but not a lot of code. And the code is all very much about the
NMI itself, not about other random sequences. No?

Linus

2010-07-14 21:45:11

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, 14 Jul 2010, Linus Torvalds wrote:

> No. As mentioned, there is no such counter in real hardware either.

There is a 1-bit counter or actually a latch.

> Look at what happens for the not-nested case:
>
> - NMI1 triggers. The CPU takes a fault, and runs the NMI handler with
> NMI's disabled

Correct.

> - NMI2 triggers. Nothing happens, the NMI's are disabled.

The NMI latch records the second NMI. Note this is edge-sensitive like
the NMI line itself.

> - NMI3 triggers. Again, nothing happens, the NMI's are still disabled

Correct.

> - the NMI handler returns.
>
> - What happens now?

NMI2 latched above causes the NMI handler to be invoked as the next
instruction after IRET. The latch is cleared as the interrupt is taken.

> How many NMI interrupts do you get? ONE. Exactly like my "emulate it
> in software" approach. The hardware doesn't have any counters for
> pending NMI's either. Why should the software emulation have them?

Two. :)

Maciej

2010-07-14 21:53:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 2:45 PM, Maciej W. Rozycki <[email protected]> wrote:
> On Wed, 14 Jul 2010, Linus Torvalds wrote:
>
>> No. As mentioned, there is no such counter in real hardware either.
>
> ?There is a 1-bit counter or actually a latch.

Well, that's what our single-word flag is too.

>> Look at what happens for the not-nested case:
>>
>> ?- NMI1 triggers. The CPU takes a fault, and runs the NMI handler with
>> NMI's disabled
>
> ?Correct.
>
>> ?- NMI2 triggers. Nothing happens, the NMI's are disabled.
>
> ?The NMI latch records the second NMI. ?Note this is edge-sensitive like
> the NMI line itself.
>
>> ?- NMI3 triggers. Again, nothing happens, the NMI's are still disabled
>
> ?Correct.
>
>> ?- the NMI handler returns.
>>
>> ?- What happens now?
>
> ?NMI2 latched above causes the NMI handler to be invoked as the next
> instruction after IRET. ?The latch is cleared as the interrupt is taken.
>
>> How many NMI interrupts do you get? ONE. Exactly like my "emulate it
>> in software" approach. The hardware doesn't have any counters for
>> pending NMI's either. Why should the software emulation have them?
>
> ?Two. :)

You just count differently. I don't count the first one (the "real"
NMI). That obviously happens. So I only count how many interrupts we
need to fake. That's my "one". That's the one that happens as a result
of the fault that we take on the iret in the emulated model.

So there is no need to count anything. We take a fault on the iret if
we got a nested NMI (regardless of how _many_ such nested NMI's we
took). That's the "latch", exactly like in the hardware. No counter.

(Yeah, yeah, you can call it a "one-bit counter", but I don't think
that's a counter. It's just a bit of information).

Linus

2010-07-14 22:14:25

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 12:54:19PM -0700, Linus Torvalds wrote:
> On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> <[email protected]> wrote:
> >
> > There is also the fact we need to handle the lost NMI, by defering its
> > treatment or so. That adds even more complexity.
>
> I don't think your read my proposal very deeply. It already handles
> them by taking a fault on the iret of the first one (that's why we
> point to the stack frame - so that we can corrupt it and force a
> fault).


Ah right, I missed this part.

2010-07-14 22:21:20

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Linus Torvalds ([email protected]) wrote:
> On Wed, Jul 14, 2010 at 1:39 PM, Mathieu Desnoyers
> <[email protected]> wrote:
> >
> >> ?- load percpu NMI stack frame pointer
> >> ?- if non-zero we know we're nested, and should ignore this NMI:
> >> ? ? - we're returning to kernel mode, so return immediately by using
> >> "popf/ret", which also keeps NMI's disabled in the hardware until the
> >> "real" NMI iret happens.
> >
> > Maybe incrementing a per-cpu missed NMIs count could be appropriate here so we
> > know how many NMIs should be replayed at iret ?
>
> No. As mentioned, there is no such counter in real hardware either.
>
> Look at what happens for the not-nested case:
>
> - NMI1 triggers. The CPU takes a fault, and runs the NMI handler with
> NMI's disabled
>
> - NMI2 triggers. Nothing happens, the NMI's are disabled.
>
> - NMI3 triggers. Again, nothing happens, the NMI's are still disabled
>
> - the NMI handler returns.
>
> - What happens now?
>
> How many NMI interrupts do you get? ONE. Exactly like my "emulate it
> in software" approach. The hardware doesn't have any counters for
> pending NMI's either. Why should the software emulation have them?

So I figure that given Maciej's response, we can get at most 2 nested nmis, no
more, no less. So I was probably going too far with the counter, but we need 2.
However, failure to deliver the second NMI in this case would not match the
hardware expectations (see below).

>
> >> ? ? - before the popf/iret, use the NMI stack pointer to make the NMI
> >> return stack be invalid and cause a fault
> >
> > I assume you mean "popf/ret" here.
>
> Yes, that was as typo. The whole point of using popf was obviously to
> _avoid_ the iret ;)
>
> > So assuming we use a frame copy, we should
> > change the nmi stack pointer in the nesting 0 nmi stack copy, so the nesting 0
> > NMI iret will trigger the fault
> >
> >> ? - set the NMI stack pointer to the current stack pointer
> >
> > That would mean bringing back the NMI stack pointer to the (nesting - 1) nmi
> > stack copy.
>
> I think you're confused. Or I am by your question.
>
> The NMI code would literally just do:
>
> - check if the NMI was nested, by looking at whether the percpu
> nmi-stack-pointer is non-NULL
>
> - if it was nested, do nothing, an return with a popf/ret. The only
> stack this sequence might needs is to save/restore the register that
> we use for the percpu value (although maybe we can just co a "cmpl
> $0,%_percpu_seg:nmi_stack_ptr" and not even need that), and it's
> atomic because at this point we know that NMI's are disabled (we've
> not _yet_ taken any nested faults)
>
> - if it's a regular (non-nesting) NMI, we'd basically do
>
> 6* pushq 48(%rsp)
>
> to copy the five words that the NMI pushed (ss/esp/eflags/cs/eip)
> and the one we saved ourselves (if we needed any, maybe we can make do
> with just 5 words).

Ah, right, you only need to do the copy and use the copy for the nesting level 0
NMI handler. The nested NMI can work on the "real" nmi stack because we never
expect it to fault.

>
> - then we just save that new stack pointer to the percpu thing with a simple
>
> movq %rsp,%__percpu_seg:nmi_stack_ptr
>
> and we're all done. The final "iret" will do the right thing (either
> fault or return), and there are no races that I can see exactly
> because we use a single nmi-atomic instruction (the "iret" itself) to
> either re-enable NMI's _or_ test whether we should re-do an NMI.
>
> There is a single-instruction window that is interestign in the return
> path, which is the window between the two final instructions:
>
> movl $0,%__percpu_seg:nmi_stack_ptr
> iret
>
> where I wonder what happens if we have re-enabled NMI (due to a fault
> in the NMI handler), but we haven't actually taken the NMI itself yet,
> so now we _will_ re-use the stack. Hmm. I suspect we need another of
> those horrible "if the NMI happens at this particular %rip" cases that
> we already have for the sysenter code on x86-32 for the NMI/DEBUG trap
> case of fixing up the stack pointer.

Yes, this was this exact instruction window I was worried about. I see another
possible failure mode:

- NMI
- page fault
- iret
- NMI
- set nmi_stack_ptr to 0, popf/lret.
- page fault (yep, another one!)
- iret
- movl $0,%__percpu_seg:nmi_stack_ptr
- iret

So in this case, movl/iret are executed with NMIs enabled. So if an NMI comes in
after the movl instruction, it will not detect that it is nested and will re-use
the percpu "nmi_stack_ptr" stack, squashing the "faulty" stack ptr with a brand
new one which won't trigger a fault. I'm afraid that in this case, the last NMI
handler will iret to the "nesting 0" handler at the iret instruction, which will
in turn return to itself, breaking all hell loose in the meantime (endless iret
loop).

So this also calls for special-casing an NMI nested on top of the following iret

- movl $0,%__percpu_seg:nmi_stack_ptr
- iret <-----

instruction. At the beginning of the NMI handler, we could detect if we are
either nested over an NMI (checking nmi_stack_ptr != NULL) or if we are at this
specifica %rip, and assume we are nested in both cases.

>
> And maybe I missed something else. But it does look reasonably simple.
> Subtle, but not a lot of code. And the code is all very much about the
> NMI itself, not about other random sequences. No?

If we can find a clean way to handle this NMI vs iret problem outside of the
entry_*.S code, within NMI-specific code, I'm indeed all for it. entry_*.s is
already complicated enough as it is. I think checking the %rip at NMI entry
could work out.

Thanks!

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-14 22:31:11

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Frederic Weisbecker ([email protected]) wrote:
> On Wed, Jul 14, 2010 at 12:54:19PM -0700, Linus Torvalds wrote:
> > On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> > <[email protected]> wrote:
> > >
> > > There is also the fact we need to handle the lost NMI, by defering its
> > > treatment or so. That adds even more complexity.
> >
> > I don't think your read my proposal very deeply. It already handles
> > them by taking a fault on the iret of the first one (that's why we
> > point to the stack frame - so that we can corrupt it and force a
> > fault).
>
>
> Ah right, I missed this part.

Hrm, Frederic, I hate to ask that but.. what are you doing with those percpu 8k
data structures exactly ? :)

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-14 22:31:17

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, 14 Jul 2010, Linus Torvalds wrote:

> You just count differently. I don't count the first one (the "real"
> NMI). That obviously happens. So I only count how many interrupts we
> need to fake. That's my "one". That's the one that happens as a result
> of the fault that we take on the iret in the emulated model.

Ah, I see -- so we are on the same page after all.

> (Yeah, yeah, you can call it a "one-bit counter", but I don't think
> that's a counter. It's just a bit of information).

Hardware has something like a strapped-high D flip-flop (NMI goes to the
clock input) with an extra reset input I presume -- this dates back to
8086 when the transistor count mattered with accuracy higher than 1e6. ;)

Maciej

2010-07-14 22:37:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 3:21 PM, Mathieu Desnoyers
<[email protected]> wrote:
>
> If we can find a clean way to handle this NMI vs iret problem outside of the
> entry_*.S code, within NMI-specific code, I'm indeed all for it. entry_*.s is
> already complicated enough as it is. I think checking the %rip at NMI entry
> could work out.

I think the %rip check should be pretty simple - exactly because there
is only a single point where the race is open between that 'mov' and
the 'iret'. So it's simpler than the (similar) thing we do for
debug/nmi stack fixup for sysenter that has to check a range.

The only worry is if that crazy paravirt code wants to paravirtualize
the iretq. Afaik, paravirt does that exactly because they screw up
iret handling themselves. Maybe we could stop doing that stupid iretq
paravirtualization, and just tell the paravirt people to do the same
thing I propose, and just allow nesting.

Linus

2010-07-14 22:42:28

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 04:05:52PM -0400, Mathieu Desnoyers wrote:
> * Andi Kleen ([email protected]) wrote:
> > > or similar. Wouldn't that be nice to have as a capability?
> >
> > It means the NMI watchdog would get useless if these areas
> > become common.
> >
> > Again I suspect all of this is not really needed anyways if
> > vmalloc_sync_all() works properly. That would solve the original
> > problem Mathieu was trying to solve for per_cpu data. The rule
> > would be just to call vmalloc_sync_all() properly when changing
> > per CPU data too.
>
> Yep, that would solve the page fault in nmi problem altogether without adding
> complexity.
>
> >
> > In fact I'm pretty sure it worked originally. Perhaps it regressed?
>
> I'd first ask the obvious to Perf authors: does perf issue vmalloc_sync_all()
> between percpu data allocation and tracing activation ? The generic ring buffer
> library I posted last week does it already as a precaution for this very
> specific reason (making sure NMIs never trigger page faults).


Ok, I should try.

Until now I didn't because I clearly misunderstand the vmalloc internals. I'm
not even quite sure why a memory allocated with vmalloc sometimes can be not
mapped (and then fault once for this to sync). Some people have tried to explain
me but the picture is still vague to me.

And moreover, I'm not quite sure whether vmalloc_sync_all() syncs the pgd
for every tasks or so... Tejun seemed to say it's not necessary the case in
x86-32... Again I think I haven't totally understood the details.

Thanks.

2010-07-14 22:48:59

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 06:31:07PM -0400, Mathieu Desnoyers wrote:
> * Frederic Weisbecker ([email protected]) wrote:
> > On Wed, Jul 14, 2010 at 12:54:19PM -0700, Linus Torvalds wrote:
> > > On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> > > <[email protected]> wrote:
> > > >
> > > > There is also the fact we need to handle the lost NMI, by defering its
> > > > treatment or so. That adds even more complexity.
> > >
> > > I don't think your read my proposal very deeply. It already handles
> > > them by taking a fault on the iret of the first one (that's why we
> > > point to the stack frame - so that we can corrupt it and force a
> > > fault).
> >
> >
> > Ah right, I missed this part.
>
> Hrm, Frederic, I hate to ask that but.. what are you doing with those percpu 8k
> data structures exactly ? :)
>
> Mathieu



So, when an event triggers in perf, we sometimes want to capture the stacktrace
that led to the event.

We want this stacktrace (here we call that a callchain) to be recorded
locklessly. So we want this callchain buffer per cpu, with the following
type:

#define PERF_MAX_STACK_DEPTH 255

struct perf_callchain_entry {
__u64 nr;
__u64 ip[PERF_MAX_STACK_DEPTH];
};


That makes 2048 bytes. But per cpu is not enough for the callchain to be recorded
locklessly, we also need one buffer per context: task, softirq, hardirq, nmi, as
an event can trigger in any of these.
Since we disable preemption, none of these contexts can nest locally. In
fact hardirqs can nest but we just don't care about this corner case.

So, it makes 2048 * 4 = 8192 bytes. And that per cpu.

2010-07-14 22:51:39

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 07/14/2010 03:37 PM, Linus Torvalds wrote:
> I think the %rip check should be pretty simple - exactly because there
> is only a single point where the race is open between that 'mov' and
> the 'iret'. So it's simpler than the (similar) thing we do for
> debug/nmi stack fixup for sysenter that has to check a range.
>
> The only worry is if that crazy paravirt code wants to paravirtualize
> the iretq. Afaik, paravirt does that exactly because they screw up
> iret handling themselves. Maybe we could stop doing that stupid iretq
> paravirtualization, and just tell the paravirt people to do the same
> thing I propose, and just allow nesting.
>

We screw around with iret because there's a separate interrupt mask flag
which can't be set atomically with respect to a stack/ring change (well,
there's more to it, but I won't confuse matters).

But it only really matters if the PV guest can also get NMI-like
interrupts. While Xen supports NMI for PV guests, we don't use it much
and I haven't looked into implementing support for it yet.

J

2010-07-14 22:57:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 3:31 PM, Frederic Weisbecker <[email protected]> wrote:
>
> Until now I didn't because I clearly misunderstand the vmalloc internals. I'm
> not even quite sure why a memory allocated with vmalloc sometimes can be not
> mapped (and then fault once for this to sync). Some people have tried to explain
> me but the picture is still vague to me.

So the issue is that the system can have thousands and thousands of
page tables (one for each process), and what do you do when you add a
new kernel virtual mapping?

You can:

- make sure that you only ever use _one_ single top-level entry for
all vmalloc issues, and can make sure that all processes are created
with that static entry filled in. This is optimal, but it just doesn't
work on all architectures (eg on 32-bit x86, it would limit the
vmalloc space to 4MB in non-PAE, whatever)

- at vmalloc time, when adding a new page directory entry, walk all
the tens of thousands of existing page tables under a lock that
guarantees that we don't add any new ones (ie it will lock out fork())
and add the required pgd entry to them.

- or just take the fault and do the "fill the page tables" on demand.

Quite frankly, most of the time it's probably better to make that last
choice (unless your hardware makes it easy to make the first choice,
which is obviously simplest for everybody). It makes it _much_ cheaper
to do vmalloc. It also avoids that nasty latency issue. And it's just
simpler too, and has no interesting locking issues with how/when you
expose the page tables in fork() etc.

So the only downside is that you do end up taking a fault in the
(rare) case where you have a newly created task that didn't get an
even newer vmalloc entry. And that fault can sometimes be in an
interrupt or an NMI. Normally it's trivial to handle that fairly
simple nested fault. But NMI has that inconvenient "iret unblocks
NMI's, because there is no dedicated 'nmiret' instruction" problem on
x86.

Linus

2010-07-14 23:02:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 3:51 PM, Jeremy Fitzhardinge <[email protected]> wrote:
>
> We screw around with iret because there's a separate interrupt mask flag
> which can't be set atomically with respect to a stack/ring change (well,
> there's more to it, but I won't confuse matters).

Umm, I know. It's what this whole discussion (non-paravirtualized) is
all about. And I have a suggestion that should fix the
non-paravirtualized case _without_ actually touching anything but the
NMI code itself.

What I tried to say is that the paravirtualized people should take a
look at my suggestion, and see if they can apply the logic to their
NMI handling too. And in the process totally remove the need for
paravirtualizing iret, exactly because the approach handles the magic
NMI lock logic entirely in the NMI handler itself.

Because I think the same thing that would make us not need to worry
about nested page faults _during_ NMI (because we could make the NMI
code do the right thing even when the hardware doesn't lock out NMI's
for us) is also the exact same logic that the paravirt monitor could
do for its own NMI handling.

Wouldn't it be nice to be able to remove the need to paravirtualize iret?

Linus

2010-07-14 23:09:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

> - at vmalloc time, when adding a new page directory entry, walk all
> the tens of thousands of existing page tables under a lock that
> guarantees that we don't add any new ones (ie it will lock out fork())
> and add the required pgd entry to them.
>
> - or just take the fault and do the "fill the page tables" on demand.
>
> Quite frankly, most of the time it's probably better to make that last
> choice (unless your hardware makes it easy to make the first choice,

Adding new PGDs should happen only very rarely (in fact
at most once per boot on i386-PAE36 with only 4 entries, most used by user
space), most of the time when you do a vmalloc it changes only lower level
tables.

The PGD for the kernel mappings is already set up. On x86-64 it can happen
more often in theory, but in practice it should be also extremly rare because
a PGD is so large.

That's why I'm not sure this problem even happened. It should
be extremly rare that you exactly add that PGD during the per cpu
allocation.

It can happen in theory, but for such a rare case take a lock
and walking everything should be fine.

-Andi

2010-07-14 23:11:21

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Frederic Weisbecker ([email protected]) wrote:
> On Wed, Jul 14, 2010 at 06:31:07PM -0400, Mathieu Desnoyers wrote:
> > * Frederic Weisbecker ([email protected]) wrote:
> > > On Wed, Jul 14, 2010 at 12:54:19PM -0700, Linus Torvalds wrote:
> > > > On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> > > > <[email protected]> wrote:
> > > > >
> > > > > There is also the fact we need to handle the lost NMI, by defering its
> > > > > treatment or so. That adds even more complexity.
> > > >
> > > > I don't think your read my proposal very deeply. It already handles
> > > > them by taking a fault on the iret of the first one (that's why we
> > > > point to the stack frame - so that we can corrupt it and force a
> > > > fault).
> > >
> > >
> > > Ah right, I missed this part.
> >
> > Hrm, Frederic, I hate to ask that but.. what are you doing with those percpu 8k
> > data structures exactly ? :)
> >
> > Mathieu
>
>
>
> So, when an event triggers in perf, we sometimes want to capture the stacktrace
> that led to the event.
>
> We want this stacktrace (here we call that a callchain) to be recorded
> locklessly. So we want this callchain buffer per cpu, with the following
> type:

Ah OK, so you mean that perf now has 2 different ring buffer implementations ?
How about using a single one that is generic enough to handle perf and ftrace
needs instead ?

(/me runs away quickly before the lightning strikes) ;)

Mathieu


>
> #define PERF_MAX_STACK_DEPTH 255
>
> struct perf_callchain_entry {
> __u64 nr;
> __u64 ip[PERF_MAX_STACK_DEPTH];
> };
>
>
> That makes 2048 bytes. But per cpu is not enough for the callchain to be recorded
> locklessly, we also need one buffer per context: task, softirq, hardirq, nmi, as
> an event can trigger in any of these.
> Since we disable preemption, none of these contexts can nest locally. In
> fact hardirqs can nest but we just don't care about this corner case.
>
> So, it makes 2048 * 4 = 8192 bytes. And that per cpu.
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-14 23:28:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 4:09 PM, Andi Kleen <[email protected]> wrote:
>
> It can happen in theory, but for such a rare case take a lock
> and walking everything should be fine.

Actually, that's _exactly_ the wrong kind of thinking.

Bad latency is bad latency, even when it happens rarely. So latency
problems kill - even when they are rare. So you want to avoid them.
And walking every possible page table is a _huge_ latency problem when
it happens.

In contrast, what's the advantage of doing thigns synchronously while
holding a lock? It's that you can avoid a few page faults, and get
better CPU use. But that's _stupid_ if it's something that is very
rare to begin with.

So the very rarity argues for the lazy approach. If it wasn't rare,
there would be a much stronger argument for trying to do things
up-front.

Linus

2010-07-14 23:32:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

Hello,

On 07/14/2010 10:08 PM, H. Peter Anvin wrote:
>> I suspect the low level per cpu allocation functions should
>> just call it.
>>
>
> Yes, specifically the point at which we allocate new per cpu memory
> blocks.

We can definitely do that but walking whole page table list is too
heavy to do automatically at that level especially when all users
other than NMI would be fine w/ the default lazy approach. If Linus'
approach doesn't pan out, I think the right thing to do would be
adding a wrapper to vmalloc_sync_all() and let perf code call it after
percpu allocation.

Thanks.

--
tejun

2010-07-14 23:38:49

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 07:11:17PM -0400, Mathieu Desnoyers wrote:
> * Frederic Weisbecker ([email protected]) wrote:
> > On Wed, Jul 14, 2010 at 06:31:07PM -0400, Mathieu Desnoyers wrote:
> > > * Frederic Weisbecker ([email protected]) wrote:
> > > > On Wed, Jul 14, 2010 at 12:54:19PM -0700, Linus Torvalds wrote:
> > > > > On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > There is also the fact we need to handle the lost NMI, by defering its
> > > > > > treatment or so. That adds even more complexity.
> > > > >
> > > > > I don't think your read my proposal very deeply. It already handles
> > > > > them by taking a fault on the iret of the first one (that's why we
> > > > > point to the stack frame - so that we can corrupt it and force a
> > > > > fault).
> > > >
> > > >
> > > > Ah right, I missed this part.
> > >
> > > Hrm, Frederic, I hate to ask that but.. what are you doing with those percpu 8k
> > > data structures exactly ? :)
> > >
> > > Mathieu
> >
> >
> >
> > So, when an event triggers in perf, we sometimes want to capture the stacktrace
> > that led to the event.
> >
> > We want this stacktrace (here we call that a callchain) to be recorded
> > locklessly. So we want this callchain buffer per cpu, with the following
> > type:
>
> Ah OK, so you mean that perf now has 2 different ring buffer implementations ?
> How about using a single one that is generic enough to handle perf and ftrace
> needs instead ?
>
> (/me runs away quickly before the lightning strikes) ;)
>
> Mathieu


:-)

That's no ring buffer. It's a temporary linear buffer to fill a stacktrace,
and get its effective size before committing it to the real ring buffer.

Sure that involves two copies.

But I don't have a better solution in mind than using a pre-buffer for that,
since we can't know the size of the stacktrace in advance. We could
always reserve the max stacktrace size, but that would be wasteful.

2010-07-14 23:40:53

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

[ /me removes the duplicate email of himself! ]

On Wed, 2010-07-14 at 19:11 -0400, Mathieu Desnoyers wrote:
> >
> > So, when an event triggers in perf, we sometimes want to capture the stacktrace
> > that led to the event.
> >
> > We want this stacktrace (here we call that a callchain) to be recorded
> > locklessly. So we want this callchain buffer per cpu, with the following
> > type:
>
> Ah OK, so you mean that perf now has 2 different ring buffer implementations ?
> How about using a single one that is generic enough to handle perf and ftrace
> needs instead ?
>
> (/me runs away quickly before the lightning strikes) ;)
>

To be fair, that's just a temp buffer.

-- Steve

(/me sends this to try to remove the dup email he's getting )

2010-07-14 23:54:11

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 07/14/2010 04:02 PM, Linus Torvalds wrote:
> Umm, I know. It's what this whole discussion (non-paravirtualized) is
> all about. And I have a suggestion that should fix the
> non-paravirtualized case _without_ actually touching anything but the
> NMI code itself.
>
> What I tried to say is that the paravirtualized people should take a
> look at my suggestion, and see if they can apply the logic to their
> NMI handling too.

My point is that it's moot (for now) because there is no NMI handing.

> And in the process totally remove the need for
> paravirtualizing iret, exactly because the approach handles the magic
> NMI lock logic entirely in the NMI handler itself.
>

Nothing in this thread is ringing any alarm bells from that perspective,
so I don't much mind either way. When I get around to dealing with
paravirtualized NMI, I'll look at the state of things and see how to go
from there. (Xen's iret hypercall takes a flag to say whether to unmask
NMIs, which will probably come in handy.)

I don't think any of the other pure PV implementations have NMI either,
so I don't think it affects them.

> Wouldn't it be nice to be able to remove the need to paravirtualize iret?
>

Of course. But we also need to do an iret in a hypercall to handle ring
switching in some cases, so we still need it aside from the interrupt issue.

J

2010-07-15 01:24:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 3:37 PM, Linus Torvalds
<[email protected]> wrote:
>
> I think the %rip check should be pretty simple - exactly because there
> is only a single point where the race is open between that 'mov' and
> the 'iret'. So it's simpler than the (similar) thing we do for
> debug/nmi stack fixup for sysenter that has to check a range.

So this is what I think it might look like, with the %rip in place.
And I changed the "nmi_stack_ptr" thing to have both the pointer and a
flag - because it turns out that in the single-instruction race case,
we actually want the old pointer.

Totally untested, of course. But _something_ like this might work:

#
# Two per-cpu variables: a "are we nested" flag (one byte), and
# a "if we're nested, what is the %rsp for the nested case".
#
# The reason for why we can't just clear the saved-rsp field and
# use that as the flag is that we actually want to know the saved
# rsp for the special case of having a nested NMI happen on the
# final iret of the unnested case.
#
nmi:
cmpb $0,%__percpu_seg:nmi_stack_nesting
jne nmi_nested_corrupt_and_return
cmpq $nmi_iret_address,0(%rsp)
je nmi_might_be_nested
# create new stack
is_unnested_nmi:
# Save some space for nested NMI's. The exception itself
# will never use more space, but it might use less (since
# if will be a kernel-kernel transition). But the nested
# exception will want two save registers and a place to
# save the original CS that it will corrupt
subq $64,%rsp

# copy the five words of stack info. 96 = 64 + stack
# offset of ss.
pushq 96(%rsp) # ss
pushq 96(%rsp) # rsp
pushq 96(%rsp) # eflags
pushq 96(%rsp) # cs
pushq 96(%rsp) # rip

# and set the nesting flags
movq %rsp,%__percpu_seg:nmi_stack_ptr
movb $0xff,%__percpu_seg:nmi_stack_nesting

regular_nmi_code:
...
# regular NMI code goes here, and can take faults,
# because this sequence now has proper nested-nmi
# handling
...
nmi_exit:
movb $0,%__percpu_seg:nmi_stack_nesting
nmi_iret_address:
iret

# The saved rip points to the final NMI iret, after we've cleared
# nmi_stack_ptr. Check the CS segment to make sure.
nmi_might_be_nested:
cmpw $__KERNEL_CS,8(%rsp)
jne is_unnested_nmi

# This is the case when we hit just as we're supposed to do the final
# iret of a previous nmi. We run the NMI using the old return address
# that is still on the stack, rather than copy the new one that is bogus
# and points to where the nested NMI interrupted the original NMI
# handler!
# Easy: just reset the stack pointer to the saved one (this is why
# we use a separate "valid" flag, so that we can still use the saved
# stack pointer)
movq %__percpu_seg:nmi_stack_ptr,%rsp
jmp regular_nmi_code

# This is the actual nested case. Make sure we fault on iret by setting
# CS to zero and saving the old CS. %rax contains the stack pointer to
# the original code.
nmi_nested_corrupt_and_return:
pushq %rax
pushq %rdx
movq %__percpu_seg:nmi_stack_ptr,%rax
movq 8(%rax),%rdx # CS of original NMI
testq %rdx,%rdx # CS already zero?
je nmi_nested_and_already_corrupted
movq %rdx,40(%rax) # save old CS away
movq $0,8(%rax)
nmi_nested_and_already_corrupted:
popq %rdx
popq %rax
popfq
jmp *(%rsp)

Hmm?

Linus

2010-07-15 01:51:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 6:23 PM, Linus Torvalds
<[email protected]> wrote:
>
> So this is what I think it might look like, with the %rip in place.
> [ ...]
> Hmm?

I didn't fill in the iret fault details, because I thought that would
be trivial. We get an exception, it's a slow case, how hard can it be
to just call the NMI code?

But thinking some more about it, it doesn't feel as trivial any more.
We want to set up that same nesting code for the faked NMI call, but
now I made it be two separate variables, and they need to be set in an
NMI-safe way without us actually having access to the whole NMI
blocking that the CPU does for a real NMI.

So there's a few subtleties there too. Probably need to make the two
percpu values adjacent, and use cmpxchg16b in the "emulate NMI on
exception" code to set them both atomically. Or something. So I think
it's doable, but it's admittedly more complicated than I thought it
would be.

.. and obviously there's nothing that guarantees that the code I
already posted is correct either. The whole concept might be total
crap.

Linus

2010-07-15 14:11:35

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 03:56:43PM -0700, Linus Torvalds wrote:
> On Wed, Jul 14, 2010 at 3:31 PM, Frederic Weisbecker <[email protected]> wrote:
> >
> > Until now I didn't because I clearly misunderstand the vmalloc internals. I'm
> > not even quite sure why a memory allocated with vmalloc sometimes can be not
> > mapped (and then fault once for this to sync). Some people have tried to explain
> > me but the picture is still vague to me.
>
> So the issue is that the system can have thousands and thousands of
> page tables (one for each process), and what do you do when you add a
> new kernel virtual mapping?
>
> You can:
>
> - make sure that you only ever use _one_ single top-level entry for
> all vmalloc issues, and can make sure that all processes are created
> with that static entry filled in. This is optimal, but it just doesn't
> work on all architectures (eg on 32-bit x86, it would limit the
> vmalloc space to 4MB in non-PAE, whatever)


But then, even if you ensure that, don't we need to also fill lower level
entries for the new mapping.

Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
risk to add a new memory mapping for new memory allocated with kmalloc?



> - at vmalloc time, when adding a new page directory entry, walk all
> the tens of thousands of existing page tables under a lock that
> guarantees that we don't add any new ones (ie it will lock out fork())
> and add the required pgd entry to them.
>
> - or just take the fault and do the "fill the page tables" on demand.
>
> Quite frankly, most of the time it's probably better to make that last
> choice (unless your hardware makes it easy to make the first choice,
> which is obviously simplest for everybody). It makes it _much_ cheaper
> to do vmalloc. It also avoids that nasty latency issue. And it's just
> simpler too, and has no interesting locking issues with how/when you
> expose the page tables in fork() etc.
>
> So the only downside is that you do end up taking a fault in the
> (rare) case where you have a newly created task that didn't get an
> even newer vmalloc entry.


But then how did the previous tasks get this new mapping? You said
we don't walk through every process page tables for vmalloc.

I would understand this race if we were to walk on every processes page
tables and add the new mapping on them, but we missed one new task that
forked or so, because we didn't lock (or just rcu).



> And that fault can sometimes be in an
> interrupt or an NMI. Normally it's trivial to handle that fairly
> simple nested fault. But NMI has that inconvenient "iret unblocks
> NMI's, because there is no dedicated 'nmiret' instruction" problem on
> x86.


Yeah.


So the parts of the problem I don't understand are:

- why don't we have this problem with kmalloc() ?
- did I understand well the race that makes the fault necessary,
ie: we walk the tasklist lockless, add the new mapping if
not present, but we might miss a task lately forked, but
the fault will fix that.

Thanks.

2010-07-15 14:35:24

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

> But then how did the previous tasks get this new mapping? You said
> we don't walk through every process page tables for vmalloc.

No because those are always shared for the kernel and have been
filled in for init_mm.

Also most updates only update the lower tables anyways, top level
updates are extremly rare. In fact on PAE36 they should only happen
at most once, if at all, and most likely at early boot anyways
where you only have a single task.

On x86-64 they will only happen once every 512GB of vmalloc.
So for most systems also at most once at early boot.
>
> I would understand this race if we were to walk on every processes page
> tables and add the new mapping on them, but we missed one new task that
> forked or so, because we didn't lock (or just rcu).

The new task will always get a copy of the reference init_mm, which
was already updated.

-Andi

2010-07-15 14:46:29

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, 2010-07-15 at 16:11 +0200, Frederic Weisbecker wrote:

> > - make sure that you only ever use _one_ single top-level entry for
> > all vmalloc issues, and can make sure that all processes are created
> > with that static entry filled in. This is optimal, but it just doesn't
> > work on all architectures (eg on 32-bit x86, it would limit the
> > vmalloc space to 4MB in non-PAE, whatever)
>
>
> But then, even if you ensure that, don't we need to also fill lower level
> entries for the new mapping.

If I understand your question, you do not need to worry about the lower
level entries because all the processes will share the same top level.

process 1's GPD ------,
|
+------> PMD --> ...
|
process 2' GPD -------'

Thus we have one page entry shared by all processes. The issue happens
when the vm space crosses the PMD boundary and we need to update all the
GPD's of all processes to point to the new PMD we need to add to handle
the spread of the vm space.

>
> Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> risk to add a new memory mapping for new memory allocated with kmalloc?

Because all of memory (well 800 some megs on 32 bit) is mapped into
memory for all processes. That is, kmalloc only uses this memory (as
does get_free_page()). All processes have a PMD (or PUD, whatever) that
maps this memory. The issues only arises when we use new virtual memory,
which vmalloc does. Vmalloc may map to physical memory that is already
mapped to all processes but the address that the vmalloc uses to access
that memory is not yet mapped.

The usual reason the kernel uses vmalloc is to get a contiguous range of
memory. The vmalloc can map several pages as one contiguous piece of
memory that in reality is several different pages scattered around
physical memory. kmalloc can only map pages that are contiguous in
physical memory. That is, if kmalloc gets 8192 bytes on an arch with
4096 byte pages, it will allocate two consecutive pages in physical
memory. If two contiguous pages are not available even if thousand of
single pages are, the kmalloc will fail, where as the vmalloc will not.

An allocation of vmalloc can use two different pages and just map the
page table to make them contiguous in view of the kernel. Note, this
comes at a cost. One is when we do this, we suffer the case where we
need to update a bunch of page tables. The other is that we must waste
TLB entries to point to these separate pages. Kmalloc and
get_free_page() uses the big memory mappings. That is, if the TLB allows
us to map large pages, we can do that for kernel memory since we just
want the contiguous memory as it is in physical memory.

Thus the kernel maps the physical memory with the fewest TLB entries as
needed (large pages and large TLB entries). If we can map 64K pages, we
do that. Then kmalloc just allocates within this range, it does not need
to map any pages. They are already mapped.

Does this make a bit more sense?

>
>
>
> > - at vmalloc time, when adding a new page directory entry, walk all
> > the tens of thousands of existing page tables under a lock that
> > guarantees that we don't add any new ones (ie it will lock out fork())
> > and add the required pgd entry to them.
> >
> > - or just take the fault and do the "fill the page tables" on demand.
> >
> > Quite frankly, most of the time it's probably better to make that last
> > choice (unless your hardware makes it easy to make the first choice,
> > which is obviously simplest for everybody). It makes it _much_ cheaper
> > to do vmalloc. It also avoids that nasty latency issue. And it's just
> > simpler too, and has no interesting locking issues with how/when you
> > expose the page tables in fork() etc.
> >
> > So the only downside is that you do end up taking a fault in the
> > (rare) case where you have a newly created task that didn't get an
> > even newer vmalloc entry.
>
>
> But then how did the previous tasks get this new mapping? You said
> we don't walk through every process page tables for vmalloc.

Actually we don't even need to walk the page tables in the first task
(although we might do that). When the kernel accesses that memory we
take the page fault, the page fault will see that this memory is vmalloc
data and fill in the page tables for the task at that time.

>
> I would understand this race if we were to walk on every processes page
> tables and add the new mapping on them, but we missed one new task that
> forked or so, because we didn't lock (or just rcu).
>
>
>
> > And that fault can sometimes be in an
> > interrupt or an NMI. Normally it's trivial to handle that fairly
> > simple nested fault. But NMI has that inconvenient "iret unblocks
> > NMI's, because there is no dedicated 'nmiret' instruction" problem on
> > x86.
>
>
> Yeah.
>
>
> So the parts of the problem I don't understand are:
>
> - why don't we have this problem with kmalloc() ?

I hope I explained that above.

> - did I understand well the race that makes the fault necessary,
> ie: we walk the tasklist lockless, add the new mapping if
> not present, but we might miss a task lately forked, but
> the fault will fix that.

I'm lost on this race. If we do a lock and walk all page tables I think
the race goes away. So I don't understand this either?

-- Steve

2010-07-15 14:59:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, Jul 15, 2010 at 7:11 AM, Frederic Weisbecker <[email protected]> wrote:
> On Wed, Jul 14, 2010 at 03:56:43PM -0700, Linus Torvalds wrote:
>> You can:
>>
>> ?- make sure that you only ever use _one_ single top-level entry for
>> all vmalloc issues, and can make sure that all processes are created
>> with that static entry filled in. This is optimal, but it just doesn't
>> work on all architectures (eg on 32-bit x86, it would limit the
>> vmalloc space to 4MB in non-PAE, whatever)
>
> But then, even if you ensure that, don't we need to also fill lower level
> entries for the new mapping.

Yes, but now they are all mapped by the one *shared* top-level entry.

Think about it.

[ Time passes ]

End result: if you can map the whole vmalloc area with a single
top-level entry that is shared by all processes, and can then just
fill in the lower levels when doing actual allocations, it means that
all processes will automatically get the entries added, and do not
need any fixups.

In other words, the page tables will be automatically correct and
filled in for everybody - without having to traverse any lists,
without any extra locking, and without any races. So this is efficient
and simple, and never needs any faulting to fill in page tables later
on.

(Side note: "single top-level entry" could equally well be "multiple
preallocated entries covering the whole region": the important part is
not really the "single entry", but the "preallocated and filled into
every page directory from the start" part)

> Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> risk to add a new memory mapping for new memory allocated with kmalloc?

No. The kmalloc space is all in the 1:1 kernel mapping, and is always
mapped. Even with PAGEALLOC_DEBUG, it's always mapped at the top
level, and even if a particular page is unmapped/remapped for
debugging, it is done so in the shared kernel page tables (which ends
up being the above trivial case - there is just a single set of page
directory entries that are shared by everybody).

>> ?- at vmalloc time, when adding a new page directory entry, walk all
>> the tens of thousands of existing page tables under a lock that
>> guarantees that we don't add any new ones (ie it will lock out fork())
>> and add the required pgd entry to them.
>>
>> ?- or just take the fault and do the "fill the page tables" on demand.
>>
>> Quite frankly, most of the time it's probably better to make that last
>> choice (unless your hardware makes it easy to make the first choice,
>> which is obviously simplest for everybody). It makes it _much_ cheaper
>> to do vmalloc. It also avoids that nasty latency issue. And it's just
>> simpler too, and has no interesting locking issues with how/when you
>> expose the page tables in fork() etc.
>>
>> So the only downside is that you do end up taking a fault in the
>> (rare) case where you have a newly created task that didn't get an
>> even newer vmalloc entry.
>
> But then how did the previous tasks get this new mapping? You said
> we don't walk through every process page tables for vmalloc.

We always add the mapping to the "init_mm" page tables when it is
created (just a single mm), and when fork creates a new page table, it
will always copy the kernel mapping parts from the old one. So the
_common_ case is that all normal mappings are already set up in page
tables, including newly created page tables.

The uncommon case is when there is a new page table created _and_ a
new vmalloc mapping, and the race that happens between those events.
Whent hat new page table is then later used (and it can be _much_
later, of course: we're now talking process scheduling, so IO delays
etc are relevant), it won't necessarily have the page table entries
for vmalloc stuff that was created since the page tables were created.
So we fill _those_ in dynamically.

But vmalloc mappings should be reasonably rare, and the actual "fill
them in" cases are much rarer still (since we fill them in page
directory entries at a time: so even if you make a lot of vmalloc()
calls, we only _fill_ at most once per page directory entry, which is
usually a pretty big chunk). On 32-bit x86, for example, we'd fill
once every 4MB (or 2MB if PAE), and you cannot have a lot of vmalloc
mappings that large (since the VM space is limited).

So the cost of filling things in is basically zero, because it happens
so seldom. And by _allowing_ things to be done lazily, we avoid all
the locking costs, and all the costs of traversing every single
possible mm (and there can be many many thousands of those).

> I would understand this race if we were to walk on every processes page
> tables and add the new mapping on them, but we missed one new task that
> forked or so, because we didn't lock (or just rcu).

.. and how do you keep track of which tasks you missed? And no, it's
not just the new tasks - you have old tasks that have their page
tables built up too, but need to be updated. They may never need the
mapping since they may be sleeping and never using the driver or data
structures that created it (in fact, that's a common case), so filling
them would be pointless. But if we don't do the lazy fill, we'd have
to fill them all, because WE DO NOT KNOW.

> So the parts of the problem I don't understand are:
>
> - why don't we have this problem with kmalloc() ?

Hopefully clarified.

> - did I understand well the race that makes the fault necessary,
> ?ie: we walk the tasklist lockless, add the new mapping if
> ?not present, but we might miss a task lately forked, but
> ?the fault will fix that.

But the _fundamental_ issue is that we do not want to walk the
tasklist (or the mm_list) AT ALL. It's a f*cking waste of time. It's a
long list, and nobody cares. In many cases it won't be needed.

The lazy algorithm is _better_. It's way simpler (we take nested
faults all the time in the kernel, and it's a particularly _easy_ page
fault to handle with no IO or no locking needed), and it does less
work. It really boils down to that.

So it's not the lazy page table fill that is the problem. Never has
been. We've been doing the lazy fill for a long time, and it was
simple and useful way back when.

The problem has always been NMI, and nothing else. NMI's are nasty,
and the x86 NMI blocking is insane and crazy.

Which is why I'm so adamant that this should be fixed in the NMI code,
and we should _not_ talk about trying to screw up other, totally
unrelated, code. The lazy fill really was never the problem.

Linus

2010-07-15 15:47:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, Jul 15, 2010 at 7:51 AM, Linus Torvalds
<[email protected]> wrote:
>
> So it's not the lazy page table fill that is the problem. Never has
> been. We've been doing the lazy fill for a long time, and it was
> simple and useful way back when.

Btw, this is true to the degree that I would _much_ rather just get
rid of the crazy "vmalloc_sync_all()" crap entirely, and make it clear
that non-lazy vmalloc page table fill is a bug.

Because quite frankly, it _is_ a bug to depend on non-lazy vmalloc.
The whole function is only implemented on 32-bit x86, so any code that
thinks it needs it is either just wrong, or will only work on 32-bit
x86 anyway (and on other architectures by pure chance, likely because
their VM fill granularity is so big that they never saw the problem).

So getting rid of vmalloc_sync_all() entirely would be a good thing.
Then we wouldn't have that silly and pointless interface, and we
wouldn't have that crazy "this only does something on x86-32,
everywhere else it's a placebo".

Linus

2010-07-15 16:26:36

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Frederic Weisbecker ([email protected]) wrote:
> On Wed, Jul 14, 2010 at 07:11:17PM -0400, Mathieu Desnoyers wrote:
> > * Frederic Weisbecker ([email protected]) wrote:
> > > On Wed, Jul 14, 2010 at 06:31:07PM -0400, Mathieu Desnoyers wrote:
> > > > * Frederic Weisbecker ([email protected]) wrote:
> > > > > On Wed, Jul 14, 2010 at 12:54:19PM -0700, Linus Torvalds wrote:
> > > > > > On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > There is also the fact we need to handle the lost NMI, by defering its
> > > > > > > treatment or so. That adds even more complexity.
> > > > > >
> > > > > > I don't think your read my proposal very deeply. It already handles
> > > > > > them by taking a fault on the iret of the first one (that's why we
> > > > > > point to the stack frame - so that we can corrupt it and force a
> > > > > > fault).
> > > > >
> > > > >
> > > > > Ah right, I missed this part.
> > > >
> > > > Hrm, Frederic, I hate to ask that but.. what are you doing with those percpu 8k
> > > > data structures exactly ? :)
> > > >
> > > > Mathieu
> > >
> > >
> > >
> > > So, when an event triggers in perf, we sometimes want to capture the stacktrace
> > > that led to the event.
> > >
> > > We want this stacktrace (here we call that a callchain) to be recorded
> > > locklessly. So we want this callchain buffer per cpu, with the following
> > > type:
> >
> > Ah OK, so you mean that perf now has 2 different ring buffer implementations ?
> > How about using a single one that is generic enough to handle perf and ftrace
> > needs instead ?
> >
> > (/me runs away quickly before the lightning strikes) ;)
> >
> > Mathieu
>
>
> :-)
>
> That's no ring buffer. It's a temporary linear buffer to fill a stacktrace,
> and get its effective size before committing it to the real ring buffer.

I was more thinking along the lines of making sure a ring buffer has the proper
support for your use-case. It shares a lot of requirements with a standard ring
buffer:

- Need to be lock-less
- Need to reserve space, write data in a buffer

By configuring a ring buffer with 4k sub-buffer size (that's configurable
dynamically), all we need to add is the ability to squash a previously saved
record from the buffer. I am confident we can provide a clean API for this that
would allow discard of previously committed entry as long as we are still on the
same non-fully-committed sub-buffer. This fits your use-case exactly, so that's
fine.

You could have one 4k ring buffer per cpu per execution context. I wonder if
each Linux architecture have support for separated thread vs softirtq vs irq vs
nmi stacks ? Even then, given you have only one stack for all shared irqs, you
need something that is concurrency-aware at the ring buffer level.

These small stack-like ring buffers could be used to save your temporary stack
copy. When you really need to save it to a larger ring buffer along with a
trace, then you just take a snapshot of the stack ring buffers.

So you get to use one single ring buffer synchronization and memory allocation
mechanism, that everyone has reviewed. The advantage is that we would not be
having this nmi race discussion in the first place: the generic ring buffer uses
"get page" directly rather than relying on vmalloc, because these bugs have
already been identified and dealt with years ago.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-15 16:44:55

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Linus Torvalds ([email protected]) wrote:
> On Wed, Jul 14, 2010 at 3:37 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > I think the %rip check should be pretty simple - exactly because there
> > is only a single point where the race is open between that 'mov' and
> > the 'iret'. So it's simpler than the (similar) thing we do for
> > debug/nmi stack fixup for sysenter that has to check a range.
>
> So this is what I think it might look like, with the %rip in place.
> And I changed the "nmi_stack_ptr" thing to have both the pointer and a
> flag - because it turns out that in the single-instruction race case,
> we actually want the old pointer.
>
> Totally untested, of course. But _something_ like this might work:
>
> #
> # Two per-cpu variables: a "are we nested" flag (one byte), and
> # a "if we're nested, what is the %rsp for the nested case".
> #
> # The reason for why we can't just clear the saved-rsp field and
> # use that as the flag is that we actually want to know the saved
> # rsp for the special case of having a nested NMI happen on the
> # final iret of the unnested case.
> #
> nmi:
> cmpb $0,%__percpu_seg:nmi_stack_nesting
> jne nmi_nested_corrupt_and_return
> cmpq $nmi_iret_address,0(%rsp)
> je nmi_might_be_nested
> # create new stack
> is_unnested_nmi:
> # Save some space for nested NMI's. The exception itself
> # will never use more space, but it might use less (since
> # if will be a kernel-kernel transition). But the nested
> # exception will want two save registers and a place to
> # save the original CS that it will corrupt
> subq $64,%rsp
>
> # copy the five words of stack info. 96 = 64 + stack
> # offset of ss.
> pushq 96(%rsp) # ss
> pushq 96(%rsp) # rsp
> pushq 96(%rsp) # eflags
> pushq 96(%rsp) # cs
> pushq 96(%rsp) # rip
>
> # and set the nesting flags
> movq %rsp,%__percpu_seg:nmi_stack_ptr
> movb $0xff,%__percpu_seg:nmi_stack_nesting
>
> regular_nmi_code:
> ...
> # regular NMI code goes here, and can take faults,
> # because this sequence now has proper nested-nmi
> # handling
> ...
> nmi_exit:
> movb $0,%__percpu_seg:nmi_stack_nesting

The first thing that strikes me is that we could be interrupted by a standard
interrupt on top of the iret instruction below. This interrupt handler could in
turn be interrupted by a NMI, so the NMI handler would not know that it is
nested over nmi_iret_address. Maybe we could simply disable interrupts
explicitly at the beginning of the handler, so they get re-enabled by iret below
upon return from nmi ?

Doing that would ensure that only NMIs can interrupt us.

I'll look a bit more at the code and come back with more comments if things come
up.

Thanks,

Mathieu

> nmi_iret_address:
> iret
>
> # The saved rip points to the final NMI iret, after we've cleared
> # nmi_stack_ptr. Check the CS segment to make sure.
> nmi_might_be_nested:
> cmpw $__KERNEL_CS,8(%rsp)
> jne is_unnested_nmi
>
> # This is the case when we hit just as we're supposed to do the final
> # iret of a previous nmi. We run the NMI using the old return address
> # that is still on the stack, rather than copy the new one that is bogus
> # and points to where the nested NMI interrupted the original NMI
> # handler!
> # Easy: just reset the stack pointer to the saved one (this is why
> # we use a separate "valid" flag, so that we can still use the saved
> # stack pointer)
> movq %__percpu_seg:nmi_stack_ptr,%rsp
> jmp regular_nmi_code
>
> # This is the actual nested case. Make sure we fault on iret by setting
> # CS to zero and saving the old CS. %rax contains the stack pointer to
> # the original code.
> nmi_nested_corrupt_and_return:
> pushq %rax
> pushq %rdx
> movq %__percpu_seg:nmi_stack_ptr,%rax
> movq 8(%rax),%rdx # CS of original NMI
> testq %rdx,%rdx # CS already zero?
> je nmi_nested_and_already_corrupted
> movq %rdx,40(%rax) # save old CS away
> movq $0,8(%rax)
> nmi_nested_and_already_corrupted:
> popq %rdx
> popq %rax
> popfq
> jmp *(%rsp)
>
> Hmm?
>
> Linus

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-15 16:49:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, Jul 15, 2010 at 9:44 AM, Mathieu Desnoyers
<[email protected]> wrote:
>
> The first thing that strikes me is that we could be interrupted by a standard
> interrupt on top of the iret instruction below.

No, that can never happen.

Why? Simple: regular interrupts aren't ever enabled in eflags. So the
only kinds of traps we can get are NMI's (that don't follow the normal
rules), and exceptions.

Of course, if there is some trap that re-enables interrupts even if
the trap happened in an interrupt-disabled region, then that would
change things, but that would be a bad bug regardless (and totally
independently of any NMI issues). So in that sense it's a "could
happen", but it's something that would be a totally separate bug.

Linus

2010-07-15 17:38:41

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Linus Torvalds ([email protected]) wrote:
> On Thu, Jul 15, 2010 at 9:44 AM, Mathieu Desnoyers
> <[email protected]> wrote:
> >
> > The first thing that strikes me is that we could be interrupted by a standard
> > interrupt on top of the iret instruction below.
>
> No, that can never happen.
>
> Why? Simple: regular interrupts aren't ever enabled in eflags. So the
> only kinds of traps we can get are NMI's (that don't follow the normal
> rules), and exceptions.

Ah, you're right, since NMIs are an intr gate, IF is disabled in the EFLAGS all
along.

>
> Of course, if there is some trap that re-enables interrupts even if
> the trap happened in an interrupt-disabled region, then that would
> change things, but that would be a bad bug regardless (and totally
> independently of any NMI issues). So in that sense it's a "could
> happen", but it's something that would be a totally separate bug.

Yep, this kind of scenario would therefore be a bug that does not belong to the
specific realm of nmis.

Thanks,

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-15 18:31:57

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Linus Torvalds ([email protected]) wrote:
> On Wed, Jul 14, 2010 at 6:23 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > So this is what I think it might look like, with the %rip in place.
> > [ ...]
> > Hmm?
>
> I didn't fill in the iret fault details, because I thought that would
> be trivial. We get an exception, it's a slow case, how hard can it be
> to just call the NMI code?

I'm wondering if we really have to handle this with a fault. Couldn't we just
send iret to the following nmi handler instead ? (chaining nmis)

I think we can even find a way to handle the fact that the fake nmi does not run
with nmis disabled. We could keep the nmi nested bit set if we find out that
iret will branch to the fake nmi. We would then have to make sure the fake nmi
entry point is a little further than the standard nmi entry point: somewhere
after the initial nmi nesting check.

>
> But thinking some more about it, it doesn't feel as trivial any more.
> We want to set up that same nesting code for the faked NMI call, but
> now I made it be two separate variables, and they need to be set in an
> NMI-safe way without us actually having access to the whole NMI
> blocking that the CPU does for a real NMI.
>
> So there's a few subtleties there too. Probably need to make the two
> percpu values adjacent, and use cmpxchg16b in the "emulate NMI on
> exception" code to set them both atomically. Or something. So I think
> it's doable, but it's admittedly more complicated than I thought it
> would be.

Hrm, we could probably get away with only keeping the nmi_stack_nested per-cpu
variable. The nmi_stack_ptr could be known statically if we set it at a fixed
offset from the bottom of stack rather than using an offset relative to the top
(which can change depending if we are nested over the kernel or userspace).
We just have to reserve enough space for the bottom of stack.

>
> .. and obviously there's nothing that guarantees that the code I
> already posted is correct either. The whole concept might be total
> crap.

Call me optimistic if you want, but I think we're really getting somewhere. :)

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-15 18:49:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, Jul 15, 2010 at 11:43 AM, Linus Torvalds
<[email protected]> wrote:
>
> But maybe I'm missing something.

Hmm. Of course - one way of solving this might be to just make the
32-bit case switch stacks in software. That might be a good idea
regardless, and would not be complicated. We already do that for
sysenter, but the NMI case would be simpler because we don't need to
worry about being re-entered by NMI/DEBUG during the stack switch.

And since we have to play some games with moving the data on the stack
around _anyway_, doing the whole "switch stacks entirely rather than
just subtract a bit from the old stack" would be fairly logical.

So I think you may end up being right: we don't need to save the
original NMI stack pointer, because we can make sure that the
replacement stack (that we need anyway) is always deterministic.

Linus

2010-07-15 19:08:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, Jul 15, 2010 at 11:31 AM, Mathieu Desnoyers
<[email protected]> wrote:
>
> Hrm, we could probably get away with only keeping the nmi_stack_nested per-cpu
> variable. The nmi_stack_ptr could be known statically if we set it at a fixed
> offset from the bottom of stack rather than using an offset relative to the top
> (which can change depending if we are nested over the kernel or userspace).
> We just have to reserve enough space for the bottom of stack.

I thought about trying that, but I don't think it's true. At least not
for the 32-bit case.

The thing is, the 32-bit case will re-use the kernel stack if it
happens in kernel space, and will thus start from a random space (and
won't push all the information anyway). So a nested NMI really doesn't
know where the original NMI stack is to be found unless we save it
off.

In the case of x86-64, I think the stack will always be at a fixed
address, and push a fixed amount of data (because we use the IST
thing). So there you could probably just use the flag, but you'd still
have to handle the 32-bit case, and quite frankly, I think it would be
much nicer if the logic could be shared for the 32-bit and 64-bit
cases.

But maybe I'm missing something.

Linus

2010-07-15 20:46:29

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 07/15/2010 10:38 AM, Mathieu Desnoyers wrote:
>>
>> Of course, if there is some trap that re-enables interrupts even if
>> the trap happened in an interrupt-disabled region, then that would
>> change things, but that would be a bad bug regardless (and totally
>> independently of any NMI issues). So in that sense it's a "could
>> happen", but it's something that would be a totally separate bug.
>
> Yep, this kind of scenario would therefore be a bug that does not belong to the
> specific realm of nmis.
>

Yes, the only specific issue here is NMI -> trap -> IRET -> [nested
NMI], which is what this whole thing is about.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-07-15 22:01:22

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

Hi Linus,

I modified your code, intenting to handle the fake NMI entry gracefully given
that NMIs are not necessarily disabled at the entry point. It uses a "need fake
NMI" flag rather than playing games with CS and faults. When a fake NMI is
needed, it simply jumps back to the beginning of regular nmi code. NMI exit code
and fake NMI entry are made reentrant with respect to NMI handler interruption
by testing, at the very beginning of the NMI handler, if a NMI is nested over
the whole nmi_atomic .. nmi_atomic_end code region. This code assumes NMIs have
a separate stack.

This code is still utterly untested and might eat your Doritos, only provided
for general enjoyment.

Thanks,

Mathieu

#
# Two per-cpu variables: a "are we nested" flag (one byte).
# a "do we need to execute a fake NMI" flag (one byte).
# The %rsp at which the stack copy is saved is at a fixed address, which leaves
# enough room at the bottom of NMI stack for the "real" NMI entry stack. This
# assumes we have a separate NMI stack.
# The NMI stack copy top of stack is at nmi_stack_copy.
# The NMI stack copy "rip" is at nmi_stack_copy_rip, which is set to
# nmi_stack_copy-32.
#
nmi:
# Test if nested over atomic code.
cmpq $nmi_atomic,0(%rsp)
jae nmi_addr_is_ae
# Test if nested over general NMI code.
cmpb $0,%__percpu_seg:nmi_stack_nesting
jne nmi_nested_set_fake_and_return
# create new stack
is_unnested_nmi:
# Save some space for nested NMI's. The exception itself
# will never use more space, but it might use less (since
# if will be a kernel-kernel transition).

# Save %rax on top of the stack (need to temporarily use it)
pushq %rax
movq %rsp, %rax
movq $nmi_stack_copy,%rsp

# copy the five words of stack info. rip starts at 8+0(%rax).
pushq 8+32(%rax) # ss
pushq 8+24(%rax) # rsp
pushq 8+16(%rax) # eflags
pushq 8+8(%rax) # cs
pushq 8+0(%rax) # rip
movq 0(%rax),%rax # restore %rax

set_nmi_nesting:
# and set the nesting flags
movb $0xff,%__percpu_seg:nmi_stack_nesting

regular_nmi_code:
...
# regular NMI code goes here, and can take faults,
# because this sequence now has proper nested-nmi
# handling
...

nmi_atomic:
# An NMI nesting over the whole nmi_atomic .. nmi_atomic_end region will
# be handled specially. This includes the fake NMI entry point.
cmpb $0,%__percpu_seg:need_fake_nmi
jne fake_nmi
movb $0,%__percpu_seg:nmi_stack_nesting
iret

# This is the fake NMI entry point.
fake_nmi:
movb $0x0,%__percpu_seg:need_fake_nmi
jmp regular_nmi_code
nmi_atomic_end:

# Make sure the address is in the nmi_atomic range and in CS segment.
nmi_addr_is_ae:
cmpq $nmi_atomic_end,0(%rsp)
jae is_unnested_nmi
# The saved rip points to the final NMI iret. Check the CS segment to
# make sure.
cmpw $__KERNEL_CS,8(%rsp)
jne is_unnested_nmi

# This is the case when we hit just as we're supposed to do the atomic code
# of a previous nmi. We run the NMI using the old return address that is still
# on the stack, rather than copy the new one that is bogus and points to where
# the nested NMI interrupted the original NMI handler!
# Easy: just set the stack pointer to point to the stack copy, clear
# need_fake_nmi (because we are directly going to execute the requested NMI) and
# jump to "nesting flag set" (which is followed by regular nmi code execution).
movq $nmi_stack_copy_rip,%rsp
movb $0x0,%__percpu_seg:need_fake_nmi
jmp set_nmi_nesting

# This is the actual nested case. Make sure we branch to the fake NMI handler
# after this handler is done.
nmi_nested_set_fake_and_return:
movb $0xff,%__percpu_seg:need_fake_nmi
popfq
jmp *(%rsp)


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-15 22:16:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, Jul 15, 2010 at 3:01 PM, Mathieu Desnoyers
<[email protected]> wrote:
>
> . NMI exit code
> and fake NMI entry are made reentrant with respect to NMI handler interruption
> by testing, at the very beginning of the NMI handler, if a NMI is nested over
> the whole nmi_atomic .. nmi_atomic_end code region.

That is totally bogus. The NMI can be nested by exceptions and
function calls - the whole _point_ of this thing. So testing "rip" for
anything else than the specific final "iret" is meaningless. You will
be in an NMI region regardless of what rip is.

> This code assumes NMIs have a separate stack.

It also needs to be made per-cpu (and the flags be per-cpu).

Then you could in fact possibly test the stack pointer for whether it
is in the NMI stack area, and use the value of %rsp itself as the
flag. So you could avoid the flag entirely. Because testing %rsp is
valid - testing %rip is not.

That would also avoid the race, because %rsp (as a flag) now gets
cleared atomically by the "iret". So that might actually solve things.

Linus

2010-07-15 22:26:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 07/15/2010 03:16 PM, Linus Torvalds wrote:
>
>> This code assumes NMIs have a separate stack.
>
> It also needs to be made per-cpu (and the flags be per-cpu).
>
> Then you could in fact possibly test the stack pointer for whether it
> is in the NMI stack area, and use the value of %rsp itself as the
> flag. So you could avoid the flag entirely. Because testing %rsp is
> valid - testing %rip is not.
>
> That would also avoid the race, because %rsp (as a flag) now gets
> cleared atomically by the "iret". So that might actually solve things.
>

This seems really clean to me.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-07-15 22:27:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, Jul 15, 2010 at 3:16 PM, Linus Torvalds
<[email protected]> wrote:
>
> Then you could in fact possibly test the stack pointer for whether it
> is in the NMI stack area, and use the value of %rsp itself as the
> flag. So you could avoid the flag entirely. Because testing %rsp is
> valid - testing %rip is not.
>
> That would also avoid the race, because %rsp (as a flag) now gets
> cleared atomically by the "iret". So that might actually solve things.

Hmm. So on x86-32, it's easy: if the NMI is nested, you can literally
look at the current %rsp value, and see if it's within the NMI stack
region.

But on x86-64, due to IST, you need to look at the saved-rsp value on
the stack, since the %rsp always gets reset to the NMI stack region
regardless of where it was before.

Why do we force IST use for NMI, btw? Maybe we shouldn't, and just use
the normal kernel stack mechanisms?

Linus

2010-07-15 22:30:49

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Linus Torvalds ([email protected]) wrote:
> On Thu, Jul 15, 2010 at 3:01 PM, Mathieu Desnoyers
> <[email protected]> wrote:
> >
> > . NMI exit code
> > and fake NMI entry are made reentrant with respect to NMI handler interruption
> > by testing, at the very beginning of the NMI handler, if a NMI is nested over
> > the whole nmi_atomic .. nmi_atomic_end code region.
>
> That is totally bogus. The NMI can be nested by exceptions and
> function calls - the whole _point_ of this thing. So testing "rip" for
> anything else than the specific final "iret" is meaningless. You will
> be in an NMI region regardless of what rip is.

There are 2 tests done on NMI handler entry:

1) test if nested over nmi_atomic region (which is a very restrained region
around nmi_exit, which does not do any function call nor take traps).
2) test if the per-cpu nmi_nesting flag is set.

Test #2 takes care of NMIs nested over functions called and traps.

>
> > This code assumes NMIs have a separate stack.
>
> It also needs to be made per-cpu (and the flags be per-cpu).

Sure, that was implied ;)

>
> Then you could in fact possibly test the stack pointer for whether it
> is in the NMI stack area, and use the value of %rsp itself as the
> flag. So you could avoid the flag entirely. Because testing %rsp is
> valid - testing %rip is not.

That could be used as a way to detect "nesting over NMI", but I'm not entirely
sure it would deal with the "we need a fake NMI" flag set/clear (more or less
equivalent to setting CS to 0 in your implementation and then back to some other
value). The "set" is done with NMIs disabled, but the "clear" is done at fake
NMI entry, where NMIs are active.

>
> That would also avoid the race, because %rsp (as a flag) now gets
> cleared atomically by the "iret". So that might actually solve things.

Well, I'm still unconvinced there is anything to solve, as I built my NMI entry
with 2 tests: one for "nmi_atomic" code range and the other for per-cpu nesting
flag. Given that I set/clear the per-cpu nesting flag either with NMIs off or
within the nmi_atomic code range, this should all work fine.

Unless I am missing something else ?

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-15 22:47:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 07/15/2010 03:26 PM, Linus Torvalds wrote:
> On Thu, Jul 15, 2010 at 3:16 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Then you could in fact possibly test the stack pointer for whether it
>> is in the NMI stack area, and use the value of %rsp itself as the
>> flag. So you could avoid the flag entirely. Because testing %rsp is
>> valid - testing %rip is not.
>>
>> That would also avoid the race, because %rsp (as a flag) now gets
>> cleared atomically by the "iret". So that might actually solve things.
>
> Hmm. So on x86-32, it's easy: if the NMI is nested, you can literally
> look at the current %rsp value, and see if it's within the NMI stack
> region.
>
> But on x86-64, due to IST, you need to look at the saved-rsp value on
> the stack, since the %rsp always gets reset to the NMI stack region
> regardless of where it was before.
>
> Why do we force IST use for NMI, btw? Maybe we shouldn't, and just use
> the normal kernel stack mechanisms?
>

The reasons for using TSS (32 bits) or IST (64 bits) are: concern about
the size of the regular kernel stack, and a concern that the kernel
stack pointer may not be in a usable state. The former is not a problem
here: we're doing a stack switch anyway, and so the additional overhead
on the main stack is pretty minimal, but the latter may be.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-07-15 22:58:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

> Why do we force IST use for NMI, btw? Maybe we shouldn't, and just use
> the normal kernel stack mechanisms?

If you don't use IST the SYSCALL entry is racy during the window
when RSP is not set up yet (same for MCE etc.)

-Andi
--
[email protected] -- Speaking for myself only.

2010-07-15 23:21:22

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 07/15/2010 03:58 PM, Andi Kleen wrote:
>> Why do we force IST use for NMI, btw? Maybe we shouldn't, and just use
>> the normal kernel stack mechanisms?
>
> If you don't use IST the SYSCALL entry is racy during the window
> when RSP is not set up yet (same for MCE etc.)
>

Right, the kernel stack is not ready.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-07-15 23:24:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, Jul 15, 2010 at 4:20 PM, H. Peter Anvin <[email protected]> wrote:
> On 07/15/2010 03:58 PM, Andi Kleen wrote:
>>> Why do we force IST use for NMI, btw? Maybe we shouldn't, and just use
>>> the normal kernel stack mechanisms?
>>
>> If you don't use IST the SYSCALL entry is racy during the window
>> when RSP is not set up yet (same for MCE etc.)
>>
>
> Right, the kernel stack is not ready.

Well, it may not be ready for the _current_ NMI handler, but if we're
going to do a stack switch in sw on NMI anyway... ?

Linus

2010-07-15 23:43:05

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 07/15/2010 04:23 PM, Linus Torvalds wrote:
> On Thu, Jul 15, 2010 at 4:20 PM, H. Peter Anvin <[email protected]> wrote:
>> On 07/15/2010 03:58 PM, Andi Kleen wrote:
>>>> Why do we force IST use for NMI, btw? Maybe we shouldn't, and just use
>>>> the normal kernel stack mechanisms?
>>>
>>> If you don't use IST the SYSCALL entry is racy during the window
>>> when RSP is not set up yet (same for MCE etc.)
>>>
>>
>> Right, the kernel stack is not ready.
>
> Well, it may not be ready for the _current_ NMI handler, but if we're
> going to do a stack switch in sw on NMI anyway... ?
>

No, the problem is that without IST it'll try to drop the NMI stack
frame itself *on the user stack*.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-07-15 23:45:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, Jul 15, 2010 at 4:41 PM, H. Peter Anvin <[email protected]> wrote:
>
> No, the problem is that without IST it'll try to drop the NMI stack
> frame itself *on the user stack*.

Oh, because SS has already been cleared, but rsp still points to the
user stack? Ok, that does seem insurmountable.

Linus

2010-07-15 23:48:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, Jul 15, 2010 at 04:23:20PM -0700, Linus Torvalds wrote:
> On Thu, Jul 15, 2010 at 4:20 PM, H. Peter Anvin <[email protected]> wrote:
> > On 07/15/2010 03:58 PM, Andi Kleen wrote:
> >>> Why do we force IST use for NMI, btw? Maybe we shouldn't, and just use
> >>> the normal kernel stack mechanisms?
> >>
> >> If you don't use IST the SYSCALL entry is racy during the window
> >> when RSP is not set up yet (same for MCE etc.)
> >>
> >
> > Right, the kernel stack is not ready.
>
> Well, it may not be ready for the _current_ NMI handler, but if we're
> going to do a stack switch in sw on NMI anyway... ?

The CPU written initial stack frame would still go on the wrong stack.

-Andi

--
[email protected] -- Speaking for myself only.

2010-07-15 23:48:41

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 07/15/2010 04:44 PM, Linus Torvalds wrote:
> On Thu, Jul 15, 2010 at 4:41 PM, H. Peter Anvin <[email protected]> wrote:
>>
>> No, the problem is that without IST it'll try to drop the NMI stack
>> frame itself *on the user stack*.
>
> Oh, because SS has already been cleared, but rsp still points to the
> user stack? Ok, that does seem insurmountable.
>

Well, SS doesn't matter for 64 bits, but yes, RSP still points to the
user stack.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-07-16 10:47:32

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, Jul 15, 2010 at 10:46:13AM -0400, Steven Rostedt wrote:
> On Thu, 2010-07-15 at 16:11 +0200, Frederic Weisbecker wrote:
>
> > > - make sure that you only ever use _one_ single top-level entry for
> > > all vmalloc issues, and can make sure that all processes are created
> > > with that static entry filled in. This is optimal, but it just doesn't
> > > work on all architectures (eg on 32-bit x86, it would limit the
> > > vmalloc space to 4MB in non-PAE, whatever)
> >
> >
> > But then, even if you ensure that, don't we need to also fill lower level
> > entries for the new mapping.
>
> If I understand your question, you do not need to worry about the lower
> level entries because all the processes will share the same top level.
>
> process 1's GPD ------,
> |
> +------> PMD --> ...
> |
> process 2' GPD -------'
>
> Thus we have one page entry shared by all processes. The issue happens
> when the vm space crosses the PMD boundary and we need to update all the
> GPD's of all processes to point to the new PMD we need to add to handle
> the spread of the vm space.




Oh right. We point to that PMD, and the update has been made itself inside
the lower level entries pointed by the PMD. Indeed.



>
> >
> > Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> > risk to add a new memory mapping for new memory allocated with kmalloc?
>
> Because all of memory (well 800 some megs on 32 bit) is mapped into
> memory for all processes. That is, kmalloc only uses this memory (as
> does get_free_page()). All processes have a PMD (or PUD, whatever) that
> maps this memory. The issues only arises when we use new virtual memory,
> which vmalloc does. Vmalloc may map to physical memory that is already
> mapped to all processes but the address that the vmalloc uses to access
> that memory is not yet mapped.



Ok I see.




>
> The usual reason the kernel uses vmalloc is to get a contiguous range of
> memory. The vmalloc can map several pages as one contiguous piece of
> memory that in reality is several different pages scattered around
> physical memory. kmalloc can only map pages that are contiguous in
> physical memory. That is, if kmalloc gets 8192 bytes on an arch with
> 4096 byte pages, it will allocate two consecutive pages in physical
> memory. If two contiguous pages are not available even if thousand of
> single pages are, the kmalloc will fail, where as the vmalloc will not.
>
> An allocation of vmalloc can use two different pages and just map the
> page table to make them contiguous in view of the kernel. Note, this
> comes at a cost. One is when we do this, we suffer the case where we
> need to update a bunch of page tables. The other is that we must waste
> TLB entries to point to these separate pages. Kmalloc and
> get_free_page() uses the big memory mappings. That is, if the TLB allows
> us to map large pages, we can do that for kernel memory since we just
> want the contiguous memory as it is in physical memory.
>
> Thus the kernel maps the physical memory with the fewest TLB entries as
> needed (large pages and large TLB entries). If we can map 64K pages, we
> do that. Then kmalloc just allocates within this range, it does not need
> to map any pages. They are already mapped.
>
> Does this make a bit more sense?



Totally! You've made it very clear to me.
Moreover I did not know we can have such variable page size. I mean I thought
we can have variable page size but that would apply to every pages.





>
> >
> >
> >
> > > - at vmalloc time, when adding a new page directory entry, walk all
> > > the tens of thousands of existing page tables under a lock that
> > > guarantees that we don't add any new ones (ie it will lock out fork())
> > > and add the required pgd entry to them.
> > >
> > > - or just take the fault and do the "fill the page tables" on demand.
> > >
> > > Quite frankly, most of the time it's probably better to make that last
> > > choice (unless your hardware makes it easy to make the first choice,
> > > which is obviously simplest for everybody). It makes it _much_ cheaper
> > > to do vmalloc. It also avoids that nasty latency issue. And it's just
> > > simpler too, and has no interesting locking issues with how/when you
> > > expose the page tables in fork() etc.
> > >
> > > So the only downside is that you do end up taking a fault in the
> > > (rare) case where you have a newly created task that didn't get an
> > > even newer vmalloc entry.
> >
> >
> > But then how did the previous tasks get this new mapping? You said
> > we don't walk through every process page tables for vmalloc.
>
> Actually we don't even need to walk the page tables in the first task
> (although we might do that). When the kernel accesses that memory we
> take the page fault, the page fault will see that this memory is vmalloc
> data and fill in the page tables for the task at that time.



Right.




> >
> > I would understand this race if we were to walk on every processes page
> > tables and add the new mapping on them, but we missed one new task that
> > forked or so, because we didn't lock (or just rcu).
> >
> >
> >
> > > And that fault can sometimes be in an
> > > interrupt or an NMI. Normally it's trivial to handle that fairly
> > > simple nested fault. But NMI has that inconvenient "iret unblocks
> > > NMI's, because there is no dedicated 'nmiret' instruction" problem on
> > > x86.
> >
> >
> > Yeah.
> >
> >
> > So the parts of the problem I don't understand are:
> >
> > - why don't we have this problem with kmalloc() ?
>
> I hope I explained that above.



Yeah :)

Thanks a lot for your explanations!

2010-07-16 11:21:14

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, Jul 15, 2010 at 04:35:18PM +0200, Andi Kleen wrote:
> > But then how did the previous tasks get this new mapping? You said
> > we don't walk through every process page tables for vmalloc.
>
> No because those are always shared for the kernel and have been
> filled in for init_mm.
>
> Also most updates only update the lower tables anyways, top level
> updates are extremly rare. In fact on PAE36 they should only happen
> at most once, if at all, and most likely at early boot anyways
> where you only have a single task.
>
> On x86-64 they will only happen once every 512GB of vmalloc.
> So for most systems also at most once at early boot.
> >
> > I would understand this race if we were to walk on every processes page
> > tables and add the new mapping on them, but we missed one new task that
> > forked or so, because we didn't lock (or just rcu).
>
> The new task will always get a copy of the reference init_mm, which
> was already updated.
>
> -Andi


Ok, got it.

But then, in the example here with perf, I'm allocating 8192 bytes per cpu
and my total memory amount is of 2 GB.

And it always fault at least once on access, after the allocation.
I really doubt it's because we are adding a new top level page table,
considering the amount of memory I have.

It seems to me that the mapping of a newly allocated vmalloc area is
always inserted through the lazy way (update on fault). Or there is
something I'm missing.

Thanks.

2010-07-16 11:43:47

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Fri, 2010-07-16 at 12:47 +0200, Frederic Weisbecker wrote:
> > Thus the kernel maps the physical memory with the fewest TLB entries as
> > needed (large pages and large TLB entries). If we can map 64K pages, we
> > do that. Then kmalloc just allocates within this range, it does not need
> > to map any pages. They are already mapped.
> >
> > Does this make a bit more sense?
>
>
>
> Totally! You've made it very clear to me.
> Moreover I did not know we can have such variable page size. I mean I thought
> we can have variable page size but that would apply to every pages.

In x86_64, if bit 7 in the PDE (Page Directory Entry) is set then it
points to a 2 Meg page. Otherwise it points to a page table which will
have 512 PTE's pointing to 4K pages.

Download:

http://support.amd.com/us/Processor_TechDocs/24593.pdf

It has nice diagrams that explains this. Check out page 207 (fig 5-17)
and 210 (fig 5-22).

The phys_pmd_init() in arch/x86/mm/init_64.c will try to map memory
using 2M pages if it can, otherwise it falls back to 4K pages.

-- Steve

2010-07-16 12:00:19

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, Jul 15, 2010 at 07:51:55AM -0700, Linus Torvalds wrote:
> On Thu, Jul 15, 2010 at 7:11 AM, Frederic Weisbecker <[email protected]> wrote:
> > On Wed, Jul 14, 2010 at 03:56:43PM -0700, Linus Torvalds wrote:
> >> You can:
> >>
> >> ?- make sure that you only ever use _one_ single top-level entry for
> >> all vmalloc issues, and can make sure that all processes are created
> >> with that static entry filled in. This is optimal, but it just doesn't
> >> work on all architectures (eg on 32-bit x86, it would limit the
> >> vmalloc space to 4MB in non-PAE, whatever)
> >
> > But then, even if you ensure that, don't we need to also fill lower level
> > entries for the new mapping.
>
> Yes, but now they are all mapped by the one *shared* top-level entry.
>
> Think about it.
>
> [ Time passes ]
>
> End result: if you can map the whole vmalloc area with a single
> top-level entry that is shared by all processes, and can then just
> fill in the lower levels when doing actual allocations, it means that
> all processes will automatically get the entries added, and do not
> need any fixups.
>
> In other words, the page tables will be automatically correct and
> filled in for everybody - without having to traverse any lists,
> without any extra locking, and without any races. So this is efficient
> and simple, and never needs any faulting to fill in page tables later
> on.
>
> (Side note: "single top-level entry" could equally well be "multiple
> preallocated entries covering the whole region": the important part is
> not really the "single entry", but the "preallocated and filled into
> every page directory from the start" part)



Right, I got it. Thanks for these explanations.



>
> > Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> > risk to add a new memory mapping for new memory allocated with kmalloc?
>
> No. The kmalloc space is all in the 1:1 kernel mapping, and is always
> mapped. Even with PAGEALLOC_DEBUG, it's always mapped at the top
> level, and even if a particular page is unmapped/remapped for
> debugging, it is done so in the shared kernel page tables (which ends
> up being the above trivial case - there is just a single set of page
> directory entries that are shared by everybody).



Ok.



> >> ?- at vmalloc time, when adding a new page directory entry, walk all
> >> the tens of thousands of existing page tables under a lock that
> >> guarantees that we don't add any new ones (ie it will lock out fork())
> >> and add the required pgd entry to them.
> >>
> >> ?- or just take the fault and do the "fill the page tables" on demand.
> >>
> >> Quite frankly, most of the time it's probably better to make that last
> >> choice (unless your hardware makes it easy to make the first choice,
> >> which is obviously simplest for everybody). It makes it _much_ cheaper
> >> to do vmalloc. It also avoids that nasty latency issue. And it's just
> >> simpler too, and has no interesting locking issues with how/when you
> >> expose the page tables in fork() etc.
> >>
> >> So the only downside is that you do end up taking a fault in the
> >> (rare) case where you have a newly created task that didn't get an
> >> even newer vmalloc entry.
> >
> > But then how did the previous tasks get this new mapping? You said
> > we don't walk through every process page tables for vmalloc.
>
> We always add the mapping to the "init_mm" page tables when it is
> created (just a single mm), and when fork creates a new page table, it
> will always copy the kernel mapping parts from the old one. So the
> _common_ case is that all normal mappings are already set up in page
> tables, including newly created page tables.
>
> The uncommon case is when there is a new page table created _and_ a
> new vmalloc mapping, and the race that happens between those events.
> Whent hat new page table is then later used (and it can be _much_
> later, of course: we're now talking process scheduling, so IO delays
> etc are relevant), it won't necessarily have the page table entries
> for vmalloc stuff that was created since the page tables were created.
> So we fill _those_ in dynamically.



Such new page table created that might race is only about top level page
right? Otherwise it wouldn't race since the top level entries are shared
and then updates inside lower level pages are naturally propagated, if
I understood you well.

So, if only top level pages that gets added can generate such lazily
mapping update, I wonder why I experienced this fault everytime with
my patches.

I allocated 8192 bytes per cpu in a x86-32 system that has only 2 GB.
I doubt there is a top level page table update there at this time with
such a small amount of available memory. But still it faults once on
access.

I have troubles to visualize the race and the problem here.



>
> But vmalloc mappings should be reasonably rare, and the actual "fill
> them in" cases are much rarer still (since we fill them in page
> directory entries at a time: so even if you make a lot of vmalloc()
> calls, we only _fill_ at most once per page directory entry, which is
> usually a pretty big chunk). On 32-bit x86, for example, we'd fill
> once every 4MB (or 2MB if PAE), and you cannot have a lot of vmalloc
> mappings that large (since the VM space is limited).
>
> So the cost of filling things in is basically zero, because it happens
> so seldom. And by _allowing_ things to be done lazily, we avoid all
> the locking costs, and all the costs of traversing every single
> possible mm (and there can be many many thousands of those).



Ok.



> > I would understand this race if we were to walk on every processes page
> > tables and add the new mapping on them, but we missed one new task that
> > forked or so, because we didn't lock (or just rcu).
>
> .. and how do you keep track of which tasks you missed? And no, it's
> not just the new tasks - you have old tasks that have their page
> tables built up too, but need to be updated. They may never need the
> mapping since they may be sleeping and never using the driver or data
> structures that created it (in fact, that's a common case), so filling
> them would be pointless. But if we don't do the lazy fill, we'd have
> to fill them all, because WE DO NOT KNOW.



Right.



>
> > So the parts of the problem I don't understand are:
> >
> > - why don't we have this problem with kmalloc() ?
>
> Hopefully clarified.


Indeed.



> > - did I understand well the race that makes the fault necessary,
> > ?ie: we walk the tasklist lockless, add the new mapping if
> > ?not present, but we might miss a task lately forked, but
> > ?the fault will fix that.
>
> But the _fundamental_ issue is that we do not want to walk the
> tasklist (or the mm_list) AT ALL. It's a f*cking waste of time. It's a
> long list, and nobody cares. In many cases it won't be needed.
>
> The lazy algorithm is _better_. It's way simpler (we take nested
> faults all the time in the kernel, and it's a particularly _easy_ page
> fault to handle with no IO or no locking needed), and it does less
> work. It really boils down to that.


Yeah, agreed. But I'm still confused about when exactly we need to fault
(doubts I have detailed in my question above).



> So it's not the lazy page table fill that is the problem. Never has
> been. We've been doing the lazy fill for a long time, and it was
> simple and useful way back when.
>
> The problem has always been NMI, and nothing else. NMI's are nasty,
> and the x86 NMI blocking is insane and crazy.
>
> Which is why I'm so adamant that this should be fixed in the NMI code,
> and we should _not_ talk about trying to screw up other, totally
> unrelated, code. The lazy fill really was never the problem.


Yeah agreed.

Thanks for your explanations!

2010-07-16 12:55:07

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Fri, 2010-07-16 at 14:00 +0200, Frederic Weisbecker wrote:
> On Thu, Jul 15, 2010 at 07:51:55AM -0700, Linus Torvalds wrote:

>
> Such new page table created that might race is only about top level page
> right? Otherwise it wouldn't race since the top level entries are shared
> and then updates inside lower level pages are naturally propagated, if
> I understood you well.
>
> So, if only top level pages that gets added can generate such lazily
> mapping update, I wonder why I experienced this fault everytime with
> my patches.
>
> I allocated 8192 bytes per cpu in a x86-32 system that has only 2 GB.
> I doubt there is a top level page table update there at this time with
> such a small amount of available memory. But still it faults once on
> access.
>
> I have troubles to visualize the race and the problem here.
>

A few trace_printks and a tracing_off() on fault would probably show
exactly what was happening ;-)

-- Steve

2010-07-16 19:13:10

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

Hi Linus,

What I omitted in my original description paragraph is that I also test for NMIs
nested over NMI "regular code" with a "nesting" per-cpu flag, which deals with
the concerns you expressed in your reply about function calls and traps.

I'm self-replying to keep track of Avi's comment about the need to save/restore
cr2 at the beginning/end of the NMI handler, so we don't end up corrupting a VM
CR2 if we have the following scenario: trap in VM, NMI, trap in NMI. So I added
cr2 awareness to the code snippet below, so we should be close to have something
that starts to make sense. (although I'm not saying it's bug-free yet) ;)

Please note that I'll be off on vacation for 2 weeks starting this evening (back
on August 2) without Internet access, so my answers might be delayed.

Thanks !

Mathieu


Code originally written by Linus Torvalds, modified by Mathieu Desnoyers
intenting to handle the fake NMI entry gracefully given that NMIs are not
necessarily disabled at the entry point. It uses a "need fake NMI" flag rather
than playing games with CS and faults. When a fake NMI is needed, it simply
jumps back to the beginning of regular nmi code. NMI exit code and fake NMI
entry are made reentrant with respect to NMI handler interruption by testing, at
the very beginning of the NMI handler, if a NMI is nested over the whole
nmi_atomic .. nmi_atomic_end code region. It also tests for nested NMIs by
keeping a per-cpu "nmi nested" flag"; this deals with detection of nesting over
the "regular nmi" execution. This code assumes NMIs have a separate stack.

#
# Two per-cpu variables: a "are we nested" flag (one byte).
# a "do we need to execute a fake NMI" flag (one byte).
# The %rsp at which the stack copy is saved is at a fixed address, which leaves
# enough room at the bottom of NMI stack for the "real" NMI entry stack. This
# assumes we have a separate NMI stack.
# The NMI stack copy top of stack is at nmi_stack_copy.
# The NMI stack copy "rip" is at nmi_stack_copy_rip, which is set to
# nmi_stack_copy-40.
#
nmi:
# Test if nested over atomic code.
cmpq $nmi_atomic,0(%rsp)
jae nmi_addr_is_ae
# Test if nested over general NMI code.
cmpb $0,%__percpu_seg:nmi_stack_nesting
jne nmi_nested_set_fake_and_return
# create new stack
is_unnested_nmi:
# Save some space for nested NMI's. The exception itself
# will never use more space, but it might use less (since
# if will be a kernel-kernel transition).

# Save %rax on top of the stack (need to temporarily use it)
pushq %rax
movq %rsp, %rax
movq $nmi_stack_copy,%rsp

# copy the five words of stack info. rip starts at 8+0(%rax).
# cr2 is saved at nmi_stack_copy_rip+40
pushq %cr2 # save cr2 to handle nesting over page faults
pushq 8+32(%rax) # ss
pushq 8+24(%rax) # rsp
pushq 8+16(%rax) # eflags
pushq 8+8(%rax) # cs
pushq 8+0(%rax) # rip
movq 0(%rax),%rax # restore %rax

set_nmi_nesting:
# and set the nesting flags
movb $0xff,%__percpu_seg:nmi_stack_nesting

regular_nmi_code:
...
# regular NMI code goes here, and can take faults,
# because this sequence now has proper nested-nmi
# handling
...

nmi_atomic:
# An NMI nesting over the whole nmi_atomic .. nmi_atomic_end region will
# be handled specially. This includes the fake NMI entry point.
cmpb $0,%__percpu_seg:need_fake_nmi
jne fake_nmi
movb $0,%__percpu_seg:nmi_stack_nesting
# restore cr2
movq %nmi_stack_copy_rip+40,%cr2
iret

# This is the fake NMI entry point.
fake_nmi:
movb $0x0,%__percpu_seg:need_fake_nmi
jmp regular_nmi_code
nmi_atomic_end:

# Make sure the address is in the nmi_atomic range and in CS segment.
nmi_addr_is_ae:
cmpq $nmi_atomic_end,0(%rsp)
jae is_unnested_nmi
# The saved rip points to the final NMI iret. Check the CS segment to
# make sure.
cmpw $__KERNEL_CS,8(%rsp)
jne is_unnested_nmi

# This is the case when we hit just as we're supposed to do the atomic code
# of a previous nmi. We run the NMI using the old return address that is still
# on the stack, rather than copy the new one that is bogus and points to where
# the nested NMI interrupted the original NMI handler!
# Easy: just set the stack pointer to point to the stack copy, clear
# need_fake_nmi (because we are directly going to execute the requested NMI) and
# jump to "nesting flag set" (which is followed by regular nmi code execution).
movq $nmi_stack_copy_rip,%rsp
movb $0x0,%__percpu_seg:need_fake_nmi
jmp set_nmi_nesting

# This is the actual nested case. Make sure we branch to the fake NMI handler
# after this handler is done.
nmi_nested_set_fake_and_return:
movb $0xff,%__percpu_seg:need_fake_nmi
popfq
jmp *(%rsp)


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-18 11:05:09

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 07/15/2010 04:23 AM, Linus Torvalds wrote:
> On Wed, Jul 14, 2010 at 3:37 PM, Linus Torvalds
> <[email protected]> wrote:
>
>> I think the %rip check should be pretty simple - exactly because there
>> is only a single point where the race is open between that 'mov' and
>> the 'iret'. So it's simpler than the (similar) thing we do for
>> debug/nmi stack fixup for sysenter that has to check a range.
>>
> So this is what I think it might look like, with the %rip in place.
> And I changed the "nmi_stack_ptr" thing to have both the pointer and a
> flag - because it turns out that in the single-instruction race case,
> we actually want the old pointer.
>
> Totally untested, of course. But _something_ like this might work:
>
> #
> # Two per-cpu variables: a "are we nested" flag (one byte), and
> # a "if we're nested, what is the %rsp for the nested case".
> #
> # The reason for why we can't just clear the saved-rsp field and
> # use that as the flag is that we actually want to know the saved
> # rsp for the special case of having a nested NMI happen on the
> # final iret of the unnested case.
> #
> nmi:
> cmpb $0,%__percpu_seg:nmi_stack_nesting
> jne nmi_nested_corrupt_and_return
> cmpq $nmi_iret_address,0(%rsp)
> je nmi_might_be_nested
> # create new stack
> is_unnested_nmi:
> # Save some space for nested NMI's. The exception itself
> # will never use more space, but it might use less (since
> # if will be a kernel-kernel transition). But the nested
> # exception will want two save registers and a place to
> # save the original CS that it will corrupt
> subq $64,%rsp
>
> # copy the five words of stack info. 96 = 64 + stack
> # offset of ss.
> pushq 96(%rsp) # ss
> pushq 96(%rsp) # rsp
> pushq 96(%rsp) # eflags
> pushq 96(%rsp) # cs
> pushq 96(%rsp) # rip
>
> # and set the nesting flags
> movq %rsp,%__percpu_seg:nmi_stack_ptr
> movb $0xff,%__percpu_seg:nmi_stack_nesting
>
>

By trading off some memory, we don't need this trickery. We can
allocate two nmi stacks, so the code becomes:

nmi:
cmpb $0, %__percpu_seg:nmi_stack_nesting
je unnested_nmi
cmpq $nmi_iret,(%rsp)
jne unnested_nmi
cmpw $__KERNEL_CS,8(%rsp)
jne unnested_nmi
popf
retfq
unnested_nmi:
xorq $(nmi_stack_1 ^ nmi_stack_2),%__percpu_seg:tss_nmi_ist_entry
movb $1, __percpu_seg:nmi_stack_nesting
regular_nmi:
...
regular_nmi_end:
movb $0, __percpu_seg:nmi_stack_nesting
nmi_iret:
iretq




--
error compiling committee.c: too many arguments to function

2010-07-18 17:37:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Sun, Jul 18, 2010 at 4:03 AM, Avi Kivity <[email protected]> wrote:
>
> By trading off some memory, we don't need this trickery. ?We can allocate
> two nmi stacks, so the code becomes:

I really don't think you need even that. See earlier in the discussion
about how we could just test %rsp itself. Which makes all the %rip
testing totally unnecessary, because we don't even need any flags,and
we have no races because %rsp is atomically changed with taking the
exception.

Lookie here, the %rsp comparison really isn't that hard:

nmi:
pushq %rax
pushq %rdx
movq %rsp,%rdx # current stack top
movq 40(%rsp),%rax # old stack top
xor %rax,%rdx # same 8kB aligned area?
shrq $13,%rdx # ignore low 13 bits
je it_is_a_nested_nmi # looks nested..
non_nested:
...
... ok, we're not nested, do normal NMI handling ...
...
popq %rdx
popq %rax
iret

it_is_a_nested_nmi:
cmpw $0,48(%rsp) # double-check that it really was a nested exception
jne non_nested # from user space or something..
# this is the nested case
# NOTE! NMI's are blocked, we don't take any exceptions etc etc
addq $-160,%rax # 128-byte redzone on the old stack + 4 words
movq (%rsp),%rdx
movq %rdx,(%rax) # old %rdx
movq 8(%rsp),%rdx
movq %rdx,8(%rax) # old %rax
movq 32(%rsp),%rdx
movq %rdx,16(%rax) # old %rflags
movq 16(%rsp),%rdx
movq %rdx,24(%rax) # old %rip
movq %rax,%rsp
popq %rdx
popq %rax
popf
ret $128 # restore %rip and %rsp

doesn't that look pretty simple?

NOTE! OBVIOUSLY TOTALLY UNTESTED!

Linus

2010-07-18 18:05:26

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 07/18/2010 08:36 PM, Linus Torvalds wrote:
> On Sun, Jul 18, 2010 at 4:03 AM, Avi Kivity<[email protected]> wrote:
>
>> By trading off some memory, we don't need this trickery. We can allocate
>> two nmi stacks, so the code becomes:
>>
> I really don't think you need even that. See earlier in the discussion
> about how we could just test %rsp itself. Which makes all the %rip
> testing totally unnecessary, because we don't even need any flags,and
> we have no races because %rsp is atomically changed with taking the
> exception.
>
> Lookie here, the %rsp comparison really isn't that hard:
>
> nmi:
> pushq %rax
> pushq %rdx
> movq %rsp,%rdx # current stack top
> movq 40(%rsp),%rax # old stack top
> xor %rax,%rdx # same 8kB aligned area?
> shrq $13,%rdx # ignore low 13 bits
> je it_is_a_nested_nmi # looks nested..
>
>

...

> doesn't that look pretty simple?
>
>

Too simple - an MCE will switch to its own stack, failing the test. Now
that we have correctable MCEs, that's not a good idea.

So the plain everyday sequence

NMI
#PF
MCE (uncompleted)
NMI

will fail.

Plus, even in the non-nested case, you have to copy the stack frame, or
the nested NMI will corrupt it. With stack switching, the nested NMI is
allocated its own frame.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-18 18:18:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Sun, Jul 18, 2010 at 10:36 AM, Linus Torvalds
<[email protected]> wrote:
>
> Lookie here, the %rsp comparison really isn't that hard:

A few notes on that (still) untested code suggestion:

> ?nmi:
> ? ? ?pushq %rax
> ? ? ?pushq %rdx
> ? ? ?movq %rsp,%rdx ? ? ? ? ?# current stack top
> ? ? ?movq 40(%rsp),%rax ? # old stack top
> ? ? ?xor %rax,%rdx ? ? ? ? ? ? ?# same 8kB aligned area?
> ? ? ?shrq $13,%rdx ? ? ? ? ? ? # ignore low 13 bits
> ? ? ?je it_is_a_nested_nmi ? # looks nested..
> ?non_nested:
> ? ? ?...
> ? ? ?... ok, we're not nested, do normal NMI handling ...
> ? ? ?...

The non_nested case still needs to start off with moving it's stack
frame to a safe area that won't be overwritten by any nesting NMI's
(note that they cannot nest at this point, since we've done nothing
that can fault). So we'd still need that

7* pushq 48(%rsp)

which copies the five words that got pushed by hardware, and the two
register-save locations that we used for the nesting check and special
return.

After we've done those 7 pushes, we can then run code that may take a
fault. Because when the fault returns with an "iret" and re-enables
NMI's, our nesting code is ready.

So all told, we need a maximum of about 216 bytes of stack for the
nested NMI case: 56 bytes for the seven copied words, and the 160
bytes that we build up _under_ the stack pointer for the nested case.
And we need the NMI stack itself to be aligned in order for that
"ignore low bits" check to work. Although we don't actually have to do
that "xor+shr", we could do the test equally well with a "sub+unsigned
compare against stack size".

Other than that, I think the extra test that we're really nested might
better be done differently:

> ?it_is_a_nested_nmi:
> ? ? ?cmpw $0,48(%rsp) ? ? # double-check that it really was a nested exception
> ? ? ?jne non_nested ? ? ? ? ? # from user space or something..
> ? ? ?# this is the nested case

It migth be safer to check the saved CS rather than the saved SS on
the stack to see that we really are in kernel mode. It's possible that
somebody could load a NULL SS in user mode and then just not use the
stack - and try to make it look like they are in kernel mode for when
the NMI happens. Now, I _think_ that loading a zero SS is supposed to
trap, but checking CS is still likely to be the better test for "were
we in kernel mode". That's where the CPL is really encoded, after all.

So that "cmpw $0,48(%rsp)" is probably ok, but it would likely be
better to do it as

testl $3,24(%rsp)
jne non_nested

instead. That's what entry_64.S does everywhere else.

Oh, and the non-nested case obviously needs all the regular "make the
kernel state look right" code. Like the swapgs stuff etc if required.
My example code was really meant to just document the nesting
handling, not the existing stuff we already need to do with
save_paranoid etc.

And I really think it should work, but I'd again like to stress that
it's just a RFD code sequence with no testing what-so-ever etc.

Linus

2010-07-18 18:23:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Sun, Jul 18, 2010 at 11:04 AM, Avi Kivity <[email protected]> wrote:
>
> Too simple - an MCE will switch to its own stack, failing the test. ?Now
> that we have correctable MCEs, that's not a good idea.

Ahh, true. And I think we do DEBUG traps with IST too.

So we do need the explicit flag over the region. Too bad. I was hoping
to handle the nested case without having to set up the percpu segment
(that whole conditional swapgs thing, which is extra painful in NMI).

And at that point, if you require the separate flag anyway, the %rsp
range test is equivalent to the %rip range test.

Linus

2010-07-18 18:44:05

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Sun, 2010-07-18 at 11:17 -0700, Linus Torvalds wrote:

> Oh, and the non-nested case obviously needs all the regular "make the
> kernel state look right" code. Like the swapgs stuff etc if required.
> My example code was really meant to just document the nesting
> handling, not the existing stuff we already need to do with
> save_paranoid etc.
>
> And I really think it should work, but I'd again like to stress that
> it's just a RFD code sequence with no testing what-so-ever etc.
>

Are you sure you don't want to use Mathieu's 2/2 patch? We are fixing
the x86 problem that iret re-enables NMIs, and you don't want to touch
anything else but the NMI code. But it may be saner to just fix the
places that call iret. We can perhaps encapsulate those into a single
macro that we can get right and will be correct everywhere it is used.

The ugliest part of Mathieu's code is dealing with paravirt, but
paravirt is ugly to begin with.

Doing this prevents nested NMIs as well as all the unknowns that will
come with dealing with nested NMIs. Where as, handling all iret's should
be straight forward, although a bit more intrusive than what we would
like.

Just saying,

-- Steve

2010-07-18 19:27:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Sun, Jul 18, 2010 at 11:43 AM, Steven Rostedt <[email protected]> wrote:
>
> Are you sure you don't want to use Mathieu's 2/2 patch?

Yeah, I'm pretty sure. Unless somebody can show that it's faster, I
really don't want to muck with regular iret's. Also, as shown during
the discussion, even with Mathieu's 2/2 patch, we'd _still_ need NMI
to also save cr2 etc.

So the sane thing to do is to put all the NMI crap where it belongs.
NMI's need to know about the fact that them taking exceptions is
special. That whole "vmalloc_sync_all()" is simply pure brokenness.

In other words, it is _not_ just about 'iret' fixup. It's a bigger
thing. NMI's are special, and we don't want to spread that specialness
around.

Linus

2010-07-19 07:32:54

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 07/18/2010 09:22 PM, Linus Torvalds wrote:
> On Sun, Jul 18, 2010 at 11:04 AM, Avi Kivity<[email protected]> wrote:
>
>> Too simple - an MCE will switch to its own stack, failing the test. Now
>> that we have correctable MCEs, that's not a good idea.
>>
> Ahh, true. And I think we do DEBUG traps with IST too.
>
> So we do need the explicit flag over the region. Too bad. I was hoping
> to handle the nested case without having to set up the percpu segment
> (that whole conditional swapgs thing, which is extra painful in NMI).
>

Well, we have to do that anyway for the non-nested case. So we just do
it before checking whether we're nested or not, and undo it on the popf;
retf path.

--
error compiling committee.c: too many arguments to function

2010-08-03 17:19:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, 2010-07-15 at 12:26 -0400, Mathieu Desnoyers wrote:

> I was more thinking along the lines of making sure a ring buffer has the proper
> support for your use-case. It shares a lot of requirements with a standard ring
> buffer:
>
> - Need to be lock-less
> - Need to reserve space, write data in a buffer
>
> By configuring a ring buffer with 4k sub-buffer size (that's configurable
> dynamically),

FWIW I really utterly detest the whole concept of sub-buffers.

> all we need to add is the ability to squash a previously saved
> record from the buffer. I am confident we can provide a clean API for this that
> would allow discard of previously committed entry as long as we are still on the
> same non-fully-committed sub-buffer. This fits your use-case exactly, so that's
> fine.

squash? truncate you mean? So we can allocate/reserve the largest
possible event size and write the actual event and then truncate to the
actually used size?

I really dislike how that will end up with huge holes in the buffer when
you get nested events.

Also, I think you're forgetting that doing the stack unwind is a very
costly pointer chase, adding a simple linear copy really doesn't seem
like a problem.

Additionally, if you have multiple consumers you can simply copy the
stacktrace again, avoiding the whole pointer chase exercise. While you
could conceivably copy from one ringbuffer into another that will result
in very nasty serialization issues.

> You could have one 4k ring buffer per cpu per execution context.

Why?

> I wonder if
> each Linux architecture have support for separated thread vs softirtq vs irq vs
> nmi stacks ?

Why would that be relevant? We can have NMI inside IRQ inside soft-IRQ
inside task context in general (dismissing the nested IRQ mess). You
don't need to have a separate stack for each context in order to nest
them.

> Even then, given you have only one stack for all shared irqs, you
> need something that is concurrency-aware at the ring buffer level.

I'm failing to see you point.

> These small stack-like ring buffers could be used to save your temporary stack
> copy. When you really need to save it to a larger ring buffer along with a
> trace, then you just take a snapshot of the stack ring buffers.

OK, why? Your proposal includes the exact same extra copy but introduces
a ton of extra code to effect the same, not a win.

> So you get to use one single ring buffer synchronization and memory allocation
> mechanism, that everyone has reviewed. The advantage is that we would not be
> having this nmi race discussion in the first place: the generic ring buffer uses
> "get page" directly rather than relying on vmalloc, because these bugs have
> already been identified and dealt with years ago.

That's like saying don't use percpu_alloc() but open-code the thing
using kmalloc()/get_pages().. I really don't see any merit in that.

2010-08-03 18:26:06

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Peter Zijlstra ([email protected]) wrote:
> On Thu, 2010-07-15 at 12:26 -0400, Mathieu Desnoyers wrote:
>
> > I was more thinking along the lines of making sure a ring buffer has the proper
> > support for your use-case. It shares a lot of requirements with a standard ring
> > buffer:
> >
> > - Need to be lock-less
> > - Need to reserve space, write data in a buffer
> >
> > By configuring a ring buffer with 4k sub-buffer size (that's configurable
> > dynamically),
>
> FWIW I really utterly detest the whole concept of sub-buffers.

This reluctance against splitting a buffer into sub-buffers might contribute to
explain the poor performance experienced with the Perf ring buffer. These
"sub-buffers" are really nothing new: these are called "periods" in the audio
world. They help lowering the ring buffer performance overhead because:

1) They allow writing into the ring buffer without SMP-safe synchronization
primitives and memory barriers for each record. Synchronization is only needed
across sub-buffer boundaries, which amortizes the cost over a large number of
events.

2) They are much more splice (and, in general, page-exchange) friendly, because
records written after a synchronization point start at the beginning of a page.
This removes the need for extra copies.

So I have to ask: do you detest the sub-buffer concept only because you are tied
to the current Perf userspace ABI which cannot support this without an ABI
change ?

I'm trying to help out here, but it does not make the task easy if we have both
hands tied in our back because we have to keep backward ABI compatibility for a
tool (perf) forever, even considering its sources are shipped with the kernel.

>
> > all we need to add is the ability to squash a previously saved
> > record from the buffer. I am confident we can provide a clean API for this that
> > would allow discard of previously committed entry as long as we are still on the
> > same non-fully-committed sub-buffer. This fits your use-case exactly, so that's
> > fine.
>
> squash? truncate you mean? So we can allocate/reserve the largest
> possible event size and write the actual event and then truncate to the
> actually used size?

Nope. I'm thinking that we can use a buffer just to save the stack as we call
functions and return, e.g.

call X -> reserve space to save "X" and arguments.
call Y -> same for Y.
call Z -> same for Z.
return -> discard event for Z.
return -> discard event for Y.

if we grab the buffer content at that point, then we have X and its arguments,
which is the function currently executed. That would require the ability to
uncommit and unreserve an event, which is not a problem as long as we have not
committed a full sub-buffer.

>
> I really dislike how that will end up with huge holes in the buffer when
> you get nested events.

This buffer only works like a stack. I don't think your comment apply.

>
> Also, I think you're forgetting that doing the stack unwind is a very
> costly pointer chase, adding a simple linear copy really doesn't seem
> like a problem.

I thought that this buffer was chasing the function entry/exits rather than
doing a stack unwind, but I might be wrong. Perhaps Frederic could tell us more
about his use-case ?

>
> Additionally, if you have multiple consumers you can simply copy the
> stacktrace again, avoiding the whole pointer chase exercise. While you
> could conceivably copy from one ringbuffer into another that will result
> in very nasty serialization issues.

Assuming Frederic is saving information to this stack-like ring buffer at each
function entry and discarding at each function return, then we don't have the
pointer chase.

What I am proposing does not even involve a copy: when we want to take a
snapshot, we just have to force a sub-buffer switch on the ring buffer. The
"returns" happening at the beginning of the next (empty) sub-buffer would
clearly fail to discard records (expecting non-existing entry records). We would
then have to save a small record saying that a function return occurred. The
current stack frame at the end of the next sub-buffer could be deduced from the
complete collection of stack frame samples.

>
> > You could have one 4k ring buffer per cpu per execution context.
>
> Why?

This seems to fit what Frederic described he needed: he uses one separate buffer
per cpu per execution context at the moment. But we could arguably save
all this stack-shaped information in per-cpu buffers.

>
> > I wonder if
> > each Linux architecture have support for separated thread vs softirtq vs irq vs
> > nmi stacks ?
>
> Why would that be relevant? We can have NMI inside IRQ inside soft-IRQ
> inside task context in general (dismissing the nested IRQ mess). You
> don't need to have a separate stack for each context in order to nest
> them.

I was asking this because Frederic seems to rely on having separate buffers per
cpu and per execution context to deal with concurrency (so not expecting
concurrency from interrupts or NMIs when writing into the softirq per-cpu stack
buffer).

>
> > Even then, given you have only one stack for all shared irqs, you
> > need something that is concurrency-aware at the ring buffer level.
>
> I'm failing to see you point.

My point is that we might need to expect concurrency from local execution
contexts (e.g. interrupts nested over other interrupt handlers) in the design of
this stack-like ring buffer. I'm not sure Frederic's approach of using one
buffer per execution context per cpu makes sense for all cases. The memory vs
context isolation trade-off seems rather odd if we have to create e.g. one
buffer per IRQ number.

>
> > These small stack-like ring buffers could be used to save your temporary stack
> > copy. When you really need to save it to a larger ring buffer along with a
> > trace, then you just take a snapshot of the stack ring buffers.
>
> OK, why? Your proposal includes the exact same extra copy but introduces
> a ton of extra code to effect the same, not a win.

Please refer to the "no extra copy" solution I explain in the reply here (see
above). I did not want to go into too much details regarding performance
optimization in the initial mail to Frederic, as these things can be done
incrementally. But given that you insist... :)

>
> > So you get to use one single ring buffer synchronization and memory allocation
> > mechanism, that everyone has reviewed. The advantage is that we would not be
> > having this nmi race discussion in the first place: the generic ring buffer uses
> > "get page" directly rather than relying on vmalloc, because these bugs have
> > already been identified and dealt with years ago.
>
> That's like saying don't use percpu_alloc() but open-code the thing
> using kmalloc()/get_pages().. I really don't see any merit in that.

I'm not saying "open-code this". I'm saying "use a specialized library that does
this get_pages() allocation and execution context synchronization for you", so
we stop the code duplication madness.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-03 19:02:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <[email protected]> wrote:
>
> FWIW I really utterly detest the whole concept of sub-buffers.

I'm not quite sure why. Is it something fundamental, or just an
implementation issue?

One thing that I think could easily make sense in a _lot_ of buffering
areas is the notion of a "continuation" buffer. We know we have cases
where we want to attach a lot of data to one particular event, but the
buffering itself is inevitably always going to have some limits on
atomicity etc. And quite often, the event that _generates_ the data is
not necessarily going to have all that data in one contiguous region,
and doing a scatter-gather memcpy to get it that way is not good
either.

At the same time, I do _not_ believe that the kernel ring-buffer code
should handle pointers to sub-buffers etc, or worry about iovec-like
arrays of smaller ranges. So if _that_ is what you mean by "concept of
sub-buffers", then I agree with you.

But what I do think might make a lot of sense is to allow buffer
fragments, and just teach user space to do de-fragmentation. Where it
would be important that the de-fragmentation really is all in user
space, and not really ever visible to the ring-buffer implementation
itself (and there would not, for example, be any guarantees that the
fragments would be contiguous - there could be other events in the
buffer in between fragments). Maybe we could even say that fragments
might be across different CPU ring-buffers, and user-space needs to
sort it out if it wants to (where "sort it out" literally would mean
having to sort and re-attach them in the right order, since there
wouldn't be any ordering between them).

>From a kernel perspective, the only thing you need for fragment
handling would be to have a buffer entry that just says "I'm fragment
number X of event ID Y". Nothing more. Everything else would be up to
the parser in user space to work out.

In other words - if you have something like the current situation,
where you want to save a whole back-trace, INSTEAD of allocating a
large max-sized buffer for it and "linearizing" the back-trace in
order to then create a backtrace ring event, maybe we could just fill
the ring buffer with lots of small fragments, and do the whole
linearizing in the code that reads it in user space. No temporary
allocations in kernel space at all, no memcpy, let user space sort it
out. Each stack level would just add its own event, and increment the
fragment count it uses.

It's going to be a fairly rare case, so some user space parsers might
just decide to ignore fragmented packets, because they know they
aren't interested in such "complex" events.

I dunno. This thread has kind of devolved into many different details,
and I reacted to just one very small fragment of it. Maybe not even a
very interesting fragment.

Linus

2010-08-03 19:45:58

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Linus Torvalds ([email protected]) wrote:
> On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > FWIW I really utterly detest the whole concept of sub-buffers.
>
> I'm not quite sure why. Is it something fundamental, or just an
> implementation issue?

The real issue here, IMHO, is that Perf has tied gory ring buffer implementation
details to the userspace perf ABI, and there is now strong unwillingness from
Perf developers to break this ABI.

About the sub-buffer definition: it only means that a buffer is splitted into
many regions. Their boundary are synchronization points between the data
producer and consumer. This involves padding the end of regions when records do
not fit in the remaining space.

I think that the problem lays in that Peter wants all his ring-buffer data to be
side-to-side, without padding. He needs this because the perf ABI, presented to
the user-space perf program, requires this: every implementation detail is
exposed to user-space through a mmap'd memory region (yeah, even the control
data is touched by both the kernel and userland through that shared page).

When Perf has been initially proposed, I've thought that because the perf
user-space tool is shipped along with the kernel sources, we could change the
ABI easily afterward, but Peter seems to disagree and wants it to stay the as it
is for backward compatibility and not offending contributors. If I had known
this when the ABI first came in, I would have surely nack'd it.

Now we are stucked with this ABI which exposes every tiny ring buffer
implementation detail to userspace, which simply kills any future enhancement.

Thanks,

Mathieu

P.S.: I'm holding back reply to the rest of your email to increase focus on the
fundamental perf ABI problem.

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-03 20:12:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe


* Linus Torvalds <[email protected]> wrote:

> On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers
> <[email protected]> wrote:
> >
> > The real issue here, IMHO, is that Perf has tied gory ring buffer
> > implementation details to the userspace perf ABI, and there is now strong
> > unwillingness from Perf developers to break this ABI.

(Wrong.)

> The thing is - I think my outlined buffer fragmentation model would work
> fine with the perf ABI too. Exactly because there is no deep structure,
> just the same "stream of small events" both from a kernel and a user model
> standpoint. Sure, the stream would now contain a new event type, but that's
> trivial. It would still be _entirely_ reasonable to have the actual data in
> the exact same ring buffer, including the whole mmap'ed area.

Yeah.

> Of course, when user space actually parses it, user space would have to
> eventually defragment the event by allocating a new area and copying the
> fragments together in the right order, but that's pretty trivial to do. It
> certainly doesn't affect the current mmap'ed interface in the least.
>
> Now, whether the perf people feel they want that kind of functionality, I
> don't know. It's possible that they simply do not want to handle events that
> are complex enough that they would have arbitrary size.

Looks useful. There's a steady trickle of new events and we already use type
encapsulation for things like trace events - which are only made sense of
later on in user-space.

We may want to add things like a NOP event to pad out the end of page

Thanks,

Ingo

2010-08-03 20:21:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe


* Ingo Molnar <[email protected]> wrote:

>
> * Linus Torvalds <[email protected]> wrote:
>
> > On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers
> > <[email protected]> wrote:
> > >
> > > The real issue here, IMHO, is that Perf has tied gory ring buffer
> > > implementation details to the userspace perf ABI, and there is now strong
> > > unwillingness from Perf developers to break this ABI.
>
> (Wrong.)
>
> > The thing is - I think my outlined buffer fragmentation model would work
> > fine with the perf ABI too. Exactly because there is no deep structure,
> > just the same "stream of small events" both from a kernel and a user model
> > standpoint. Sure, the stream would now contain a new event type, but that's
> > trivial. It would still be _entirely_ reasonable to have the actual data in
> > the exact same ring buffer, including the whole mmap'ed area.
>
> Yeah.
>
> > Of course, when user space actually parses it, user space would have to
> > eventually defragment the event by allocating a new area and copying the
> > fragments together in the right order, but that's pretty trivial to do. It
> > certainly doesn't affect the current mmap'ed interface in the least.
> >
> > Now, whether the perf people feel they want that kind of functionality, I
> > don't know. It's possible that they simply do not want to handle events that
> > are complex enough that they would have arbitrary size.
>
> Looks useful. There's a steady trickle of new events and we already use type
> encapsulation for things like trace events - which are only made sense of
> later on in user-space.
>
> We may want to add things like a NOP event to pad out the end of page

/me once again experiences the subtle difference between 'Y' and 'N' when postponing a mail

So adding fragments would be possible as well. We've got the space for such
extensions in the ABI and the basic model of streaming information is not
affected.

[ The control structure of the mmap area is there for performance/wakeup
optimizations (and to allow the kernel to lose information on producer
overload, while still giving user-space an idea that we lost data and how
much) - it does not affect semantics and does not limit us. ]

So there's no design limitation - Peter simply prefers one possible solution
over another and outlined his reasons - we should hash that out based on the
technical arguments.

Thanks,

Ingo

2010-08-03 20:32:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers
<[email protected]> wrote:
>
> The real issue here, IMHO, is that Perf has tied gory ring buffer implementation
> details to the userspace perf ABI, and there is now strong unwillingness from
> Perf developers to break this ABI.

The thing is - I think my outlined buffer fragmentation model would
work fine with the perf ABI too. Exactly because there is no deep
structure, just the same "stream of small events" both from a kernel
and a user model standpoint. Sure, the stream would now contain a new
event type, but that's trivial. It would still be _entirely_
reasonable to have the actual data in the exact same ring buffer,
including the whole mmap'ed area.

Of course, when user space actually parses it, user space would have
to eventually defragment the event by allocating a new area and
copying the fragments together in the right order, but that's pretty
trivial to do. It certainly doesn't affect the current mmap'ed
interface in the least.

Now, whether the perf people feel they want that kind of
functionality, I don't know. It's possible that they simply do not
want to handle events that are complex enough that they would have
arbitrary size.

Linus

2010-08-03 20:54:28

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Linus Torvalds ([email protected]) wrote:
> On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers
> <[email protected]> wrote:
> >
> > The real issue here, IMHO, is that Perf has tied gory ring buffer implementation
> > details to the userspace perf ABI, and there is now strong unwillingness from
> > Perf developers to break this ABI.
>
> The thing is - I think my outlined buffer fragmentation model would
> work fine with the perf ABI too. Exactly because there is no deep
> structure, just the same "stream of small events" both from a kernel
> and a user model standpoint. Sure, the stream would now contain a new
> event type, but that's trivial. It would still be _entirely_
> reasonable to have the actual data in the exact same ring buffer,
> including the whole mmap'ed area.

Yes, indeed. Your scheme (using a "cookie" to identify multiple related events,
each of them being the "continuation" of the previous event with the same
cookie) would work on top of basically all ring buffers implementations. We
already use something similar to follow socket buffers and block device buffers
across the kernel in LTTng.

>
> Of course, when user space actually parses it, user space would have
> to eventually defragment the event by allocating a new area and
> copying the fragments together in the right order, but that's pretty
> trivial to do. It certainly doesn't affect the current mmap'ed
> interface in the least.
>
> Now, whether the perf people feel they want that kind of
> functionality, I don't know. It's possible that they simply do not
> want to handle events that are complex enough that they would have
> arbitrary size.

I agree. Although I think the scheme you propose can sit on top of the ring
buffer and does not necessarily need to be at the bottom layer. The sub-buffer
disagreement Peter and I have is related to the ring buffer core.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-03 21:16:50

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Ingo Molnar ([email protected]) wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Linus Torvalds <[email protected]> wrote:
> >
> > > On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers
> > > <[email protected]> wrote:
> > > >
> > > > The real issue here, IMHO, is that Perf has tied gory ring buffer
> > > > implementation details to the userspace perf ABI, and there is now strong
> > > > unwillingness from Perf developers to break this ABI.
> >
> > (Wrong.)

I am glad to hear this. So should I understand that if we show that the current
perf ABI imposes significant design constraints and results in poor performance
and inability to support flight recorder mode (which is needed to unify the ring
buffers), we can deprecate the ABI ?

[...]


> > We may want to add things like a NOP event to pad out the end of page

Or simply write the page (or sub-buffer) size information in a page (or
sub-buffer) header. The gain here is that by doing so we don't have to reserve
an event ID for the NOP event, which adds one extra ID reserved in _each_ event
header. You might be tempted to say "oh, it's just a single value, who cares ?",
but with the amount of data we're moving, being able to represent the event
header on a very small amount of bits really makes a difference. Bloat creeps in
one single bit at a time until we start not caring about adding whole integers,
and when we're there the game was over long ago: performance suffer deeply.

The huge size of the perf event headers is another factor that might explain its
poor performance by the way.

[...]

> [ The control structure of the mmap area is there for performance/wakeup
> optimizations

I am doubtful about an "optimization" that affects what should be a slow path:
user-space wakeup for delivering a multiple events at once. Have you checked if
this leads to actual noticeable performance increase at all ?

> (and to allow the kernel to lose information on producer
> overload, while still giving user-space an idea that we lost data and how
> much)

This can be performed with a standard system call rather than playing games
with a shared pages into which both the kernel and user-space write. The
advantage is that by letting user-space calling the kernel (rather than just
writing "I'm done" in that page by updating the consumer value), we can let the
kernel perform tasks that might enable us to implement flight recorder mode all
within the same ring buffer implementation.

> - it does not affect semantics and does not limit us. ]

Well, so far, the main limitation I can see is that it does not allow us to do
flight recorder tracing (a.k.a. overwrite mode).

>
> So there's no design limitation - Peter simply prefers one possible solution
> over another and outlined his reasons - we should hash that out based on the
> technical arguments.

Another argument I've seen from Peter is that he prefers the perf
kernel-userspace interaction to happen through this shared page to diminish the
number of traced events generated by perf activity. But I find this argument
unconvincing, because it really only applies to system call tracing: the rest of
tracing will be affected by the perf user-space process activity. So we might as
well just bite the bullet and accept that the trace is "polluted" by user-space
perf events. It _is_ using up CPU time anyway, so I think it's actually _better_
to know about it, rather than to try to hide the tracer activity. If one really
wants to filter out the tracer activity, it can be done at post-processing
without problem. But at least the information is there.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-04 06:29:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Tue, 2010-08-03 at 11:56 -0700, Linus Torvalds wrote:
> On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > FWIW I really utterly detest the whole concept of sub-buffers.
>
> I'm not quite sure why. Is it something fundamental, or just an
> implementation issue?

The sub-buffer thing that both ftrace and lttng have is creating a large
buffer from a lot of small buffers, I simply don't see the point of
doing that. It adds complexity and limitations for very little gain.

Their benefit is known synchronization points into the stream, you can
parse each sub-buffer independently, but you can always break up a
continuous stream into smaller parts or use a transport that includes
index points or whatever.

Their down side is that you can never have individual events larger than
the sub-buffer, you need to be aware of the sub-buffer when reserving
space etc..

2010-08-04 06:47:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Tue, 2010-08-03 at 14:25 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra ([email protected]) wrote:
> > On Thu, 2010-07-15 at 12:26 -0400, Mathieu Desnoyers wrote:
> >
> > > I was more thinking along the lines of making sure a ring buffer has the proper
> > > support for your use-case. It shares a lot of requirements with a standard ring
> > > buffer:
> > >
> > > - Need to be lock-less
> > > - Need to reserve space, write data in a buffer
> > >
> > > By configuring a ring buffer with 4k sub-buffer size (that's configurable
> > > dynamically),
> >
> > FWIW I really utterly detest the whole concept of sub-buffers.
>
> This reluctance against splitting a buffer into sub-buffers might contribute to
> explain the poor performance experienced with the Perf ring buffer.

That's just unsubstantiated FUD.

> These
> "sub-buffers" are really nothing new: these are called "periods" in the audio
> world. They help lowering the ring buffer performance overhead because:
>
> 1) They allow writing into the ring buffer without SMP-safe synchronization
> primitives and memory barriers for each record. Synchronization is only needed
> across sub-buffer boundaries, which amortizes the cost over a large number of
> events.

The only SMP barrier we (should) have is when we update the user visible
head pointer. The buffer code itself uses local{,64}_t for all other
atomic ops.

If you want to amortize that barrier, simply hold off the head update
for a while, no need to introduce sub-buffers.

> 2) They are much more splice (and, in general, page-exchange) friendly, because
> records written after a synchronization point start at the beginning of a page.
> This removes the need for extra copies.

This just doesn't make any sense at all, I could splice full pages just
fine, splice keeps page order so these synchronization points aren't
critical in any way.

The only problem I have with splice atm is that we don't have a buffer
interface without mmap() and we cannot splice pages out from under
mmap() on all architectures in a sane manner.

> So I have to ask: do you detest the sub-buffer concept only because you are tied
> to the current Perf userspace ABI which cannot support this without an ABI
> change ?

No because I don't see the point.

> I'm trying to help out here, but it does not make the task easy if we have both
> hands tied in our back because we have to keep backward ABI compatibility for a
> tool (perf) forever, even considering its sources are shipped with the kernel.

Dude, its a published user<->kernel ABI, also you're not saying why you
would want to break it. In your other email you allude to things like
flight recorder mode, that could be done with the current set-up, no
need to break the ABI at all. All you need to do is track the tail
pointer and publish it.

> Nope. I'm thinking that we can use a buffer just to save the stack as we call
> functions and return, e.g.

We don't have a callback on function entry, and I'm not going to use
mcount for that, that's simply insane.

> call X -> reserve space to save "X" and arguments.
> call Y -> same for Y.
> call Z -> same for Z.
> return -> discard event for Z.
> return -> discard event for Y.
>
> if we grab the buffer content at that point, then we have X and its arguments,
> which is the function currently executed. That would require the ability to
> uncommit and unreserve an event, which is not a problem as long as we have not
> committed a full sub-buffer.

Again, I'm not really seeing the point of using sub-buffers at all.

Also, what happens when we write an event after Y? Then the discard must
fail or turn Y into a NOP, leaving a hole in the buffer.

> I thought that this buffer was chasing the function entry/exits rather than
> doing a stack unwind, but I might be wrong. Perhaps Frederic could tell us more
> about his use-case ?

No, its a pure stack unwind from NMI context. When we get an event (PMI,
tracepoint, whatever) we write out event, if the consumer asked for a
stacktrace with each event, we unwind the stack for him.

> > Additionally, if you have multiple consumers you can simply copy the
> > stacktrace again, avoiding the whole pointer chase exercise. While you
> > could conceivably copy from one ringbuffer into another that will result
> > in very nasty serialization issues.
>
> Assuming Frederic is saving information to this stack-like ring buffer at each
> function entry and discarding at each function return, then we don't have the
> pointer chase.
>
> What I am proposing does not even involve a copy: when we want to take a
> snapshot, we just have to force a sub-buffer switch on the ring buffer. The
> "returns" happening at the beginning of the next (empty) sub-buffer would
> clearly fail to discard records (expecting non-existing entry records). We would
> then have to save a small record saying that a function return occurred. The
> current stack frame at the end of the next sub-buffer could be deduced from the
> complete collection of stack frame samples.

And suppose the stack-trace was all of 16 entries (not uncommon for a
kernel stack), then you waste a whole page for 128 bytes (assuming your
sub-buffer is page sized). I'll take the memcopy, thank you.

2010-08-04 06:49:19

by Dave Chinner

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Tue, Aug 03, 2010 at 11:56:11AM -0700, Linus Torvalds wrote:
> On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > FWIW I really utterly detest the whole concept of sub-buffers.
>
> I'm not quite sure why. Is it something fundamental, or just an
> implementation issue?
>
> One thing that I think could easily make sense in a _lot_ of buffering
> areas is the notion of a "continuation" buffer. We know we have cases
> where we want to attach a lot of data to one particular event, but the
> buffering itself is inevitably always going to have some limits on
> atomicity etc. And quite often, the event that _generates_ the data is
> not necessarily going to have all that data in one contiguous region,
> and doing a scatter-gather memcpy to get it that way is not good
> either.
>
> At the same time, I do _not_ believe that the kernel ring-buffer code
> should handle pointers to sub-buffers etc, or worry about iovec-like
> arrays of smaller ranges. So if _that_ is what you mean by "concept of
> sub-buffers", then I agree with you.
>
> But what I do think might make a lot of sense is to allow buffer
> fragments, and just teach user space to do de-fragmentation. Where it
> would be important that the de-fragmentation really is all in user
> space, and not really ever visible to the ring-buffer implementation
> itself (and there would not, for example, be any guarantees that the
> fragments would be contiguous - there could be other events in the
> buffer in between fragments). Maybe we could even say that fragments
> might be across different CPU ring-buffers, and user-space needs to
> sort it out if it wants to (where "sort it out" literally would mean
> having to sort and re-attach them in the right order, since there
> wouldn't be any ordering between them).
>
> From a kernel perspective, the only thing you need for fragment
> handling would be to have a buffer entry that just says "I'm fragment
> number X of event ID Y". Nothing more. Everything else would be up to
> the parser in user space to work out.

Heh. For a moment there I thought you were describing the the way
XFS writes transactions into it's log. Replace "CPU ring-buffers"
with "in-core log buffers", "userspace parsing" with "log recovery"
and "event ID" with "transaction ID", and the concept you describe
is eerily similar. That includes the fact that transactions are not
contiguous in the log, can interleave fragments between concurrent
transaction commits and they can span multiple log buffers, too. It
works pretty well for scaling concurrent writers....

I'll get back in my box now ;)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-08-04 07:15:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe


* Peter Zijlstra <[email protected]> wrote:

> > What I am proposing does not even involve a copy: when we want to take a
> > snapshot, we just have to force a sub-buffer switch on the ring buffer.
> > The "returns" happening at the beginning of the next (empty) sub-buffer
> > would clearly fail to discard records (expecting non-existing entry
> > records). We would then have to save a small record saying that a function
> > return occurred. The current stack frame at the end of the next sub-buffer
> > could be deduced from the complete collection of stack frame samples.
>
> And suppose the stack-trace was all of 16 entries (not uncommon for a kernel
> stack), then you waste a whole page for 128 bytes (assuming your sub-buffer
> is page sized). I'll take the memcopy, thank you.

To throw some hard numbers into the discussion, i found two random callgraph
perf.data's on my boxes (both created prior the start of this discussion) and
here is the distribution of their call-chain length:

aldebaran:~> perf report -D | grep 'chain: nr:' | cut -d: -f3- | sort -n | uniq -c
2 4
21 6
23 8
13 9
20 10
29 11
21 12
25 13
54 14
112 15
72 16
77 17
35 18
38 19
48 20
29 21
10 22
97 23
3 24
1 25
2 26
2 28
2 29
1 30
2 31

So the peak/average here is around 15 entries.

The other one:

phoenix:~> perf report -D | grep 'chain: nr:' | cut -d: -f3- | sort -n | uniq -c
1 2
70 3
222 4
112 5
116 6
329 7
241 8
163 9
203 10
287 11
159 12
4 13
6 14
22 15
2 16
11 17
5 18

Here the average is even lower - around 8 entries.

Thanks,

Ingo

2010-08-04 07:22:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe


* Dave Chinner <[email protected]> wrote:

> On Tue, Aug 03, 2010 at 11:56:11AM -0700, Linus Torvalds wrote:
> > On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <[email protected]> wrote:
> > >
> > > FWIW I really utterly detest the whole concept of sub-buffers.
> >
> > I'm not quite sure why. Is it something fundamental, or just an
> > implementation issue?
> >
> > One thing that I think could easily make sense in a _lot_ of buffering
> > areas is the notion of a "continuation" buffer. We know we have cases
> > where we want to attach a lot of data to one particular event, but the
> > buffering itself is inevitably always going to have some limits on
> > atomicity etc. And quite often, the event that _generates_ the data is
> > not necessarily going to have all that data in one contiguous region,
> > and doing a scatter-gather memcpy to get it that way is not good
> > either.
> >
> > At the same time, I do _not_ believe that the kernel ring-buffer code
> > should handle pointers to sub-buffers etc, or worry about iovec-like
> > arrays of smaller ranges. So if _that_ is what you mean by "concept of
> > sub-buffers", then I agree with you.
> >
> > But what I do think might make a lot of sense is to allow buffer
> > fragments, and just teach user space to do de-fragmentation. Where it
> > would be important that the de-fragmentation really is all in user
> > space, and not really ever visible to the ring-buffer implementation
> > itself (and there would not, for example, be any guarantees that the
> > fragments would be contiguous - there could be other events in the
> > buffer in between fragments). Maybe we could even say that fragments
> > might be across different CPU ring-buffers, and user-space needs to
> > sort it out if it wants to (where "sort it out" literally would mean
> > having to sort and re-attach them in the right order, since there
> > wouldn't be any ordering between them).
> >
> > From a kernel perspective, the only thing you need for fragment
> > handling would be to have a buffer entry that just says "I'm fragment
> > number X of event ID Y". Nothing more. Everything else would be up to
> > the parser in user space to work out.
>
> Heh. For a moment there I thought you were describing the the way XFS writes
> transactions into it's log. Replace "CPU ring-buffers" with "in-core log
> buffers", "userspace parsing" with "log recovery" and "event ID" with
> "transaction ID", and the concept you describe is eerily similar. That
> includes the fact that transactions are not contiguous in the log, can
> interleave fragments between concurrent transaction commits and they can
> span multiple log buffers, too. It works pretty well for scaling concurrent
> writers....

That's certainly a good model when you have to stream into a
persistent-storage transaction log space with multiple writers.

The difference is that with instrumentation we are generally able to make
things per task or per cpu so there's no real multi-CPU 'concurrent writers'
concurrency.

You dont have that luxory/simplicity when logging to storage, of course!

Thanks,

Ingo

2010-08-04 14:06:11

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Peter Zijlstra ([email protected]) wrote:
> On Tue, 2010-08-03 at 11:56 -0700, Linus Torvalds wrote:
> > On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <[email protected]> wrote:
> > >
> > > FWIW I really utterly detest the whole concept of sub-buffers.
> >
> > I'm not quite sure why. Is it something fundamental, or just an
> > implementation issue?
>
> The sub-buffer thing that both ftrace and lttng have is creating a large
> buffer from a lot of small buffers, I simply don't see the point of
> doing that. It adds complexity and limitations for very little gain.

The first major gain is the ability to implement flight recorder tracing
(overwrite mode), which Perf still lacks.

A second major gain: having these sub-buffers lets the trace analyzer seek in
the trace very efficiently by allowing it to perform a binary search for time to
find the appropriate sub-buffer. It becomes immensely useful with large traces.

The third major gain: for live streaming of traces, having sub-buffer lets you
"package" the event data you send over the network into sub-buffers. So the
trace analyzer, receiving this information live while the trace is being
recorded, can start using the information when the full sub-buffer is received.
It does not have to play games with the last event (or event header) perhaps
being incompletely sent, which imply that you absolutely _need_ to save the
event size along with each event header (you cannot simply let the analyzer
parse the event payload to determine the size). Here again, space wasted.
Furthermore, this deals with information loss: a trace is still readable even if
a sub-buffer must be discarded.

Making sure events don't cross sub-buffer boundaries simplify a lot of things,
starting with dealing with "overwritten" sub-buffers in flight recorder mode.
Trying to deal with a partially overwritten event is just insane.

>
> Their benefit is known synchronization points into the stream, you can
> parse each sub-buffer independently, but you can always break up a
> continuous stream into smaller parts or use a transport that includes
> index points or whatever.

I understand that you could perform amortized synchronization without
sub-buffers. I however don't see how flight recorder, efficient seek on multi-GB
traces (without reading the whole event stream), and live streaming can be
achieved.

> Their down side is that you can never have individual events larger than
> the sub-buffer,

True. But with configurable sub-buffer size (can be from 4kB to many MB), I
don't see the problem.

> you need to be aware of the sub-buffer when reserving
> space

Only the ring buffer needs to be aware of that. It returns an error if the event
is larger than the sub-buffer size.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-04 14:45:45

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Peter Zijlstra ([email protected]) wrote:
> On Tue, 2010-08-03 at 14:25 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra ([email protected]) wrote:
> > > On Thu, 2010-07-15 at 12:26 -0400, Mathieu Desnoyers wrote:
> > >
> > > > I was more thinking along the lines of making sure a ring buffer has the proper
> > > > support for your use-case. It shares a lot of requirements with a standard ring
> > > > buffer:
> > > >
> > > > - Need to be lock-less
> > > > - Need to reserve space, write data in a buffer
> > > >
> > > > By configuring a ring buffer with 4k sub-buffer size (that's configurable
> > > > dynamically),
> > >
> > > FWIW I really utterly detest the whole concept of sub-buffers.
> >
> > This reluctance against splitting a buffer into sub-buffers might contribute to
> > explain the poor performance experienced with the Perf ring buffer.
>
> That's just unsubstantiated FUD.

Extracted from:
http://lkml.org/lkml/2010/7/9/368

(executive summary)

* Throughput

* Flight recorder mode

Ring Buffer Library 83 ns/entry (512kB sub-buffers, no reader)
89 ns/entry (512kB sub-buffers: read 0.3M entries/s)


Ftrace Ring Buffer: 103 ns/entry (no reader)
187 ns/entry (read by event: read 0.4M entries/s)

Perf record (flight recorder mode unavailable)


* Discard mode

Ring Buffer Library: 96 ns/entry discarded
257 ns/entry written (read: 2.8M entries/s)

Perf Ring Buffer: 423 ns/entry written (read: 2.3M entries/s)
(Note that this number is based on the perf event approximation output (based on
a 24 bytes/entry estimation) rather than the benchmark module count due its
inaccuracy, which is caused by perf not letting the benchmark module know about
discarded events.)

It is really hard to get a clear picture of the data write overhead with perf,
because you _need_ to consume data. Making perf support flight recorder mode
would really help getting benchmarks that are easier to compare.

>
> > These
> > "sub-buffers" are really nothing new: these are called "periods" in the audio
> > world. They help lowering the ring buffer performance overhead because:
> >
> > 1) They allow writing into the ring buffer without SMP-safe synchronization
> > primitives and memory barriers for each record. Synchronization is only needed
> > across sub-buffer boundaries, which amortizes the cost over a large number of
> > events.
>
> The only SMP barrier we (should) have is when we update the user visible
> head pointer. The buffer code itself uses local{,64}_t for all other
> atomic ops.
>
> If you want to amortize that barrier, simply hold off the head update
> for a while, no need to introduce sub-buffers.

I understand your point about amortized synchronization. However I still don't
see how you can achieve flight recorder mode, efficient seek on multi-GB traces
without reading the whole event stream, and live streaming without sub-buffers
(and, ideally, without much headhaches involved). ;)

>
> > 2) They are much more splice (and, in general, page-exchange) friendly, because
> > records written after a synchronization point start at the beginning of a page.
> > This removes the need for extra copies.
>
> This just doesn't make any sense at all, I could splice full pages just
> fine, splice keeps page order so these synchronization points aren't
> critical in any way.

If you need to read non-filled pages, then you need to splice pages piece-wise.
This does not fit well with flight recorder tracing, for which the solution
Steven and I have found is to atomically exchange pages (for Ftrace) or
sub-buffers (for the generic ring buffer library) between the reader and writer.

>
> The only problem I have with splice atm is that we don't have a buffer
> interface without mmap() and we cannot splice pages out from under
> mmap() on all architectures in a sane manner.

The problem Perf has is probably more with flight recorder (overwrite) tracing
support than splice() per se, in this you are right.

>
> > So I have to ask: do you detest the sub-buffer concept only because you are tied
> > to the current Perf userspace ABI which cannot support this without an ABI
> > change ?
>
> No because I don't see the point.

OK, good to know you are open to ABI changes if I present convincing arguments.

>
> > I'm trying to help out here, but it does not make the task easy if we have both
> > hands tied in our back because we have to keep backward ABI compatibility for a
> > tool (perf) forever, even considering its sources are shipped with the kernel.
>
> Dude, its a published user<->kernel ABI, also you're not saying why you
> would want to break it. In your other email you allude to things like
> flight recorder mode, that could be done with the current set-up, no
> need to break the ABI at all. All you need to do is track the tail
> pointer and publish it.

How do you plan to read the data concurrently with the writer overwriting the
data while you are reading it without corruption ?

>
> > Nope. I'm thinking that we can use a buffer just to save the stack as we call
> > functions and return, e.g.
>
> We don't have a callback on function entry, and I'm not going to use
> mcount for that, that's simply insane.

OK, now I get a clearer picture of what Frederic is trying to do.

>
> > call X -> reserve space to save "X" and arguments.
> > call Y -> same for Y.
> > call Z -> same for Z.
> > return -> discard event for Z.
> > return -> discard event for Y.
> >
> > if we grab the buffer content at that point, then we have X and its arguments,
> > which is the function currently executed. That would require the ability to
> > uncommit and unreserve an event, which is not a problem as long as we have not
> > committed a full sub-buffer.
>
> Again, I'm not really seeing the point of using sub-buffers at all.

This part of the email is unrelated to sub-buffers.

>
> Also, what happens when we write an event after Y? Then the discard must
> fail or turn Y into a NOP, leaving a hole in the buffer.

Given that this buffer is simply used to dump the stack unwind result then I
think my scenario above was simply mislead.

>
> > I thought that this buffer was chasing the function entry/exits rather than
> > doing a stack unwind, but I might be wrong. Perhaps Frederic could tell us more
> > about his use-case ?
>
> No, its a pure stack unwind from NMI context. When we get an event (PMI,
> tracepoint, whatever) we write out event, if the consumer asked for a
> stacktrace with each event, we unwind the stack for him.

So why the copy ? Frederic seems to put the stack unwind in a special temporary
buffer. Why is it not saved directly into the trace buffers ?

> > > Additionally, if you have multiple consumers you can simply copy the
> > > stacktrace again, avoiding the whole pointer chase exercise. While you
> > > could conceivably copy from one ringbuffer into another that will result
> > > in very nasty serialization issues.
> >
> > Assuming Frederic is saving information to this stack-like ring buffer at each
> > function entry and discarding at each function return, then we don't have the
> > pointer chase.
> >
> > What I am proposing does not even involve a copy: when we want to take a
> > snapshot, we just have to force a sub-buffer switch on the ring buffer. The
> > "returns" happening at the beginning of the next (empty) sub-buffer would
> > clearly fail to discard records (expecting non-existing entry records). We would
> > then have to save a small record saying that a function return occurred. The
> > current stack frame at the end of the next sub-buffer could be deduced from the
> > complete collection of stack frame samples.
>
> And suppose the stack-trace was all of 16 entries (not uncommon for a
> kernel stack), then you waste a whole page for 128 bytes (assuming your
> sub-buffer is page sized). I'll take the memcopy, thank you.

Well, now that I understand what you are trying to achieve, I retract my
proposal of using a stack-like ring buffer for this. I think that the stack dump
should simply be saved directly to the ring buffer, without copy. The
dump_stack() functions might have to be extended so they don't just save text
dumbly, but can also be used to save events into the trace in binary format,
perhaps with the continuation cookie Linus was proposing.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-04 14:50:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, 2010-08-04 at 10:06 -0400, Mathieu Desnoyers wrote:

> The first major gain is the ability to implement flight recorder tracing
> (overwrite mode), which Perf still lacks.

http://lkml.org/lkml/2009/7/6/178

I've send out something like that several times, but nobody took it
(that is, tested it and provided a user). Note how it doesn't require
anything like sub-buffers.

> A second major gain: having these sub-buffers lets the trace analyzer seek in
> the trace very efficiently by allowing it to perform a binary search for time to
> find the appropriate sub-buffer. It becomes immensely useful with large traces.

You can add sync events with a specific magic cookie in. Once you find
the cookie you can sync and start reading it reliably -- the advantage
is that sync events are very easy to have as an option and don't
complicate the reserve path.

> The third major gain: for live streaming of traces, having sub-buffer lets you
> "package" the event data you send over the network into sub-buffers.

See the sync events. Also, a transport can rewrite the stream any which
way it pretty well wants to, as long as the kernel<->user interface is
reliable an unreliable user<->user transport can repackage it to suit
its needs.

> Making sure events don't cross sub-buffer boundaries simplify a lot of things,
> starting with dealing with "overwritten" sub-buffers in flight recorder mode.
> Trying to deal with a partially overwritten event is just insane.

See the above patch, simply parse the events and push the tail pointer
ahead of the reservation before you trample on it.

If you worry about the cost of parsing the events, you can amortize that
by things like keeping the offset of the first event in every page in
the pageframe, or the offset of the next sync event or whatever scheme
you want.

Again, no need for sub-buffers.

Also, not having sub-buffers makes reservation easier since you don't
need to worry about those empty tails.

2010-08-04 14:56:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:

> How do you plan to read the data concurrently with the writer overwriting the
> data while you are reading it without corruption ?

I don't consider reading while writing (in overwrite mode) a valid case.

If you want to use overwrite, stop the writer before reading it.

> I think that the stack dump
> should simply be saved directly to the ring buffer, without copy. The
> dump_stack() functions might have to be extended so they don't just save text
> dumbly, but can also be used to save events into the trace in binary format,
> perhaps with the continuation cookie Linus was proposing.

Because I don't want to support truncating reservations (because that
leads to large nops for nested events) and when the event needs to go to
multiple buffers you can re-use the stack-dump without having to do the
unwind again.

The problem with the continuation thing Linus suggested is that it would
bloat the output 3 fold. A stack entry is a single u64. If you want to
wrap that in a continuation event you need: a header (u64), a cookie
(u64) and the entry (u64).

Continuation events might make heaps of sense for larger data pieces,
but I don't see them being practical for such small pieces.

2010-08-06 01:42:39

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Peter Zijlstra ([email protected]) wrote:
> On Wed, 2010-08-04 at 10:06 -0400, Mathieu Desnoyers wrote:
>
> > The first major gain is the ability to implement flight recorder tracing
> > (overwrite mode), which Perf still lacks.
>
> http://lkml.org/lkml/2009/7/6/178
>
> I've send out something like that several times, but nobody took it
> (that is, tested it and provided a user). Note how it doesn't require
> anything like sub-buffers.

+static void perf_output_tail(struct perf_mmap_data *data, unsigned int head)
...
+ unsigned long tail, new;
...
+ unsigned long size;

+ while (tail + size - head < 0) {
....
+ }

How is the while condition ever be supposed to be true ? I guess nobody took it
because it simply was not ready for testing.

>
> > A second major gain: having these sub-buffers lets the trace analyzer seek in
> > the trace very efficiently by allowing it to perform a binary search for time to
> > find the appropriate sub-buffer. It becomes immensely useful with large traces.
>
> You can add sync events with a specific magic cookie in. Once you find
> the cookie you can sync and start reading it reliably

You need to read the whole trace to find these cookies (even if it is just once
at the beginning if you create an index). My experience with users have shown me
that the delay between stopping trace gathering having the data shown to the
user is very important, because this is repeatedly done while debugging a
problem, and this is time the user is sitting in front of his screen, waiting.

> -- the advantage
> is that sync events are very easy to have as an option and don't
> complicate the reserve path.

Perf, on its reserve/commit fast paths:

perf_output_begin: 543 bytes
(perf_output_get_handle is inlined)

perf_output_put_handle: 201 bytes
perf_output_end: 77 bytes
calls perf_output_put_handle

Total for perf: 821 bytes

Generic Ring Buffer Library reserve/commit fast paths:

Reserve: 511 bytes
Commit: 266 bytes
Total for Generic Ring Buffer: 777 bytes

So the generic ring buffer is not only faster and supports sub-buffers (along
with all the nice features this brings); its reserve and commit hot paths
fit in less instructions: it is *less* complicated than Perf's.


>
> > The third major gain: for live streaming of traces, having sub-buffer lets you
> > "package" the event data you send over the network into sub-buffers.
>
> See the sync events.

I am guessing you plan to rely on these sync events to know which data "blocs"
are fully received. This could possibly be made to work.

> Also, a transport can rewrite the stream any which
> way it pretty well wants to, as long as the kernel<->user interface is
> reliable an unreliable user<->user transport can repackage it to suit
> its needs.

repackage = copy = poor performance. No thanks.

>
> > Making sure events don't cross sub-buffer boundaries simplify a lot of things,
> > starting with dealing with "overwritten" sub-buffers in flight recorder mode.
> > Trying to deal with a partially overwritten event is just insane.
>
> See the above patch, simply parse the events and push the tail pointer
> ahead of the reservation before you trample on it.

I'm not sure that patch is ready for prime-time yet. As you point out in your
following email, you need to stop tracing to consume data, which does not fit my
users'use-cases.

>
> If you worry about the cost of parsing the events, you can amortize that
> by things like keeping the offset of the first event in every page in
> the pageframe, or the offset of the next sync event or whatever scheme
> you want.

Hrm ? AFAIK, the page-frame is an internal kernel-only data structure. That
won't be exported to user-space, so how is the parser supposed to see this
information exactly to help it speeding up parsing ?

>
> Again, no need for sub-buffers.

I don't see this claim as satisfactorily supported here, sorry.

>
> Also, not having sub-buffers makes reservation easier since you don't
> need to worry about those empty tails.

So far I've shown that you sub-buffer-less implementation is not even simpler
than a implementation using sub-buffers.

By the way, even with your sub-buffer free scheme, you cannot write an event
bigger than your buffer size. So you have a likewise limitation in terms of
maximum event size (so you already have to test this on your fast path).

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-06 01:49:34

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Peter Zijlstra ([email protected]) wrote:
> On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
>
> > How do you plan to read the data concurrently with the writer overwriting the
> > data while you are reading it without corruption ?
>
> I don't consider reading while writing (in overwrite mode) a valid case.
>
> If you want to use overwrite, stop the writer before reading it.

How inconvenient. It happens that the relatively large group of users I am
working for do care for this use-case. They cannot afford to stop tracing as
soon as they hit "one bug". This "bug" could be a simple odd scenario that they
want to snapshot, but in all cases they want tracing to continue.

>
> > I think that the stack dump
> > should simply be saved directly to the ring buffer, without copy. The
> > dump_stack() functions might have to be extended so they don't just save text
> > dumbly, but can also be used to save events into the trace in binary format,
> > perhaps with the continuation cookie Linus was proposing.
>
> Because I don't want to support truncating reservations (because that
> leads to large nops for nested events)

Agreed in this case. Truncating reservations might make sense for filtering, but
even there I have a strong preference for filtering directly on the information
received as parameter, before performing buffer space reservation, whenever
possible.

> and when the event needs to go to
> multiple buffers you can re-use the stack-dump without having to do the
> unwind again.
>
> The problem with the continuation thing Linus suggested is that it would
> bloat the output 3 fold. A stack entry is a single u64. If you want to
> wrap that in a continuation event you need: a header (u64), a cookie
> (u64) and the entry (u64).

Agreed, it's probably not such a good fit for these small pieces of information.

>
> Continuation events might make heaps of sense for larger data pieces,
> but I don't see them being practical for such small pieces.

Yep.

What I did in a past life in earlier LTTng versions was to use a 2-pass unwind.
The first pass is the most costly because it brings all the data into the L1
cache. This first pass is used to compute the array size you need to save the
whole stack frame, but it does not copy anything. The second pass performs the
copy. This was surprisingly efficient.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

Peter Zijlstra wrote:
> On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
>
>> How do you plan to read the data concurrently with the writer overwriting the
>> data while you are reading it without corruption ?
>
> I don't consider reading while writing (in overwrite mode) a valid case.
>
> If you want to use overwrite, stop the writer before reading it.

For example, would you like to read system audit log always after
stop the audit?

NO, that's a most important requirement for tracers, especially for
system admins (they're the most important users of Linux) to check
the system health and catch system troubles.

For performance measurement and checking hotspot, one-shot tracing
is enough. But it's just for developers. But for the real world
computing, Linux is just an OS, users want to run their system,
middleware and applications, without troubles. But when they hit
a trouble, they wanna shoot it ASAP.
The flight recorder mode is mainly for those users.

Thank you,

--
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]

2010-08-06 09:51:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Fri, 2010-08-06 at 15:18 +0900, Masami Hiramatsu wrote:
> Peter Zijlstra wrote:
> > On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> >
> >> How do you plan to read the data concurrently with the writer overwriting the
> >> data while you are reading it without corruption ?
> >
> > I don't consider reading while writing (in overwrite mode) a valid case.
> >
> > If you want to use overwrite, stop the writer before reading it.
>
> For example, would you like to read system audit log always after
> stop the audit?
>
> NO, that's a most important requirement for tracers, especially for
> system admins (they're the most important users of Linux) to check
> the system health and catch system troubles.
>
> For performance measurement and checking hotspot, one-shot tracing
> is enough. But it's just for developers. But for the real world
> computing, Linux is just an OS, users want to run their system,
> middleware and applications, without troubles. But when they hit
> a trouble, they wanna shoot it ASAP.
> The flight recorder mode is mainly for those users.

You cannot over-write and consistently read the buffer, that's plain
impossible. With sub-buffers you can swivel a sub-buffer and
consistently read that, but there is no guarantee the next sub-buffer
you steal was indeed adjacent to the previous buffer you stole as that
might have gotten over-written by the active writer while you were
stealing the previous one.

If you want to snapshot buffers, do that, simply swivel the whole trace
buffer, and continue tracing in a new one, then consume the old trace in
a consistent manner.

I really see no value in being able to read unrelated bits and pieces of
a buffer.

So no, I will _not_ support reading an over-write buffer while there is
an active reader.

2010-08-06 09:52:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, 2010-08-05 at 21:49 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra ([email protected]) wrote:
> > On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> >
> > > How do you plan to read the data concurrently with the writer overwriting the
> > > data while you are reading it without corruption ?
> >
> > I don't consider reading while writing (in overwrite mode) a valid case.
> >
> > If you want to use overwrite, stop the writer before reading it.
>
> How inconvenient. It happens that the relatively large group of users I am
> working for do care for this use-case. They cannot afford to stop tracing as
> soon as they hit "one bug". This "bug" could be a simple odd scenario that they
> want to snapshot, but in all cases they want tracing to continue.

Snapshot is fine, just swivel the whole buffer.

2010-08-06 10:11:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, 2010-08-05 at 21:42 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra ([email protected]) wrote:
> > On Wed, 2010-08-04 at 10:06 -0400, Mathieu Desnoyers wrote:
> >
> > > The first major gain is the ability to implement flight recorder tracing
> > > (overwrite mode), which Perf still lacks.
> >
> > http://lkml.org/lkml/2009/7/6/178
> >
> > I've send out something like that several times, but nobody took it
> > (that is, tested it and provided a user). Note how it doesn't require
> > anything like sub-buffers.

> How is the while condition ever be supposed to be true ? I guess nobody took it
> because it simply was not ready for testing.

I know, I never claimed it was, it was always an illustration of how to
accomplish it. But then, nobody found it important enough to finish.

> > > A second major gain: having these sub-buffers lets the trace analyzer seek in
> > > the trace very efficiently by allowing it to perform a binary search for time to
> > > find the appropriate sub-buffer. It becomes immensely useful with large traces.
> >
> > You can add sync events with a specific magic cookie in. Once you find
> > the cookie you can sync and start reading it reliably
>
> You need to read the whole trace to find these cookies (even if it is just once
> at the beginning if you create an index).

Depends on what you want to do, you can start reading at any point in
the stream and be guaranteed to find a sync point within sync-distance
+max-event-size.

> My experience with users have shown me
> that the delay between stopping trace gathering having the data shown to the
> user is very important, because this is repeatedly done while debugging a
> problem, and this is time the user is sitting in front of his screen, waiting.

Yeah, because after having had to wait for 36h for the problem to
trigger that extra minute really kills.

All I can say is that in my experience brain throughput is the limiting
factor in debugging. Not some ability to draw fancy pictures.

> > -- the advantage
> > is that sync events are very easy to have as an option and don't
> > complicate the reserve path.
>
> Perf, on its reserve/commit fast paths:
>
> perf_output_begin: 543 bytes
> (perf_output_get_handle is inlined)
>
> perf_output_put_handle: 201 bytes
> perf_output_end: 77 bytes
> calls perf_output_put_handle
>
> Total for perf: 821 bytes
>
> Generic Ring Buffer Library reserve/commit fast paths:
>
> Reserve: 511 bytes
> Commit: 266 bytes
> Total for Generic Ring Buffer: 777 bytes
>
> So the generic ring buffer is not only faster and supports sub-buffers (along
> with all the nice features this brings); its reserve and commit hot paths
> fit in less instructions: it is *less* complicated than Perf's.

All I can say is that less code doesn't equal less complex (nor faster
per-se). Nor have I spend all my time on writing the ring-buffer,
there's more interesting things to do.

And the last time I ran perf on perf, the buffer wasn't the thing that
was taking most time.

And unlike what you claim below, it most certainly can deal with events
larger than a single page.

> > If you worry about the cost of parsing the events, you can amortize that
> > by things like keeping the offset of the first event in every page in
> > the pageframe, or the offset of the next sync event or whatever scheme
> > you want.
>
> Hrm ? AFAIK, the page-frame is an internal kernel-only data structure. That
> won't be exported to user-space, so how is the parser supposed to see this
> information exactly to help it speeding up parsing ?

Its about the kernel parsing the buffer to push the tail ahead of the
reserve window, so that you have a reliable point to start reading the
trace from -- or didn't you actually get the intent of that patch?

2010-08-06 11:15:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Fri, 2010-08-06 at 12:11 +0200, Peter Zijlstra wrote:
> > You need to read the whole trace to find these cookies (even if it is just once
> > at the beginning if you create an index).

Even if you want to index all sync points you can quickly skip through
the file using the sync-distance, after which you'll have, on average,
only 1/2 avg-event-size to read before you find your next sync point.

So suppose you have a 1M sync-distance, and an effective average event
size of 128 bytes, then for a 4G file, you can find all sync points by
only reading ~262144 bytes (not counting for the fact that the pagecache
will bring in full pages, which would result in something like 16M to be
read in total or somesuch -- which, again assumes read-ahead isn't going
to play tricks on you).

2010-08-06 13:37:57

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Peter Zijlstra ([email protected]) wrote:
> On Fri, 2010-08-06 at 15:18 +0900, Masami Hiramatsu wrote:
> > Peter Zijlstra wrote:
> > > On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> > >
> > >> How do you plan to read the data concurrently with the writer overwriting the
> > >> data while you are reading it without corruption ?
> > >
> > > I don't consider reading while writing (in overwrite mode) a valid case.
> > >
> > > If you want to use overwrite, stop the writer before reading it.
> >
> > For example, would you like to read system audit log always after
> > stop the audit?
> >
> > NO, that's a most important requirement for tracers, especially for
> > system admins (they're the most important users of Linux) to check
> > the system health and catch system troubles.
> >
> > For performance measurement and checking hotspot, one-shot tracing
> > is enough. But it's just for developers. But for the real world
> > computing, Linux is just an OS, users want to run their system,
> > middleware and applications, without troubles. But when they hit
> > a trouble, they wanna shoot it ASAP.
> > The flight recorder mode is mainly for those users.
>
> You cannot over-write and consistently read the buffer, that's plain
> impossible.

If you think it is impossible, then you should really go have a look at the
generic ring buffer library, at LTTng and at Ftrace. It looks like we're all
doing the "impossible".

> With sub-buffers you can swivel a sub-buffer and
> consistently read that, but there is no guarantee the next sub-buffer
> you steal was indeed adjacent to the previous buffer you stole as that
> might have gotten over-written by the active writer while you were
> stealing the previous one.

We don't care about taking the next adjascent sub-buffer. We care about always
grabbing the oldest sub-buffer that has been written up to the currentmost
one.

>
> If you want to snapshot buffers, do that, simply swivel the whole trace
> buffer, and continue tracing in a new one, then consume the old trace in
> a consistent manner.

So you need to allocate many trace buffers to accomplish the same and an extra
layer on top that does this buffer exchange, I don't call that "simple". Note
that only two trace buffers might not be enough if you have repeated failures in
a short time window; the consumer might take some time to extract all these.

Compared to that, the sub-buffer scheme only needs a single buffer with 2 (or
more) sub-buffers, plus an extra sub-buffer owned by the reader that we exchange
with the sub-buffer we want to grab for reading. The reader always grabs the
sub-buffer with the oldest data into it. The number of sub-buffers used is the
limit on the number of snapshots that can be taken in a relatively short time
window (the time it takes to the reader to consume the data).

>
> I really see no value in being able to read unrelated bits and pieces of
> a buffer.

Within a sub-buffer, events are adjascent, and between sub-buffers, events are
guaranteed to be in order (oldest to newest event). It is only in the case where
buffers are relatively small compared to the data throughput that the writer can
overwrite information that would have been useful for a snapshot (e.g.
overwriting relatively recent information while the reader reads the oldest
sub-buffer), but in that case users simply have to tune they buffer size
appropriately to match the trace data throughput.

>
> So no, I will _not_ support reading an over-write buffer while there is
> an active reader.

(I guess you mean active writer)

Here you argue that you don't need to support this feature at the ring buffer
level because you can have a group of ring buffers that does it instead.
How is your multiple-buffer scheme any simpler than sub-buffers ? Either you
have to allocate many of them up front, or, if you want to do it on-demand, you
have to perform memory allocation in NMI context. I don't see any of these two
solutions as particularly appealing.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-06 13:46:39

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Peter Zijlstra ([email protected]) wrote:
> On Thu, 2010-08-05 at 21:49 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra ([email protected]) wrote:
> > > On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> > >
> > > > How do you plan to read the data concurrently with the writer overwriting the
> > > > data while you are reading it without corruption ?
> > >
> > > I don't consider reading while writing (in overwrite mode) a valid case.
> > >
> > > If you want to use overwrite, stop the writer before reading it.
> >
> > How inconvenient. It happens that the relatively large group of users I am
> > working for do care for this use-case. They cannot afford to stop tracing as
> > soon as they hit "one bug". This "bug" could be a simple odd scenario that they
> > want to snapshot, but in all cases they want tracing to continue.
>
> Snapshot is fine, just swivel the whole buffer.

There is a very important trade-off between the amount of information that can
be kept around in memory to take as snapshot and the amount of system memory
reserved for buffers. The sub-buffer scheme is pretty good at that: the whole
memory reserved (except the extra reader-owned sub-buffer) is available to save
the flight recorder trace.

With the multiple-buffer scheme you propose, only one of the buffers can be used
to save data. This is very limiting, especially for embedded systems in telecom
switches that does not have that much memory: all the memory reserved for the
buffer that is currently inactive is simply wasted. It does not even allow the
user to gather a longer snapshot.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-06 14:13:53

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Peter Zijlstra ([email protected]) wrote:
> On Thu, 2010-08-05 at 21:42 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra ([email protected]) wrote:
> > > On Wed, 2010-08-04 at 10:06 -0400, Mathieu Desnoyers wrote:
[...]
> > > > A second major gain: having these sub-buffers lets the trace analyzer seek in
> > > > the trace very efficiently by allowing it to perform a binary search for time to
> > > > find the appropriate sub-buffer. It becomes immensely useful with large traces.
> > >
> > > You can add sync events with a specific magic cookie in. Once you find
> > > the cookie you can sync and start reading it reliably
> >
> > You need to read the whole trace to find these cookies (even if it is just once
> > at the beginning if you create an index).
>
> Depends on what you want to do, you can start reading at any point in
> the stream and be guaranteed to find a sync point within sync-distance
> +max-event-size.

At _any_ point in the stream ?

So if I take, let's say, a few kB of Perf ring buffer data and I choose to
encode than into an event into another buffer (e.g. we're tracing part of the
network traffic). Then we end up in a situation where the event payload will
contain your "so special" sync point data. Basically, you have no guarantee that
you won't mix up standard event data and your synchronization event headers.

Your sync point solution just kills all encapsulation good practices in one go.

> > My experience with users have shown me
> > that the delay between stopping trace gathering having the data shown to the
> > user is very important, because this is repeatedly done while debugging a
> > problem, and this is time the user is sitting in front of his screen, waiting.
>
> Yeah, because after having had to wait for 36h for the problem to
> trigger that extra minute really kills.
>
> All I can say is that in my experience brain throughput is the limiting
> factor in debugging. Not some ability to draw fancy pictures.

Here I have to bring up the fact that Linux kernel developers are not the only
tracer users.

Traces of multi-GB can be generated easily within a few seconds/minutes on many
workloads, so we're not talking of many-hours-traces here. But if we need to
read the whole trace before it can be shown, we're adding a significant delay
before the trace can be accessed.

In my experience, both brain and data gathering throughputs are limiting factors
to debugging. Drawing fancy pictures merely helps speeding up the brain process
in some cases.


>
> > > -- the advantage
> > > is that sync events are very easy to have as an option and don't
> > > complicate the reserve path.
> >
> > Perf, on its reserve/commit fast paths:
> >
> > perf_output_begin: 543 bytes
> > (perf_output_get_handle is inlined)
> >
> > perf_output_put_handle: 201 bytes
> > perf_output_end: 77 bytes
> > calls perf_output_put_handle
> >
> > Total for perf: 821 bytes
> >
> > Generic Ring Buffer Library reserve/commit fast paths:
> >
> > Reserve: 511 bytes
> > Commit: 266 bytes
> > Total for Generic Ring Buffer: 777 bytes
> >
> > So the generic ring buffer is not only faster and supports sub-buffers (along
> > with all the nice features this brings); its reserve and commit hot paths
> > fit in less instructions: it is *less* complicated than Perf's.
>
> All I can say is that less code doesn't equal less complex (nor faster
> per-se).

Less code = less instruction cache overhead. I've also shown that the LTTng code
is at least twice faster. In terms of complexity, it is not much more complex; I
also took the extra care of doing the formal proofs to make sure the
corner-cases were dealt with, which I don't reckon neither Steven nor yourself
have done.

> Nor have I spend all my time on writing the ring-buffer,
> there's more interesting things to do.

I must admit that I probably spent much more time working on the ring buffer
than you did. It looks like one's interest can only focus on so many areas at
once. So if you are not that interested in ring buffers, can you at least stop
opposing to people who care deeply ?

If we agree that we don't care about the same use-cases, there might be room for
many ring buffers in the kernel. It's just a shame that we have to multiply
amount of code-review. We have to note that this goes against Linus' request for
a shared and common ring buffer used by all tracers.


> And the last time I ran perf on perf, the buffer wasn't the thing that
> was taking most time.

Very interesting. I know the trace clock performance are terrible too. But let's
keep that for another discussion please.

>
> And unlike what you claim below, it most certainly can deal with events
> larger than a single page.

What I said below was: perf cannot write events larger than its buffer size. So
it already has to take that "test" branch for maximum event size. I said nothing
about page size in this context.

>
> > > If you worry about the cost of parsing the events, you can amortize that
> > > by things like keeping the offset of the first event in every page in
> > > the pageframe, or the offset of the next sync event or whatever scheme
> > > you want.
> >
> > Hrm ? AFAIK, the page-frame is an internal kernel-only data structure. That
> > won't be exported to user-space, so how is the parser supposed to see this
> > information exactly to help it speeding up parsing ?
>
> Its about the kernel parsing the buffer to push the tail ahead of the
> reserve window, so that you have a reliable point to start reading the
> trace from -- or didn't you actually get the intent of that patch?

I got the intent of the patch, I just somehow missed that this paragraph was
applying to the patch specifically.

Thanks,

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-06 14:15:27

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Peter Zijlstra ([email protected]) wrote:
> On Fri, 2010-08-06 at 12:11 +0200, Peter Zijlstra wrote:
> > > You need to read the whole trace to find these cookies (even if it is just once
> > > at the beginning if you create an index).
>
> Even if you want to index all sync points you can quickly skip through
> the file using the sync-distance, after which you'll have, on average,
> only 1/2 avg-event-size to read before you find your next sync point.
>
> So suppose you have a 1M sync-distance, and an effective average event
> size of 128 bytes, then for a 4G file, you can find all sync points by
> only reading ~262144 bytes (not counting for the fact that the pagecache
> will bring in full pages, which would result in something like 16M to be
> read in total or somesuch -- which, again assumes read-ahead isn't going
> to play tricks on you).

How do you distinguish between sync events and random payload data ?

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

Peter Zijlstra wrote:
> On Fri, 2010-08-06 at 15:18 +0900, Masami Hiramatsu wrote:
>> Peter Zijlstra wrote:
>>> On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
>>>
>>>> How do you plan to read the data concurrently with the writer overwriting the
>>>> data while you are reading it without corruption ?
>>> I don't consider reading while writing (in overwrite mode) a valid case.
>>>
>>> If you want to use overwrite, stop the writer before reading it.
>> For example, would you like to read system audit log always after
>> stop the audit?
>>
>> NO, that's a most important requirement for tracers, especially for
>> system admins (they're the most important users of Linux) to check
>> the system health and catch system troubles.
>>
>> For performance measurement and checking hotspot, one-shot tracing
>> is enough. But it's just for developers. But for the real world
>> computing, Linux is just an OS, users want to run their system,
>> middleware and applications, without troubles. But when they hit
>> a trouble, they wanna shoot it ASAP.
>> The flight recorder mode is mainly for those users.
>
> You cannot over-write and consistently read the buffer, that's plain
> impossible. With sub-buffers you can swivel a sub-buffer and
> consistently read that, but there is no guarantee the next sub-buffer
> you steal was indeed adjacent to the previous buffer you stole as that
> might have gotten over-written by the active writer while you were
> stealing the previous one.

Right, we cannot ensure that. In over-written mode, reader could lose
some data, because of overwriting by writers. (or writer may fails
to write new data on buffer in non-overwritten mode)
However, I think that doesn't mean this mode is completely useless.
If we can know when(where) the data was lost, the rest of data
is enough useful in some cases.

> If you want to snapshot buffers, do that, simply swivel the whole trace
> buffer, and continue tracing in a new one, then consume the old trace in
> a consistent manner.

Hmm, would that consume much more memory compared with sub-buffer
ring buffer if we have spare buffers?
Or, if allocating it after reader opens buffer, will it also slow
down the reader process?

> I really see no value in being able to read unrelated bits and pieces of
> a buffer.

I think there is a trade-off of perfect snapshot and consuming memory,
and it depends on use-case in many cases.

>
> So no, I will _not_ support reading an over-write buffer while there is
> an active reader.
>

I hope you to reconsider how over-write buffer is useful even if
it is far from perfect.

Thank you,

--
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]

2010-08-09 16:53:20

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Fri, Aug 06, 2010 at 11:50:40AM +0200, Peter Zijlstra wrote:
> On Fri, 2010-08-06 at 15:18 +0900, Masami Hiramatsu wrote:
> > Peter Zijlstra wrote:
> > > On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> > >
> > >> How do you plan to read the data concurrently with the writer overwriting the
> > >> data while you are reading it without corruption ?
> > >
> > > I don't consider reading while writing (in overwrite mode) a valid case.
> > >
> > > If you want to use overwrite, stop the writer before reading it.
> >
> > For example, would you like to read system audit log always after
> > stop the audit?
> >
> > NO, that's a most important requirement for tracers, especially for
> > system admins (they're the most important users of Linux) to check
> > the system health and catch system troubles.
> >
> > For performance measurement and checking hotspot, one-shot tracing
> > is enough. But it's just for developers. But for the real world
> > computing, Linux is just an OS, users want to run their system,
> > middleware and applications, without troubles. But when they hit
> > a trouble, they wanna shoot it ASAP.
> > The flight recorder mode is mainly for those users.
>
> You cannot over-write and consistently read the buffer, that's plain
> impossible. With sub-buffers you can swivel a sub-buffer and
> consistently read that, but there is no guarantee the next sub-buffer
> you steal was indeed adjacent to the previous buffer you stole as that
> might have gotten over-written by the active writer while you were
> stealing the previous one.
>
> If you want to snapshot buffers, do that, simply swivel the whole trace
> buffer, and continue tracing in a new one, then consume the old trace in
> a consistent manner.
>
> I really see no value in being able to read unrelated bits and pieces of
> a buffer.



It all depends on the frequency on your events and on the amount of memory
used for the buffer.

If you are tracing syscalls in a semi-idle box with a ring buffer of 500 MB
per cpu, you really don't care about the writer catching up the reader: it
will simply not happen.

OTOH if you are tracing function graphs, no buffer size will ever be enough:
the writer will always be faster and catch up the reader.

Using the sub-buffer scheme though, and allowing concurrent writer and reader
in overwriting mode, we can easily tell the user about the writer beeing
faster and content that have been lost. On top of these informations, the
user can chose what to do: trying with a larger buffer or so.

See? It's not our role to say: the result might be unreliable if the user
does silly settings (not enough memory, reader too slow for random reasons,
too high frequency events or so...). Let the user deal with that and just
inform him about unreliable results. This is what ftrace does currently.

Also the snapshot thing doesn't look like a replacement. If you are
tracing on a low memory embedded system, you consume a lot of memory
to keep the snapshot alive, it means the live buffer can be critically
lowered and you might in turn lose traces there.
That said it's an interesting feature that may fit on other kind of
environments or for other needs.


Off-topic: It's sad that about tracing, we often have to figure out the needs
from embedded world, or learn from indirect sources. In the end we rarely
know from them directly. Except may be in confs....

2010-08-11 14:34:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

Egad! Go on vacation and the world falls apart.

On Wed, 2010-08-04 at 08:27 +0200, Peter Zijlstra wrote:
> On Tue, 2010-08-03 at 11:56 -0700, Linus Torvalds wrote:
> > On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <[email protected]> wrote:
> > >
> > > FWIW I really utterly detest the whole concept of sub-buffers.
> >
> > I'm not quite sure why. Is it something fundamental, or just an
> > implementation issue?
>
> The sub-buffer thing that both ftrace and lttng have is creating a large
> buffer from a lot of small buffers, I simply don't see the point of
> doing that. It adds complexity and limitations for very little gain.

So, I want to allocate a 10Meg buffer. I need to make sure the kernel
has 10megs of memory available. If the memory is quite fragmented, then
too bad, I lose out.

Oh wait, I could also use vmalloc. But then again, now I'm blasting
valuable TLB entries for a tracing utility, thus making the tracer have
a even bigger impact on the entire system.

BAH!

I originally wanted to go with the continuous buffer, but I was
convinced after trying to implement it, that it was a bad choice.
Specifically, because of needing to 1) get large amounts of memory that
is continuous, or 2) eating up TLB entries and causing the system to
perform poorer.

I chose page size "sub-buffers" to solve the above. It also made
implementing splice trivial. OK, I admit, I never thought about mmapping
the buffers, just because I figured splice was faster. But I do have
patches that allow a user to mmap the entire ring buffer, but only in a
"producer/consumer" mode.

Note, I use page size sub-buffers, but the design could work with any
size sub-buffers. I just never implemented that (even though, when I
wrote the code it was secretly on my todo list).


>
> Their benefit is known synchronization points into the stream, you can
> parse each sub-buffer independently, but you can always break up a
> continuous stream into smaller parts or use a transport that includes
> index points or whatever.
>
> Their down side is that you can never have individual events larger than
> the sub-buffer, you need to be aware of the sub-buffer when reserving
> space etc..

The answer to that is to make a macro to do the assignment of the event,
and add a new API.

event = ring_buffer_reserve_unlimited();

ring_buffer_assign(event, data1);
ring_buffer_assign(event, data2);

ring_buffer_commit(event);

The ring_buffer_reserve_unlimited() could reserve a bunch of space
beyond one ring buffer. It could reserve data in fragments. Then the
ring_buffer_assgin() could either copy directly to the event (if the
event exists on one sub buffer) or do a copy the space was fragmented.

Of course, userspace would need to know how to read it. And it can get
complex due to interrupts coming in and also reserving between
fragments, or what happens if a partial fragment is overwritten. But all
these are not impossible to solve.

-- Steve


2010-08-11 14:44:17

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Fri, 2010-08-06 at 10:13 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra ([email protected]) wrote:

> Less code = less instruction cache overhead. I've also shown that the LTTng code
> is at least twice faster. In terms of complexity, it is not much more complex; I
> also took the extra care of doing the formal proofs to make sure the
> corner-cases were dealt with, which I don't reckon neither Steven nor yourself
> have done.

Yes Mathieu, you did a formal proof. Good for you. But honestly, it is
starting to get very annoying to hear you constantly stating that,
because, to most kernel developers, it is meaningless. Any slight
modification of your algorithm, renders the proof invalid.

You are not the only one that has done a proof to an algorithm in the
kernel, but you are definitely the only one that constantly reminds
people that you have done so. Congrats on your PhD, and in academia,
proofs are important.

But this is a ring buffer, not a critical part of the workings of the
kernel. There are much more critical and fragile parts of the kernel
that work fine without a formal proof.

Paul McKenney did a proof for RCU not for us, but just to help give him
a warm fuzzy about it. RCU is much more complex than the ftrace ring
buffer, and it also is much more critical. If Paul gets it wrong, a
machine will crash. He's right to worry. And even Paul told me that no
formal proof makes up for large scale testing. Which BTW, the ftrace
ring buffer has gone through.

Someday I may go ahead and do that proof, but I did do a very intensive
state diagram, and I'm quite confident that it works. It's been deployed
for quite a bit, and the design has yet to be a factor in any bug report
of the ring buffer.

-- Steve

2010-08-15 15:30:14

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Steven Rostedt ([email protected]) wrote:
> Egad! Go on vacation and the world falls apart.
>
> On Wed, 2010-08-04 at 08:27 +0200, Peter Zijlstra wrote:
> > On Tue, 2010-08-03 at 11:56 -0700, Linus Torvalds wrote:
> > > On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <[email protected]> wrote:
> > > >
> > > > FWIW I really utterly detest the whole concept of sub-buffers.
> > >
> > > I'm not quite sure why. Is it something fundamental, or just an
> > > implementation issue?
> >
> > The sub-buffer thing that both ftrace and lttng have is creating a large
> > buffer from a lot of small buffers, I simply don't see the point of
> > doing that. It adds complexity and limitations for very little gain.
>
> So, I want to allocate a 10Meg buffer. I need to make sure the kernel
> has 10megs of memory available. If the memory is quite fragmented, then
> too bad, I lose out.
>
> Oh wait, I could also use vmalloc. But then again, now I'm blasting
> valuable TLB entries for a tracing utility, thus making the tracer have
> a even bigger impact on the entire system.
>
> BAH!
>
> I originally wanted to go with the continuous buffer, but I was
> convinced after trying to implement it, that it was a bad choice.
> Specifically, because of needing to 1) get large amounts of memory that
> is continuous, or 2) eating up TLB entries and causing the system to
> perform poorer.
>
> I chose page size "sub-buffers" to solve the above. It also made
> implementing splice trivial. OK, I admit, I never thought about mmapping
> the buffers, just because I figured splice was faster. But I do have
> patches that allow a user to mmap the entire ring buffer, but only in a
> "producer/consumer" mode.

FYI: the generic ring buffer also implements the mmap() interface for the flight
recorder mode.

>
> Note, I use page size sub-buffers, but the design could work with any
> size sub-buffers. I just never implemented that (even though, when I
> wrote the code it was secretly on my todo list).

The main difference between our designs is that Ftrace use a linked list and the
generic ring buffer lib. uses a sub-buffer/page table. Considering the use-case
of reading available flight recorder pages in reverse order I've hear about at
LinuxCon (heard about it from both from Peter and Masami, and it actually makes
a whole lot of sense, because the data we care about the most and want to read
ASAP is the last subbuffers), I think the page table is more appropriate (and
flexible) than a single-direction linked list, because it allows to pick a
random page (or subbuffer) in the buffer without walking over all pages.

>
>
> >
> > Their benefit is known synchronization points into the stream, you can
> > parse each sub-buffer independently, but you can always break up a
> > continuous stream into smaller parts or use a transport that includes
> > index points or whatever.
> >
> > Their down side is that you can never have individual events larger than
> > the sub-buffer, you need to be aware of the sub-buffer when reserving
> > space etc..
>
> The answer to that is to make a macro to do the assignment of the event,
> and add a new API.
>
> event = ring_buffer_reserve_unlimited();
>
> ring_buffer_assign(event, data1);
> ring_buffer_assign(event, data2);
>
> ring_buffer_commit(event);
>
> The ring_buffer_reserve_unlimited() could reserve a bunch of space
> beyond one ring buffer. It could reserve data in fragments. Then the
> ring_buffer_assgin() could either copy directly to the event (if the
> event exists on one sub buffer) or do a copy the space was fragmented.
>
> Of course, userspace would need to know how to read it. And it can get
> complex due to interrupts coming in and also reserving between
> fragments, or what happens if a partial fragment is overwritten. But all
> these are not impossible to solve.

Dealing with fragmentation, sub-buffer loss, etc. is then pushed up to the trace
analyzer. While I agree that we have to keep the burden of complexity out of the
kernel as much as possible, I also think that an elegant design at the data
producer level which keeps the trace reader/analyzer simple and reliable should
be favored if it keeps a similar level of complexity in the kernel code.

A good argument supporting this is that some tracing users want to use a mmap()
scheme directly on the trace buffers to analyze the data directly in user-space
on the traced machine. In these cases, the complexity/overhead added to the
analyzer will impact the overall performance of the system being traced.

Thanks,

Mathieu

>
> -- Steve
>
>
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-15 16:34:41

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 08/11/2010 05:34 PM, Steven Rostedt wrote:
>
> So, I want to allocate a 10Meg buffer. I need to make sure the kernel
> has 10megs of memory available. If the memory is quite fragmented, then
> too bad, I lose out.

With memory compaction, the cpu churns for a while, then you have your
buffer. Of course there's still no guarantee, just a significantly
higher probability of success.

> Oh wait, I could also use vmalloc. But then again, now I'm blasting
> valuable TLB entries for a tracing utility, thus making the tracer have
> a even bigger impact on the entire system.

Most trace entries will occupy much less than a page, and are accessed
sequentially, so I don't think this will have a large impact.

--
error compiling committee.c: too many arguments to function

2010-08-15 16:44:17

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Avi Kivity ([email protected]) wrote:
> On 08/11/2010 05:34 PM, Steven Rostedt wrote:
>>
>> So, I want to allocate a 10Meg buffer. I need to make sure the kernel
>> has 10megs of memory available. If the memory is quite fragmented, then
>> too bad, I lose out.
>
> With memory compaction, the cpu churns for a while, then you have your
> buffer. Of course there's still no guarantee, just a significantly
> higher probability of success.

The bigger the buffers, the lower the probabilities of success are. My users
often allocate buffers as large as a few GB per cpu. Relying on compaction does
not seem like a viable solution in this case.

>
>> Oh wait, I could also use vmalloc. But then again, now I'm blasting
>> valuable TLB entries for a tracing utility, thus making the tracer have
>> a even bigger impact on the entire system.
>
> Most trace entries will occupy much less than a page, and are accessed
> sequentially, so I don't think this will have a large impact.

You seem to underestimate the frequency at which trace events can be generated.
E.g., by the time you run the scheduler once (which we can consider a very hot
kernel path), some tracing modes will generate thousands of events, which will
touch a very significant amount of TLB entries.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-15 17:10:35

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 08/15/2010 07:44 PM, Mathieu Desnoyers wrote:
> * Avi Kivity ([email protected]) wrote:
>> On 08/11/2010 05:34 PM, Steven Rostedt wrote:
>>> So, I want to allocate a 10Meg buffer. I need to make sure the kernel
>>> has 10megs of memory available. If the memory is quite fragmented, then
>>> too bad, I lose out.
>> With memory compaction, the cpu churns for a while, then you have your
>> buffer. Of course there's still no guarantee, just a significantly
>> higher probability of success.
> The bigger the buffers, the lower the probabilities of success are. My users
> often allocate buffers as large as a few GB per cpu. Relying on compaction does
> not seem like a viable solution in this case.

Wow. Even if you could compact that much memory, it would take quite a
bit of time.

>>> Oh wait, I could also use vmalloc. But then again, now I'm blasting
>>> valuable TLB entries for a tracing utility, thus making the tracer have
>>> a even bigger impact on the entire system.
>> Most trace entries will occupy much less than a page, and are accessed
>> sequentially, so I don't think this will have a large impact.
> You seem to underestimate the frequency at which trace events can be generated.
> E.g., by the time you run the scheduler once (which we can consider a very hot
> kernel path), some tracing modes will generate thousands of events, which will
> touch a very significant amount of TLB entries.

Let's say a trace entry occupies 40 bytes and a TLB miss costs 200
cycles on average. So we have 100 entries per page costing 200 cycles;
amortized each entry costs 2 cycles.

There's an additional cost caused by the need to re-fill the TLB later,
but you incur that anyway if the scheduler caused a context switch.

Of course, my assumptions may be completely off (likely larger entries
but smaller miss costs). Has a vmalloc based implementation been
tested? It seems so much easier than the other alternatives.

--
error compiling committee.c: too many arguments to function

2010-08-15 18:31:26

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Avi Kivity ([email protected]) wrote:
> On 08/15/2010 07:44 PM, Mathieu Desnoyers wrote:
>> * Avi Kivity ([email protected]) wrote:
>>> On 08/11/2010 05:34 PM, Steven Rostedt wrote:
>>>> So, I want to allocate a 10Meg buffer. I need to make sure the kernel
>>>> has 10megs of memory available. If the memory is quite fragmented, then
>>>> too bad, I lose out.
>>> With memory compaction, the cpu churns for a while, then you have your
>>> buffer. Of course there's still no guarantee, just a significantly
>>> higher probability of success.
>> The bigger the buffers, the lower the probabilities of success are. My users
>> often allocate buffers as large as a few GB per cpu. Relying on compaction does
>> not seem like a viable solution in this case.
>
> Wow. Even if you could compact that much memory, it would take quite a
> bit of time.

Yep.

>
>>>> Oh wait, I could also use vmalloc. But then again, now I'm blasting
>>>> valuable TLB entries for a tracing utility, thus making the tracer have
>>>> a even bigger impact on the entire system.
>>> Most trace entries will occupy much less than a page, and are accessed
>>> sequentially, so I don't think this will have a large impact.
>> You seem to underestimate the frequency at which trace events can be generated.
>> E.g., by the time you run the scheduler once (which we can consider a very hot
>> kernel path), some tracing modes will generate thousands of events, which will
>> touch a very significant amount of TLB entries.
>
> Let's say a trace entry occupies 40 bytes and a TLB miss costs 200
> cycles on average. So we have 100 entries per page costing 200 cycles;
> amortized each entry costs 2 cycles.

A quick test (shown below) gives the cost of a TLB miss on the Intel Xeon E5404:

Number of cycles added over test baseline:

tlb and cache hit: 12.42
tlb hit, l2 hit, l1 miss 17.88
tlb hit,l2+l1 miss 32.34
tlb and cache miss 449.58

So it's closer to 500 per tlb miss.

Also, your analysis does not seem to correctly represent reality of the TLB
trashing cost. On a workload walking over a large number of random pages (e.g. a
large hash table) all the time, eating just a few more TLB entries will impact
the number of misses over the entire workload.

So it's not much the misses that we see at the tracing site that is the problem,
but also the extra misses taken by the application caused by the extra pressure
on TLB. So just a few more TLB entries taken by the tracer will likely hurt
these workloads.

>
> There's an additional cost caused by the need to re-fill the TLB later,
> but you incur that anyway if the scheduler caused a context switch.

The performance hit is not taken if the scheduler schedules another thread with
the same mapping, only when it schedules a different process.

>
> Of course, my assumptions may be completely off (likely larger entries
> but smaller miss costs).

Depending on the tracer design, the avg. event size can range from 12 bytes
(lttng is very agressive in event size compaction) to about 40 bytes (perf); so
for this you are mostly right. However, as explained above, the TLB miss cost is
higher than you expected.

> Has a vmalloc based implementation been
> tested? It seems so much easier than the other alternatives.

I tested it in the past, and must admit that I changed from a vmalloc-based
implementation to page-based using software cross-page write primitives based on
feedback from Steven and Ingo. Diminishing TLB trashing seemed like a good
approach, and using vmalloc on 32-bit machines is a pain, because users have to
tweak the vmalloc region size at boot. So all in all, I moved to a vmalloc-less
implementation without much more thought.

If you feel we should test the performance of both approaches, we could do it in
the generic ring buffer library (it allows both type of allocation backends).
However, we'd have to find the right type of TLB-trashing real-world workload to
have meaningful results. This might be the hardest part.

Thanks,

Mathieu

# tlbmiss.c

#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>

typedef unsigned long long cycles_t;

#define barrier() __asm__ __volatile__("": : :"memory")

/*
* Serialize core instruction execution. Also acts as a compiler barrier.
* On PIC ebx cannot be clobbered
*/
#ifdef __PIC__
#define sync_core() \
asm volatile("push %%ebx; cpuid; pop %%ebx" \
: : : "memory", "eax", "ecx", "edx");
#endif
#ifndef __PIC__
#define sync_core() \
asm volatile("cpuid" : : : "memory", "eax", "ebx", "ecx", "edx");
#endif

#define mb() asm volatile("mfence":::"memory")
#define smp_mb() mb()

#define rdtsc(low,high) \
__asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high))

#define rdtscl(low) \
__asm__ __volatile__("rdtsc" : "=a" (low) : : "edx")

#define rdtscll(val) \
__asm__ __volatile__("rdtsc" : "=A" (val))


#define mb() asm volatile("mfence":::"memory")

static inline cycles_t get_cycles_sync(void)
{
unsigned long long ret = 0;

smp_mb();
sync_core();
rdtscll(ret);
sync_core();
smp_mb();
return ret;
}

#define PAGE_SIZE 4096ULL /* 4k */
#define L1_CACHELINE_SIZE 64
#define L2_CACHELINE_SIZE 128
#define ARRAY_SIZE 262144ULL /* 1 GB */

static char testpage[PAGE_SIZE] __attribute__((aligned(PAGE_SIZE)));

static unsigned int idx[ARRAY_SIZE];

#define NR_TESTS 100

int main(int argc, char **argv)
{
struct timeval tv;
struct timezone tz;
cycles_t time1, time2;
double cycles_per_iter;
unsigned int i, j;
pid_t pid;
char *array;
double baseline;

printf("number of tests : %lu\n", NR_TESTS);

srandom(get_cycles_sync());

array = malloc(sizeof(char) * ARRAY_SIZE * PAGE_SIZE);

for (i=0; i<ARRAY_SIZE; i++)
idx[i] = random() % ARRAY_SIZE;

testpage[0] = 1;

printf("Nothing (baseline)\n");
cycles_per_iter = 0.0;
for (i=0; i<NR_TESTS; i++) {
for (j=0; j<ARRAY_SIZE; j++)
array[idx[j] * PAGE_SIZE] = 1;
testpage[0] = 1;
time1 = get_cycles_sync();
time2 = get_cycles_sync();
cycles_per_iter += (time2 - time1);
}
cycles_per_iter /= (double)NR_TESTS;

baseline = (double) cycles_per_iter;
printf("Baseline takes %g cycles\n", baseline);

printf("TLB and caches hit\n");
cycles_per_iter = 0.0;
for (i=0; i<NR_TESTS; i++) {
for (j=0; j<ARRAY_SIZE; j++)
array[idx[j] * PAGE_SIZE] = 1;
testpage[0] = 1;
time1 = get_cycles_sync();
testpage[0] = 1;
time2 = get_cycles_sync();
cycles_per_iter += (time2 - time1);
}
cycles_per_iter /= (double)NR_TESTS;

printf("tlb and cache hit %g cycles (adds %g)\n",
(double) cycles_per_iter,
(double) cycles_per_iter - baseline);

printf("TLB hit, l2 cache hit, l1 cache miss\n");
cycles_per_iter = 0.0;
for (i=0; i<NR_TESTS; i++) {
for (j=0; j<ARRAY_SIZE; j++)
array[idx[j] * PAGE_SIZE] = 1;
testpage[0] = 1;
time1 = get_cycles_sync();
testpage[L1_CACHELINE_SIZE] = 1;
time2 = get_cycles_sync();
cycles_per_iter += (time2 - time1);
}
cycles_per_iter /= (double)NR_TESTS;

printf("tlb hit, l2 hit, l1 miss %g cycles (adds %g)\n",
(double) cycles_per_iter,
(double) cycles_per_iter - baseline);

printf("TLB hit, l2 cache miss, l1 cache miss\n");
cycles_per_iter = 0.0;
for (i=0; i<NR_TESTS; i++) {
for (j=0; j<ARRAY_SIZE; j++)
array[idx[j] * PAGE_SIZE] = 1;
testpage[0] = 1;
time1 = get_cycles_sync();
testpage[L2_CACHELINE_SIZE] = 1;
time2 = get_cycles_sync();
cycles_per_iter += (time2 - time1);
}
cycles_per_iter /= (double)NR_TESTS;

printf("tlb hit,l2+l1 miss %g cycles (adds %g)\n",
(double) cycles_per_iter,
(double) cycles_per_iter - baseline);

printf("TLB and cache miss\n");
cycles_per_iter = 0.0;
for (i=0; i<NR_TESTS; i++) {
for (j=0; j<ARRAY_SIZE; j++)
array[idx[j] * PAGE_SIZE] = 1;
time1 = get_cycles_sync();
testpage[0] = 1;
time2 = get_cycles_sync();
cycles_per_iter += (time2 - time1);
}
cycles_per_iter /= (double)NR_TESTS;

printf("tlb and cache miss %g cycles (adds %g)\n",
(double) cycles_per_iter,
(double) cycles_per_iter - baseline);
free(array);

return 0;
}

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-16 10:50:01

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 08/15/2010 09:31 PM, Mathieu Desnoyers wrote:
>>>
>>> You seem to underestimate the frequency at which trace events can be generated.
>>> E.g., by the time you run the scheduler once (which we can consider a very hot
>>> kernel path), some tracing modes will generate thousands of events, which will
>>> touch a very significant amount of TLB entries.
>> Let's say a trace entry occupies 40 bytes and a TLB miss costs 200
>> cycles on average. So we have 100 entries per page costing 200 cycles;
>> amortized each entry costs 2 cycles.
> A quick test (shown below) gives the cost of a TLB miss on the Intel Xeon E5404:
>
> Number of cycles added over test baseline:
>
> tlb and cache hit: 12.42
> tlb hit, l2 hit, l1 miss 17.88
> tlb hit,l2+l1 miss 32.34
> tlb and cache miss 449.58
>
> So it's closer to 500 per tlb miss.

The cache miss would not be avoided if the TLB was hit, so it should not
be accounted as part of the costs (though a TLB miss will increase cache
pressure). Also, your test does not allow the cpu to pipeline anything;
in reality, different workloads have different TLB miss costs:

- random reads (pointer chasing) incur almost the full impact since the
processor is stalled
- sequential writes can be completely pipelined and suffer almost no impact

Even taking your numbers, it's still 5 cycles per trace entry.


> Also, your analysis does not seem to correctly represent reality of the TLB
> trashing cost. On a workload walking over a large number of random pages (e.g. a
> large hash table) all the time, eating just a few more TLB entries will impact
> the number of misses over the entire workload.

Let's say this doubles the impact. So 10 cycles per trace entry. Will
a non-vmap solution cost less?


> So it's not much the misses that we see at the tracing site that is the problem,
> but also the extra misses taken by the application caused by the extra pressure
> on TLB. So just a few more TLB entries taken by the tracer will likely hurt
> these workloads.
>

I really think this should be benchmarked.

If the user workload thrashes the TLB, it should use huge pages itself,
that will make it immune from kernel TLB thrashing and give it a nice
boost besides.


>> There's an additional cost caused by the need to re-fill the TLB later,
>> but you incur that anyway if the scheduler caused a context switch.
> The performance hit is not taken if the scheduler schedules another thread with
> the same mapping, only when it schedules a different process.

True.

>> Of course, my assumptions may be completely off (likely larger entries
>> but smaller miss costs).
> Depending on the tracer design, the avg. event size can range from 12 bytes
> (lttng is very agressive in event size compaction) to about 40 bytes (perf); so
> for this you are mostly right. However, as explained above, the TLB miss cost is
> higher than you expected.

For the vmalloc area hit, it's lower. For the user application, it may
indeed be higher.

>> Has a vmalloc based implementation been
>> tested? It seems so much easier than the other alternatives.
> I tested it in the past, and must admit that I changed from a vmalloc-based
> implementation to page-based using software cross-page write primitives based on
> feedback from Steven and Ingo. Diminishing TLB trashing seemed like a good
> approach, and using vmalloc on 32-bit machines is a pain, because users have to
> tweak the vmalloc region size at boot. So all in all, I moved to a vmalloc-less
> implementation without much more thought.
>
> If you feel we should test the performance of both approaches, we could do it in
> the generic ring buffer library (it allows both type of allocation backends).
> However, we'd have to find the right type of TLB-trashing real-world workload to
> have meaningful results. This might be the hardest part.

specJBB is a well known TLB intensive workload, known to benefit greatly
from large pages.

<snip test>

For a similar test see http://people.redhat.com/akivity/largepage.c.


--
error compiling committee.c: too many arguments to function

2010-08-16 11:31:08

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On 08/15/2010 09:31 PM, Mathieu Desnoyers wrote:
>
> I tested it in the past, and must admit that I changed from a vmalloc-based
> implementation to page-based using software cross-page write primitives based on
> feedback from Steven and Ingo. Diminishing TLB trashing seemed like a good
> approach, and using vmalloc on 32-bit machines is a pain, because users have to
> tweak the vmalloc region size at boot. So all in all, I moved to a vmalloc-less
> implementation without much more thought.


Forgot to comment about the i386 issue - that really is a blocker if you
absolutely need to support large trace buffers on 32-bit machines. I
would urge all those people to move to x86_64 and be done with it, but I
don't know all the use cases.

It's possible to hack this to work by having a private mm_struct and
switching to it temporarily, but it will be horribly slow.

--
error compiling committee.c: too many arguments to function