2009-09-01 11:18:25

by Metzger, Markus T

[permalink] [raw]
Subject: [discuss] BTS overflow handling, was: [PATCH] perf_counter: Fix a race on perf_counter_ctx

Ingo, Peter,

>>>> Currently, I'm not sure that this (i.e. that the interrupt
>>>> handling takes too long) is the underlying problem of the hangs
>>>> that I'm seeing.
>>>
>>>I havent seen a plausible theory yet about why an actual lockup
>>>would happen on your box.
>>
>>So you do not think that taking too long in the ISR could cause this?
>>
>>And is it working on your box?

My current theory is that the BTS buffer fills up so quickly when tracing
the kernel, that the kernel is busy handling overflows and reacting on
other interrupts that pile up while we're handling the BTS overflow.

When I trace user-mode branches, it works.

When I do not copy the trace during overflow handling, the kernel does not hang.

When I attach a jtag debugger to a hung system (perf top and perf record
-e branches -c 1), I find that one core is waiting for an smp call
response, while the other core is busy emptying the BTS buffer.

When I then disable branch tracing (the debugger prevents the kernel
from changing DEBUGCTL to enable tracing again), the system recovers.


I have a patch that switches buffers during overflow handling and leaves
the draining for later (which currently never happens) - the kernel does
not hang, in that case.

I do need 3 buffers of 2048 entries = 3x48 pages per cpu, though. One buffer
to switch in during overflow handling; another to switch in during sched_out
(assuming that we need to schedule out the traced task before we may start
the draining task). Even then, there's a chance that we will lose trace
when the draining task may not start immediately. I would even say that
this is quite likely.


What I do not have, yet, is the actual draining. Draining needs to start
after the counter has been disabled. But draining needs the perf_counter
to drain the trace into. The counter will thus be busy after it has been
disabled - ugly.

There already seems to be something in place regarding deferring work, i.e.
perf_counter_do_pending(). Would it be OK if I added the deferred BTS buffer
draining to that?

Looks like this would guarantee that the counter does not go away as long
as there is work pending. Is this correct?



In any case, this is getting late for the upcoming merge window.
Would you rather drop the BTS patch or disable kernel tracing?

thanks and regards,
markus.

---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


2009-09-01 13:00:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [discuss] BTS overflow handling, was: [PATCH] perf_counter: Fix a race on perf_counter_ctx

On Tue, 2009-09-01 at 12:17 +0100, Metzger, Markus T wrote:

> My current theory is that the BTS buffer fills up so quickly when tracing
> the kernel, that the kernel is busy handling overflows and reacting on
> other interrupts that pile up while we're handling the BTS overflow.
>
> When I trace user-mode branches, it works.
>
> When I do not copy the trace during overflow handling, the kernel does not hang.

Agreed, that was my suspicion as well. Would you happen to know where to
get these USB debug port cables, and how to find out if a machine
supports this?

> When I attach a jtag debugger to a hung system

Sweet, x86 JTAG.. want.. ;-)

> (perf top and perf record
> -e branches -c 1), I find that one core is waiting for an smp call
> response, while the other core is busy emptying the BTS buffer.
>
> When I then disable branch tracing (the debugger prevents the kernel
> from changing DEBUGCTL to enable tracing again), the system recovers.
>
>
> I have a patch that switches buffers during overflow handling and leaves
> the draining for later (which currently never happens) - the kernel does
> not hang, in that case.
>
> I do need 3 buffers of 2048 entries = 3x48 pages per cpu, though.

And those pages have to be contiguous too, right? That's an order-6
alloc, painful.

> One buffer
> to switch in during overflow handling; another to switch in during sched_out
> (assuming that we need to schedule out the traced task before we may start
> the draining task). Even then, there's a chance that we will lose trace
> when the draining task may not start immediately. I would even say that
> this is quite likely.

Right, is it possible to detect this loss?

This makes me wonder how much time it takes to drain these buffers, it
is at all possible to optimize that code path into oblivion, or will
nothing be fast enough?

> What I do not have, yet, is the actual draining. Draining needs to start
> after the counter has been disabled. But draining needs the perf_counter
> to drain the trace into. The counter will thus be busy after it has been
> disabled - ugly.

Yes, this is a tad weird...

> There already seems to be something in place regarding deferring work, i.e.
> perf_counter_do_pending(). Would it be OK if I added the deferred BTS buffer
> draining to that?

Yes, note that this pending work runs from hardirq context as well. On
x86 we self-ipi to get into IRQ context ASAP after the NMI.

So if the remote cpu is blocked waiting on an SMP call, doing the work
from hardirq context won't really help things.

> Looks like this would guarantee that the counter does not go away as long
> as there is work pending. Is this correct?

Agreed, it waits for all pending bits to complete before destroying the
counter.

> In any case, this is getting late for the upcoming merge window.
> Would you rather drop the BTS patch or disable kernel tracing?

I don't think we need to drop it, at worst we could defer the patch
to .33, but I think we can indeed get away with disabling the kernel
tracing for now.

2009-09-01 13:09:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [discuss] BTS overflow handling, was: [PATCH] perf_counter: Fix a race on perf_counter_ctx


* Peter Zijlstra <[email protected]> wrote:

> > In any case, this is getting late for the upcoming merge window.
> > Would you rather drop the BTS patch or disable kernel tracing?
>
> I don't think we need to drop it, at worst we could defer the
> patch to .33, but I think we can indeed get away with disabling
> the kernel tracing for now.

That could be done via a simple patch that does not enable BTS
tracing if the .kernel attribute bit is set. I.e. we enable BTS on a
narrower set of config settings than in the past - but still keep
the code largely untouched.

Ingo

2009-09-01 13:33:40

by Metzger, Markus T

[permalink] [raw]
Subject: RE: [discuss] BTS overflow handling, was: [PATCH] perf_counter: Fix a race on perf_counter_ctx

>-----Original Message-----
>From: Peter Zijlstra [mailto:[email protected]]
>Sent: Tuesday, September 01, 2009 3:01 PM
>To: Metzger, Markus T
>Cc: Ingo Molnar; [email protected]; [email protected]; [email protected]; linux-
>[email protected]; Paul Mackerras
>Subject: Re: [discuss] BTS overflow handling, was: [PATCH] perf_counter: Fix a race on
>perf_counter_ctx
>
>On Tue, 2009-09-01 at 12:17 +0100, Metzger, Markus T wrote:
>
>> My current theory is that the BTS buffer fills up so quickly when tracing
>> the kernel, that the kernel is busy handling overflows and reacting on
>> other interrupts that pile up while we're handling the BTS overflow.
>>
>> When I trace user-mode branches, it works.
>>
>> When I do not copy the trace during overflow handling, the kernel does not hang.
>
>Agreed, that was my suspicion as well. Would you happen to know where to
>get these USB debug port cables, and how to find out if a machine
>supports this?


I'm sorry but I don't understand what you mean with "these USB debug port cables".


>> I do need 3 buffers of 2048 entries = 3x48 pages per cpu, though.
>
>And those pages have to be contiguous too, right? That's an order-6
>alloc, painful.


According to an earlier discussion with Roland, they don't have to.
They still need to be locked, though.
According to some other discussion with Andrew and Ingo, I still use
kmalloc to allocate those buffers.


>> One buffer
>> to switch in during overflow handling; another to switch in during sched_out
>> (assuming that we need to schedule out the traced task before we may start
>> the draining task). Even then, there's a chance that we will lose trace
>> when the draining task may not start immediately. I would even say that
>> this is quite likely.
>
>Right, is it possible to detect this loss?

It is. But in order to get the PERF_EVENT_LOST record into the correct place,
I need to defer logging the lost trace;-)

And we would loose this very nice feature of fixed-size entries.


>This makes me wonder how much time it takes to drain these buffers, it
>is at all possible to optimize that code path into oblivion, or will
>nothing be fast enough?


Are you saying that we should rather speed up that code path than try to
defer all the work? There definitely is a lot of redundant work done on
the generic path.

I did a few experiments where I would drain only parts of the buffer.
I could not drain too much before the system would hang.
Besides, that does not sound too robust to me. Would it sill work on
a slower system? Or on a faster one? Or on a fully loaded one?


>> What I do not have, yet, is the actual draining. Draining needs to start
>> after the counter has been disabled. But draining needs the perf_counter
>> to drain the trace into. The counter will thus be busy after it has been
>> disabled - ugly.
>
>Yes, this is a tad weird...


Hmmm, since the counter is removed during sched_out, and I can't drain the
buffer in x86_pmu_perf_disable(), I'm afraid we don't have much choice.


>> There already seems to be something in place regarding deferring work, i.e.
>> perf_counter_do_pending(). Would it be OK if I added the deferred BTS buffer
>> draining to that?
>
>Yes, note that this pending work runs from hardirq context as well. On
>x86 we self-ipi to get into IRQ context ASAP after the NMI.
>
>So if the remote cpu is blocked waiting on an SMP call, doing the work
>from hardirq context won't really help things.


I can't use that, then.

When I use schedule_work() instead, how would I ensure that the work is done
before the traced (or tracing) task is rescheduled?

I would need to ensure that the counter does not go away as long as draining
work is scheduled. I would store a pointer to the counter in the work struct.
Should perf_counter's be use-counted?


>> Looks like this would guarantee that the counter does not go away as long
>> as there is work pending. Is this correct?
>
>Agreed, it waits for all pending bits to complete before destroying the
>counter.

That would be an alternative to use-counting.


>> In any case, this is getting late for the upcoming merge window.
>> Would you rather drop the BTS patch or disable kernel tracing?
>
>I don't think we need to drop it, at worst we could defer the patch
>to .33, but I think we can indeed get away with disabling the kernel
>tracing for now.

There's some more review feedback from you that I have not integrated, yet.
One is that BTS should return an error instead of falling back to generic
counters.
Another is that BTS does not provide a counter value.

How important are those?


thanks and regards,
markus.

---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

2009-09-01 13:53:24

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [discuss] BTS overflow handling, was: [PATCH] perf_counter: Fix a race on perf_counter_ctx

On Tue, 2009-09-01 at 14:32 +0100, Metzger, Markus T wrote:

> >> My current theory is that the BTS buffer fills up so quickly when tracing
> >> the kernel, that the kernel is busy handling overflows and reacting on
> >> other interrupts that pile up while we're handling the BTS overflow.
> >>
> >> When I trace user-mode branches, it works.
> >>
> >> When I do not copy the trace during overflow handling, the kernel does not hang.
> >
> >Agreed, that was my suspicion as well. Would you happen to know where to
> >get these USB debug port cables, and how to find out if a machine
> >supports this?
>
>
> I'm sorry but I don't understand what you mean with "these USB debug port cables".

Not sure either, whatever makes all the code from commit:
5c05917e7fe313a187ad6ebb94c1c6cf42862a0b usable I guess.

> >> I do need 3 buffers of 2048 entries = 3x48 pages per cpu, though.
> >
> >And those pages have to be contiguous too, right? That's an order-6
> >alloc, painful.
>
>
> According to an earlier discussion with Roland, they don't have to.
> They still need to be locked, though.
> According to some other discussion with Andrew and Ingo, I still use
> kmalloc to allocate those buffers.

Section 18.18.5 of book 3B says the DS buffer base is a linear address.
This suggests each buffer does need contiguous pages.

48 contiguous pages constitutes an order-6 allocation (64 pages), which
is unreliable at best.

> >> One buffer
> >> to switch in during overflow handling; another to switch in during sched_out
> >> (assuming that we need to schedule out the traced task before we may start
> >> the draining task). Even then, there's a chance that we will lose trace
> >> when the draining task may not start immediately. I would even say that
> >> this is quite likely.
> >
> >Right, is it possible to detect this loss?
>
> It is. But in order to get the PERF_EVENT_LOST record into the correct place,
> I need to defer logging the lost trace;-)
>
> And we would loose this very nice feature of fixed-size entries.

Well, if you don't defer there is no possibility of loosing data is
there? So we either defer and have to deal with lost data, or we do not.

> >This makes me wonder how much time it takes to drain these buffers, it
> >is at all possible to optimize that code path into oblivion, or will
> >nothing be fast enough?
>
>
> Are you saying that we should rather speed up that code path than try to
> defer all the work? There definitely is a lot of redundant work done on
> the generic path.
>
> I did a few experiments where I would drain only parts of the buffer.
> I could not drain too much before the system would hang.
> Besides, that does not sound too robust to me. Would it sill work on
> a slower system? Or on a faster one? Or on a fully loaded one?

Base cpu speed is what counts, load is not interesting.

Also it seems a normalizing property, the slower the cpu the less
branches it can process per time unit, so less data to process.

But yes, I was suggesting to optimize this, since the current way of
calling perf_counter_output() multiple times is massively bloated.

> >> There already seems to be something in place regarding deferring work, i..e.
> >> perf_counter_do_pending(). Would it be OK if I added the deferred BTS buffer
> >> draining to that?
> >
> >Yes, note that this pending work runs from hardirq context as well. On
> >x86 we self-ipi to get into IRQ context ASAP after the NMI.
> >
> >So if the remote cpu is blocked waiting on an SMP call, doing the work
> >from hardirq context won't really help things.
>
>
> I can't use that, then.
>
> When I use schedule_work() instead, how would I ensure that the work is done
> before the traced (or tracing) task is rescheduled?

No, basically the only thing left is softirqs, which can be preempted by
hardirqs, but that's a horrid hack too, esp since processing the BTS
outside of the handler will basically result in the BTS tracing its own
processing, generating even more data to process.

> I would need to ensure that the counter does not go away as long as draining
> work is scheduled. I would store a pointer to the counter in the work struct.
> Should perf_counter's be use-counted?

Yes, they currently rely on the reference counting of the filedesc
they're associated with, but that needs to be undone for in-kernel usage
anyway.

> There's some more review feedback from you that I have not integrated, yet.
> One is that BTS should return an error instead of falling back to generic
> counters.
> Another is that BTS does not provide a counter value.
>
> How important are those?

Would be very nice to have.

2009-09-01 14:27:23

by Metzger, Markus T

[permalink] [raw]
Subject: RE: [discuss] BTS overflow handling, was: [PATCH] perf_counter: Fix a race on perf_counter_ctx

>-----Original Message-----
>From: Peter Zijlstra [mailto:[email protected]]
>Sent: Tuesday, September 01, 2009 3:53 PM
>To: Metzger, Markus T
>Cc: Ingo Molnar; [email protected]; [email protected]; [email protected]; linux-
>[email protected]; Paul Mackerras
>Subject: RE: [discuss] BTS overflow handling, was: [PATCH] perf_counter: Fix a race on
>perf_counter_ctx
>
>On Tue, 2009-09-01 at 14:32 +0100, Metzger, Markus T wrote:
>
>> >> My current theory is that the BTS buffer fills up so quickly when tracing
>> >> the kernel, that the kernel is busy handling overflows and reacting on
>> >> other interrupts that pile up while we're handling the BTS overflow.
>> >>
>> >> When I trace user-mode branches, it works.
>> >>
>> >> When I do not copy the trace during overflow handling, the kernel does not hang.
>> >
>> >Agreed, that was my suspicion as well. Would you happen to know where to
>> >get these USB debug port cables, and how to find out if a machine
>> >supports this?
>>
>>
>> I'm sorry but I don't understand what you mean with "these USB debug port cables".
>
>Not sure either, whatever makes all the code from commit:
>5c05917e7fe313a187ad6ebb94c1c6cf42862a0b usable I guess.


Sorry, I don't know anything about that.


>> >> I do need 3 buffers of 2048 entries = 3x48 pages per cpu, though.
>> >
>> >And those pages have to be contiguous too, right? That's an order-6
>> >alloc, painful.
>>
>>
>> According to an earlier discussion with Roland, they don't have to.
>> They still need to be locked, though.
>> According to some other discussion with Andrew and Ingo, I still use
>> kmalloc to allocate those buffers.
>
>Section 18.18.5 of book 3B says the DS buffer base is a linear address.
>This suggests each buffer does need contiguous pages.
>
>48 contiguous pages constitutes an order-6 allocation (64 pages), which
>is unreliable at best.

Roland argued that this means virtually contiguous, not physically.


>> When I use schedule_work() instead, how would I ensure that the work is done
>> before the traced (or tracing) task is rescheduled?
>
>No, basically the only thing left is softirqs, which can be preempted by
>hardirqs, but that's a horrid hack too, esp since processing the BTS
>outside of the handler will basically result in the BTS tracing its own
>processing, generating even more data to process.

I would have disabled perf on that cpu; it won't work, otherwise,
since the draining code alone would generate more trace than fits into
a buffer. I would need to disable preemption, though.

Are you saying that schedule_work() won't work? It will quite likely be very
lossy, but why won't it work at all?

How would that softirq approach work? Could you point me to some reference
code?


thanks and regards,
markus.
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

2009-09-01 14:35:51

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [discuss] BTS overflow handling, was: [PATCH] perf_counter: Fix a race on perf_counter_ctx

On Tue, 2009-09-01 at 15:27 +0100, Metzger, Markus T wrote:

> >> >> I do need 3 buffers of 2048 entries = 3x48 pages per cpu, though.
> >> >
> >> >And those pages have to be contiguous too, right? That's an order-6
> >> >alloc, painful.
> >>
> >>
> >> According to an earlier discussion with Roland, they don't have to.
> >> They still need to be locked, though.
> >> According to some other discussion with Andrew and Ingo, I still use
> >> kmalloc to allocate those buffers.
> >
> >Section 18.18.5 of book 3B says the DS buffer base is a linear address.
> >This suggests each buffer does need contiguous pages.
> >
> >48 contiguous pages constitutes an order-6 allocation (64 pages), which
> >is unreliable at best.
>
> Roland argued that this means virtually contiguous, not physically.

Sure it does, but either you use the linear kernel map, or use vmap.

vmap doesn't sound like very good idea.

> >> When I use schedule_work() instead, how would I ensure that the work is done
> >> before the traced (or tracing) task is rescheduled?
> >
> >No, basically the only thing left is softirqs, which can be preempted by
> >hardirqs, but that's a horrid hack too, esp since processing the BTS
> >outside of the handler will basically result in the BTS tracing its own
> >processing, generating even more data to process.
>
> I would have disabled perf on that cpu; it won't work, otherwise,
> since the draining code alone would generate more trace than fits into
> a buffer. I would need to disable preemption, though.
>
> Are you saying that schedule_work() won't work? It will quite likely be very
> lossy, but why won't it work at all?
>
> How would that softirq approach work? Could you point me to some reference
> code?

Look at the tasklet stuff I guess. Look at tasklet_hi_schedule() and co.

2009-09-03 14:26:29

by Metzger, Markus T

[permalink] [raw]
Subject: RE: [discuss] BTS overflow handling, was: [PATCH] perf_counter: Fix a race on perf_counter_ctx

>-----Original Message-----
>From: Peter Zijlstra [mailto:[email protected]]
>Sent: Tuesday, September 01, 2009 3:53 PM
>To: Metzger, Markus T
>Cc: Ingo Molnar; [email protected]; [email protected]; [email protected]; linux-
>[email protected]; Paul Mackerras
>Subject: RE: [discuss] BTS overflow handling, was: [PATCH] perf_counter: Fix a race on
>perf_counter_ctx
>
>On Tue, 2009-09-01 at 14:32 +0100, Metzger, Markus T wrote:


>> >This makes me wonder how much time it takes to drain these buffers, it
>> >is at all possible to optimize that code path into oblivion, or will
>> >nothing be fast enough?
>>
>>
>> Are you saying that we should rather speed up that code path than try to
>> defer all the work? There definitely is a lot of redundant work done on
>> the generic path.
>>
>> I did a few experiments where I would drain only parts of the buffer.
>> I could not drain too much before the system would hang.
>> Besides, that does not sound too robust to me. Would it sill work on
>> a slower system? Or on a faster one? Or on a fully loaded one?
>
>Base cpu speed is what counts, load is not interesting.
>
>Also it seems a normalizing property, the slower the cpu the less
>branches it can process per time unit, so less data to process.
>
>But yes, I was suggesting to optimize this, since the current way of
>calling perf_counter_output() multiple times is massively bloated.


This seems to do the trick - at least on my box.

I prepare the header, then do a single perf_output_begin()/perf_output_end()
pair, and between those two, I drain the entire 2048 records BTS buffer -
pretty much the same way as perf_counter_output() does.

We could optimize this further by providing specialized draining functions,
one for each combination of PERF_SAMPLE_ bits, but it seems to be fast
enough the way it is.

Holding the output lock that long does not seem to be a problem.
I can do perf record -a -o /dev/null -e branches -c 1 and I don't get a
hrtimer warning in dmesg.

When I do a perf record -e branches -c 1 true in parallel, I do not get
any trace, though. And perf does not report an error, either.


I copied some of the generic sampling code; I'll try to restructure it
a bit so I can call a generic function to do the actual sampling - provided
this is still fast enough.

How would we make sure it works on other boxes, as well?
Is there a way for me to detect that I'm not handling the interrupt fast enough?


I found another "kernel hangs" bug that is reproducible with 'normal' profiling:

when I do sudo perf record -e instructions -c 1000000 -a -o /dev/null
then unplug and replug one cpu
then kill the perf record job
the kernel hangs


thanks and regards,
markus.

---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.