2013-07-04 22:30:04

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH] perf: fix interrupt handler timing harness


This patch fixes a serious bug in:

commit 14c63f17b1fde5a575a28e96547a22b451c71fb5
Author: Dave Hansen <[email protected]>
Date: Fri Jun 21 08:51:36 2013 -0700

perf: Drop sample rate when sampling is too slow


There was an misunderstanding on the API of the do_div()
macro. It returns the remainder of the division and this
was not what the function expected leading to disabling the
interrupt latency watchdog.

This patch also remove a duplicate assignment in
perf_sample_event_took().

Signed-off-by: Stephane Eranian <[email protected]>
---

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1db3af9..1833bc5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -182,7 +182,7 @@ void update_perf_cpu_limits(void)
u64 tmp = perf_sample_period_ns;

tmp *= sysctl_perf_cpu_time_max_percent;
- tmp = do_div(tmp, 100);
+ do_div(tmp, 100);
atomic_set(&perf_sample_allowed_ns, tmp);
}

@@ -232,7 +232,7 @@ DEFINE_PER_CPU(u64, running_sample_length);
void perf_sample_event_took(u64 sample_len_ns)
{
u64 avg_local_sample_len;
- u64 local_samples_len = __get_cpu_var(running_sample_length);
+ u64 local_samples_len;

if (atomic_read(&perf_sample_allowed_ns) == 0)
return;


2013-07-05 06:54:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf: fix interrupt handler timing harness


* Stephane Eranian <[email protected]> wrote:

> This patch fixes a serious bug in:
>
> commit 14c63f17b1fde5a575a28e96547a22b451c71fb5
> Author: Dave Hansen <[email protected]>
> Date: Fri Jun 21 08:51:36 2013 -0700
>
> perf: Drop sample rate when sampling is too slow
>
>
> There was an misunderstanding on the API of the do_div()
> macro. It returns the remainder of the division and this
> was not what the function expected leading to disabling the
> interrupt latency watchdog.
>
> This patch also remove a duplicate assignment in
> perf_sample_event_took().
>
> Signed-off-by: Stephane Eranian <[email protected]>

Thanks Stephane - I've applied this to perf/urgent and will get it to
Linus ASAP.

Ingo

Subject: [tip:perf/urgent] perf: Fix interrupt handler timing harness

Commit-ID: e5302920da9ef23f9d19d4e9ac85704cc25bee7a
Gitweb: http://git.kernel.org/tip/e5302920da9ef23f9d19d4e9ac85704cc25bee7a
Author: Stephane Eranian <[email protected]>
AuthorDate: Fri, 5 Jul 2013 00:30:11 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 5 Jul 2013 08:54:43 +0200

perf: Fix interrupt handler timing harness

This patch fixes a serious bug in:

14c63f17b1fd perf: Drop sample rate when sampling is too slow

There was an misunderstanding on the API of the do_div()
macro. It returns the remainder of the division and this
was not what the function expected leading to disabling the
interrupt latency watchdog.

This patch also remove a duplicate assignment in
perf_sample_event_took().

Signed-off-by: Stephane Eranian <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/20130704223010.GA30625@quad
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1db3af9..1833bc5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -182,7 +182,7 @@ void update_perf_cpu_limits(void)
u64 tmp = perf_sample_period_ns;

tmp *= sysctl_perf_cpu_time_max_percent;
- tmp = do_div(tmp, 100);
+ do_div(tmp, 100);
atomic_set(&perf_sample_allowed_ns, tmp);
}

@@ -232,7 +232,7 @@ DEFINE_PER_CPU(u64, running_sample_length);
void perf_sample_event_took(u64 sample_len_ns)
{
u64 avg_local_sample_len;
- u64 local_samples_len = __get_cpu_var(running_sample_length);
+ u64 local_samples_len;

if (atomic_read(&perf_sample_allowed_ns) == 0)
return;

2013-07-08 14:34:54

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] perf: fix interrupt handler timing harness

On 07/04/2013 03:30 PM, Stephane Eranian wrote:
> There was an misunderstanding on the API of the do_div()
> macro. It returns the remainder of the division and this
> was not what the function expected leading to disabling the
> interrupt latency watchdog.

"Misunderstanding" is a very kind term for what I did there. Stephane,
were you actually running in to the new cpu limit, or were you just
trying to update kernel.perf_event_max_sample_rate?

BTW, I also did the same thing in 2ab00456:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2ab00456ea8a0d79acb1390659b98416111880b2

I'll have a patch out shortly for that one, but its damage would be
limited to a bogus printk.

2013-07-08 18:08:29

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH] perf: fix interrupt handler timing harness

On Mon, Jul 8, 2013 at 4:34 PM, Dave Hansen <[email protected]> wrote:
> On 07/04/2013 03:30 PM, Stephane Eranian wrote:
>> There was an misunderstanding on the API of the do_div()
>> macro. It returns the remainder of the division and this
>> was not what the function expected leading to disabling the
>> interrupt latency watchdog.
>
> "Misunderstanding" is a very kind term for what I did there. Stephane,
> were you actually running in to the new cpu limit, or were you just
> trying to update kernel.perf_event_max_sample_rate?
>
I am chasing a problem on HSW desktop where your code triggers the throttling.
It happens systematically on my system first time I run perf record usually the
first time.

I admit I have some issues with your patch and what it is trying to avoid.
There is already interrupt throttling. Your code seems to address latency
issues on the handler rather than rate issues. Yet to mitigate the latency
it is modify the throttling.

For some unknown reasons, my HSW interrupt handler goes crazy for
a while running a very simple:
$ perf record -e cycles branchy_loop

And I do see in the log:
perf samples too long (2546 > 2500), lowering
kernel.perf_event_max_sample_rate to 50000

Which is an enormous latency. I instrumented the code, and under
normal conditions the latency
of the handler for this perf run, is about 500ns and it is consistent
with what I see on SNB.
However, on HSW, it is sometimes 5x longer. I have no explanation for
this. Andi's patch
to delay the NMI ack is enabled but it does not alleviate this
problem. There is something
else going on possibly with the HW, and I don't know what it is. This
is not a one off. It lasts for
several calls because it fires your watchdog which averages out over
multiple calls.
So one cannot blame this on the case where the samping buffer gets
full, for instance.

My model is:
processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
stepping : 3
microcode : 0x7

So something is not right and I would like to know what it is.
I will keep investigating.

> BTW, I also did the same thing in 2ab00456:
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2ab00456ea8a0d79acb1390659b98416111880b2
>
> I'll have a patch out shortly for that one, but its damage would be
> limited to a bogus printk.

2013-07-08 20:05:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] perf: fix interrupt handler timing harness

On 07/08/2013 11:08 AM, Stephane Eranian wrote:
> I admit I have some issues with your patch and what it is trying to avoid.
> There is already interrupt throttling. Your code seems to address latency
> issues on the handler rather than rate issues. Yet to mitigate the latency
> it is modify the throttling.

If we have too many interrupts, we need to drop the rate (existing
throttling).

If the interrupts _consistently_ take too long individually they can
starve out all the other CPU users. I saw no way to make them finish
faster, so the only recourse is to also drop the rate.

> For some unknown reasons, my HSW interrupt handler goes crazy for
> a while running a very simple:
> $ perf record -e cycles branchy_loop
>
> And I do see in the log:
> perf samples too long (2546 > 2500), lowering
> kernel.perf_event_max_sample_rate to 50000
>
> Which is an enormous latency. I instrumented the code, and under
> normal conditions the latency
> of the handler for this perf run, is about 500ns and it is consistent
> with what I see on SNB.

I was seeing latencies near 1 second from time to time, but
_consistently_ in the hundreds of milliseconds.

2013-07-08 20:20:24

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH] perf: fix interrupt handler timing harness

On Mon, Jul 8, 2013 at 10:05 PM, Dave Hansen <[email protected]> wrote:
> On 07/08/2013 11:08 AM, Stephane Eranian wrote:
>> I admit I have some issues with your patch and what it is trying to avoid.
>> There is already interrupt throttling. Your code seems to address latency
>> issues on the handler rather than rate issues. Yet to mitigate the latency
>> it is modify the throttling.
>
> If we have too many interrupts, we need to drop the rate (existing
> throttling).
>
> If the interrupts _consistently_ take too long individually they can
> starve out all the other CPU users. I saw no way to make them finish
> faster, so the only recourse is to also drop the rate.
>
I think we need to investigate why some interrupts take so much time.
Could be HW, could be SW. Not talking about old hardware here.
Once we understand this, then we know maybe adjust the timing on
our patch.

>> For some unknown reasons, my HSW interrupt handler goes crazy for
>> a while running a very simple:
>> $ perf record -e cycles branchy_loop
>>
>> And I do see in the log:
>> perf samples too long (2546 > 2500), lowering
>> kernel.perf_event_max_sample_rate to 50000
>>
>> Which is an enormous latency. I instrumented the code, and under
>> normal conditions the latency
>> of the handler for this perf run, is about 500ns and it is consistent
>> with what I see on SNB.
>
> I was seeing latencies near 1 second from time to time, but
> _consistently_ in the hundreds of milliseconds.

On my systems, I see 500ns with one session running. But on HSW,
something else is going on with bursts at 2500ns. That's not normal.
I want an explanation for this.

2013-07-08 20:35:03

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] perf: fix interrupt handler timing harness

On 07/08/2013 01:20 PM, Stephane Eranian wrote:
> On Mon, Jul 8, 2013 at 10:05 PM, Dave Hansen <[email protected]> wrote:
>> If the interrupts _consistently_ take too long individually they can
>> starve out all the other CPU users. I saw no way to make them finish
>> faster, so the only recourse is to also drop the rate.
>>
> I think we need to investigate why some interrupts take so much time.
> Could be HW, could be SW. Not talking about old hardware here.
> Once we understand this, then we know maybe adjust the timing on
> our patch.

I spent quite a while looking at it on my hardware. It's difficult to
profile in NMIs, but I'm fairly satisfied (for me) it is a NUMA issue
which gets worse as I add cores.

I did a quite a bit of ftracing to look for spots inside the handler
which were taking large amounts of time. There were none. The
execution time was spread very evenly over the entire nmi handler. It
didn't appear to be any individual hot cachelines or doing something
silly like sitting in a loop handling lots of PMU events.

2013-07-08 20:54:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] perf: fix interrupt handler timing harness

> I did a quite a bit of ftracing to look for spots inside the handler
> which were taking large amounts of time. There were none. The
> execution time was spread very evenly over the entire nmi handler. It
> didn't appear to be any individual hot cachelines or doing something
> silly like sitting in a loop handling lots of PMU events.

In some cases callgraphs seemed to be quite expensive (large fraction
of the handler time)

-Andi

2013-07-08 20:56:40

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH] perf: fix interrupt handler timing harness

On Mon, Jul 8, 2013 at 10:54 PM, Andi Kleen <[email protected]> wrote:
>> I did a quite a bit of ftracing to look for spots inside the handler
>> which were taking large amounts of time. There were none. The
>> execution time was spread very evenly over the entire nmi handler. It
>> didn't appear to be any individual hot cachelines or doing something
>> silly like sitting in a loop handling lots of PMU events.
>
> In some cases callgraphs seemed to be quite expensive (large fraction
> of the handler time)
>
In my case, I am not doing callgraphs, just regular cycles.