Subject: Scheduler regression: Too frequent timer interrupts(?)

Since 2.6.22 the Linux scheduler interrupts programs with increasing
frequency. The interrupts cause run time variances that are affecting HPC
jobs. Various low latency jobs show increasing runtimes because of these
additional interrupts by the scheduler.

In the following test a simple program was run that continually retrieves
TSC and measures the times between the TSC retrievals. A run time variance
is noted whenever the time between two TSC retrievals is larger than 1
usec (on a 3.3Ghz Xeon box quad cores dual processor). The numbers given
are the interrupts occuring in a 10 second measurement period. The tests
can be downloaded from http://gentwo.org/ll .


Kernel Test 1 Test 2 Test 3 Variances(SUM)
2.6.22 383 540 667 1590
2.6.23 2738 2019 2303 7060
2.6.24 2503 573 583 3659
2.6.25 302 359 241 902
2.6.26 2503 2501 2503 7507
2.6.27 2502 2503 2478 7483
2.6.28 2502 2504 2502 7508
2.6.29 2502 2490 2503 7495
2.6.30-rc2 2504 2503 2502 7509

The kernel was compiled with high res timer support and a HZ of 250.

The 2.6.22 kernel has only about 38.3 disruptions per second. That likely
means that HRTIMER is able to eliminate the timer interrupt.

In .23 this is no longer possible and we get the full number of 250 timer
interrupts per second. Fixed again in .25 but .26 looses it again.

Can we restore the .22 situation where a running task without any
competitor on the same cpu does not need the timer interrupt?


2009-04-17 07:00:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Thu, 2009-04-16 at 15:53 -0400, Christoph Lameter wrote:
> Since 2.6.22 the Linux scheduler interrupts programs with increasing
> frequency. The interrupts cause run time variances that are affecting HPC
> jobs. Various low latency jobs show increasing runtimes because of these
> additional interrupts by the scheduler.
>
> In the following test a simple program was run that continually retrieves
> TSC and measures the times between the TSC retrievals. A run time variance
> is noted whenever the time between two TSC retrievals is larger than 1
> usec (on a 3.3Ghz Xeon box quad cores dual processor). The numbers given
> are the interrupts occuring in a 10 second measurement period. The tests
> can be downloaded from http://gentwo.org/ll .
>
>
> Kernel Test 1 Test 2 Test 3 Variances(SUM)
> 2.6.22 383 540 667 1590
> 2.6.23 2738 2019 2303 7060
> 2.6.24 2503 573 583 3659
> 2.6.25 302 359 241 902
> 2.6.26 2503 2501 2503 7507
> 2.6.27 2502 2503 2478 7483
> 2.6.28 2502 2504 2502 7508
> 2.6.29 2502 2490 2503 7495
> 2.6.30-rc2 2504 2503 2502 7509
>
> The kernel was compiled with high res timer support and a HZ of 250.
>
> The 2.6.22 kernel has only about 38.3 disruptions per second. That likely
> means that HRTIMER is able to eliminate the timer interrupt.

This has never been true afaikt, as long as we have a task running, we
take the interrupt, I just looked at the .22 code and that certainly
expects the scheduler_tick() to be called when there is a running
process.

Also, look at /proc/interrupts if you want to determine interrupt
frequency.

Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Fri, 17 Apr 2009, Peter Zijlstra wrote:

> This has never been true afaikt, as long as we have a task running, we
> take the interrupt, I just looked at the .22 code and that certainly
> expects the scheduler_tick() to be called when there is a running
> process.
>
> Also, look at /proc/interrupts if you want to determine interrupt
> frequency.

I am mainly interested in limited the number and length of cpu holdoffs.

If the timer interrupt is always taken then it must take less than 1 usec
in certain versions of the kernel in order to not show up in the
statistics gathered.

2009-04-17 14:16:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Fri, 2009-04-17 at 09:42 -0400, Christoph Lameter wrote:
> On Fri, 17 Apr 2009, Peter Zijlstra wrote:
>
> > This has never been true afaikt, as long as we have a task running, we
> > take the interrupt, I just looked at the .22 code and that certainly
> > expects the scheduler_tick() to be called when there is a running
> > process.
> >
> > Also, look at /proc/interrupts if you want to determine interrupt
> > frequency.
>
> I am mainly interested in limited the number and length of cpu holdoffs.
>
> If the timer interrupt is always taken then it must take less than 1 usec
> in certain versions of the kernel in order to not show up in the
> statistics gathered.

Such a test as you had is pretty useless for anything.

If you want to measure something I'd suggest making a histogram of tsc
values in 10ns buckets or something, and seeing if there are a few
predominant spikes above the noise.

With something like that you could say, the jiffy tick went from 0.8+-.1
to 1.1+-.1 us or somesuch.

After that, you could possibly use oprofile or readprofile or
perf-counters to get an idea where the time is spend. I did a quick
profile on one of my machines, and about half the kernel time spend in a
while(1) loop comes from __do_softirq().

Really, I should not have to tell you this...

Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Fri, 17 Apr 2009, Peter Zijlstra wrote:

> If you want to measure something I'd suggest making a histogram of tsc
> values in 10ns buckets or something, and seeing if there are a few
> predominant spikes above the noise.

See
http://www.kernel.org/pub/linux/kernel/people/christoph/collab-spring-2009/Latencharts.ods
for more details. The tool we are using can produce histograms but they
were too voluminous to be included.

> With something like that you could say, the jiffy tick went from 0.8+-.1
> to 1.1+-.1 us or somesuch.

Well yeah we can look at this but there seem to be regressions in a lot of
other subsystems as well. Rescheduling is another thing that we tracked.
Its interesting that the holdoffs varied at lot during the scheduler
transition to CFS and then stayed high after that was complete.

> After that, you could possibly use oprofile or readprofile or
> perf-counters to get an idea where the time is spend. I did a quick
> profile on one of my machines, and about half the kernel time spend in a
> while(1) loop comes from __do_softirq().
>
> Really, I should not have to tell you this...

I can get down there but do you really want me to start hacking on the
scheduler again? This seems to be a regression from what we had working
fine before.

2009-04-17 14:52:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Fri, 2009-04-17 at 10:29 -0400, Christoph Lameter wrote:

> > With something like that you could say, the jiffy tick went from 0.8+-.1
> > to 1.1+-.1 us or somesuch.
>
> Well yeah we can look at this but there seem to be regressions in a lot of
> other subsystems as well. Rescheduling is another thing that we tracked.
> Its interesting that the holdoffs varied at lot during the scheduler
> transition to CFS and then stayed high after that was complete.
>
> > After that, you could possibly use oprofile or readprofile or
> > perf-counters to get an idea where the time is spend. I did a quick
> > profile on one of my machines, and about half the kernel time spend in a
> > while(1) loop comes from __do_softirq().
> >
> > Really, I should not have to tell you this...
>
> I can get down there but do you really want me to start hacking on the
> scheduler again? This seems to be a regression from what we had working
> fine before.

I won't mind you sending patches. But really, the first thing to do is
figuring out what is taking time.

And a random 1us cutoff, is well, random.

If you want to reduce interrupts, that's fine, but not counting an
interrupt because its below the magic 1us marker sounds a bit, well,
magic -- might work for you, might not for me on another machine, might
even be compiler dependent.

So 5 <1us interruption are not at all accounted, whereas a single 1>us
interruption is. I'd rather get rid of those 5 than try and shave a bit
of the one, if you get what I mean.

I'm pretty sure if we run the current kernel on a 5GHz machine all
interrupts are under 1us again :-), problem fixed? I don't think so.

Furthermore, yes the scheduler is one of those jiffy tick users, but
there are more. We can do ntp/gtod things in there, there is process
accounting, there is some RCU machinery, timers etc..

Like said, I did a profile on current -tip and __do_softirq was about
half the time spend in kernel. I'm not sure why it would be, maybe we're
doing tons of cache misses there for some reason, I dunno.


Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Fri, 17 Apr 2009, Peter Zijlstra wrote:

> And a random 1us cutoff, is well, random.

Its got to be somewhere.

> If you want to reduce interrupts, that's fine, but not counting an
> interrupt because its below the magic 1us marker sounds a bit, well,
> magic -- might work for you, might not for me on another machine, might
> even be compiler dependent.

The point is to reduce the interrupts of user space application. Hardware
interrupts etc are something else. Maybe I should not use the term
interrupt. Cpu unavailability? Cache pollution?

> So 5 <1us interruption are not at all accounted, whereas a single 1>us
> interruption is. I'd rather get rid of those 5 than try and shave a bit
> of the one, if you get what I mean.

Ok. We can set the threshold lower and see how that goes.

> Furthermore, yes the scheduler is one of those jiffy tick users, but
> there are more. We can do ntp/gtod things in there, there is process
> accounting, there is some RCU machinery, timers etc..

Again things were fine before the scheduler changed.

Could we defer interrupts if a single process is running on a cpu and
there is no other process on the runqueue and no other use of the timer
interrupt?

2009-04-17 15:29:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Fri, 2009-04-17 at 11:04 -0400, Christoph Lameter wrote:
> On Fri, 17 Apr 2009, Peter Zijlstra wrote:
>
> > And a random 1us cutoff, is well, random.
>
> Its got to be somewhere.

No it doesn't, a cutoff is useless. If I steal cputime at units below
the cutoff I can, in the limit, steal up to 100% cpu time and you'd not
notice.

> > If you want to reduce interrupts, that's fine, but not counting an
> > interrupt because its below the magic 1us marker sounds a bit, well,
> > magic -- might work for you, might not for me on another machine, might
> > even be compiler dependent.
>
> The point is to reduce the interrupts of user space application. Hardware
> interrupts etc are something else. Maybe I should not use the term
> interrupt. Cpu unavailability? Cache pollution?

The thing is, a 1us cut-off, or any other cut-off, doesn't tell you
_anything_ about availablity what so ever.

> > So 5 <1us interruption are not at all accounted, whereas a single 1>us
> > interruption is. I'd rather get rid of those 5 than try and shave a bit
> > of the one, if you get what I mean.
>
> Ok. We can set the threshold lower and see how that goes.

Pointless. See above.

> > Furthermore, yes the scheduler is one of those jiffy tick users, but
> > there are more. We can do ntp/gtod things in there, there is process
> > accounting, there is some RCU machinery, timers etc..
>
> Again things were fine before the scheduler changed.

Sure, it might be the scheduler, it might not be. I'm perfectly fine
with it being the scheduler, but show me where you're spending your time
and we can try and do something about it.

> Could we defer interrupts if a single process is running on a cpu and
> there is no other process on the runqueue and no other use of the timer
> interrupt?

If you can resolve all dependencies on the jiffy tick, maybe. Last time
I looked there were too many.

2009-04-17 15:35:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)


* Christoph Lameter <[email protected]> wrote:

> On Fri, 17 Apr 2009, Peter Zijlstra wrote:
>
> > And a random 1us cutoff, is well, random.
>
> Its got to be somewhere.

Sorry, that's not a rational answer that makes any sense.

Peter's point is statistics 101: please show absolute values not an
event distribution cutoff - how much total time do we spend in the
kernel in that workload?

Is the overhead 1%? 2%? 0.5%? And how did it change from 2.6.22
onwards? Did it go up by 0.1%, from 1% to 1.1%? Or did the average
go down by 0.05%, while increasing the spread of events (thus
fooling your cutoff)?

These are very simple, very basic, very straightforward questions -
and no straight answer was forthcoming from you. Are you not
interested in that answer?

Ingo

Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Fri, 17 Apr 2009, Ingo Molnar wrote:

>
> * Christoph Lameter <[email protected]> wrote:
>
> > On Fri, 17 Apr 2009, Peter Zijlstra wrote:
> >
> > > And a random 1us cutoff, is well, random.
> >
> > Its got to be somewhere.
>
> Sorry, that's not a rational answer that makes any sense.

There must be criteria for what constitutes a interrupt of the user space
application that we care about. There are numerous effects (like cacheline
fetches etc) that also cause holdoffs. So its not only the OS. There are
also other randomizing influences due to the exact location in the
measurement code path where the interrupt occurred. The criterion used
here was that the cpu was not executing code for longer than 1 usec. We
can discuss if this makes sense or what makes sense. But the measurements
here are entirely from the perspective of the user space program
experiencing OS noise.

> Peter's point is statistics 101: please show absolute values not an
> event distribution cutoff - how much total time do we spend in the
> kernel in that workload?

Futher details are included in the document that I pointed him to. The
"workload" is a synthetic test of a busy loop continually reading the TSC
register.

http://www.kernel.org/pub/linux/kernel/people/christoph/collab-spring-2009/Latencharts.ods

> Is the overhead 1%? 2%? 0.5%? And how did it change from 2.6.22
> onwards? Did it go up by 0.1%, from 1% to 1.1%? Or did the average
> go down by 0.05%, while increasing the spread of events (thus
> fooling your cutoff)?

As you see in the diagrams provided there is a 4 fold increase in the
number of interrupts >1usecs when going from 2.6.22 to 2.6.23. How would
you measure the overhead? Time spent in the OS? Disturbance of the caches
by the OS that cause the application to have to refetch data from Ram?

> These are very simple, very basic, very straightforward questions -
> and no straight answer was forthcoming from you. Are you not
> interested in that answer?

Is this goofiness going to continue? I provided measurements from a user
space perspective and you keep on demanding that I take other
measurements. I can dig deeper into the reasons for these regressions and
we can discuss what would be useful to measure but that wont happen
overnight.

2009-04-17 16:24:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Fri, 2009-04-17 at 11:55 -0400, Christoph Lameter wrote:

> Futher details are included in the document that I pointed him to. The
> "workload" is a synthetic test of a busy loop continually reading the TSC
> register.
>
> http://www.kernel.org/pub/linux/kernel/people/christoph/collab-spring-2009/Latencharts.ods

You're really going to make me open that thing eh..

As we've constituted, graph 1 is useless.

I'm still not quite sure why you couldn't provide the data for the other
graphs in email. They are not at all that much:

Graph 2: Noise Length

Kernel Test 1 Test 2 Test 3 Interruption(AVG)
2.6.22 2.55 2.61 1.92 2.36
2.6.23 1.33 1.38 1.34 1.35
2.6.24 1.97 1.86 1.87 1.90
2.6.25 2.09 2.29 2.09 2.16
2.6.26 1.49 1.22 1.22 1.31
2.6.27 1.67 1.28 1.18 1.38
2.6.28 1.27 1.21 1.14 1.21
2.6.29 1.44 1.33 1.54 1.44
2.6.30-rc2 2.06 1.49 1.24 1.60

Is pretty useless too, since it only counts >1us events. Hence it will
always be biased.

Much better would have been a graph constructed from a histrogram that
plots all lengths (say in 10ns) buckets. There you would have had a few
(at least 2) peaks, one around the time it takes to complete the
userspace loop, and one around the time it takes to do this interrupt
thing -- and maybe some other smaller ones. But now we're clueless.

Combined with graph 1, we can only compare 26+, and there we can see
there is some variance in how long a tick takes between kernels -- but a
std-dev along with this avg, would have been even better.

For the past few releases I cannot remember anything that would
immediately affect the tick length. Nothing structural was changed
around that code.

> > Is the overhead 1%? 2%? 0.5%? And how did it change from 2.6.22
> > onwards? Did it go up by 0.1%, from 1% to 1.1%? Or did the average
> > go down by 0.05%, while increasing the spread of events (thus
> > fooling your cutoff)?
>
> As you see in the diagrams provided there is a 4 fold increase in the
> number of interrupts >1usecs when going from 2.6.22 to 2.6.23. How would
> you measure the overhead? Time spent in the OS? Disturbance of the caches
> by the OS that cause the application to have to refetch data from Ram?

You could for example run an NMI profiler at 10000 Hz and collect
samples. Or use PMU hardware to collect numbers

Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Fri, 17 Apr 2009, Peter Zijlstra wrote:

> As we've constituted, graph 1 is useless.

It certainly shows that the number of >1usec interrupts increases.

> I'm still not quite sure why you couldn't provide the data for the other
> graphs in email. They are not at all that much:

Could have but I thought I better focus on one.

> Graph 2: Noise Length
>
> Kernel Test 1 Test 2 Test 3 Interruption(AVG)
> 2.6.22 2.55 2.61 1.92 2.36
> 2.6.23 1.33 1.38 1.34 1.35
> 2.6.24 1.97 1.86 1.87 1.90
> 2.6.25 2.09 2.29 2.09 2.16
> 2.6.26 1.49 1.22 1.22 1.31
> 2.6.27 1.67 1.28 1.18 1.38
> 2.6.28 1.27 1.21 1.14 1.21
> 2.6.29 1.44 1.33 1.54 1.44
> 2.6.30-rc2 2.06 1.49 1.24 1.60
>
> Is pretty useless too, since it only counts >1us events. Hence it will
> always be biased.

So what would be useful is to have all the differences in terms of
nanoseconds between measurements? I can just remove the cutoff and then
measure in tenth of usecs. That would be okay?

> You could for example run an NMI profiler at 10000 Hz and collect
> samples. Or use PMU hardware to collect numbers

The point is to have an test apps that can be used to measure OS noise
that may have a variety of causes.

Such a test app needs to be as simple as possible. The TSC loop should be
fine as far as I can tell. Histogram is already there. I can just remove
the 1usec cutoff to get you full details.

2009-04-17 16:49:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)


* Peter Zijlstra <[email protected]> wrote:

> > > Is the overhead 1%? 2%? 0.5%? And how did it change from
> > > 2.6.22 onwards? Did it go up by 0.1%, from 1% to 1.1%? Or did
> > > the average go down by 0.05%, while increasing the spread of
> > > events (thus fooling your cutoff)?
> >
> > As you see in the diagrams provided there is a 4 fold increase
> > in the number of interrupts >1usecs when going from 2.6.22 to
> > 2.6.23. How would you measure the overhead? Time spent in the
> > OS? Disturbance of the caches by the OS that cause the
> > application to have to refetch data from Ram?
>
> You could for example run an NMI profiler at 10000 Hz and collect
> samples. Or use PMU hardware to collect numbers

it's even simpler than that: use a user-space loop and an alert()
based signal-handler triggered measurement to count the number of
loops user-space can execute, per second.

Such measurements can very easily be made much more precise than
0.1%, and in a tight loop there's basically no caching issues - it
measures pure 'cycles left for user-space' performance.

Ingo

2009-04-17 17:19:22

by Chris Friesen

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

Ingo Molnar wrote:

> it's even simpler than that: use a user-space loop and an alert()
> based signal-handler triggered measurement to count the number of
> loops user-space can execute, per second.

Alternately, just measure how long it takes to execute a given (large)
number of loops. This avoids the signal handler.

Chris

Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

Ok I made some changes to the latencytest program (gentwo.org/ll). I hope
you can tell me what exactly you would consider meaningful numbers.

Here is a sample execution with a granularity of 1usec and a bucket size
of 1usec and the mininum set to zero (2.6.29 HZ=1000):

latencytest -d -m0 -b1000

CPUs: Freq=3.16Ghz Processors=8 Cores=4 cacheline_size=64 Intel(R) Xeon(R) CPU X5460 @ 3.16GHz
522 involuntary context switches
207231729 (100.00000000%) variances in 10.00 seconds: minimum 0.04us
maximum 109.58us average 0.05us
usecs percent samples
-------------------------
0.0 100.0 207221180
2.0 0.0 35
3.0 0.0 3691
4.0 0.0 4637
5.0 0.0 891
6.0 0.0 464
7.0 0.0 90
8.0 0.0 83
9.0 0.0 44
10.0 0.0 42
11.0 0.0 42
12.0 0.0 171
13.0 0.0 150
14.0 0.0 44
15.0 0.0 19
16.0 0.0 23
17.0 0.0 16
18.0 0.0 9
19.0 0.0 15
20.0 0.0 2
21.0 0.0 6
22.0 0.0 4
23.0 0.0 4
24.0 0.0 13
25.0 0.0 2
27.0 0.0 2
29.0 0.0 2
30.0 0.0 3
31.0 0.0 1
32.0 0.0 3
33.0 0.0 2
34.0 0.0 1
36.0 0.0 2
37.0 0.0 1
38.0 0.0 1
39.0 0.0 7
40.0 0.0 7
41.0 0.0 9
42.0 0.0 3
43.0 0.0 1
44.0 0.0 1
46.0 0.0 1
49.0 0.0 1
50.0 0.0 1
57.0 0.0 1
61.0 0.0 1
109.0 0.0 1

Setting the mininum to 1usec yields:

christoph@JTCHITW00139:~/lldiag-0.2$ ./latencytest -d -m1000 -b1000
CPUs: Freq=3.16Ghz Processors=8 Cores=4 cacheline_size=64 Intel(R) Xeon(R) CPU X5460 @ 3.16GHz
247597652 samples below 1000 nsec
28 involuntary context switches
10011 (0.00404309%) variances in 10.00 seconds: minimum 3.20us maximum 336.82us average 5.35us
usecs percent samples
-------------------------
3.0 6.4 645
4.0 28.6 2866
5.0 57.7 5781
6.0 4.5 448
7.0 1.1 113
8.0 0.4 39
9.0 0.6 59
10.0 0.2 19
11.0 0.0 1
12.0 0.0 2
13.0 0.0 4
14.0 0.0 1
15.0 0.0 1
16.0 0.0 2
20.0 0.0 3
26.0 0.0 1
28.0 0.0 1
30.0 0.0 1
37.0 0.0 1
38.0 0.0 1
43.0 0.0 1
45.0 0.0 3
46.0 0.0 1
47.0 0.0 2
48.0 0.0 1
53.0 0.0 1
55.0 0.0 1
56.0 0.0 4
57.0 0.0 4
61.0 0.0 1
76.0 0.0 1
93.0 0.0 1
336.0 0.0 1

More fine grained resolution (100ns do not count measurements below 100ns):

latencytest -d -m100 -b100

CPUs: Freq=3.16Ghz Processors=8 Cores=4 cacheline_size=64 Intel(R) Xeon(R) CPU X5460 @ 3.16GHz
266619352 samples below 100 nsec
19 involuntary context switches
13254 (0.00497088%) variances in 10.00 seconds: minimum 0.10us maximum
143.63us average 3.63us
usecs percent samples
-------------------------
0.1 23.7 3141
0.2 0.7 87
0.3 0.1 12
0.5 0.0 1
0.6 0.0 1
1.9 0.1 17
2.0 0.4 47
2.1 1.0 127
2.2 2.3 309
2.3 0.3 35
2.4 0.2 25
2.5 0.2 22
2.6 0.2 23
2.7 2.0 267
2.8 4.2 559
2.9 0.4 52
3.0 1.1 143
3.1 1.5 198
3.2 1.1 149
3.3 0.7 97
3.4 0.6 78
3.5 0.6 83
3.6 0.9 121
3.7 0.8 112
3.8 4.1 547
3.9 1.5 194
4.0 0.3 42
4.1 0.1 15
4.2 0.2 23
4.3 0.8 108
4.4 0.5 63
4.5 0.2 20
4.6 0.2 25
4.7 0.2 28
4.8 0.3 44
4.9 0.2 24
5.0 1.0 126
5.1 3.1 414
5.2 18.0 2385
5.3 6.5 857
5.4 4.9 656
5.5 2.5 335
5.6 3.4 448
5.7 3.4 445
5.8 2.0 268
5.9 0.9 116
6.0 0.4 54
6.1 0.2 22
6.2 0.3 36
6.3 0.1 17
6.4 0.1 10
6.5 0.1 11
6.6 0.1 9
6.7 0.1 9
6.8 0.1 8
6.9 0.0 4
7.0 0.1 8
7.1 0.0 6
7.2 0.1 7
7.3 0.1 14
7.4 0.1 7
7.5 0.0 4
7.6 0.0 3
7.7 0.0 6
7.8 0.1 7
7.9 0.0 6
8.0 0.0 5
8.1 0.0 4
8.2 0.0 3
8.3 0.0 2
8.4 0.0 2

The group around 5usec is likely the timer interrupt.

2009-04-17 18:12:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Fri, 2009-04-17 at 13:45 -0400, Christoph Lameter wrote:
> Ok I made some changes to the latencytest program (gentwo.org/ll). I hope
> you can tell me what exactly you would consider meaningful numbers.
>

> More fine grained resolution (100ns do not count measurements below 100ns):
>
> latencytest -d -m100 -b100
>
> CPUs: Freq=3.16Ghz Processors=8 Cores=4 cacheline_size=64 Intel(R) Xeon(R) CPU X5460 @ 3.16GHz
> 266619352 samples below 100 nsec
> 19 involuntary context switches
> 13254 (0.00497088%) variances in 10.00 seconds: minimum 0.10us maximum
> 143.63us average 3.63us
> usecs percent samples
> -------------------------
> 0.1 23.7 3141
> 0.2 0.7 87
> 0.3 0.1 12
> 0.5 0.0 1
> 0.6 0.0 1
> 1.9 0.1 17
> 2.0 0.4 47
> 2.1 1.0 127
> 2.2 2.3 309
> 2.3 0.3 35
> 2.4 0.2 25
> 2.5 0.2 22
> 2.6 0.2 23
> 2.7 2.0 267
> 2.8 4.2 559
> 2.9 0.4 52
> 3.0 1.1 143
> 3.1 1.5 198
> 3.2 1.1 149
> 3.3 0.7 97
> 3.4 0.6 78
> 3.5 0.6 83
> 3.6 0.9 121
> 3.7 0.8 112
> 3.8 4.1 547
> 3.9 1.5 194
> 4.0 0.3 42
> 4.1 0.1 15
> 4.2 0.2 23
> 4.3 0.8 108
> 4.4 0.5 63
> 4.5 0.2 20
> 4.6 0.2 25
> 4.7 0.2 28
> 4.8 0.3 44
> 4.9 0.2 24
> 5.0 1.0 126
> 5.1 3.1 414
> 5.2 18.0 2385
> 5.3 6.5 857
> 5.4 4.9 656
> 5.5 2.5 335
> 5.6 3.4 448
> 5.7 3.4 445
> 5.8 2.0 268
> 5.9 0.9 116
> 6.0 0.4 54
> 6.1 0.2 22
> 6.2 0.3 36
> 6.3 0.1 17
> 6.4 0.1 10
> 6.5 0.1 11
> 6.6 0.1 9
> 6.7 0.1 9
> 6.8 0.1 8
> 6.9 0.0 4
> 7.0 0.1 8
> 7.1 0.0 6
> 7.2 0.1 7
> 7.3 0.1 14
> 7.4 0.1 7
> 7.5 0.0 4
> 7.6 0.0 3
> 7.7 0.0 6
> 7.8 0.1 7
> 7.9 0.0 6
> 8.0 0.0 5
> 8.1 0.0 4
> 8.2 0.0 3
> 8.3 0.0 2
> 8.4 0.0 2
>
> The group around 5usec is likely the timer interrupt.

Right, this is nice.

There are a number of peaks (aside from 0), 2.2, 2.8, 3.1, 3.8, 4.3 and
5.2 -- and I suspect your guess about 5.2 is right.

Something like this is nice to compare between kernels. Chris'
suggestion of timing a simple fixed loop:

$ time (let i=1000000; while [ $i -gt 0 ]; do let i--; done)

real 0m14.389s
user 0m13.787s
sys 0m0.498s

Is also useful, since it gives an absolute measure of time available to
user-space.

Although I suspect a simple C while(i--); might be better due to less
code involved.


Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Fri, 17 Apr 2009, Peter Zijlstra wrote:

> Something like this is nice to compare between kernels. Chris'
> suggestion of timing a simple fixed loop:
>
> $ time (let i=1000000; while [ $i -gt 0 ]; do let i--; done)
>
> real 0m14.389s
> user 0m13.787s
> sys 0m0.498s
>
> Is also useful, since it gives an absolute measure of time available to
> user-space.
>
> Although I suspect a simple C while(i--); might be better due to less
> code involved.

The absolute time available to user space is not that important. It is
important that the processor stays available during latency critical
operations and is not taken away by the OS. The intervals that the OS
takes the processor away determine the mininum interval that the
application has to react to events (f.e. RDMA transfers via Infiniband,
or operations on requests coming in via shared memory). These operations
often must occur in parallel on multiple cores. Processing is delayed if
any of the cores encounters a delay due to OS noise.

The latencytest code simulates a busy processor (no system calls, all
memory is prefaulted). For some reasons Linux is increasingly taking time
away from such processes (that intentionally run uncontended on a
dedicated processor). This causes regressions so that current upstream is
not usable for these applications.

It would be best for these applications if the processor would be left
undisturbed. There is likely not much that the OS needs to do on a busy
processor if there are no competing threads and if there is no I/O taking
place.

2009-04-17 18:58:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Fri, 2009-04-17 at 14:20 -0400, Christoph Lameter wrote:
> On Fri, 17 Apr 2009, Peter Zijlstra wrote:
>
> > Something like this is nice to compare between kernels. Chris'
> > suggestion of timing a simple fixed loop:
> >
> > $ time (let i=1000000; while [ $i -gt 0 ]; do let i--; done)
> >
> > real 0m14.389s
> > user 0m13.787s
> > sys 0m0.498s
> >
> > Is also useful, since it gives an absolute measure of time available to
> > user-space.
> >
> > Although I suspect a simple C while(i--); might be better due to less
> > code involved.
>
> The absolute time available to user space is not that important. It is
> important that the processor stays available during latency critical
> operations and is not taken away by the OS. The intervals that the OS
> takes the processor away determine the mininum interval that the
> application has to react to events (f.e. RDMA transfers via Infiniband,
> or operations on requests coming in via shared memory). These operations
> often must occur in parallel on multiple cores. Processing is delayed if
> any of the cores encounters a delay due to OS noise.

So you have hard deadlines in the order of us? Are these SCHED_FIFO
tasks or SCHED_OTHER?

> The latencytest code simulates a busy processor (no system calls, all
> memory is prefaulted). For some reasons Linux is increasingly taking time
> away from such processes (that intentionally run uncontended on a
> dedicated processor). This causes regressions so that current upstream is
> not usable for these applications.
>
> It would be best for these applications if the processor would be left
> undisturbed. There is likely not much that the OS needs to do on a busy
> processor if there are no competing threads and if there is no I/O taking
> place.

Agreed -- that would be nice.

I can't really match the pattern to anything. The one thing that worried
me was sched_clock_tick(), which does a GTOD to sync the clocks between
cpus.

Your Xeon is a core2 class machine and should have relatively stable
TSC, however its also a dual socket, which I think defeats the
stable-ness.

What clocksource do you have?

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

Thing is, that doesn't really match the .23 is expensive and .25 isn't.

Also, looking over the rest of the scheduler tick code, I can't really
see what would be so expensive.

Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Fri, 17 Apr 2009, Peter Zijlstra wrote:

> > often must occur in parallel on multiple cores. Processing is delayed if
> > any of the cores encounters a delay due to OS noise.
>
> So you have hard deadlines in the order of us? Are these SCHED_FIFO
> tasks or SCHED_OTHER?

SCHED_FIFO has the effect of removing all the involuntary context switches
but it does not effect the other interrutions.

> Your Xeon is a core2 class machine and should have relatively stable
> TSC, however its also a dual socket, which I think defeats the
> stable-ness.

> What clocksource do you have?
>
> cat /sys/devices/system/clocksource/clocksource0/current_clocksource

cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

> Also, looking over the rest of the scheduler tick code, I can't really
> see what would be so expensive.

The expensiveness may be fine if we can limit the number of occurences.
Maybe the histograms for those releases give more insight.

2009-04-17 20:53:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Fri, 2009-04-17 at 16:34 -0400, Christoph Lameter wrote:
> On Fri, 17 Apr 2009, Peter Zijlstra wrote:
>
> > > often must occur in parallel on multiple cores. Processing is delayed if
> > > any of the cores encounters a delay due to OS noise.
> >
> > So you have hard deadlines in the order of us? Are these SCHED_FIFO
> > tasks or SCHED_OTHER?
>
> SCHED_FIFO has the effect of removing all the involuntary context switches
> but it does not effect the other interrutions.

OK, that's good to know.

> > Your Xeon is a core2 class machine and should have relatively stable
> > TSC, however its also a dual socket, which I think defeats the
> > stable-ness.
>
> > What clocksource do you have?
> >
> > cat /sys/devices/system/clocksource/clocksource0/current_clocksource
>
> cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> tsc

Ah, good. I could measure a significant difference on my testbox between
tsc and acpi_pm.

> > Also, looking over the rest of the scheduler tick code, I can't really
> > see what would be so expensive.
>
> The expensiveness may be fine if we can limit the number of occurences.
> Maybe the histograms for those releases give more insight.

Yeah, curious to see what .22 looks like -- readprofile/oprofile runs of
the kernel for those might be interesting as well.

2009-04-17 23:24:29

by Chris Friesen

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

Christoph Lameter wrote:

> The latencytest code simulates a busy processor (no system calls, all
> memory is prefaulted). For some reasons Linux is increasingly taking time
> away from such processes (that intentionally run uncontended on a
> dedicated processor). This causes regressions so that current upstream is
> not usable for these applications.
>
> It would be best for these applications if the processor would be left
> undisturbed. There is likely not much that the OS needs to do on a busy
> processor if there are no competing threads and if there is no I/O taking
> place.

Peter/Ingo, could ftrace be used to determine where time is being spent
in the kernel with suitable accuracy? (This may be a dumb question, I
haven't played with ftrace much.)

Given that we're talking about tens of usecs of duration, statistical
sampling may not work all that well.

Chris

2009-04-18 07:35:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)


* Chris Friesen <[email protected]> wrote:

> Christoph Lameter wrote:
>
>> The latencytest code simulates a busy processor (no system calls, all
>> memory is prefaulted). For some reasons Linux is increasingly taking time
>> away from such processes (that intentionally run uncontended on a
>> dedicated processor). This causes regressions so that current upstream is
>> not usable for these applications.
>>
>> It would be best for these applications if the processor would be
>> left undisturbed. There is likely not much that the OS needs to
>> do on a busy processor if there are no competing threads and if
>> there is no I/O taking place.
>
> Peter/Ingo, could ftrace be used to determine where time is being
> spent in the kernel with suitable accuracy? (This may be a dumb
> question, I haven't played with ftrace much.)

Yes, both ftrace and perfcounters can be used to directly trace or
sample any overhead/delay - no matter how sporadic it might be.

> Given that we're talking about tens of usecs of duration,
> statistical sampling may not work all that well.

Peter did rather convincing perfcounters profiles. Of course, if
only maximum latencies are concerned, tracing works better. But
mainline is pretty inadequate for that purpose anyway - there's a
whole PREEMPT_RT world with many stories to tell there ;-)

Ingo

2009-04-18 07:59:26

by Andi Kleen

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

"Chris Friesen" <[email protected]> writes:
>
> Given that we're talking about tens of usecs of duration, statistical
> sampling may not work all that well.

He could just use oprofile with a small period. The minimum period for
CPU_CLK_UNHALTED is 6000 (and it could be probably lowered by tweaking
the events) oprofile counters on x86 can be configured to only count
in ring 0 So you would get an entry every 6000 ring 0 cycles. That
should give a pretty good picture where the kernel is spending time.

-Andi


--
[email protected] -- Speaking for myself only.

2009-04-25 08:16:23

by Pavel Machek

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Fri 2009-04-17 17:28:57, Peter Zijlstra wrote:
> On Fri, 2009-04-17 at 11:04 -0400, Christoph Lameter wrote:
> > On Fri, 17 Apr 2009, Peter Zijlstra wrote:
> >
> > > And a random 1us cutoff, is well, random.
> >
> > Its got to be somewhere.
>
> No it doesn't, a cutoff is useless. If I steal cputime at units below
> the cutoff I can, in the limit, steal up to 100% cpu time and you'd not
> notice.

Are you sure cutoff is useless?

Irq entry/exit is sure to take > 1usec, right?

> > > So 5 <1us interruption are not at all accounted, whereas a single 1>us
> > > interruption is. I'd rather get rid of those 5 than try and shave a bit
> > > of the one, if you get what I mean.
> >
> > Ok. We can set the threshold lower and see how that goes.
>
> Pointless. See above.

I'd say that taking interrupt and returning in below 1000nsec is
impossible, even on very fast hw.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Thu, 23 Apr 2009, Pavel Machek wrote:

> I'd say that taking interrupt and returning in below 1000nsec is
> impossible, even on very fast hw.

Hmmm.... Could it be that CONFIG_NO_HZ caused the timer interrupt to be
avoided in earlier releases? NO_HZ theory at least is to only schedule
interrupts when necessary. So an interrupt is needed at the end of the
scheduling intervals and not at HZ frequency.

2009-04-28 21:23:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Tue, 2009-04-28 at 17:02 -0400, Christoph Lameter wrote:
> On Thu, 23 Apr 2009, Pavel Machek wrote:
>
> > I'd say that taking interrupt and returning in below 1000nsec is
> > impossible, even on very fast hw.
>
> Hmmm.... Could it be that CONFIG_NO_HZ caused the timer interrupt to be
> avoided in earlier releases? NO_HZ theory at least is to only schedule
> interrupts when necessary. So an interrupt is needed at the end of the
> scheduling intervals and not at HZ frequency.

No, NOHZ only works on idle cpus.

Subject: Re: Scheduler regression: Too frequent timer interrupts(?)

On Tue, 28 Apr 2009, Peter Zijlstra wrote:

> On Tue, 2009-04-28 at 17:02 -0400, Christoph Lameter wrote:
> > On Thu, 23 Apr 2009, Pavel Machek wrote:
> >
> > > I'd say that taking interrupt and returning in below 1000nsec is
> > > impossible, even on very fast hw.
> >
> > Hmmm.... Could it be that CONFIG_NO_HZ caused the timer interrupt to be
> > avoided in earlier releases? NO_HZ theory at least is to only schedule
> > interrupts when necessary. So an interrupt is needed at the end of the
> > scheduling intervals and not at HZ frequency.
>
> No, NOHZ only works on idle cpus.

It is advertised to do more:

config NO_HZ
bool "Tickless System (Dynamic Ticks)"
depends on GENERIC_TIME && GENERIC_CLOCKEVENTS
select TICK_ONESHOT
help
This option enables a tickless system: timer interrupts will
only trigger on an as-needed basis both when the system is
busy and when the system is idle.


So NO_HZ is really IDLE_NO_HZ. The system is not tickless.