2013-06-20 19:46:24

by Dave Chiluk

[permalink] [raw]
Subject: Scheduler accounting inflated for io bound processes.

Running the below testcase shows each process consuming 41-43% of it's
respective cpu while per core idle numbers show 63-65%, a disparity of
roughly 4-8%. Is this a bug, known behaviour, or consequence of the
process being io bound?

1. run sudo taskset -c 0 netserver
2. run taskset -c 1 netperf -H localhost -l 3600 -t TCP_RR & (start
netperf with priority on cpu1)
3. run top, press 1 for multiple CPUs to be separated

The below output is the top output notice the cpu0 idle at 67% when the
processes claim 42% usage a roughly 9% discrepancy.

------------------------------------------------
top - 19:27:38 up 4:08, 2 users, load average: 0.45, 0.19, 0.13
Tasks: 85 total, 2 running, 83 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.8 us, 15.4 sy, 0.0 ni, 66.7 id, 0.0 wa, 0.0 hi, 17.1 si,
0.0 st
%Cpu1 : 0.8 us, 17.3 sy, 0.0 ni, 63.1 id, 0.0 wa, 0.0 hi, 18.8 si,
0.0 st
KiB Mem: 4049180 total, 252952 used, 3796228 free, 23108 buffers
KiB Swap: 0 total, 0 used, 0 free, 132932 cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

6150 root 20 0 9756 700 536 R 42.6 0.0 0:48.90 netserver

6149 ubuntu 20 0 11848 1056 852 S 42.2 0.0 0:48.92 netperf
------------------------------------------------

The above testcase was run on 3.10-rc6.

So is this a bug or can someone explain to me why this isn't a bug?

The related ubuntu bug is
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1193073


2013-06-25 16:02:16

by Mike Galbraith

[permalink] [raw]
Subject: Re: Scheduler accounting inflated for io bound processes.

On Thu, 2013-06-20 at 14:46 -0500, Dave Chiluk wrote:
> Running the below testcase shows each process consuming 41-43% of it's
> respective cpu while per core idle numbers show 63-65%, a disparity of
> roughly 4-8%. Is this a bug, known behaviour, or consequence of the
> process being io bound?

All three I suppose. Idle is indeed inflated when softirq load is
present. Depends on ACCOUNTING config what exact numbers you see.

There are lies, there are damn lies.. and there are statistics.

> 1. run sudo taskset -c 0 netserver
> 2. run taskset -c 1 netperf -H localhost -l 3600 -t TCP_RR & (start
> netperf with priority on cpu1)
> 3. run top, press 1 for multiple CPUs to be separated

CONFIG_TICK_CPU_ACCOUNTING cpu[23] isolated

cgexec -g cpuset:rtcpus netperf.sh 999&sleep 300 && killall -9 top

%Cpu2 : 6.8 us, 42.0 sy, 0.0 ni, 42.0 id, 0.0 wa, 0.0 hi, 9.1 si, 0.0 st
%Cpu3 : 5.6 us, 43.3 sy, 0.0 ni, 40.0 id, 0.0 wa, 0.0 hi, 11.1 si, 0.0 st
^^^^
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
7226 root 20 0 8828 336 192 S 57.6 0.0 2:49.40 3 netserver 100*(2*60+49.4)/300 = 56.4
7225 root 20 0 8824 648 504 R 55.6 0.0 2:46.55 2 netperf 100*(2*60+46.55)/300 = 55.5

Ok, accumulated time ~agrees with %CPU snapshots.

cgexec -g cpuset:rtcpus taskset -c 3 schedctl -I pert 5

(pert is self calibrating tsc tight loop perturbation measurement
proggy, enters kernel once per 5s period for write. It doesn't care
about post period stats processing/output time, but it's running
SCHED_IDLE, gets VERY little CPU when competing, so runs more or less
only when netserver is idle. Plenty good enough proxy for idle.)
...
cgexec -g cpuset:rtcpus netperf.sh 9999
...
pert/s: 81249 >17.94us: 24 min: 0.08 max: 33.89 avg: 8.24 sum/s:669515us overhead:66.95%
pert/s: 81151 >18.43us: 25 min: 0.14 max: 37.53 avg: 8.25 sum/s:669505us overhead:66.95%
^^^^^^^^^^^^^^^^^^^^^^^
pert userspace tsc loop gets ~32% ~= idle upper bound, reported = ~40%,
disparity ~8%.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
23067 root 20 0 8828 340 196 R 57.5 0.0 0:19.15 3 netserver
23040 root 20 0 8208 396 304 R 42.7 0.0 0:35.61 3 pert
^^^^ ~10% disparity.

perf record -e irq:softirq* -a -C 3 -- sleep 00
perf report --sort=comm

99.80% netserver
0.20% pert

pert does ~zip softirq processing (timer+rcu) and ~zip squat kernel.

Repeat.

cgexec -g cpuset:rtcpus netperf.sh 3600
pert/s: 80860 >474.34us: 0 min: 0.06 max: 35.26 avg: 8.28 sum/s:669197us overhead:66.92%
pert/s: 80897 >429.20us: 0 min: 0.14 max: 37.61 avg: 8.27 sum/s:668673us overhead:66.87%
pert/s: 80800 >388.26us: 0 min: 0.14 max: 31.33 avg: 8.26 sum/s:667277us overhead:66.73%

%Cpu3 : 36.3 us, 51.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 12.1 si, 0.0 st
^^^^ ~agrees with pert
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
23569 root 20 0 8828 340 196 R 57.2 0.0 0:21.97 3 netserver
23040 root 20 0 8208 396 304 R 42.9 0.0 6:46.20 3 pert
^^^^ pert is VERY nearly 100% userspace
one of those numbers is a.. statistic
Kills pert...

%Cpu3 : 3.4 us, 42.5 sy, 0.0 ni, 41.4 id, 0.1 wa, 0.0 hi, 12.5 si, 0.0 st
^^^ ~agrees that pert's us claim did go away, but wth is up
with sy, it dropped ~9% after killing ~100% us proggy. nak
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
23569 root 20 0 8828 340 196 R 56.6 0.0 2:50.80 3 netserver

Yup, adding softirq load turns utilization numbers into.. statistics.
Pure cpu load idle numbers look fine.

-Mike

2013-06-25 17:49:24

by Mike Galbraith

[permalink] [raw]
Subject: Re: Scheduler accounting inflated for io bound processes.

On Tue, 2013-06-25 at 18:01 +0200, Mike Galbraith wrote:
> On Thu, 2013-06-20 at 14:46 -0500, Dave Chiluk wrote:
> > Running the below testcase shows each process consuming 41-43% of it's
> > respective cpu while per core idle numbers show 63-65%, a disparity of
> > roughly 4-8%. Is this a bug, known behaviour, or consequence of the
> > process being io bound?
>
> All three I suppose.

P.S.

perf top --sort=comm -C 3 -d 5 -F 250 (my tick freq)
56.65% netserver
43.35% pert

perf top --sort=comm -C 3 -d 5
67.16% netserver
32.84% pert

If you sample a high freq signal (netperf TCP_RR) at low freq (tick),
then try to reproduce the original signal, (very familiar) distortion
results. Perf doesn't even care about softirq yada yada, so seems it's
a pure sample rate thing.

-Mike

2013-06-26 09:37:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: Scheduler accounting inflated for io bound processes.


* Mike Galbraith <[email protected]> wrote:

> On Tue, 2013-06-25 at 18:01 +0200, Mike Galbraith wrote:
> > On Thu, 2013-06-20 at 14:46 -0500, Dave Chiluk wrote:
> > > Running the below testcase shows each process consuming 41-43% of it's
> > > respective cpu while per core idle numbers show 63-65%, a disparity of
> > > roughly 4-8%. Is this a bug, known behaviour, or consequence of the
> > > process being io bound?
> >
> > All three I suppose.
>
> P.S.
>
> perf top --sort=comm -C 3 -d 5 -F 250 (my tick freq)
> 56.65% netserver
> 43.35% pert
>
> perf top --sort=comm -C 3 -d 5
> 67.16% netserver
> 32.84% pert
>
> If you sample a high freq signal (netperf TCP_RR) at low freq (tick),
> then try to reproduce the original signal, (very familiar) distortion
> results. Perf doesn't even care about softirq yada yada, so seems it's
> a pure sample rate thing.

Would be very nice to randomize the sampling rate, by randomizing the
intervals within a 1% range or so - perf tooling will probably recognize
the different weights.

Thanks,

Ingo

2013-06-26 15:29:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler accounting inflated for io bound processes.

On Wed, Jun 26, 2013 at 11:37:13AM +0200, Ingo Molnar wrote:
> Would be very nice to randomize the sampling rate, by randomizing the
> intervals within a 1% range or so - perf tooling will probably recognize
> the different weights.

You're suggesting adding noise to the regular kernel tick?

2013-06-26 15:50:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: Scheduler accounting inflated for io bound processes.


* Peter Zijlstra <[email protected]> wrote:

> On Wed, Jun 26, 2013 at 11:37:13AM +0200, Ingo Molnar wrote:
> > Would be very nice to randomize the sampling rate, by randomizing the
> > intervals within a 1% range or so - perf tooling will probably recognize
> > the different weights.
>
> You're suggesting adding noise to the regular kernel tick?

No, to the perf interval (which I assumed Mike was using to profile this?)
- although slightly randomizing the kernel tick might make sense as well,
especially if it's hrtimer driven and reprogrammed anyway.

I might have gotten it all wrong though ...

Thanks,

Ingo

2013-06-26 16:01:53

by Mike Galbraith

[permalink] [raw]
Subject: Re: Scheduler accounting inflated for io bound processes.

On Wed, 2013-06-26 at 17:50 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Wed, Jun 26, 2013 at 11:37:13AM +0200, Ingo Molnar wrote:
> > > Would be very nice to randomize the sampling rate, by randomizing the
> > > intervals within a 1% range or so - perf tooling will probably recognize
> > > the different weights.
> >
> > You're suggesting adding noise to the regular kernel tick?
>
> No, to the perf interval (which I assumed Mike was using to profile this?)

Yeah, perf top -F 250 exhibits the same inaccuracy as 250 Hz tick cpu
accounting. (sufficient sample jitter should cure it, but I think I'd
prefer to just live with it)

-Mike

2013-06-26 16:04:43

by David Ahern

[permalink] [raw]
Subject: Re: Scheduler accounting inflated for io bound processes.

On 6/26/13 9:50 AM, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
>> On Wed, Jun 26, 2013 at 11:37:13AM +0200, Ingo Molnar wrote:
>>> Would be very nice to randomize the sampling rate, by randomizing the
>>> intervals within a 1% range or so - perf tooling will probably recognize
>>> the different weights.
>>
>> You're suggesting adding noise to the regular kernel tick?
>
> No, to the perf interval (which I assumed Mike was using to profile this?)
> - although slightly randomizing the kernel tick might make sense as well,
> especially if it's hrtimer driven and reprogrammed anyway.
>
> I might have gotten it all wrong though ...

Sampled S/W events like cpu-clock have a fixed rate
(perf_swevent_init_hrtimer converts freq to sample_period).

Sampled H/W events have an adaptive period that converges to the desired
sampling rate. The first few samples come in 10 usecs are so apart and
the time period expands to the desired rate. As I recall that adaptive
algorithm starts over every time the event is scheduled in.

David

2013-06-26 16:10:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: Scheduler accounting inflated for io bound processes.


* David Ahern <[email protected]> wrote:

> On 6/26/13 9:50 AM, Ingo Molnar wrote:
> >
> >* Peter Zijlstra <[email protected]> wrote:
> >
> >>On Wed, Jun 26, 2013 at 11:37:13AM +0200, Ingo Molnar wrote:
> >>>Would be very nice to randomize the sampling rate, by randomizing the
> >>>intervals within a 1% range or so - perf tooling will probably recognize
> >>>the different weights.
> >>
> >>You're suggesting adding noise to the regular kernel tick?
> >
> >No, to the perf interval (which I assumed Mike was using to profile this?)
> >- although slightly randomizing the kernel tick might make sense as well,
> >especially if it's hrtimer driven and reprogrammed anyway.
> >
> >I might have gotten it all wrong though ...
>
> Sampled S/W events like cpu-clock have a fixed rate
> (perf_swevent_init_hrtimer converts freq to sample_period).
>
> Sampled H/W events have an adaptive period that converges to the desired
> sampling rate. The first few samples come in 10 usecs are so apart and
> the time period expands to the desired rate. As I recall that adaptive
> algorithm starts over every time the event is scheduled in.

Yes, but last I checked it (2 years ago? :-) the auto-freq code was
converging pretty well to the time clock, with little jitter - in essence
turning it into a fixed-period, fixed-frequency sampling method. That
would explain Mike's results.

Thanks,

Ingo

2013-06-26 16:13:29

by David Ahern

[permalink] [raw]
Subject: Re: Scheduler accounting inflated for io bound processes.

On 6/26/13 10:10 AM, Ingo Molnar wrote:
>> Sampled H/W events have an adaptive period that converges to the desired
>> sampling rate. The first few samples come in 10 usecs are so apart and
>> the time period expands to the desired rate. As I recall that adaptive
>> algorithm starts over every time the event is scheduled in.
>
> Yes, but last I checked it (2 years ago? :-) the auto-freq code was
> converging pretty well to the time clock, with little jitter - in essence
> turning it into a fixed-period, fixed-frequency sampling method. That
> would explain Mike's results.

It does converge quickly and stay there for CPU-based events. My point
was more along the lines that the code is there. Perhaps a tweak to add
jitter to the period would address fixed period sampling affects.

David