LinuxLists.cc - Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2

2014-04-02 19:36:15

by Denys Vlasenko

[permalink] [raw]

Subject: Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2

On Mon, Mar 31, 2014 at 4:08 AM, Hidetoshi Seto
<[email protected]> wrote:
> There are 2 problems:
>
> [PROBLEM 1]: there is no exclusive control.
>
> It is easy to understand that there are 2 different cpu - an
> observing cpu where running a program observing idle cpu's stat
> and an observed cpu where performing idle. It means race can
> happen if observed cpu wakes up while it is observed. Observer
> can accidentally add calculated duration time (say delta) to
> time stats which is just updated by woken cpu. Soon correct
> idle time is returned in next turn, so it will result in
> backward time. Therefore readers must be excluded by writer.
>
> Then time stats are updated by woken cpu so there are only one
> writer, right? No, unfortunately no. I'm not sure why are they,
> but in some reason the function get_cpu_{idle,iowait}_time_us()
> has ability to update sleeping cpu's time stats from observing
> cpu. From grep of today's kernel tree, this feature is used by
> cpufreq module. Calling this function with this feature in
> periodically manner works like emulating tick for sleeping cpu.

Frederic's patches started by moving all updates
to tick_nohz_stop_idle(), makign the above problem easier -
get_cpu_{idle,iowait}_time_us() are pure readers.

The patches are here:

https://lkml.org/lkml/2013/10/19/86

> [PROBLEM 2]: broken iowait accounting.
>
> As historical nature, cpu's idle time was accounted as either
> idle or iowait depending on the presence of tasks blocked by
> I/O. No one complain about it for a long time. However:
>
> > Still trying to wrap my head around it, but conceptually
> > get_cpu_iowait_time_us() doesn't make any kind of sense.
> > iowait isn't per cpu since effectively tasks that aren't
> > running aren't assigned a cpu (as Oleg already pointed out).
> -- Peter Zijlstra
>
> Now some kernel folks realized that accounting iowait as per-cpu
> does not make sense in SMP world. When we were in traditional
> UP era, cpu is considered to be waiting I/O if it is idle while
> nr_iowait > 0. But in these days with SMP systems, tasks easily
> migrate from a cpu where they issued an I/O to another cpu where
> they are queued after I/O completion.

However, if we would put ourselves into admin's seat, iowait
immediately starts to make sense: for admin, the system state
where a lot of CPU time is genuinely idle is qualitatively different
form the state where a lot of CPU time is "idle" because
we are I/O bound.

Admins probably wouldn't insist that iowait accounting must be
very accurate. I would hazard to guess that admins would settle
for the following rules:

* (idle + iowait) should accurately represent amount of time
CPUs were idle.
* both idle and iowait should never go backwards
* when system is truly idle, only idle should increase
* when system is truly I/O bound on all CPUs, only iowait should increase
* when the situation is in between of the above two cases,
both iowait and idle counters should grow. It's ok if they
represent idle/IO-bound ratio only approximately

> Back to NO_HZ mechanism. Totally terrible thing here is that
> observer need to handle duration "delta" without knowing that
> nr_iowait of sleeping cpu can be changed easily by migration
> even if cpu is sleeping.

How about the following: when CPU enters idle, it remembers
in struct tick_sched->idle_active whether it was "truly" idle
or I/O bound: something like

ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;

Then, when we exit idle, we account entire idle period as
"true" idle or as iowait depending on ts->idle_active value,
regardless of what happened to I/O bound task (whether
it migrated or not).

> [WHAT THIS PATCH PROPOSED]: fix problem 1 first.
>
> To fix problem 1, this patch adds seqlock for NO_HZ idle
> accounting to avoid possible races between multiple reader/writer.

That's what Frederic proposed too.
However, he thought it adds too much overhead. It adds
a new field, two memory barriers and a RMW cycle
in the updater (tick_nohz_stop_idle),
and similarly two memory barriers in readers too.

How about a slightly different approach.
We take his first two patches:

"nohz: Convert a few places to use local per cpu accesses"
"nohz: Only update sleeptime stats locally"

then on top of that we fix racy access by readers as follows:

updater does not need ts->idle_entrytime field
after it is done. We can reuse it as "updater in progress" flag.
We set it to a sentinel value (say, zero),
then increase idle or iowait, then clear ts->idle_active.
With two memory barriers to ensure other CPUs see
updates exactly in that order:

tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now):
...
/* Updates the per cpu time idle statistics counters */
delta = ktime_sub(now, ts->idle_entrytime);
- if (nr_iowait_cpu(smp_processor_id()) > 0)
+ ts->idle_entrytime.tv64 = 0;
+ smp_wmb();
+ if (ts->idle_active == 2)
ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
else
ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+ smp_wmb();
ts->idle_active = 0;

Compared to seqlock approach, we don't need a new field (seqlock counter)
and don't need to increment it twice (two RMW cycles). We have the same
number of wmb's as seqlock approach has - two.

The iowait reader reads these three fields in reverse order,
with correct read barriers:

get_cpu_iowait_time_us(int cpu, u64 *last_update_time):
...
+ again:
+ active = ACCESS_ONCE(ts->idle_active);
+ smp_rmb();
+ count = ACCESS_ONCE(ts->iowait_sleeptime);
+ if (active == 2) { // 2 means "idle was entered with pending I/O"
+ smp_rmb();
+ start = ACCESS_ONCE(ts->idle_entrytime);
....
+ }
+ return count;

Compared to seqlock approach, we can avoid second rmb (as shown above)
if we see that idle_active was not set: we know that in this case,
we already have the result in "count", no need to correct it.

Now, if idle_active was set (for iowait case, to 2). We can check the
sentinel in idle_entrytime. If it is there, we are racing with updater.
in this case, we loop back:

+ if (active == 2) {
+ smp_rmb();
+ start = ACCESS_ONCE(ts->idle_entrytime);
+ if (start.tv64 == 0)
+ /* Other CPU is updating the count.
+ * We don't know whether fetched count is valid.
+ */
+ goto again;
+ delta = ktime_sub(now, start);
+ count = ktime_add(count, delta);
}

If we don't see sentinel, we adjust "count" and we are done.

I will send the full patches shortly.

Thoughts?

2014-04-03 07:03:42

by Hidetoshi Seto

[permalink] [raw]

Subject: Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2

(2014/04/03 4:35), Denys Vlasenko wrote:
> On Mon, Mar 31, 2014 at 4:08 AM, Hidetoshi Seto
> <[email protected]> wrote:
>> There are 2 problems:
>>
>> [PROBLEM 1]: there is no exclusive control.
>>
>> It is easy to understand that there are 2 different cpu - an
>> observing cpu where running a program observing idle cpu's stat
>> and an observed cpu where performing idle. It means race can
>> happen if observed cpu wakes up while it is observed. Observer
>> can accidentally add calculated duration time (say delta) to
>> time stats which is just updated by woken cpu. Soon correct
>> idle time is returned in next turn, so it will result in
>> backward time. Therefore readers must be excluded by writer.
>>
>> Then time stats are updated by woken cpu so there are only one
>> writer, right? No, unfortunately no. I'm not sure why are they,
>> but in some reason the function get_cpu_{idle,iowait}_time_us()
>> has ability to update sleeping cpu's time stats from observing
>> cpu. From grep of today's kernel tree, this feature is used by
>> cpufreq module. Calling this function with this feature in
>> periodically manner works like emulating tick for sleeping cpu.
>
> Frederic's patches started by moving all updates
> to tick_nohz_stop_idle(), makign the above problem easier -
> get_cpu_{idle,iowait}_time_us() are pure readers.
>
> The patches are here:
>
> https://lkml.org/lkml/2013/10/19/86

My concern here was that get_cpu_{idle,iowait}_time_us() are
exported function so that removing update functionality from
these affects kernel ABI compatibility. Even though I guess
there would be no other user than known cpufreq and kins, I also
thought that it looks like a shotgun approach and rough stuff.

However now I noticed that it is old mindset from when kernel
have Documentation/feature-removal.txt ... so I'm OK with
removing updates from these functions.

Still I'd prefer to see this change in separate patch that
modifies get_cpu_{idle,iowait}_time_us() only. It should
have CC and acked-by with cpufreq people.

>> [PROBLEM 2]: broken iowait accounting.
>>
>> As historical nature, cpu's idle time was accounted as either
>> idle or iowait depending on the presence of tasks blocked by
>> I/O. No one complain about it for a long time. However:
>>
>> > Still trying to wrap my head around it, but conceptually
>> > get_cpu_iowait_time_us() doesn't make any kind of sense.
>> > iowait isn't per cpu since effectively tasks that aren't
>> > running aren't assigned a cpu (as Oleg already pointed out).
>> -- Peter Zijlstra
>>
>> Now some kernel folks realized that accounting iowait as per-cpu
>> does not make sense in SMP world. When we were in traditional
>> UP era, cpu is considered to be waiting I/O if it is idle while
>> nr_iowait > 0. But in these days with SMP systems, tasks easily
>> migrate from a cpu where they issued an I/O to another cpu where
>> they are queued after I/O completion.
>
> However, if we would put ourselves into admin's seat, iowait
> immediately starts to make sense: for admin, the system state
> where a lot of CPU time is genuinely idle is qualitatively different
> form the state where a lot of CPU time is "idle" because
> we are I/O bound.
>
> Admins probably wouldn't insist that iowait accounting must be
> very accurate. I would hazard to guess that admins would settle
> for the following rules:
>
> * (idle + iowait) should accurately represent amount of time
> CPUs were idle.
> * both idle and iowait should never go backwards
> * when system is truly idle, only idle should increase
> * when system is truly I/O bound on all CPUs, only iowait should increase
> * when the situation is in between of the above two cases,
> both iowait and idle counters should grow. It's ok if they
> represent idle/IO-bound ratio only approximately

Yep. Admins are at the mercy of iowait value, though they know it
is not accurate.

Assume there are task X,Y,Z (X issues io, Y sleeps moderately,
and Z has low priority):

Case 1:
cpu A: <--run X--><--iowait--><--run X--><--iowait--><--run X ...
cpu B: <---run Y--><--run Z--><--run Y--><--run Z--><--run Y ...
io: <-- io X --> <-- io X -->

Case 2:
cpu A: <--run X--><--run Z---><--run X--><--run Z---><--run X ...
cpu B: <---run Y---><--idle--><---run Y---><--idle--><--run Y ...
io: <-- io X --> <-- io X -->

So case 1 tend to be iowait while case 2 is idle, despite
almost same workloads. Then what should admins do...?
(How silly! :-p)

>> Back to NO_HZ mechanism. Totally terrible thing here is that
>> observer need to handle duration "delta" without knowing that
>> nr_iowait of sleeping cpu can be changed easily by migration
>> even if cpu is sleeping.
>
> How about the following: when CPU enters idle, it remembers
> in struct tick_sched->idle_active whether it was "truly" idle
> or I/O bound: something like
>
> ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
>
> Then, when we exit idle, we account entire idle period as
> "true" idle or as iowait depending on ts->idle_active value,
> regardless of what happened to I/O bound task (whether
> it migrated or not).

It will not be acceptable. CPU can sleep significantly long
time after all I/O bound tasks are migrated. e.g.:

cpu A: <-run X-><-- iowait ---... (few days) ...--><-run Z ..
cpu B: <----run Y------><-run X->..
io: <-io X->

Since NO_HZ fits well to systems where want to keep CPUs in
sleeping to save battery/power, situation like above is
likely common. You'll see amazing iowait on system in idle,
and also see increasing iowait despite no tasks waiting io...

>> [WHAT THIS PATCH PROPOSED]: fix problem 1 first.
>>
>> To fix problem 1, this patch adds seqlock for NO_HZ idle
>> accounting to avoid possible races between multiple reader/writer.
>
> That's what Frederic proposed too.
> However, he thought it adds too much overhead. It adds
> a new field, two memory barriers and a RMW cycle
> in the updater (tick_nohz_stop_idle),
> and similarly two memory barriers in readers too.
>
> How about a slightly different approach.
> We take his first two patches:
>
> "nohz: Convert a few places to use local per cpu accesses"
> "nohz: Only update sleeptime stats locally"
>
> then on top of that we fix racy access by readers as follows:
>
> updater does not need ts->idle_entrytime field
> after it is done. We can reuse it as "updater in progress" flag.
> We set it to a sentinel value (say, zero),
> then increase idle or iowait, then clear ts->idle_active.
> With two memory barriers to ensure other CPUs see
> updates exactly in that order:
>
> tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now):
> ...
> /* Updates the per cpu time idle statistics counters */
> delta = ktime_sub(now, ts->idle_entrytime);
> - if (nr_iowait_cpu(smp_processor_id()) > 0)
> + ts->idle_entrytime.tv64 = 0;
> + smp_wmb();
> + if (ts->idle_active == 2)
> ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> else
> ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
> + smp_wmb();
> ts->idle_active = 0;
>

I just noted that "delta might be big, such as days."

> Compared to seqlock approach, we don't need a new field (seqlock counter)
> and don't need to increment it twice (two RMW cycles). We have the same
> number of wmb's as seqlock approach has - two.
>
> The iowait reader reads these three fields in reverse order,
> with correct read barriers:
>
> get_cpu_iowait_time_us(int cpu, u64 *last_update_time):
> ...
> + again:
> + active = ACCESS_ONCE(ts->idle_active);
> + smp_rmb();
> + count = ACCESS_ONCE(ts->iowait_sleeptime);
> + if (active == 2) { // 2 means "idle was entered with pending I/O"
> + smp_rmb();
> + start = ACCESS_ONCE(ts->idle_entrytime);
> ....
> + }
> + return count;
>
> Compared to seqlock approach, we can avoid second rmb (as shown above)
> if we see that idle_active was not set: we know that in this case,
> we already have the result in "count", no need to correct it.
>
> Now, if idle_active was set (for iowait case, to 2). We can check the
> sentinel in idle_entrytime. If it is there, we are racing with updater.
> in this case, we loop back:
>
> + if (active == 2) {
> + smp_rmb();
> + start = ACCESS_ONCE(ts->idle_entrytime);
> + if (start.tv64 == 0)
> + /* Other CPU is updating the count.
> + * We don't know whether fetched count is valid.
> + */
> + goto again;
> + delta = ktime_sub(now, start);
> + count = ktime_add(count, delta);
> }
>
> If we don't see sentinel, we adjust "count" and we are done.
>
> I will send the full patches shortly.
>
> Thoughts?

I agree on removing get_cpu_{idle,iowait}_time_us() (or marking
it as obsolete) with some conditions.

However your approach to make "pure reader" is considered to be
a failure. Thank you for providing good counter design!

Thanks,
H.Seto

2014-04-03 09:51:55

by Denys Vlasenko

[permalink] [raw]

Subject: Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2

On Thu, Apr 3, 2014 at 9:02 AM, Hidetoshi Seto
<[email protected]> wrote:
>>> [PROBLEM 2]: broken iowait accounting.
>>>
>>> As historical nature, cpu's idle time was accounted as either
>>> idle or iowait depending on the presence of tasks blocked by
>>> I/O. No one complain about it for a long time. However:
>>>
>>> > Still trying to wrap my head around it, but conceptually
>>> > get_cpu_iowait_time_us() doesn't make any kind of sense.
>>> > iowait isn't per cpu since effectively tasks that aren't
>>> > running aren't assigned a cpu (as Oleg already pointed out).
>>> -- Peter Zijlstra
>>>
>>> Now some kernel folks realized that accounting iowait as per-cpu
>>> does not make sense in SMP world. When we were in traditional
>>> UP era, cpu is considered to be waiting I/O if it is idle while
>>> nr_iowait > 0. But in these days with SMP systems, tasks easily
>>> migrate from a cpu where they issued an I/O to another cpu where
>>> they are queued after I/O completion.
>>
>> However, if we would put ourselves into admin's seat, iowait
>> immediately starts to make sense: for admin, the system state
>> where a lot of CPU time is genuinely idle is qualitatively different
>> form the state where a lot of CPU time is "idle" because
>> we are I/O bound.
>>
>> Admins probably wouldn't insist that iowait accounting must be
>> very accurate. I would hazard to guess that admins would settle
>> for the following rules:
>>
>> * (idle + iowait) should accurately represent amount of time
>> CPUs were idle.
>> * both idle and iowait should never go backwards
>> * when system is truly idle, only idle should increase
>> * when system is truly I/O bound on all CPUs, only iowait should increase
>> * when the situation is in between of the above two cases,
>> both iowait and idle counters should grow. It's ok if they
>> represent idle/IO-bound ratio only approximately
>
> Yep. Admins are at the mercy of iowait value, though they know it
> is not accurate.
>
> Assume there are task X,Y,Z (X issues io, Y sleeps moderately,
> and Z has low priority):
>
> Case 1:
> cpu A: <--run X--><--iowait--><--run X--><--iowait--><--run X ...
> cpu B: <---run Y--><--run Z--><--run Y--><--run Z--><--run Y ...
> io: <-- io X --> <-- io X -->
>
> Case 2:
> cpu A: <--run X--><--run Z---><--run X--><--run Z---><--run X ...
> cpu B: <---run Y---><--idle--><---run Y---><--idle--><--run Y ...
> io: <-- io X --> <-- io X -->
>
> So case 1 tend to be iowait while case 2 is idle, despite
> almost same workloads. Then what should admins do...?

This happens with current code too, right?
No regression then.

>>> Back to NO_HZ mechanism. Totally terrible thing here is that
>>> observer need to handle duration "delta" without knowing that
>>> nr_iowait of sleeping cpu can be changed easily by migration
>>> even if cpu is sleeping.
>>
>> How about the following: when CPU enters idle, it remembers
>> in struct tick_sched->idle_active whether it was "truly" idle
>> or I/O bound: something like
>>
>> ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
>>
>> Then, when we exit idle, we account entire idle period as
>> "true" idle or as iowait depending on ts->idle_active value,
>> regardless of what happened to I/O bound task (whether
>> it migrated or not).
>
> It will not be acceptable. CPU can sleep significantly long
> time after all I/O bound tasks are migrated. e.g.:
>
> cpu A: <-run X-><-- iowait ---... (few days) ...--><-run Z ..
> cpu B: <----run Y------><-run X->..
> io: <-io X->

Does task migrate from an *idle* CPU? If yes, why?
Since its CPU is idle (i.e. immediately available
for it to be scheduled on),
I would imagine normally IO-blocked task stays
on its CPU's rq if it is idle.

> I agree on removing get_cpu_{idle,iowait}_time_us() (or marking
> it as obsolete) with some conditions.

Er?
My proposal does not eliminate or change
get_cpu_{idle,iowait}_time_us() API.

2014-04-04 02:48:19

by Hidetoshi Seto

[permalink] [raw]

Subject: Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2

(2014/04/03 18:51), Denys Vlasenko wrote:
> On Thu, Apr 3, 2014 at 9:02 AM, Hidetoshi Seto
> <[email protected]> wrote:
>>>> [PROBLEM 2]: broken iowait accounting.
>>>>
>>>> As historical nature, cpu's idle time was accounted as either
>>>> idle or iowait depending on the presence of tasks blocked by
>>>> I/O. No one complain about it for a long time. However:
>>>>
>>>> > Still trying to wrap my head around it, but conceptually
>>>> > get_cpu_iowait_time_us() doesn't make any kind of sense.
>>>> > iowait isn't per cpu since effectively tasks that aren't
>>>> > running aren't assigned a cpu (as Oleg already pointed out).
>>>> -- Peter Zijlstra
>>>>
>>>> Now some kernel folks realized that accounting iowait as per-cpu
>>>> does not make sense in SMP world. When we were in traditional
>>>> UP era, cpu is considered to be waiting I/O if it is idle while
>>>> nr_iowait > 0. But in these days with SMP systems, tasks easily
>>>> migrate from a cpu where they issued an I/O to another cpu where
>>>> they are queued after I/O completion.
>>>
>>> However, if we would put ourselves into admin's seat, iowait
>>> immediately starts to make sense: for admin, the system state
>>> where a lot of CPU time is genuinely idle is qualitatively different
>>> form the state where a lot of CPU time is "idle" because
>>> we are I/O bound.
>>>
>>> Admins probably wouldn't insist that iowait accounting must be
>>> very accurate. I would hazard to guess that admins would settle
>>> for the following rules:
>>>
>>> * (idle + iowait) should accurately represent amount of time
>>> CPUs were idle.
>>> * both idle and iowait should never go backwards
>>> * when system is truly idle, only idle should increase
>>> * when system is truly I/O bound on all CPUs, only iowait should increase
>>> * when the situation is in between of the above two cases,
>>> both iowait and idle counters should grow. It's ok if they
>>> represent idle/IO-bound ratio only approximately
>>
>> Yep. Admins are at the mercy of iowait value, though they know it
>> is not accurate.
>>
>> Assume there are task X,Y,Z (X issues io, Y sleeps moderately,
>> and Z has low priority):
>>
>> Case 1:
>> cpu A: <--run X--><--iowait--><--run X--><--iowait--><--run X ...
>> cpu B: <---run Y--><--run Z--><--run Y--><--run Z--><--run Y ...
>> io: <-- io X --> <-- io X -->
>>
>> Case 2:
>> cpu A: <--run X--><--run Z---><--run X--><--run Z---><--run X ...
>> cpu B: <---run Y---><--idle--><---run Y---><--idle--><--run Y ...
>> io: <-- io X --> <-- io X -->
>>
>> So case 1 tend to be iowait while case 2 is idle, despite
>> almost same workloads. Then what should admins do...?
>
> This happens with current code too, right?
> No regression then.

Yes, problem 2 is not regression. As I state it at first place,
it is fundamental problem of current iowait stuff. And my patch
set does not aim at this problem 2.

>>>> Back to NO_HZ mechanism. Totally terrible thing here is that
>>>> observer need to handle duration "delta" without knowing that
>>>> nr_iowait of sleeping cpu can be changed easily by migration
>>>> even if cpu is sleeping.
>>>
>>> How about the following: when CPU enters idle, it remembers
>>> in struct tick_sched->idle_active whether it was "truly" idle
>>> or I/O bound: something like
>>>
>>> ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
>>>
>>> Then, when we exit idle, we account entire idle period as
>>> "true" idle or as iowait depending on ts->idle_active value,
>>> regardless of what happened to I/O bound task (whether
>>> it migrated or not).
>>
>> It will not be acceptable. CPU can sleep significantly long
>> time after all I/O bound tasks are migrated. e.g.:
>>
>> cpu A: <-run X-><-- iowait ---... (few days) ...--><-run Z ..
>> cpu B: <----run Y------><-run X->..
>> io: <-io X->
>
> Does task migrate from an *idle* CPU? If yes, why?
> Since its CPU is idle (i.e. immediately available
> for it to be scheduled on),
> I would imagine normally IO-blocked task stays
> on its CPU's rq if it is idle.

I found an answer from Peter Zijlstra in following threads:
[PATCH RESEND 0/4] nohz: Fix racy sleeptime stats
https://lkml.org/lkml/2013/8/16/274

(Sorry, I could not reach lkml.org today due to some network
error, so I could not get direct link to following reply.
I hope you can find it from parent post started from link
above. I quote the important part instead.)

<quote>
> Option B:
>
>> Or we can live with that and still account the whole idle time slept until
>> tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
>> All we need is just a new field in ts-> that records on which state we entered
>> idle.
>>
>> What do you think?
>
> I think option B is unworkable. Afaict it could basically caused
> unlimited iowait time. Suppose we have a load-balancer that tries it
> bestestest to sort-left (ie. run a task on the lowest 'free' cpu
> possible) -- the power aware folks are pondering such schemes.
</quote>

Another answer: we cannot stop user to do cpuset (=force migration
by hand) to task which is waiting io.

>> I agree on removing get_cpu_{idle,iowait}_time_us() (or marking
>> it as obsolete) with some conditions.
>
> Er?
> My proposal does not eliminate or change
> get_cpu_{idle,iowait}_time_us() API.

Sorry to making confuse.
Well, I should revise my previous comment in different proper words.

At first, it was my fault to use "API change" for your proposal.
It does not change number/type of function's argument etc.
I guess I should use "semantics change" for "removing update
functionality".

<source kernel/time/tick-sched.c>
> /**
> * get_cpu_idle_time_us - get the total idle time of a cpu
> * @cpu: CPU number to query
> * @last_update_time: variable to store update time in. Do not update
> * counters if NULL.
</source>

Second, it was only my opinion to remove these functions.
You did not mention about it.

So revised comment would be:
- I agree on removing update functionality from
get_cpu_{idle,iowait}_time_us() if it is acceptable
semantics change for cpufreq people.
- By the way, IMHO we can remove these functions
completely. (Or if required mark it as obsolete for
a certain period.)
- Anyway such change could be a single patch separated
from current patch set.

Thanks,
H.Seto

2014-04-04 16:03:38

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2

Hi Guys,

You and Hidetoshi have sent a few patches with very detailed changelogs
and it's going to be hard to synthesize. So my reviews are going to be a
bit chaotic, sorry for that in advance.

On Wed, Apr 02, 2014 at 09:35:47PM +0200, Denys Vlasenko wrote:
> On Mon, Mar 31, 2014 at 4:08 AM, Hidetoshi Seto
> <[email protected]> wrote:
> > There are 2 problems:
> >
> > [PROBLEM 1]: there is no exclusive control.
> >
> > It is easy to understand that there are 2 different cpu - an
> > observing cpu where running a program observing idle cpu's stat
> > and an observed cpu where performing idle. It means race can
> > happen if observed cpu wakes up while it is observed. Observer
> > can accidentally add calculated duration time (say delta) to
> > time stats which is just updated by woken cpu. Soon correct
> > idle time is returned in next turn, so it will result in
> > backward time. Therefore readers must be excluded by writer.
> >
> > Then time stats are updated by woken cpu so there are only one
> > writer, right? No, unfortunately no. I'm not sure why are they,
> > but in some reason the function get_cpu_{idle,iowait}_time_us()
> > has ability to update sleeping cpu's time stats from observing
> > cpu. From grep of today's kernel tree, this feature is used by
> > cpufreq module. Calling this function with this feature in
> > periodically manner works like emulating tick for sleeping cpu.
>
> Frederic's patches started by moving all updates
> to tick_nohz_stop_idle(), makign the above problem easier -
> get_cpu_{idle,iowait}_time_us() are pure readers.
>
> The patches are here:
>
> https://lkml.org/lkml/2013/10/19/86

Yeah definetly we should limit the update areas to local updates
from idle when possible.

The get_cpu_..time_us() should try to read with seqcount to deduce
the right delta since last update.

>
> > [PROBLEM 2]: broken iowait accounting.
> >
> > As historical nature, cpu's idle time was accounted as either
> > idle or iowait depending on the presence of tasks blocked by
> > I/O. No one complain about it for a long time. However:
> >
> > > Still trying to wrap my head around it, but conceptually
> > > get_cpu_iowait_time_us() doesn't make any kind of sense.
> > > iowait isn't per cpu since effectively tasks that aren't
> > > running aren't assigned a cpu (as Oleg already pointed out).
> > -- Peter Zijlstra
> >
> > Now some kernel folks realized that accounting iowait as per-cpu
> > does not make sense in SMP world. When we were in traditional
> > UP era, cpu is considered to be waiting I/O if it is idle while
> > nr_iowait > 0. But in these days with SMP systems, tasks easily
> > migrate from a cpu where they issued an I/O to another cpu where
> > they are queued after I/O completion.
>
> However, if we would put ourselves into admin's seat, iowait
> immediately starts to make sense: for admin, the system state
> where a lot of CPU time is genuinely idle is qualitatively different
> form the state where a lot of CPU time is "idle" because
> we are I/O bound.
>
> Admins probably wouldn't insist that iowait accounting must be
> very accurate. I would hazard to guess that admins would settle
> for the following rules:

Iowait makes sense but not per cpu. Eventually it's a global
stat. Or per task.

I've sratched my head a lot on this. And I think we can't continue
with the current semantics. If we had to keep the current semantics
and enforce correctness at the same time, we are going to run into
big scalability and performance issues. This can't be done without
locking updates to nr_iowait() with seqlock:

* when a task blocks on IO and goes idle, lock some per cpu iowait_seq,
increase nr_iowait, save curr CPU number, save time.

* when a task io completes and it gets enqueued on another CPU: retrieve
old CPU, lock its iowait_seq, decrease nr_iowait, flush delta iowait .

And all that just to maintain stats which semantics are wrong, this
would be pure madness.

OTOH we must stay compatible with user ABI in /proc/stat (the one in /proc/timers_list
matters less). But if we make it a per task stat, we are free to account it
on the CPU we want.

So what we can do for example is to account it per task and update stats
on the CPU where the blocking task wakes up. This way we guarantee
that we only account locally, which is good for scalability.

This is going to be an ABI change on a /proc/stat field semantic.
We usually can not do that as it can break userspace. But I think
we have a reasonable exception here:

1) On a performance POV we don't have the choice.

2) It has always been a buggy stat on SMP. Given the possible fast iowait update
rate, I doubt it has ever dumped correct stats. So I guess that few user apps
have ever correctly relied on it.

Also it decouples iowait from idle time. Running time is also accounted
as iowait. But then again, since an iowait task is not attached to any CPU,
accounting iowait time only when a specific CPU is idle doesn't make sense.

Oh and the kernel API is not a problem either. Only cpufreq use it IIRC. But
I've been told it's only for a few old intel CPUs. Now given how buggy the stat
is, I doubt it worked correctly.

Anyway, the first step will be to remove the cpufreq API use.

Thanks.

2014-04-04 17:03:06

by Denys Vlasenko

[permalink] [raw]

Subject: Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2

On Fri, Apr 4, 2014 at 6:03 PM, Frederic Weisbecker <[email protected]> wrote:
>> However, if we would put ourselves into admin's seat, iowait
>> immediately starts to make sense: for admin, the system state
>> where a lot of CPU time is genuinely idle is qualitatively different
>> form the state where a lot of CPU time is "idle" because
>> we are I/O bound.
>>
>> Admins probably wouldn't insist that iowait accounting must be
>> very accurate. I would hazard to guess that admins would settle
>> for the following rules:
>
> Iowait makes sense but not per cpu. Eventually it's a global
> stat. Or per task.

There a lot of situations where admins want to know
how much, on average, their CPUs are idle because
they wait for IO.

If you are running, say, a Web cache,
you need to know that stat in order to be able to
conjecture "looks like I'm IO bound, perhaps caching
some data in RAM will speed it up".

Global stat will give such data to admin. Per-task won't -
there can be an efficient Web cache design which uses
many parallel tasks to hide IO latency. Thus, such
Web cache can be nearly optimal despite its tasks,
individually, having significant iowait counts each.

> I've sratched my head a lot on this. And I think we can't continue
> with the current semantics. If we had to keep the current semantics
> and enforce correctness at the same time, we are going to run into
> big scalability and performance issues. This can't be done without
> locking updates to nr_iowait() with seqlock:
>
> * when a task blocks on IO and goes idle, lock some per cpu iowait_seq,
> increase nr_iowait, save curr CPU number, save time.
>
> * when a task io completes and it gets enqueued on another CPU: retrieve
> old CPU, lock its iowait_seq, decrease nr_iowait, flush delta iowait .
>
> And all that just to maintain stats which semantics are wrong, this
> would be pure madness.
>
> OTOH we must stay compatible with user ABI in /proc/stat (the one in /proc/timers_list
> matters less). But if we make it a per task stat, we are free to account it
> on the CPU we want.
>
> So what we can do for example is to account it per task and update stats
> on the CPU where the blocking task wakes up. This way we guarantee
> that we only account locally, which is good for scalability.

When IO-bound task wakes on some CPU,
how exactly do you propose to update counters -
add total waited time of this task to this CPU's counter?

Is such counter meaningful for the admin?

> This is going to be an ABI change on a /proc/stat field semantic.
> We usually can not do that as it can break userspace. But I think
> we have a reasonable exception here:
>
> 1) On a performance POV we don't have the choice.
>
> 2) It has always been a buggy stat on SMP. Given the possible fast iowait update
> rate, I doubt it has ever dumped correct stats. So I guess that few user apps
> have ever correctly relied on it.

In busybox project, the following tools use iowait counter:

top,mpstat: in order to show "%iowait"

nmeter: to show "waiting for disk" part of CPU bar.
Example:
$ nmeter '%t %70c'
18:57:33 SUU...................................................................
18:57:34 SUUUUUUUUUUI..........................................................
18:57:35 SUUUII................................................................
18:57:36 SUUU..................................................................
18:57:37 SSUDDD................................................................
(^^^^^^ IO-intensive task starts)
18:57:38 SSSSSSUDDDDDDDDDDDDIi.................................................
18:57:39 SSSSSSSSUDDDDDDDDDi...................................................
18:57:40 SSSSSUUDDDDDDDDDDDDi..................................................
18:57:41 SSSSSUUUUUDDDDDDDDDDDDi...............................................
18:57:42 SSSSSUDDDDDDDDDDDDDIi.................................................
18:57:43 SSSUUDDDDDDDi.........................................................
(^^^^^^ IO-intensive task ends)
18:57:44 SUUUI.................................................................
18:57:45 SUUU..................................................................
18:57:46 UU....................................................................
18:57:47 U.....................................................................

This doesn't look bogus to me.
It does give me information I need to know.

> Also it decouples iowait from idle time. Running time is also accounted
> as iowait.

The time when CPUs are busy while there is IO-wait
are usually not a sign of badly tuned software/system.

Only when CPUs are idle and there is IO-wait is.

That's how it looks from userspace.

2014-04-05 10:08:22

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2

On Fri, Apr 04, 2014 at 07:02:43PM +0200, Denys Vlasenko wrote:
> On Fri, Apr 4, 2014 at 6:03 PM, Frederic Weisbecker <[email protected]> wrote:
> >> However, if we would put ourselves into admin's seat, iowait
> >> immediately starts to make sense: for admin, the system state
> >> where a lot of CPU time is genuinely idle is qualitatively different
> >> form the state where a lot of CPU time is "idle" because
> >> we are I/O bound.
> >>
> >> Admins probably wouldn't insist that iowait accounting must be
> >> very accurate. I would hazard to guess that admins would settle
> >> for the following rules:
> >
> > Iowait makes sense but not per cpu. Eventually it's a global
> > stat. Or per task.
>
> There a lot of situations where admins want to know
> how much, on average, their CPUs are idle because
> they wait for IO.
>
> If you are running, say, a Web cache,
> you need to know that stat in order to be able to
> conjecture "looks like I'm IO bound, perhaps caching
> some data in RAM will speed it up".

But then accounting iowait time waited until completion on the CPU
that the task wakes up should work for that.

Doesn't it?

>
> Global stat will give such data to admin. Per-task won't -
> there can be an efficient Web cache design which uses
> many parallel tasks to hide IO latency. Thus, such
> Web cache can be nearly optimal despite its tasks,
> individually, having significant iowait counts each.

I don't see how it's an issue. Just add up the cumulated iowait among
all these tasks and you get the global iowait time.

Anyway that's not a problem, the goal is still to display that per CPU.

>
> > I've sratched my head a lot on this. And I think we can't continue
> > with the current semantics. If we had to keep the current semantics
> > and enforce correctness at the same time, we are going to run into
> > big scalability and performance issues. This can't be done without
> > locking updates to nr_iowait() with seqlock:
> >
> > * when a task blocks on IO and goes idle, lock some per cpu iowait_seq,
> > increase nr_iowait, save curr CPU number, save time.
> >
> > * when a task io completes and it gets enqueued on another CPU: retrieve
> > old CPU, lock its iowait_seq, decrease nr_iowait, flush delta iowait .
> >
> > And all that just to maintain stats which semantics are wrong, this
> > would be pure madness.
> >
> > OTOH we must stay compatible with user ABI in /proc/stat (the one in /proc/timers_list
> > matters less). But if we make it a per task stat, we are free to account it
> > on the CPU we want.
> >
> > So what we can do for example is to account it per task and update stats
> > on the CPU where the blocking task wakes up. This way we guarantee
> > that we only account locally, which is good for scalability.
>
> When IO-bound task wakes on some CPU,
> how exactly do you propose to update counters -
> add total waited time of this task to this CPU's counter?

So we save, per task, the time when the task went to sleep. And when it wakes up
we just flush the pending time since that sleep time to the waking CPU:
iowait[$CPU] += NOW() - tsk->sleeptime;

> Is such counter meaningful for the admin?

Well, you get the iowait time accounting.

>
> > This is going to be an ABI change on a /proc/stat field semantic.
> > We usually can not do that as it can break userspace. But I think
> > we have a reasonable exception here:
> >
> > 1) On a performance POV we don't have the choice.
> >
> > 2) It has always been a buggy stat on SMP. Given the possible fast iowait update
> > rate, I doubt it has ever dumped correct stats. So I guess that few user apps
> > have ever correctly relied on it.
>
> In busybox project, the following tools use iowait counter:
>
> top,mpstat: in order to show "%iowait"
>
> nmeter: to show "waiting for disk" part of CPU bar.
> Example:
> $ nmeter '%t %70c'
> 18:57:33 SUU...................................................................
> 18:57:34 SUUUUUUUUUUI..........................................................
> 18:57:35 SUUUII................................................................
> 18:57:36 SUUU..................................................................
> 18:57:37 SSUDDD................................................................
> (^^^^^^ IO-intensive task starts)
> 18:57:38 SSSSSSUDDDDDDDDDDDDIi.................................................
> 18:57:39 SSSSSSSSUDDDDDDDDDi...................................................
> 18:57:40 SSSSSUUDDDDDDDDDDDDi..................................................
> 18:57:41 SSSSSUUUUUDDDDDDDDDDDDi...............................................
> 18:57:42 SSSSSUDDDDDDDDDDDDDIi.................................................
> 18:57:43 SSSUUDDDDDDDi.........................................................
> (^^^^^^ IO-intensive task ends)
> 18:57:44 SUUUI.................................................................
> 18:57:45 SUUU..................................................................
> 18:57:46 UU....................................................................
> 18:57:47 U.....................................................................
>
> This doesn't look bogus to me.
> It does give me information I need to know.

Hmm, but the iowait and idle stats are so racy that I would personally not trust such a
report.

Races are just too likely and potentially high frequency cumulated bugs. Imagine this simple one:

Imagine io task A sleeps on CPU 0. Then it wakes up elsewhere, runs for a while,
sleeps but not on IO for, say 1 minute.
During this time CPU 0 still sleeps.
Task A does short IO again and wakes up on CPU 0.
Cpu 0 just accounted more than 1 minute of iowait spuriously.

But I also realize that userspace code like this must rely on the fact that idle
time and iowait time are mutually exclusive. So perhaps we can't break the ABI
as easily as I thought.

Also the !NO_HZ behaviour still has the old semantics.

Haha, this interface is such well designed nightmare :-(

>
> > Also it decouples iowait from idle time. Running time is also accounted
> > as iowait.
>
> The time when CPUs are busy while there is IO-wait
> are usually not a sign of badly tuned software/system.
>
> Only when CPUs are idle and there is IO-wait is.
>
> That's how it looks from userspace.

That's actually an artificial view that we made for userspace
because its implementation was convenient on !CONFIG_NO_HZ time.

In fact when I look at the history:

* Introduction of the buggy accounting 0224cf4c5ee0d7faec83956b8e21f7d89e3df3bd
* Use it for /proc/stat 7386cdbf2f57ea8cff3c9fde93f206e58b9fe13f

We tried to mimic the way we account iowait time in !CONFIG_NO_HZ
configurations where it's easy to distinguish idle/iowait time and to account
the iowait time per blocking CPU source using the periodic tick. Because it's
always there anyway. So we can tick on the blocking CPU source and poll on the
number of tasks that blocked there for the last time until they are
woken up somewhere else. Updates on iowait stats are then always local.

So we did it that way because it was handy to do so. Somehow the implementation
ease influenced the view.

Now with CONFIG_NO_HZ this artificial view doesn't scale anymore.
If we want to mimic the !CONFIG_NO_HZ behaviour and do it correctly, we must
use remote updates instead of local updates helped by polling tick.

Thought?

2014-04-05 14:57:17

by Denys Vlasenko

[permalink] [raw]

Subject: Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2

On Sat, Apr 5, 2014 at 12:08 PM, Frederic Weisbecker <[email protected]> wrote:
>> > Iowait makes sense but not per cpu. Eventually it's a global
>> > stat. Or per task.
>>
>> There a lot of situations where admins want to know
>> how much, on average, their CPUs are idle because
>> they wait for IO.
>>
>> If you are running, say, a Web cache,
>> you need to know that stat in order to be able to
>> conjecture "looks like I'm IO bound, perhaps caching
>> some data in RAM will speed it up".
>
> But then accounting iowait time waited until completion on the CPU
> that the task wakes up should work for that.
>
> Doesn't it?

It can easily make iowait count higher than idle count,
or even higher than idle+sys+user+nice count.

IOW, it can show that the system is way more
than 100% IO bound, which doesn't make sense.

> So we save, per task, the time when the task went to sleep. And when it wakes up
> we just flush the pending time since that sleep time to the waking CPU:
> iowait[$CPU] += NOW() - tsk->sleeptime;
>
>> Is such counter meaningful for the admin?
>
> Well, you get the iowait time accounting.

Admin wants to know "how often do I have CPU idled
because they have nothing to do until IO is complete?"

Merely knowing how much tasks waited for IO
doesn't answer that question.

2014-04-07 18:17:48

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2

On Sat, Apr 05, 2014 at 04:56:54PM +0200, Denys Vlasenko wrote:
> On Sat, Apr 5, 2014 at 12:08 PM, Frederic Weisbecker <[email protected]> wrote:
> >> > Iowait makes sense but not per cpu. Eventually it's a global
> >> > stat. Or per task.
> >>
> >> There a lot of situations where admins want to know
> >> how much, on average, their CPUs are idle because
> >> they wait for IO.
> >>
> >> If you are running, say, a Web cache,
> >> you need to know that stat in order to be able to
> >> conjecture "looks like I'm IO bound, perhaps caching
> >> some data in RAM will speed it up".
> >
> > But then accounting iowait time waited until completion on the CPU
> > that the task wakes up should work for that.
> >
> > Doesn't it?
>
> It can easily make iowait count higher than idle count,
> or even higher than idle+sys+user+nice count.
>
> IOW, it can show that the system is way more
> than 100% IO bound, which doesn't make sense.

I'm talking about a semantic change, not just an implementation change.
Of course if we change iowait to "account _all_ iowait time" while it's still
interpreted with the old semantics that only "account iowait & idle time from the
CPU source", we have a problem.

Well this breakage is inevitable actually. It's just going to be leveraged
by the fact that CONFIG_NO_HZ actually never accounted correctly this iowait.

So, since it's already broken, I think we meet the conditions for an ABI change.

>
>
> > So we save, per task, the time when the task went to sleep. And when it wakes up
> > we just flush the pending time since that sleep time to the waking CPU:
> > iowait[$CPU] += NOW() - tsk->sleeptime;
> >
> >> Is such counter meaningful for the admin?
> >
> > Well, you get the iowait time accounting.
>
> Admin wants to know "how often do I have CPU idled
> because they have nothing to do until IO is complete?"
>
> Merely knowing how much tasks waited for IO
> doesn't answer that question.

If the admin asks this question on a per CPU basis, it's simply a
wrong question. Because a task waiting on an IO to complete is a
sleeping task. And a sleeping task doesn't belong to any CPU.

It's a pure per task property that can't be mapped on the lower
CPU level.

OTOH it can make sense to ask how much time did we wait on IO
and account this on the CPU which either initiated the IO or
ran the task on IO completion. Why not, it can give you an overview
of where these IO things happen most.

But the accounting we are doing today does not that.
Instead, it accounts the time spent on the CPU which initiated
the IO and only when it is idle. It is a terrible misconception that
is completly subject to scheduler randomness:

The following example displays all the nonsense of that stat:

CPU 0 CPU 1

task A block on IO ...
task B runs for 1 min ...
task A completes IO

So in the above we've been waiting on IO for 1 minute. But none of that
have been accounted. OTOH if task B were to run on CPU 1 (it could have,
really here this is about scheduler load balancing internals, hence pure
randomness for the user), the iowait time would have been accounted.

I doubt that users are interested in such random accounting. They want
to know either:

1) how much time was spent waiting on IO by the whole system
2) how much time was spent waiting on IO per task
3) how much time was spent waiting on IO per CPU that initiated
IOs, or per CPU which ran task completing IOs. In order to have
an overview on where these mostly happened.

Given my above practical example, the current accounting is unable
to reliably report any of these informations. Because it is misdesigned
and doesn't account the right thing.

Now I'm repeating, why are we accounting iowait that way then?
Because it was very convenient from an implementation POV on periodic
tick systems. But it's certainly not convenient for the users given
this accounting randomness.

Now we run dynticks kernels, mostly. And this accounting is just not
possible anymore. Not if we want to keep our kernel scalable.

Hence I think we must change the semantics of our iowait accounting.
This stat is broken since the very beginning and is a strong candidate
for a semantical ABI change.

You guys can argue as much as you want. I'll maintain my position.

Unless you find a way to maintain the current accounting semantics while
keeping it scalable and correct, good luck!

2014-04-09 12:50:20

by Denys Vlasenko

[permalink] [raw]

Subject: Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2

On Mon, Apr 7, 2014 at 8:17 PM, Frederic Weisbecker <[email protected]> wrote:
> The following example displays all the nonsense of that stat:
>
> CPU 0 CPU 1
>
> task A block on IO ...
> task B runs for 1 min ...
> task A completes IO
>
> So in the above we've been waiting on IO for 1 minute. But none of that
> have been accounted.

If there is task B which can put CPU to use while task A
waits for IO, then *system performance is not IO bound*.

> OTOH if task B were to run on CPU 1 (it could have,
> really here this is about scheduler load balancing internals, hence pure
> randomness for the user), the iowait time would have been accounted.

Case A: overall system stats are: 50% busy, 50% idle
Case B: overall system stats are: 50% busy, 50% iowait

You are right, this does not look correct.

Lets step back and look at the situation from a high-level POV.
I believe I have a solution for this problem.

Let's say we have a heavily loaded file server machine where CPUs
are busy only 5% of the time. It makes sense to say that machine
as a whole is "95% waiting for IO".

Our existing accounting did exactly that for single-CPU machines.

But for, say, 2-CPU machine it can show 5% busy, 45% iowait, 50% idle
if there is only one task reading files, or 5% busy, 95% iowait
if there are more than one task.

But it's wrong! NONE of the CPUs are "idle" as long as there even
one task blocked on IO. The machine is still IO-bound, not idling.
In my example, it should not matter whether the machine has one
or 64 CPUs, it should show 5% busy, 95% iowait overall state
in both cases.

Does the above make sense to you?

My proposal is to count each CPU's time towards iowait
if there are task(s) blocked on IO, *regardless on which
runqueue they are*. Only if there are none, then time
is counted towards idle.

> I doubt that users are interested in such random accounting. They want
> to know either:
>
> 1) how much time was spent waiting on IO by the whole system

Hmm, I think I just said the same thing :)

> 2) how much time was spent waiting on IO per task
> 3) how much time was spent waiting on IO per CPU that initiated
> IOs, or per CPU which ran task completing IOs. In order to have
> an overview on where these mostly happened.

Some people may want to know these things, and I am not objecting
to adding whatever counters to help with that.

--
vda

2014-04-09 13:42:38

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2

On Wed, Apr 09, 2014 at 02:49:55PM +0200, Denys Vlasenko wrote:
> > I doubt that users are interested in such random accounting. They want
> > to know either:
> >
> > 1) how much time was spent waiting on IO by the whole system
>
> Hmm, I think I just said the same thing :)
>
> > 2) how much time was spent waiting on IO per task

I think we already have enough cruft to measure that. We have delayacct
measuring this and schedstat also seems to measure this.

> > 3) how much time was spent waiting on IO per CPU that initiated
> > IOs, or per CPU which ran task completing IOs. In order to have
> > an overview on where these mostly happened.

I'm not entirely sure this makes sense in general. It only makes sense
in the specific case where tasks are affine to a single CPU and that cpu
is not part of any balance domain.

I think this is a specific enough case (exclusive cpusets in general)
that we can ignore it. After all, we do not have per cpuset load numbers
at all, so why bother with this one.

At which point I think Denys makes a fair argument.