2014-04-10 09:08:33

by Hidetoshi Seto

[permalink] [raw]
Subject: [PATCH v3 0/2] nohz: fix idle accounting in NO_HZ kernels

Hi all,

This patch set (rebased on v3.14) is my 3rd try to fix an issue
that idle/iowait of /proc/stat can go backward. Originally reported
by Tetsuo and Fernando at last year, Mar 2013.

This v3 takes new approach to fix the problem, and now use seqcount
while v1 and v2 uses seqlock.

Of course still reviews are welcome!

Thanks,
H.Seto

Hidetoshi Seto (2):
nohz: stop updating sleep stats from get_cpu_{idle,iowait}_time_us()
nohz: use delayed iowait accounting to avoid race on idle time stats

include/linux/tick.h | 6 ++-
kernel/time/tick-sched.c | 116 ++++++++++++++++++++++++++++------------------
2 files changed, 74 insertions(+), 48 deletions(-)


2014-04-10 09:11:34

by Hidetoshi Seto

[permalink] [raw]
Subject: [PATCH 1/2] nohz: stop updating sleep stats from get_cpu_{idle,iowait}_time_us()

It easily cause race because multiple caller can write data without
any exclusive locking. To limit the update areas to local, remove update
functionality from these functions.

Now there is no other way to reach update_ts_time_stats(), fold this
static routine into tick_nohz_stop_idle().

Signed-off-by: Hidetoshi Seto <[email protected]>
Cc: Fernando Luis Vazquez Cao <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Preeti U Murthy <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: <[email protected]>
---
include/linux/tick.h | 4 +-
kernel/time/tick-sched.c | 76 +++++++++++++++++-----------------------------
2 files changed, 30 insertions(+), 50 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index b84773c..70a69d7 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -137,8 +137,8 @@ extern void tick_nohz_idle_enter(void);
extern void tick_nohz_idle_exit(void);
extern void tick_nohz_irq_exit(void);
extern ktime_t tick_nohz_get_sleep_length(void);
-extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
-extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
+extern u64 get_cpu_idle_time_us(int cpu, u64 *wall);
+extern u64 get_cpu_iowait_time_us(int cpu, u64 *wall);

# else /* !CONFIG_NO_HZ_COMMON */
static inline int tick_nohz_tick_stopped(void)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9f8af69..3887a05 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -403,31 +403,17 @@ static void tick_nohz_update_jiffies(ktime_t now)
touch_softlockup_watchdog();
}

-/*
- * Updates the per cpu time idle statistics counters
- */
-static void
-update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
+static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
{
ktime_t delta;

- if (ts->idle_active) {
- delta = ktime_sub(now, ts->idle_entrytime);
- if (nr_iowait_cpu(cpu) > 0)
- ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
- else
- ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
- ts->idle_entrytime = now;
- }
-
- if (last_update_time)
- *last_update_time = ktime_to_us(now);
-
-}
-
-static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
-{
- update_ts_time_stats(smp_processor_id(), ts, now, NULL);
+ /* Updates the per cpu time idle statistics counters */
+ delta = ktime_sub(now, ts->idle_entrytime);
+ if (nr_iowait_cpu(smp_processor_id()) > 0)
+ ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
+ else
+ ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+ ts->idle_entrytime = now;
ts->idle_active = 0;

sched_clock_idle_wakeup_event(0);
@@ -446,8 +432,7 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
/**
* get_cpu_idle_time_us - get the total idle time of a cpu
* @cpu: CPU number to query
- * @last_update_time: variable to store update time in. Do not update
- * counters if NULL.
+ * @wall: variable to store current wall time in.
*
* Return the cummulative idle time (since boot) for a given
* CPU, in microseconds.
@@ -457,7 +442,7 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
*
* This function returns -1 if NOHZ is not enabled.
*/
-u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
+u64 get_cpu_idle_time_us(int cpu, u64 *wall)
{
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
ktime_t now, idle;
@@ -466,17 +451,15 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
return -1;

now = ktime_get();
- if (last_update_time) {
- update_ts_time_stats(cpu, ts, now, last_update_time);
- idle = ts->idle_sleeptime;
- } else {
- if (ts->idle_active && !nr_iowait_cpu(cpu)) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+ if (wall)
+ *wall = ktime_to_us(now);

- idle = ktime_add(ts->idle_sleeptime, delta);
- } else {
- idle = ts->idle_sleeptime;
- }
+ if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+ ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+
+ idle = ktime_add(ts->idle_sleeptime, delta);
+ } else {
+ idle = ts->idle_sleeptime;
}

return ktime_to_us(idle);
@@ -487,8 +470,7 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
/**
* get_cpu_iowait_time_us - get the total iowait time of a cpu
* @cpu: CPU number to query
- * @last_update_time: variable to store update time in. Do not update
- * counters if NULL.
+ * @wall: variable to store current wall time in.
*
* Return the cummulative iowait time (since boot) for a given
* CPU, in microseconds.
@@ -498,7 +480,7 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
*
* This function returns -1 if NOHZ is not enabled.
*/
-u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
+u64 get_cpu_iowait_time_us(int cpu, u64 *wall)
{
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
ktime_t now, iowait;
@@ -507,17 +489,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
return -1;

now = ktime_get();
- if (last_update_time) {
- update_ts_time_stats(cpu, ts, now, last_update_time);
- iowait = ts->iowait_sleeptime;
- } else {
- if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+ if (wall)
+ *wall = ktime_to_us(now);

- iowait = ktime_add(ts->iowait_sleeptime, delta);
- } else {
- iowait = ts->iowait_sleeptime;
- }
+ if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+ ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+
+ iowait = ktime_add(ts->iowait_sleeptime, delta);
+ } else {
+ iowait = ts->iowait_sleeptime;
}

return ktime_to_us(iowait);
--
1.7.1

2014-04-10 09:14:32

by Hidetoshi Seto

[permalink] [raw]
Subject: [PATCH 2/2] nohz: use delayed iowait accounting to avoid race on idle time stats

This patch is v3 of patch set to fix an issue that idle/iowait
of /proc/stat can go backward. Originally reported by Tetsuo and
Fernando at last year, Mar 2013.

[BACKGROUNDS]: idle accounting on NO_HZ

If NO_HZ is enabled, cpu stops tick interrupt for itself before
go sleep to be idle. It means that time stats of the sleeping cpu
will not be updated in tick interrupt. Instead when cpu wakes up,
it updates time stats by calculating idle duration time from
timestamp at entering idle and current time as exiting idle.

OTOH, it can happen that there are some kind of observer who want
to know how long the sleeping cpu have been idle. Imagine that
userland runs top command or application who read /proc/stats.
Therefore kernel provides get_cpu_{idle,iowait}_time_us() function
for user to obtain current idle time stats of such sleeping cpu.
This function reads time stats and timestamp at entering idle,
and then return current idle time by adding duration calculated
from timestamp and current time.

There are 2 problems:

[PROBLEM 1]: there is no exclusive control.

It is easy to understand that there are 2 different cpu - an
observing cpu where running a program observing idle cpu's stat
and an observed cpu where performing idle. It means race can
happen if observed cpu wakes up while it is observed. Observer
can accidentally add calculated duration time (say delta) to
time stats which is just updated by woken cpu. Soon correct
idle time is returned in next turn, so it will result in
backward time. Therefore readers must be excluded by writer.

My former patch happily changes get_cpu_{idle,iowait}_time_us()
not to update sleeping cpu's time stats from observing cpu.
It makes time stats to be updated by woken cpu only, so there
are only one writer now!

In summary there are races between one writer and multiple
reader but no exclusive control on this idle time stats dataset.

[PROBLEM 2]: broken iowait accounting.

As historical nature, cpu's idle time was accounted as either
idle or iowait depending on the presence of tasks blocked by
I/O. No one complain about it for a long time. However:

> Still trying to wrap my head around it, but conceptually
> get_cpu_iowait_time_us() doesn't make any kind of sense.
> iowait isn't per cpu since effectively tasks that aren't
> running aren't assigned a cpu (as Oleg already pointed out).
-- Peter Zijlstra

Now some kernel folks realized that accounting iowait as per-cpu
does not make sense in SMP world. When we were in traditional
UP era, cpu is considered to be waiting I/O if it is idle while
nr_iowait > 0. But in these days with SMP systems, tasks easily
migrate from a cpu where they issued an I/O to another cpu where
they are queued after I/O completion.

Back to NO_HZ mechanism. Totally terrible thing here is that
observer need to handle duration "delta" without knowing that
nr_iowait of sleeping cpu can be changed easily by migration
even if cpu is sleeping. So it can happen that:

given:
idle time stats: idle=1000, iowait=100
stamp at idle entry: entry=50
nr tasks waiting io: nr_iowait=1

observer temporary assigns delta as iowait at 1st place,
(but does not do update (=account delta to time stats)):
1st reader's query @ now = 60:
idle=1000
iowait=110 (=100+(60-50))

then blocked tasks are migrated all:
nr_iowait=0

and at last in 2nd turn observer assign delta as idle:
2nd reader's query @ now = 70:
idle=1020 (=1000+(70-50))
iowait=100

You will see that iowait is decreased from 110 to 100.

In summary iowait accounting has fundamental problem and needs
to be precisely reworked. It implies that some user interfaces
might be replaced completely. It will take long time to be
solved, so workaround for compatibility will be appreciated.

[WHAT THIS PATCH PROPOSED]:

To fix problem 1, this patch adds seqcount for NO_HZ idle
accounting to avoid possible races between reader/writer.

And to cope with problem 2, I introduced delayed iowait
accounting to get approximate value without making observers
to writers. Refer comment in patch for the detail.

References:
First report from Fernando:
[RFC] iowait/idle time accounting hiccups in NOHZ kernels
https://lkml.org/lkml/2013/3/18/962
Steps to reproduce guided by Tetsuo:
https://lkml.org/lkml/2013/4/1/201

1st patchset from Frederic:
[PATCH 0/4] nohz: Fix racy sleeptime stats
https://lkml.org/lkml/2013/8/8/638
[PATCH RESEND 0/4] nohz: Fix racy sleeptime stats
https://lkml.org/lkml/2013/8/16/274

2nd patchset from Frederic:
[RFC PATCH 0/5] nohz: Fix racy sleeptime stats v2
https://lkml.org/lkml/2013/10/19/86

My previous patch set:
[PATCH 0/2] nohz: fix idle accounting in NO_HZ kernels
https://lkml.org/lkml/2014/3/23/256
[PATCH 0/2 v2] nohz: fix idle accounting in NO_HZ kernels
https://lkml.org/lkml/2014/3/30/315

v3: use seqcount instead of seqlock
(achieved by inserting cleanup as former patch)
plus introduce delayed iowait accounting

v2: update comments and description about problem 2.
include fix for minor typo

Signed-off-by: Hidetoshi Seto <[email protected]>
Reported-by: Fernando Luis Vazquez Cao <[email protected]>
Reported-by: Tetsuo Handa <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Preeti U Murthy <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: <[email protected]>
---
include/linux/tick.h | 2 +
kernel/time/tick-sched.c | 82 +++++++++++++++++++++++++++++++++++-----------
2 files changed, 65 insertions(+), 19 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 70a69d7..cec32e4 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -62,11 +62,13 @@ struct tick_sched {
unsigned long idle_calls;
unsigned long idle_sleeps;
int idle_active;
+ seqcount_t idle_sleeptime_seq;
ktime_t idle_entrytime;
ktime_t idle_waketime;
ktime_t idle_exittime;
ktime_t idle_sleeptime;
ktime_t iowait_sleeptime;
+ ktime_t iowait_pending;
ktime_t sleep_length;
unsigned long last_jiffies;
unsigned long next_jiffies;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3887a05..03bc1c0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -407,15 +407,42 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
{
ktime_t delta;

+ write_seqcount_begin(&ts->idle_sleeptime_seq);
+
/* Updates the per cpu time idle statistics counters */
delta = ktime_sub(now, ts->idle_entrytime);
- if (nr_iowait_cpu(smp_processor_id()) > 0)
+
+ /*
+ * Perform delayed iowait accounting:
+ *
+ * We account sleep time as iowait when nr_iowait of cpu indicates
+ * there are taskes blocked by io, at the end of idle (=here).
+ * It means we can not determine whether the sleep time will be idle
+ * or iowait on the fly.
+ * Therefore introduce a new rule:
+ * - basically observers assign delta to idle
+ * - if cpu find nr_iowait>0 at idle exit, accumulate delta as missed
+ * iowait, and account it in next turn of sleep instead.
+ * - if observer find accumulated iowait while cpu is in sleep, it
+ * can calculate proper value to be accounted.
+ */
+ if (ktime_compare(ts->iowait_pending, delta) > 0) {
ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
- else
- ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
- ts->idle_entrytime = now;
+ ts->iowait_pending = ktime_sub(ts->iowait_pending, delta);
+ } else {
+ ts->idle_sleeptime = ktime_add(ts->idle_sleeptime,
+ ktime_sub(delta, ts->iowait_pending));
+ ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime,
+ ts->iowait_pending);
+ ts->iowait_pending = ktime_set(0, 0);
+ }
+ if (nr_iowait_cpu(smp_processor_id()) > 0)
+ ts->iowait_pending = ktime_add(ts->iowait_pending, delta);
+
ts->idle_active = 0;

+ write_seqcount_end(&ts->idle_sleeptime_seq);
+
sched_clock_idle_wakeup_event(0);
}

@@ -423,8 +450,11 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
{
ktime_t now = ktime_get();

+ write_seqcount_begin(&ts->idle_sleeptime_seq);
ts->idle_entrytime = now;
ts->idle_active = 1;
+ write_seqcount_end(&ts->idle_sleeptime_seq);
+
sched_clock_idle_sleep_event();
return now;
}
@@ -445,7 +475,8 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
u64 get_cpu_idle_time_us(int cpu, u64 *wall)
{
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
- ktime_t now, idle;
+ ktime_t now, idle, delta;
+ unsigned int seq;

if (!tick_nohz_active)
return -1;
@@ -454,16 +485,21 @@ u64 get_cpu_idle_time_us(int cpu, u64 *wall)
if (wall)
*wall = ktime_to_us(now);

- if (ts->idle_active && !nr_iowait_cpu(cpu)) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-
- idle = ktime_add(ts->idle_sleeptime, delta);
- } else {
+ do {
+ seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
idle = ts->idle_sleeptime;
- }

- return ktime_to_us(idle);
+ if (ts->idle_active) {
+ /* delta is sum of unaccounted idle/iowait */
+ delta = ktime_sub(now, ts->idle_entrytime);
+ if (ktime_compare(delta, ts->iowait_pending) > 0) {
+ delta = ktime_sub(delta, ts->iowait_pending);
+ idle = ktime_add(idle, delta);
+ }
+ }
+ } while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));

+ return ktime_to_us(idle);
}
EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);

@@ -483,7 +519,8 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
u64 get_cpu_iowait_time_us(int cpu, u64 *wall)
{
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
- ktime_t now, iowait;
+ ktime_t now, iowait, delta;
+ unsigned int seq;

if (!tick_nohz_active)
return -1;
@@ -492,13 +529,20 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *wall)
if (wall)
*wall = ktime_to_us(now);

- if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-
- iowait = ktime_add(ts->iowait_sleeptime, delta);
- } else {
+ do {
+ seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
iowait = ts->iowait_sleeptime;
- }
+
+ if (ts->idle_active) {
+ /* delta is sum of unaccounted idle/iowait */
+ delta = ktime_sub(now, ts->idle_entrytime);
+ if (ktime_compare(delta, ts->iowait_pending) < 0) {
+ iowait = ktime_add(iowait, delta);
+ } else {
+ iowait = ktime_add(iowait, ts->iowait_pending);
+ }
+ }
+ } while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));

return ktime_to_us(iowait);
}
--
1.7.1

2014-04-15 03:15:24

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] nohz: fix idle accounting in NO_HZ kernels

Ping?

(2014/04/10 18:07), Hidetoshi Seto wrote:
> Hi all,
>
> This patch set (rebased on v3.14) is my 3rd try to fix an issue
> that idle/iowait of /proc/stat can go backward. Originally reported
> by Tetsuo and Fernando at last year, Mar 2013.
>
> This v3 takes new approach to fix the problem, and now use seqcount
> while v1 and v2 uses seqlock.
>
> Of course still reviews are welcome!
>
> Thanks,
> H.Seto
>
> Hidetoshi Seto (2):
> nohz: stop updating sleep stats from get_cpu_{idle,iowait}_time_us()
> nohz: use delayed iowait accounting to avoid race on idle time stats
>
> include/linux/tick.h | 6 ++-
> kernel/time/tick-sched.c | 116 ++++++++++++++++++++++++++++------------------
> 2 files changed, 74 insertions(+), 48 deletions(-)

2014-04-15 08:49:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] nohz: stop updating sleep stats from get_cpu_{idle,iowait}_time_us()

On Thu, Apr 10, 2014 at 06:11:03PM +0900, Hidetoshi Seto wrote:
> - if (ts->idle_active) {
> - delta = ktime_sub(now, ts->idle_entrytime);
> - if (nr_iowait_cpu(cpu) > 0)
> - ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> - else
> - ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
> - ts->idle_entrytime = now;
> - }
> -
> - if (last_update_time)
> - *last_update_time = ktime_to_us(now);
> -
> -}
> -
> -static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
> -{
> - update_ts_time_stats(smp_processor_id(), ts, now, NULL);
> + /* Updates the per cpu time idle statistics counters */
> + delta = ktime_sub(now, ts->idle_entrytime);
> + if (nr_iowait_cpu(smp_processor_id()) > 0)

Now I know the old code also uses nr_iowait_cpu(), but that function is
crackbrained. This cannot possibly be right.

Anything using nr_iowait_cpu() is broken beyond repair.

2014-04-15 08:50:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] nohz: stop updating sleep stats from get_cpu_{idle,iowait}_time_us()

On Tue, Apr 15, 2014 at 10:48:54AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 10, 2014 at 06:11:03PM +0900, Hidetoshi Seto wrote:
> > - if (ts->idle_active) {
> > - delta = ktime_sub(now, ts->idle_entrytime);
> > - if (nr_iowait_cpu(cpu) > 0)
> > - ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> > - else
> > - ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
> > - ts->idle_entrytime = now;
> > - }
> > -
> > - if (last_update_time)
> > - *last_update_time = ktime_to_us(now);
> > -
> > -}
> > -
> > -static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
> > -{
> > - update_ts_time_stats(smp_processor_id(), ts, now, NULL);
> > + /* Updates the per cpu time idle statistics counters */
> > + delta = ktime_sub(now, ts->idle_entrytime);
> > + if (nr_iowait_cpu(smp_processor_id()) > 0)
>
> Now I know the old code also uses nr_iowait_cpu(), but that function is
> crackbrained. This cannot possibly be right.
>
> Anything using nr_iowait_cpu() is broken beyond repair.

Argh, still need to wake up.. so it uses the old rq to decrement
against.

never mind, I'll go stare at it more

2014-04-15 10:05:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] nohz: use delayed iowait accounting to avoid race on idle time stats

On Thu, Apr 10, 2014 at 06:13:54PM +0900, Hidetoshi Seto wrote:
> This patch is v3 of patch set to fix an issue that idle/iowait
> of /proc/stat can go backward. Originally reported by Tetsuo and
> Fernando at last year, Mar 2013.
>
> [BACKGROUNDS]: idle accounting on NO_HZ
>
> If NO_HZ is enabled, cpu stops tick interrupt for itself before
> go sleep to be idle. It means that time stats of the sleeping cpu
> will not be updated in tick interrupt. Instead when cpu wakes up,
> it updates time stats by calculating idle duration time from
> timestamp at entering idle and current time as exiting idle.
>
> OTOH, it can happen that there are some kind of observer who want
> to know how long the sleeping cpu have been idle. Imagine that
> userland runs top command or application who read /proc/stats.
> Therefore kernel provides get_cpu_{idle,iowait}_time_us() function
> for user to obtain current idle time stats of such sleeping cpu.
> This function reads time stats and timestamp at entering idle,
> and then return current idle time by adding duration calculated
> from timestamp and current time.
>
> There are 2 problems:
>
> [PROBLEM 1]: there is no exclusive control.
>
> It is easy to understand that there are 2 different cpu - an
> observing cpu where running a program observing idle cpu's stat
> and an observed cpu where performing idle. It means race can
> happen if observed cpu wakes up while it is observed. Observer
> can accidentally add calculated duration time (say delta) to
> time stats which is just updated by woken cpu. Soon correct
> idle time is returned in next turn, so it will result in
> backward time. Therefore readers must be excluded by writer.
>
> My former patch happily changes get_cpu_{idle,iowait}_time_us()
> not to update sleeping cpu's time stats from observing cpu.
> It makes time stats to be updated by woken cpu only, so there
> are only one writer now!
>
> In summary there are races between one writer and multiple
> reader but no exclusive control on this idle time stats dataset.

This should've gone in the previous patch; as it describes what that
does.

>
> [PROBLEM 2]: broken iowait accounting.
>
> As historical nature, cpu's idle time was accounted as either
> idle or iowait depending on the presence of tasks blocked by
> I/O. No one complain about it for a long time. However:
>
> > Still trying to wrap my head around it, but conceptually
> > get_cpu_iowait_time_us() doesn't make any kind of sense.
> > iowait isn't per cpu since effectively tasks that aren't
> > running aren't assigned a cpu (as Oleg already pointed out).
> -- Peter Zijlstra
>
> Now some kernel folks realized that accounting iowait as per-cpu
> does not make sense in SMP world. When we were in traditional
> UP era, cpu is considered to be waiting I/O if it is idle while
> nr_iowait > 0. But in these days with SMP systems, tasks easily
> migrate from a cpu where they issued an I/O to another cpu where
> they are queued after I/O completion.
>
> Back to NO_HZ mechanism. Totally terrible thing here is that
> observer need to handle duration "delta" without knowing that
> nr_iowait of sleeping cpu can be changed easily by migration
> even if cpu is sleeping. So it can happen that:
>
> given:
> idle time stats: idle=1000, iowait=100
> stamp at idle entry: entry=50
> nr tasks waiting io: nr_iowait=1
>
> observer temporary assigns delta as iowait at 1st place,
> (but does not do update (=account delta to time stats)):
> 1st reader's query @ now = 60:
> idle=1000
> iowait=110 (=100+(60-50))
>
> then blocked tasks are migrated all:
> nr_iowait=0
>
> and at last in 2nd turn observer assign delta as idle:
> 2nd reader's query @ now = 70:
> idle=1020 (=1000+(70-50))
> iowait=100
>
> You will see that iowait is decreased from 110 to 100.
>
> In summary iowait accounting has fundamental problem and needs
> to be precisely reworked. It implies that some user interfaces
> might be replaced completely. It will take long time to be
> solved, so workaround for compatibility will be appreciated.

This isn't actually true. The way the current ->nr_iowait accounting
works is that we inc/dec against the cpu the task went to sleep on. So
task migration won't actually affect nr_iowait. The only way to
decrement nr_iowait is to wake a task.

That's of course still complete crap, imagine a task going into iowait
sleep on CPU1 at t0. At t1 it wakes on CPU0, but CPU1 stays in nohz
sleep. Then at t2 CPU1 wakes and updates its sleep times.

Between t0 and t1 a reader will observe iowait on CPU1 and report iowait
+ x; x < t1 - t0. However a reader at >=t1 will observe no iowait and
report the old iowait time again but an increased idle time. Furthermore
when at t2 CPU1 wakes it will observe no iowait and account the entire
duration as idle, the iowait never happened.

2014-04-15 10:19:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] nohz: use delayed iowait accounting to avoid race on idle time stats

On Thu, Apr 10, 2014 at 06:13:54PM +0900, Hidetoshi Seto wrote:
> [WHAT THIS PATCH PROPOSED]:
>
> To fix problem 1, this patch adds seqcount for NO_HZ idle
> accounting to avoid possible races between reader/writer.
>
> And to cope with problem 2, I introduced delayed iowait
> accounting to get approximate value without making observers
> to writers. Refer comment in patch for the detail.

> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -407,15 +407,42 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
> {
> ktime_t delta;
>
> + write_seqcount_begin(&ts->idle_sleeptime_seq);
> +
> /* Updates the per cpu time idle statistics counters */
> delta = ktime_sub(now, ts->idle_entrytime);
> +
> + /*
> + * Perform delayed iowait accounting:
> + *
> + * We account sleep time as iowait when nr_iowait of cpu indicates
> + * there are taskes blocked by io, at the end of idle (=here).
> + * It means we can not determine whether the sleep time will be idle
> + * or iowait on the fly.
> + * Therefore introduce a new rule:
> + * - basically observers assign delta to idle
> + * - if cpu find nr_iowait>0 at idle exit, accumulate delta as missed
> + * iowait, and account it in next turn of sleep instead.
> + * - if observer find accumulated iowait while cpu is in sleep, it
> + * can calculate proper value to be accounted.
> + */
> + if (ktime_compare(ts->iowait_pending, delta) > 0) {
> ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> + ts->iowait_pending = ktime_sub(ts->iowait_pending, delta);
> + } else {
> + ts->idle_sleeptime = ktime_add(ts->idle_sleeptime,
> + ktime_sub(delta, ts->iowait_pending));
> + ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime,
> + ts->iowait_pending);
> + ts->iowait_pending = ktime_set(0, 0);
> + }
> + if (nr_iowait_cpu(smp_processor_id()) > 0)
> + ts->iowait_pending = ktime_add(ts->iowait_pending, delta);
> +
> ts->idle_active = 0;
>
> + write_seqcount_end(&ts->idle_sleeptime_seq);
> +
> sched_clock_idle_wakeup_event(0);
> }

Why!? Both changelog and comment are silent on this. This doesn't appear
to make any sense nor really solve anything.

2014-04-16 06:30:59

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH 2/2] nohz: use delayed iowait accounting to avoid race on idle time stats

(2014/04/15 19:04), Peter Zijlstra wrote:
> On Thu, Apr 10, 2014 at 06:13:54PM +0900, Hidetoshi Seto wrote:
>> This patch is v3 of patch set to fix an issue that idle/iowait
>> of /proc/stat can go backward. Originally reported by Tetsuo and
>> Fernando at last year, Mar 2013.
>>
>> [BACKGROUNDS]: idle accounting on NO_HZ
>>
>> If NO_HZ is enabled, cpu stops tick interrupt for itself before
>> go sleep to be idle. It means that time stats of the sleeping cpu
>> will not be updated in tick interrupt. Instead when cpu wakes up,
>> it updates time stats by calculating idle duration time from
>> timestamp at entering idle and current time as exiting idle.
>>
>> OTOH, it can happen that there are some kind of observer who want
>> to know how long the sleeping cpu have been idle. Imagine that
>> userland runs top command or application who read /proc/stats.
>> Therefore kernel provides get_cpu_{idle,iowait}_time_us() function
>> for user to obtain current idle time stats of such sleeping cpu.
>> This function reads time stats and timestamp at entering idle,
>> and then return current idle time by adding duration calculated
>> from timestamp and current time.
>>
>> There are 2 problems:
>>
>> [PROBLEM 1]: there is no exclusive control.
>>
>> It is easy to understand that there are 2 different cpu - an
>> observing cpu where running a program observing idle cpu's stat
>> and an observed cpu where performing idle. It means race can
>> happen if observed cpu wakes up while it is observed. Observer
>> can accidentally add calculated duration time (say delta) to
>> time stats which is just updated by woken cpu. Soon correct
>> idle time is returned in next turn, so it will result in
>> backward time. Therefore readers must be excluded by writer.
>>
>> My former patch happily changes get_cpu_{idle,iowait}_time_us()
>> not to update sleeping cpu's time stats from observing cpu.
>> It makes time stats to be updated by woken cpu only, so there
>> are only one writer now!
>>
>> In summary there are races between one writer and multiple
>> reader but no exclusive control on this idle time stats dataset.
>
> This should've gone in the previous patch; as it describes what that
> does.

Yes, I've being lazy... I'll fix it.

>>
>> [PROBLEM 2]: broken iowait accounting.
>>
>> As historical nature, cpu's idle time was accounted as either
>> idle or iowait depending on the presence of tasks blocked by
>> I/O. No one complain about it for a long time. However:
>>
>> > Still trying to wrap my head around it, but conceptually
>> > get_cpu_iowait_time_us() doesn't make any kind of sense.
>> > iowait isn't per cpu since effectively tasks that aren't
>> > running aren't assigned a cpu (as Oleg already pointed out).
>> -- Peter Zijlstra
>>
>> Now some kernel folks realized that accounting iowait as per-cpu
>> does not make sense in SMP world. When we were in traditional
>> UP era, cpu is considered to be waiting I/O if it is idle while
>> nr_iowait > 0. But in these days with SMP systems, tasks easily
>> migrate from a cpu where they issued an I/O to another cpu where
>> they are queued after I/O completion.
>>
>> Back to NO_HZ mechanism. Totally terrible thing here is that
>> observer need to handle duration "delta" without knowing that
>> nr_iowait of sleeping cpu can be changed easily by migration
>> even if cpu is sleeping. So it can happen that:
>>
>> given:
>> idle time stats: idle=1000, iowait=100
>> stamp at idle entry: entry=50
>> nr tasks waiting io: nr_iowait=1
>>
>> observer temporary assigns delta as iowait at 1st place,
>> (but does not do update (=account delta to time stats)):
>> 1st reader's query @ now = 60:
>> idle=1000
>> iowait=110 (=100+(60-50))
>>
>> then blocked tasks are migrated all:
>> nr_iowait=0
>>
>> and at last in 2nd turn observer assign delta as idle:
>> 2nd reader's query @ now = 70:
>> idle=1020 (=1000+(70-50))
>> iowait=100
>>
>> You will see that iowait is decreased from 110 to 100.
>>
>> In summary iowait accounting has fundamental problem and needs
>> to be precisely reworked. It implies that some user interfaces
>> might be replaced completely. It will take long time to be
>> solved, so workaround for compatibility will be appreciated.
>
> This isn't actually true. The way the current ->nr_iowait accounting
> works is that we inc/dec against the cpu the task went to sleep on. So
> task migration won't actually affect nr_iowait. The only way to
> decrement nr_iowait is to wake a task.

You are right. I used "migration" because still I had an old thought
that "sleeping task is belonging to cpu where it went to sleep on."
I'll update description here.

> That's of course still complete crap, imagine a task going into iowait
> sleep on CPU1 at t0. At t1 it wakes on CPU0, but CPU1 stays in nohz
> sleep. Then at t2 CPU1 wakes and updates its sleep times.
>
> Between t0 and t1 a reader will observe iowait on CPU1 and report iowait
> + x; x < t1 - t0. However a reader at >=t1 will observe no iowait and
> report the old iowait time again but an increased idle time. Furthermore
> when at t2 CPU1 wakes it will observe no iowait and account the entire
> duration as idle, the iowait never happened.

I don't doubt that "per-cpu iowait is complete crap."

However I concern there is no solution yet even though we already have
spent about a year after this "counter goes backward" have reported as
regression.

[continue to your next reply]

Thanks,
H.Seto



2014-04-16 06:33:32

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH 2/2] nohz: use delayed iowait accounting to avoid race on idle time stats

(2014/04/15 19:19), Peter Zijlstra wrote:
> On Thu, Apr 10, 2014 at 06:13:54PM +0900, Hidetoshi Seto wrote:
>> [WHAT THIS PATCH PROPOSED]:
>>
>> To fix problem 1, this patch adds seqcount for NO_HZ idle
>> accounting to avoid possible races between reader/writer.
>>
>> And to cope with problem 2, I introduced delayed iowait
>> accounting to get approximate value without making observers
>> to writers. Refer comment in patch for the detail.
>
>> --- a/kernel/time/tick-sched.c
>> +++ b/kernel/time/tick-sched.c
>> @@ -407,15 +407,42 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
>> {
>> ktime_t delta;
>>
>> + write_seqcount_begin(&ts->idle_sleeptime_seq);
>> +
>> /* Updates the per cpu time idle statistics counters */
>> delta = ktime_sub(now, ts->idle_entrytime);
>> +
>> + /*
>> + * Perform delayed iowait accounting:
>> + *
>> + * We account sleep time as iowait when nr_iowait of cpu indicates
>> + * there are taskes blocked by io, at the end of idle (=here).
>> + * It means we can not determine whether the sleep time will be idle
>> + * or iowait on the fly.
>> + * Therefore introduce a new rule:
>> + * - basically observers assign delta to idle
>> + * - if cpu find nr_iowait>0 at idle exit, accumulate delta as missed
>> + * iowait, and account it in next turn of sleep instead.
>> + * - if observer find accumulated iowait while cpu is in sleep, it
>> + * can calculate proper value to be accounted.
>> + */
>> + if (ktime_compare(ts->iowait_pending, delta) > 0) {
>> ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
>> + ts->iowait_pending = ktime_sub(ts->iowait_pending, delta);
>> + } else {
>> + ts->idle_sleeptime = ktime_add(ts->idle_sleeptime,
>> + ktime_sub(delta, ts->iowait_pending));
>> + ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime,
>> + ts->iowait_pending);
>> + ts->iowait_pending = ktime_set(0, 0);
>> + }
>> + if (nr_iowait_cpu(smp_processor_id()) > 0)
>> + ts->iowait_pending = ktime_add(ts->iowait_pending, delta);
>> +
>> ts->idle_active = 0;
>>
>> + write_seqcount_end(&ts->idle_sleeptime_seq);
>> +
>> sched_clock_idle_wakeup_event(0);
>> }
>
> Why!? Both changelog and comment are silent on this. This doesn't appear
> to make any sense nor really solve anything.

Sorry about my poor description.
I hope I can clarify my idea and thoughts in the following sentence...


[1] : should we make a change on a /proc/stat field semantic?

As Frederic stated in previous mail:
<quote>
> So what we can do for example is to account it per task and update stats
> on the CPU where the blocking task wakes up. This way we guarantee
> that we only account locally, which is good for scalability.
>
> This is going to be an ABI change on a /proc/stat field semantic.
> We usually can not do that as it can break userspace. But I think
> we have a reasonable exception here:
>
> 1) On a performance POV we don't have the choice.
>
> 2) It has always been a buggy stat on SMP. Given the possible fast iowait update
> rate, I doubt it has ever dumped correct stats. So I guess that few user apps
> have ever correctly relied on it.
</quote>

Basically I agree with this idea if we maintain only latest upstream in
development. But in case if target kernel is in family of stables or
some kind of distributor's kernel, I suppose this idea is not acceptable
because keeping semantics are very important for such environment.

For example, you may propose that "hey, per-cpu iowait is completely
crap! so how about making this field in /proc/stat to stick to 0?"
It would be OK for latest upstream, as interim measure till new
iowait accounting mechanism is invented. But for stable kernels,
it will bring new regression report so it will not be any help.

So we need 2 operations:
a) remove regression
b) implement new iowait accounting mechanism

What Frederic mentioned is that we don't need a) once if we invent
the solution for b). But I doubt it because a) is still required
for stable environment including some distributor's kernel.
It is clear that patches for b) will not be backportable.

Still the b) is disease that has no known cure. There is no reason
to wait works on b) before starting works for a).

If we need a semantics change, we can do it for latest upstream.
But OTOH regressions on stable kernels must be fixed.

That's why I wrote these patch set.


[2] : minimum locking for performance

To keep semantics and emulates behavior of tick-based accounting,
my v2 patch set uses seqlock to allow observer to update time stats
of sleeping cpu. I believe v2 do the very minimum implement.

However there were concerns about the performance affect by the
new seqlock. So I started v3 as rework to use minimum locking.

What we required here are:

- Monotonicity:
It should be, and it is what we were complained about.

- Accuracy:
No, it is really inaccurate from the first.
However expected is "roughly same with traditional
tick-based accounting."

- Exclusivity:
Expected. A sleep duration time must be accounted
either idle or iowait, not both.

- Performance:
Even if per-cpu iowait have gone, idle time need to be
calculated on observing cpu with delta and accumulated,
and beside update performed locally on idle exit.
Therefore minimum locking here should be seqcount
(as Frederic already expected).

Therefore v3 is designed to use seqcount instead of seqlock,
by sacrificing accuracy (=or... reasonability?) on some level.

[3] : new tricks

To use seqcount, observers must be readers and never be writers.
It means that:

- Observed cpu's time stats are fixed at idle entry, and
unchanged while sleeping (otherwise results of readers will
not be coherent).

- Readers must not refer nr_iowait of sleeping cpu because
it can be changed by task woken up on other cpu.

At this point:

- As already pointed out, stating "I'll sleep as iowait"
at idle entry will result in infinite iowait.
=> Then how about stating:
"I'll sleep for <e.g. few nsec> as iowait
and rest as idle"?
=> how to determine reasonable <few nsecs>?

- Original code accounts iowait only when nr_iowait is >0
at idle exit. It means we can not determine whether the
sleep time will be idle or iowait on the fly.
=> we cannot determine <few nsecs> at idle entry

- No matter how long >0 continues (e.g. sleep 100ms while
nr_iowait>0 until 99ms), we ignore iowait in sleep time
if nr_iowait=0 at the end.
=> if we give temporal arbitrary <few nsec> for
every idle entry, it will bump up iowait
unreasonably.

- It is completely crap but in fact no one complain about
the inaccuracy.
=> No one notice it if accounted iowait value have
<few nsec> gaps, even if it was <few msec>.

So new strategy is:

>> + * Therefore introduce a new rule:
>> + * - basically observers assign delta to idle
>> + * - if cpu find nr_iowait>0 at idle exit, accumulate delta as missed
>> + * iowait, and account it in next turn of sleep instead.
>> + * - if observer find accumulated iowait while cpu is in sleep, it
>> + * can calculate proper value to be accounted.

... I think I wrote a lot but might missed some point to be clear.
Questions?

I'll write v4 while waiting new replies.


Thanks,
H.Seto

2014-04-16 09:37:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] nohz: use delayed iowait accounting to avoid race on idle time stats

On Wed, Apr 16, 2014 at 03:33:06PM +0900, Hidetoshi Seto wrote:
> I hope I can clarify my idea and thoughts in the following sentence...
>
>
> [1] : should we make a change on a /proc/stat field semantic?
>
> As Frederic stated in previous mail:
> <quote>
> > So what we can do for example is to account it per task and update stats
> > on the CPU where the blocking task wakes up. This way we guarantee
> > that we only account locally, which is good for scalability.
> >
> > This is going to be an ABI change on a /proc/stat field semantic.
> > We usually can not do that as it can break userspace. But I think
> > we have a reasonable exception here:
> >
> > 1) On a performance POV we don't have the choice.
> >
> > 2) It has always been a buggy stat on SMP. Given the possible fast iowait update
> > rate, I doubt it has ever dumped correct stats. So I guess that few user apps
> > have ever correctly relied on it.
> </quote>
>
> Basically I agree with this idea if we maintain only latest upstream in
> development. But in case if target kernel is in family of stables or
> some kind of distributor's kernel, I suppose this idea is not acceptable
> because keeping semantics are very important for such environment.
>
> For example, you may propose that "hey, per-cpu iowait is completely
> crap! so how about making this field in /proc/stat to stick to 0?"
> It would be OK for latest upstream, as interim measure till new
> iowait accounting mechanism is invented. But for stable kernels,
> it will bring new regression report so it will not be any help.
>
> So we need 2 operations:
> a) remove regression

What regression; there's never been talk about a regression, just a bug
found. AFAICT this 'regression' is ever since we introduced NOHZ or
somesuch, which is very long ago indeed.

And since its basically been broken forever, there's no rush what so
ever.

> b) implement new iowait accounting mechanism
>
> What Frederic mentioned is that we don't need a) once if we invent
> the solution for b). But I doubt it because a) is still required
> for stable environment including some distributor's kernel.
> It is clear that patches for b) will not be backportable.
>
> Still the b) is disease that has no known cure. There is no reason
> to wait works on b) before starting works for a).

As stated, there is no a). Its been forever broken. There is no urgency.

2014-04-17 00:42:53

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH 2/2] nohz: use delayed iowait accounting to avoid race on idle time stats

(2014/04/16 18:36), Peter Zijlstra wrote:
> On Wed, Apr 16, 2014 at 03:33:06PM +0900, Hidetoshi Seto wrote:
>> So we need 2 operations:
>> a) remove regression
>
> What regression; there's never been talk about a regression, just a bug
> found. AFAICT this 'regression' is ever since we introduced NOHZ or
> somesuch, which is very long ago indeed.
>
> And since its basically been broken forever, there's no rush what so
> ever.

Well, from a customer's view, when he upgrade his foobar enterprise
linux from version 5 to 6 (for example), he will say "it's a regression"
if something worked well in previous version have broken in new version
without any documents and/or technical notes etc.

That's why I used the word "regression" for this bug.

>> b) implement new iowait accounting mechanism
>>
>> What Frederic mentioned is that we don't need a) once if we invent
>> the solution for b). But I doubt it because a) is still required
>> for stable environment including some distributor's kernel.
>> It is clear that patches for b) will not be backportable.
>>
>> Still the b) is disease that has no known cure. There is no reason
>> to wait works on b) before starting works for a).
>
> As stated, there is no a). Its been forever broken. There is no urgency.

I just wrote my patches for my customer like above and for my salary ;-)

Thank you for your comments!
I'll post my v4 patch set soon.


Thanks,
H.Seto


2014-04-17 10:06:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] nohz: use delayed iowait accounting to avoid race on idle time stats

On Wed, Apr 16, 2014 at 03:33:06PM +0900, Hidetoshi Seto wrote:
> [3] : new tricks
>
> To use seqcount, observers must be readers and never be writers.
> It means that:
>
> - Observed cpu's time stats are fixed at idle entry, and
> unchanged while sleeping (otherwise results of readers will
> not be coherent).
>
> - Readers must not refer nr_iowait of sleeping cpu because
> it can be changed by task woken up on other cpu.
>
> At this point:
>
> - As already pointed out, stating "I'll sleep as iowait"
> at idle entry will result in infinite iowait.
> => Then how about stating:
> "I'll sleep for <e.g. few nsec> as iowait
> and rest as idle"?
> => how to determine reasonable <few nsecs>?

Well, we actually _know_ when that counter drops to 0. We've got the
actual event there, we don't need to guess about any of this.

> - Original code accounts iowait only when nr_iowait is >0
> at idle exit. It means we can not determine whether the
> sleep time will be idle or iowait on the fly.
> => we cannot determine <few nsecs> at idle entry

Intel really should give us this crystal ball instruction already ;-)


Anyway, if you want to preserve the same broken ass crap we had pre
NOHZ, something like the below should do that.

I'm not really thrilled with iowait_{start,stop}() but I think they
should have the same general cost as the atomic ops we already had. In
particular on x86 an uncontended lock+unlock is a single atomic.

This is on top the first patch from Frederic that both you and Denys
carried.

That said; I really hate duckt taping this together, for the generated
numbers are still useless.

--- a/include/linux/ktime.h
+++ b/include/linux/ktime.h
@@ -58,6 +58,8 @@ union ktime {

typedef union ktime ktime_t; /* Kill this */

+#define ktime_zero ((ktime_t){ .tv64 = 0 })
+
/*
* ktime_t definitions when using the 64-bit scalar representation:
*/
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2297,15 +2297,29 @@ unsigned long nr_iowait(void)
unsigned long i, sum = 0;

for_each_possible_cpu(i)
- sum += atomic_read(&cpu_rq(i)->nr_iowait);
+ sum += cpu_rq(i)->nr_iowait;

return sum;
}

unsigned long nr_iowait_cpu(int cpu)
{
- struct rq *this = cpu_rq(cpu);
- return atomic_read(&this->nr_iowait);
+ return cpu_rq(cpu)->nr_iowait;
+}
+
+void nr_iowait_deltas(ktime_t start, ktime_t now,
+ ktime_t *iowait_delta, ktime_t *idle_delta)
+{
+ struct rq *rq = this_rq();
+
+ raw_spin_lock(&rq->iowait_lock);
+ if (rq->nr_iowait) {
+ *iowait_delta = ktime_sub(now, start);
+ } else {
+ *iowait_delta = ktime_sub(rq->last_iowait, start);
+ *idle_delta = ktime_sub(now, rq->last_iowait);
+ }
+ raw_spin_unlock(&rq->iowait_lock);
}

#ifdef CONFIG_SMP
@@ -4201,6 +4215,24 @@ bool __sched yield_to(struct task_struct
}
EXPORT_SYMBOL_GPL(yield_to);

+static inline void iowait_start(struct rq *rq)
+{
+ raw_spin_lock(&rq->iowait_lock);
+ rq->nr_iowait++;
+ raw_spin_unlock(&rq->iowait_lock);
+ current->in_iowait = 1;
+}
+
+static inline void iowait_stop(struct rq *rq)
+{
+ current->in_iowait = 0;
+ raw_spin_lock(&rq->iowait_lock);
+ rq->nr_iowait--;
+ if (!rq->nr_iowait && rq != this_rq())
+ rq->last_iowait = ktime_get();
+ raw_spin_unlock(&rq->iowait_lock);
+}
+
/*
* This task is about to go to sleep on IO. Increment rq->nr_iowait so
* that process accounting knows that this is a task in IO wait state.
@@ -4210,12 +4242,10 @@ void __sched io_schedule(void)
struct rq *rq = raw_rq();

delayacct_blkio_start();
- atomic_inc(&rq->nr_iowait);
+ iowait_start();
blk_flush_plug(current);
- current->in_iowait = 1;
schedule();
- current->in_iowait = 0;
- atomic_dec(&rq->nr_iowait);
+ iowait_stop();
delayacct_blkio_end();
}
EXPORT_SYMBOL(io_schedule);
@@ -4226,12 +4256,10 @@ long __sched io_schedule_timeout(long ti
long ret;

delayacct_blkio_start();
- atomic_inc(&rq->nr_iowait);
+ iowait_start();
blk_flush_plug(current);
- current->in_iowait = 1;
ret = schedule_timeout(timeout);
- current->in_iowait = 0;
- atomic_dec(&rq->nr_iowait);
+ iowait_stop();
delayacct_blkio_end();
return ret;
}
@@ -6880,7 +6908,10 @@ void __init sched_init(void)
#endif
#endif
init_rq_hrtick(rq);
- atomic_set(&rq->nr_iowait, 0);
+
+ raw_spinlock_init(&rq->iowait_lock);
+ rq->nr_iowait = 0;
+ rq->last_iowait = ktime_get();
}

set_load_weight(&init_task);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -571,7 +571,9 @@ struct rq {
u64 clock;
u64 clock_task;

- atomic_t nr_iowait;
+ raw_spinlock_t iowait_lock ____cacheline_aligned;
+ unsigned int nr_iowait;
+ ktime_t last_iowait

#ifdef CONFIG_SMP
struct root_domain *rd;
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -429,14 +429,16 @@ static void tick_nohz_update_jiffies(kti

static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
{
- ktime_t delta;
+ ktime_t iowait_delta = ktime_zero, idle_delta = ktime_zero;
+
+ if (ts->idle_active == 2) {
+ nr_iowait_deltas(ts->idle_entrytime, now, &iowait_delta, &idle_delta);
+ ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, iowait_delta);
+ } else {
+ idle_delta = ktime_sub(now, ts->idle_entrytime);
+ }
+ ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, idle_delta);

- /* Updates the per cpu time idle statistics counters */
- delta = ktime_sub(now, ts->idle_entrytime);
- if (nr_iowait_cpu(smp_processor_id()) > 0)
- ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
- else
- ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
ts->idle_active = 0;

sched_clock_idle_wakeup_event(0);
@@ -447,7 +449,7 @@ static ktime_t tick_nohz_start_idle(stru
ktime_t now = ktime_get();

ts->idle_entrytime = now;
- ts->idle_active = 1;
+ ts->idle_active = 1 + !!nr_iowait_cpu(smp_processor_id());
sched_clock_idle_sleep_event();
return now;
}

2014-04-17 10:09:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] nohz: use delayed iowait accounting to avoid race on idle time stats

On Thu, Apr 17, 2014 at 12:05:19PM +0200, Peter Zijlstra wrote:
> +static inline void iowait_start(struct rq *rq)
> +{
> + raw_spin_lock(&rq->iowait_lock);
> + rq->nr_iowait++;
> + raw_spin_unlock(&rq->iowait_lock);
> + current->in_iowait = 1;
> +}
> +
> +static inline void iowait_stop(struct rq *rq)
> +{
> + current->in_iowait = 0;
> + raw_spin_lock(&rq->iowait_lock);
> + rq->nr_iowait--;
> + if (!rq->nr_iowait && rq != this_rq())
> + rq->last_iowait = ktime_get();
> + raw_spin_unlock(&rq->iowait_lock);
> +}

We could actually use lockref here...

2014-04-18 05:52:53

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH 2/2] nohz: use delayed iowait accounting to avoid race on idle time stats

(2014/04/17 19:05), Peter Zijlstra wrote:
> Anyway, if you want to preserve the same broken ass crap we had pre
> NOHZ, something like the below should do that.
>
> I'm not really thrilled with iowait_{start,stop}() but I think they
> should have the same general cost as the atomic ops we already had. In
> particular on x86 an uncontended lock+unlock is a single atomic.
>
> This is on top the first patch from Frederic that both you and Denys
> carried.
>
> That said; I really hate duckt taping this together, for the generated
> numbers are still useless.
>
> --- a/include/linux/ktime.h
> +++ b/include/linux/ktime.h
> @@ -58,6 +58,8 @@ union ktime {
>
> typedef union ktime ktime_t; /* Kill this */
>
> +#define ktime_zero ((ktime_t){ .tv64 = 0 })
> +
> /*
> * ktime_t definitions when using the 64-bit scalar representation:
> */
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2297,15 +2297,29 @@ unsigned long nr_iowait(void)
> unsigned long i, sum = 0;
>
> for_each_possible_cpu(i)
> - sum += atomic_read(&cpu_rq(i)->nr_iowait);
> + sum += cpu_rq(i)->nr_iowait;
>
> return sum;
> }
>
> unsigned long nr_iowait_cpu(int cpu)
> {
> - struct rq *this = cpu_rq(cpu);
> - return atomic_read(&this->nr_iowait);
> + return cpu_rq(cpu)->nr_iowait;
> +}
> +
> +void nr_iowait_deltas(ktime_t start, ktime_t now,
> + ktime_t *iowait_delta, ktime_t *idle_delta)
> +{
> + struct rq *rq = this_rq();
> +
> + raw_spin_lock(&rq->iowait_lock);
> + if (rq->nr_iowait) {
> + *iowait_delta = ktime_sub(now, start);
> + } else {
> + *iowait_delta = ktime_sub(rq->last_iowait, start);
> + *idle_delta = ktime_sub(now, rq->last_iowait);
> + }
> + raw_spin_unlock(&rq->iowait_lock);
> }
>
> #ifdef CONFIG_SMP
> @@ -4201,6 +4215,24 @@ bool __sched yield_to(struct task_struct
> }
> EXPORT_SYMBOL_GPL(yield_to);
>
> +static inline void iowait_start(struct rq *rq)
> +{
> + raw_spin_lock(&rq->iowait_lock);
> + rq->nr_iowait++;
> + raw_spin_unlock(&rq->iowait_lock);
> + current->in_iowait = 1;
> +}
> +
> +static inline void iowait_stop(struct rq *rq)
> +{
> + current->in_iowait = 0;
> + raw_spin_lock(&rq->iowait_lock);
> + rq->nr_iowait--;
> + if (!rq->nr_iowait && rq != this_rq())
> + rq->last_iowait = ktime_get();
> + raw_spin_unlock(&rq->iowait_lock);
> +}
> +
> /*
> * This task is about to go to sleep on IO. Increment rq->nr_iowait so
> * that process accounting knows that this is a task in IO wait state.
> @@ -4210,12 +4242,10 @@ void __sched io_schedule(void)
> struct rq *rq = raw_rq();
>
> delayacct_blkio_start();
> - atomic_inc(&rq->nr_iowait);
> + iowait_start();
> blk_flush_plug(current);
> - current->in_iowait = 1;
> schedule();
> - current->in_iowait = 0;
> - atomic_dec(&rq->nr_iowait);
> + iowait_stop();
> delayacct_blkio_end();
> }
> EXPORT_SYMBOL(io_schedule);
> @@ -4226,12 +4256,10 @@ long __sched io_schedule_timeout(long ti
> long ret;
>
> delayacct_blkio_start();
> - atomic_inc(&rq->nr_iowait);
> + iowait_start();
> blk_flush_plug(current);
> - current->in_iowait = 1;
> ret = schedule_timeout(timeout);
> - current->in_iowait = 0;
> - atomic_dec(&rq->nr_iowait);
> + iowait_stop();
> delayacct_blkio_end();
> return ret;
> }
> @@ -6880,7 +6908,10 @@ void __init sched_init(void)
> #endif
> #endif
> init_rq_hrtick(rq);
> - atomic_set(&rq->nr_iowait, 0);
> +
> + raw_spinlock_init(&rq->iowait_lock);
> + rq->nr_iowait = 0;
> + rq->last_iowait = ktime_get();
> }
>
> set_load_weight(&init_task);

I think it also works... but I have some concerns here:

- it changes golden path in scheduler core.
impact for performance is questionable.

- it forces managing last_iowait even if system is in busy
I guess it will drop max performance of the system
while my proposed fix only touches procedure for idle
with nohz.

By the way, I have posted my v4 patch set:
https://lkml.org/lkml/2014/4/17/120

I'll happy if you could give your comments on it too!


Thanks,
H.Seto

2014-04-18 08:44:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] nohz: use delayed iowait accounting to avoid race on idle time stats

On Fri, Apr 18, 2014 at 02:52:05PM +0900, Hidetoshi Seto wrote:
> I think it also works... but I have some concerns here:
>
> - it changes golden path in scheduler core.
> impact for performance is questionable.

We should be able to measure that, but like said, it shouldn't change
the number of atomic ops and makes sure its all in the same cacheline,
so the additional code is all cheap.

> - it forces managing last_iowait even if system is in busy
> I guess it will drop max performance of the system
> while my proposed fix only touches procedure for idle
> with nohz.

That could be fixed by looking to see if the remote rq is idle.

> By the way, I have posted my v4 patch set:
> https://lkml.org/lkml/2014/4/17/120
>
> I'll happy if you could give your comments on it too!

Yeah, I saw that, didn't have time yet, will hopefully get to it soon
:-)