2013-10-19 15:17:27

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 0/5] nohz: Fix racy sleeptime stats v2

Hi,

So this is a new iteration with some more changes on top of the latest discussions
we had.

Unfortunately it's not yet a real viable solution. Check out patch 4/5 for details,
I think that's too much overhead. Also it changes enough the semantics to break the
idle sleeptime accounting along the way. This is because moving the accounting to
io_schedule() makes it account _all_ IO sleeptime, even if the CPU is still busy,
not just when it is idle like before. As a result, the idle sleeptime, that relies on
a diff against the io sleeptime, is broken.

Anyway, my hope with this new take is mostly to launch a discussion with a fresher
base and see if we can find better ideas.

Thanks!

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
sched/iowait

Thanks,
Frederic
---

Frederic Weisbecker (5):
nohz: Convert a few places to use local per cpu accesses
nohz: Only update sleeptime stats locally
timer: Change idle/iowait accounting semantics
sched: Refactor iowait accounting
nohz: Synchronize sleep time stats with seqlock


include/linux/ktime.h | 7 +++
include/linux/sched.h | 1 +
include/linux/tick.h | 11 ++--
kernel/sched/core.c | 68 ++++++++++++++++++---
kernel/sched/cputime.c | 2 +-
kernel/sched/sched.h | 5 +-
kernel/softirq.c | 4 +-
kernel/time/tick-broadcast.c | 6 +-
kernel/time/tick-internal.h | 4 +-
kernel/time/tick-sched.c | 138 ++++++++++++++-----------------------------
kernel/time/timer_list.c | 3 +-
11 files changed, 130 insertions(+), 119 deletions(-)


2013-10-19 15:17:39

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 1/5] nohz: Convert a few places to use local per cpu accesses

A few functions use remote per CPU access APIs when they
deal with local values.

Just to the right conversion to improve performance, code
readability and debug checks.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Fernando Luis Vazquez Cao <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---
include/linux/tick.h | 6 +++---
kernel/softirq.c | 4 +---
kernel/time/tick-broadcast.c | 6 +++---
kernel/time/tick-internal.h | 4 ++--
kernel/time/tick-sched.c | 37 +++++++++++++++----------------------
5 files changed, 24 insertions(+), 33 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 5128d33..a004f66 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -104,7 +104,7 @@ extern struct cpumask *tick_get_broadcast_oneshot_mask(void);
extern void tick_clock_notify(void);
extern int tick_check_oneshot_change(int allow_nohz);
extern struct tick_sched *tick_get_tick_sched(int cpu);
-extern void tick_check_idle(int cpu);
+extern void tick_check_idle(void);
extern int tick_oneshot_mode_active(void);
# ifndef arch_needs_cpu
# define arch_needs_cpu(cpu) (0)
@@ -112,7 +112,7 @@ extern int tick_oneshot_mode_active(void);
# else
static inline void tick_clock_notify(void) { }
static inline int tick_check_oneshot_change(int allow_nohz) { return 0; }
-static inline void tick_check_idle(int cpu) { }
+static inline void tick_check_idle(void) { }
static inline int tick_oneshot_mode_active(void) { return 0; }
# endif

@@ -121,7 +121,7 @@ static inline void tick_init(void) { }
static inline void tick_cancel_sched_timer(int cpu) { }
static inline void tick_clock_notify(void) { }
static inline int tick_check_oneshot_change(int allow_nohz) { return 0; }
-static inline void tick_check_idle(int cpu) { }
+static inline void tick_check_idle(void) { }
static inline int tick_oneshot_mode_active(void) { return 0; }
#endif /* !CONFIG_GENERIC_CLOCKEVENTS */

diff --git a/kernel/softirq.c b/kernel/softirq.c
index dcab1d3..d504e6c 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -310,8 +310,6 @@ asmlinkage void do_softirq(void)
*/
void irq_enter(void)
{
- int cpu = smp_processor_id();
-
rcu_irq_enter();
if (is_idle_task(current) && !in_interrupt()) {
/*
@@ -319,7 +317,7 @@ void irq_enter(void)
* here, as softirq will be serviced on return from interrupt.
*/
local_bh_disable();
- tick_check_idle(cpu);
+ tick_check_idle();
_local_bh_enable();
}

diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 218bcb5..bb55743 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -537,10 +537,10 @@ int tick_resume_broadcast_oneshot(struct clock_event_device *bc)
* Called from irq_enter() when idle was interrupted to reenable the
* per cpu device.
*/
-void tick_check_oneshot_broadcast(int cpu)
+void tick_check_oneshot_broadcast(void)
{
- if (cpumask_test_cpu(cpu, tick_broadcast_oneshot_mask)) {
- struct tick_device *td = &per_cpu(tick_cpu_device, cpu);
+ if (cpumask_test_cpu(smp_processor_id(), tick_broadcast_oneshot_mask)) {
+ struct tick_device *td = &__get_cpu_var(tick_cpu_device);

/*
* We might be in the middle of switching over from
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index bc906ca..6d819b4 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -51,7 +51,7 @@ extern void tick_broadcast_switch_to_oneshot(void);
extern void tick_shutdown_broadcast_oneshot(unsigned int *cpup);
extern int tick_resume_broadcast_oneshot(struct clock_event_device *bc);
extern int tick_broadcast_oneshot_active(void);
-extern void tick_check_oneshot_broadcast(int cpu);
+extern void tick_check_oneshot_broadcast(void);
bool tick_broadcast_oneshot_available(void);
# else /* BROADCAST */
static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
@@ -62,7 +62,7 @@ static inline void tick_broadcast_oneshot_control(unsigned long reason) { }
static inline void tick_broadcast_switch_to_oneshot(void) { }
static inline void tick_shutdown_broadcast_oneshot(unsigned int *cpup) { }
static inline int tick_broadcast_oneshot_active(void) { return 0; }
-static inline void tick_check_oneshot_broadcast(int cpu) { }
+static inline void tick_check_oneshot_broadcast(void) { }
static inline bool tick_broadcast_oneshot_available(void) { return true; }
# endif /* !BROADCAST */

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3612fc7..b93b5b9 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -391,11 +391,9 @@ __setup("nohz=", setup_tick_nohz);
*/
static void tick_nohz_update_jiffies(ktime_t now)
{
- int cpu = smp_processor_id();
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
unsigned long flags;

- ts->idle_waketime = now;
+ __this_cpu_write(tick_cpu_sched.idle_waketime, now);

local_irq_save(flags);
tick_do_update_jiffies64(now);
@@ -426,17 +424,15 @@ update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_upda

}

-static void tick_nohz_stop_idle(int cpu, ktime_t now)
+static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
{
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-
- update_ts_time_stats(cpu, ts, now, NULL);
+ update_ts_time_stats(smp_processor_id(), ts, now, NULL);
ts->idle_active = 0;

sched_clock_idle_wakeup_event(0);
}

-static ktime_t tick_nohz_start_idle(int cpu, struct tick_sched *ts)
+static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
{
ktime_t now = ktime_get();

@@ -752,7 +748,7 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts)
ktime_t now, expires;
int cpu = smp_processor_id();

- now = tick_nohz_start_idle(cpu, ts);
+ now = tick_nohz_start_idle(ts);

if (can_stop_idle_tick(cpu, ts)) {
int was_stopped = ts->tick_stopped;
@@ -914,8 +910,7 @@ static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
*/
void tick_nohz_idle_exit(void)
{
- int cpu = smp_processor_id();
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+ struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
ktime_t now;

local_irq_disable();
@@ -928,7 +923,7 @@ void tick_nohz_idle_exit(void)
now = ktime_get();

if (ts->idle_active)
- tick_nohz_stop_idle(cpu, now);
+ tick_nohz_stop_idle(ts, now);

if (ts->tick_stopped) {
tick_nohz_restart_sched_tick(ts, now);
@@ -1012,12 +1007,10 @@ static void tick_nohz_switch_to_nohz(void)
* timer and do not touch the other magic bits which need to be done
* when idle is left.
*/
-static void tick_nohz_kick_tick(int cpu, ktime_t now)
+static void tick_nohz_kick_tick(struct tick_sched *ts, ktime_t now)
{
#if 0
/* Switch back to 2.6.27 behaviour */
-
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
ktime_t delta;

/*
@@ -1032,19 +1025,19 @@ static void tick_nohz_kick_tick(int cpu, ktime_t now)
#endif
}

-static inline void tick_check_nohz(int cpu)
+static inline void tick_check_nohz(void)
{
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+ struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
ktime_t now;

if (!ts->idle_active && !ts->tick_stopped)
return;
now = ktime_get();
if (ts->idle_active)
- tick_nohz_stop_idle(cpu, now);
+ tick_nohz_stop_idle(ts, now);
if (ts->tick_stopped) {
tick_nohz_update_jiffies(now);
- tick_nohz_kick_tick(cpu, now);
+ tick_nohz_kick_tick(ts, now);
}
}

@@ -1058,10 +1051,10 @@ static inline void tick_check_nohz(int cpu) { }
/*
* Called from irq_enter to notify about the possible interruption of idle()
*/
-void tick_check_idle(int cpu)
+void tick_check_idle(void)
{
- tick_check_oneshot_broadcast(cpu);
- tick_check_nohz(cpu);
+ tick_check_oneshot_broadcast();
+ tick_check_nohz();
}

/*
--
1.8.3.1

2013-10-19 15:18:08

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 5/5] nohz: Synchronize sleep time stats with seqlock

When some call site uses get_cpu_*_time_us() to read a sleeptime
stat, it deduces the total sleeptime by adding the pending time
to the last sleeptime snapshot if the CPU target is idle.

Namely this sums up to:

sleeptime = ts($CPU)->idle_sleeptime;
if (ts($CPU)->idle_active)
sleeptime += NOW() - ts($CPU)->idle_entrytime

But this only works if idle_sleeptime, idle_entrytime and idle_active are
read and updated under some disciplined order.

Lets consider the following scenario:

CPU 0 CPU 1

(seq 1) ts(CPU 0)->idle_active = 1
ts(CPU 0)->idle_entrytime = NOW()

(seq 2) sleeptime = NOW() - ts(CPU 0)->idle_entrytime
ts(CPU 0)->idle_sleeptime += sleeptime sleeptime = ts(CPU 0)->idle_sleeptime;
if (ts(CPU 0)->idle_active)
ts(CPU 0)->idle_entrytime = NOW() sleeptime += NOW() - ts(CPU 0)->idle_entrytime

The resulting value of sleeptime in CPU 1 can vary depending of some
ordering scenario:

* If it sees the value of idle_entrytime after seq 1 and the value of idle_sleeptime
after seq 2, the value of sleeptime will be buggy because it accounts the delta twice,
so it will be too high.

* If it sees the value of idle_entrytime after seq 2 and the value of idle_sleeptime
after seq 1, the value of sleeptime will be buggy because it misses the delta, so it
will be too low.

* If it sees the value of idle_entrytime and idle_sleeptime, both as seen after seq 1 or 2,
the value will be correct.

Some more tricky scenario can also happen if idle_active value is read from a former sequence.

Hence we must honour the following constraints:

- idle_sleeptime, idle_active and idle_entrytime must be updated and read
under some correctly enforced SMP ordering

- The three variable values as read by CPU 1 must belong to the same update
sequences from CPU 0. The update sequences must be delimited such that the
resulting three values after a sequence completion produce a coherent result
together when read from the CPU 1.

- We need to prevent from fetching middle-state sequence values.

The ideal solution to implement this synchronization is to use a seqcount. Lets
use one here around these three values to enforce sequence synchronization between
updates and read.

This fixes a reported bug where non-monotonic sleeptime stats are returned by /proc/stat
when it is frequently read. And potential cpufreq governor bugs.

Reported-by: Fernando Luis Vazquez Cao <[email protected]>
Reported-by: Tetsuo Handa <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Fernando Luis Vazquez Cao <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---
include/linux/tick.h | 1 +
kernel/time/tick-sched.c | 22 ++++++++++++++++------
2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7a66bf6..4d6131e 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -65,6 +65,7 @@ struct tick_sched {
ktime_t idle_waketime;
ktime_t idle_exittime;
ktime_t idle_sleeptime;
+ seqcount_t idle_sleeptime_seq;
ktime_t sleep_length;
unsigned long last_jiffies;
unsigned long next_jiffies;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a038d4e..da53cbf 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -407,9 +407,11 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
ktime_t delta;

/* Updates the per cpu time idle statistics counters */
+ write_seqcount_begin(&ts->idle_sleeptime_seq);
delta = ktime_sub(now, ts->idle_entrytime);
ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
ts->idle_active = 0;
+ write_seqcount_end(&ts->idle_sleeptime_seq);

sched_clock_idle_wakeup_event(0);
}
@@ -418,9 +420,13 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
{
ktime_t now = ktime_get();

+ write_seqcount_begin(&ts->idle_sleeptime_seq);
ts->idle_entrytime = now;
ts->idle_active = 1;
+ write_seqcount_end(&ts->idle_sleeptime_seq);
+
sched_clock_idle_sleep_event();
+
return now;
}

@@ -442,6 +448,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
{
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
ktime_t now, idle;
+ unsigned int seq;
u64 iowait;

if (!tick_nohz_enabled)
@@ -451,12 +458,16 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
if (last_update_time)
*last_update_time = ktime_to_us(now);

- if (ts->idle_active) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
- idle = ktime_add(ts->idle_sleeptime, delta);
- } else {
+ do {
+ ktime_t delta;
+
+ seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
idle = ts->idle_sleeptime;
- }
+ if (ts->idle_active) {
+ delta = ktime_sub(now, ts->idle_entrytime);
+ idle = ktime_add(idle, delta);
+ }
+ } while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));

iowait = get_cpu_iowait_time_us(cpu, NULL);

@@ -468,7 +479,6 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
idle = ktime_sub_us(idle, iowait);

return ktime_to_us(idle);
-
}
EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);

--
1.8.3.1

2013-10-19 15:18:07

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 4/5] sched: Refactor iowait accounting

The iowait time accounting has awkward semantics. It is performed per
CPU. The iowait time of a CPU starts when a task sleeps on IO on
this CPU and stops when that task later wakes up on any CPU.

This assumes that a sleeping task is still assigned to the CPU
it was running on last until it wakes up. As such the iowait time stays
affine to that CPU until the io completes and the task runs again.

These assumptions are buggy given that a sleeping task is actually not
assigned to any CPU: it can be woken up to run on any target, not just
the CPU that dequeued it last. Thereby io wait time should be accounted
globally instead, or per the set of CPUs a task is allowed to run on.

Although it's tempting to remove such a buggy stat, it's important to
note that some subsystems rely on this accounting: the cpufreq subsystem
and also the procfs subsystem that displays this per cpu value through
the /proc/stat file. Even though we can work around the cpufreq case where
it appears that only few residual users rely on the iowait accounting,
the procfs file made a user ABI on top of it now. So it's not going to
be removed easily.

So we probably need to keep it on the short term. Now the way it is
implemented around the idle loop is racy. For example if CPU 0 sleeps
with pending IO for a while, then the sleeping task wakes up on CPU 1
while CPU 0 also exits idle to schedule a task, the iowait time may be
incidentally accounted twice or worse.

Since it can be updated concurrently, we need some sort of synchronization
there. This patch is inspired by a proposition of Peter Zijlstra.

We move the iowait accounting to the io schedule code and synchronize
the writers and readers with seqlocks.

Note that changes the iowait accounting semantics because the iowait
time was previously only called when the CPU was idle. Now it's
accounted as long as there is a task sleeping on IO.

Two issues with this:

* This breaks the idle_sleeptime accounting that computes a difference
against iowait time.

* Also the seqlock + local_clock is probably too much overhead for the
io schedule path.

So this patch is a just a fresher base to debate on a better solution.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Fernando Luis Vazquez Cao <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---
include/linux/sched.h | 1 +
include/linux/tick.h | 4 ---
kernel/sched/core.c | 68 +++++++++++++++++++++++++++++++++++++++++++-----
kernel/sched/cputime.c | 2 +-
kernel/sched/sched.h | 5 +++-
kernel/time/tick-sched.c | 44 +++----------------------------
6 files changed, 71 insertions(+), 53 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 833eed5..a542fa9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -102,6 +102,7 @@ extern int nr_processes(void);
extern unsigned long nr_running(void);
extern unsigned long nr_iowait(void);
extern unsigned long nr_iowait_cpu(int cpu);
+extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
extern unsigned long this_cpu_load(void);


diff --git a/include/linux/tick.h b/include/linux/tick.h
index a004f66..7a66bf6 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -47,7 +47,6 @@ enum tick_nohz_mode {
* @idle_waketime: Time when the idle was interrupted
* @idle_exittime: Time when the idle state was left
* @idle_sleeptime: Sum of the time slept in idle with sched tick stopped
- * @iowait_sleeptime: Sum of the time slept in idle with sched tick stopped, with IO outstanding
* @sleep_length: Duration of the current idle sleep
* @do_timer_lst: CPU was the last one doing do_timer before going idle
*/
@@ -66,7 +65,6 @@ struct tick_sched {
ktime_t idle_waketime;
ktime_t idle_exittime;
ktime_t idle_sleeptime;
- ktime_t iowait_sleeptime;
ktime_t sleep_length;
unsigned long last_jiffies;
unsigned long next_jiffies;
@@ -138,7 +136,6 @@ extern void tick_nohz_idle_exit(void);
extern void tick_nohz_irq_exit(void);
extern ktime_t tick_nohz_get_sleep_length(void);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
-extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);

# else /* !CONFIG_NO_HZ_COMMON */
static inline int tick_nohz_tick_stopped(void)
@@ -156,7 +153,6 @@ static inline ktime_t tick_nohz_get_sleep_length(void)
return len;
}
static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
-static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
# endif /* !CONFIG_NO_HZ_COMMON */

#ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c3feeb..b3fa17a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2157,7 +2157,7 @@ unsigned long nr_iowait(void)
unsigned long i, sum = 0;

for_each_possible_cpu(i)
- sum += atomic_read(&cpu_rq(i)->nr_iowait);
+ sum += cpu_rq(i)->nr_iowait;

return sum;
}
@@ -2165,7 +2165,7 @@ unsigned long nr_iowait(void)
unsigned long nr_iowait_cpu(int cpu)
{
struct rq *this = cpu_rq(cpu);
- return atomic_read(&this->nr_iowait);
+ return this->nr_iowait;
}

#ifdef CONFIG_SMP
@@ -4064,6 +4064,59 @@ out_irq:
}
EXPORT_SYMBOL_GPL(yield_to);

+/**
+ * get_cpu_iowait_time_us - get the total iowait time of a cpu
+ * @cpu: CPU number to query
+ * @last_update_time: variable to store update time in. Do not update
+ * counters if NULL.
+ *
+ * Return the cummulative iowait time (since boot) for a given
+ * CPU, in microseconds.
+ *
+ * This time is measured via accounting rather than sampling,
+ * and is as accurate as ktime_get() is.
+ *
+ */
+u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
+{
+ ktime_t iowait, delta = { .tv64 = 0 };
+ struct rq *rq = cpu_rq(cpu);
+ ktime_t now = ktime_get();
+ unsigned int seq;
+
+ do {
+ seq = read_seqbegin(&rq->iowait_lock);
+ if (rq->nr_iowait)
+ delta = ktime_sub(now, rq->iowait_start);
+ iowait = ktime_add(rq->iowait_time, delta);
+ } while (read_seqretry(&rq->iowait_lock, seq));
+
+ if (last_update_time)
+ *last_update_time = ktime_to_us(now);
+
+ return ktime_to_us(iowait);
+}
+EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
+
+static void cpu_iowait_start(struct rq *rq)
+{
+ write_seqlock(&rq->iowait_lock);
+ if (!rq->nr_iowait++)
+ rq->iowait_start = ktime_get();
+ write_sequnlock(&rq->iowait_lock);
+}
+
+static void cpu_iowait_end(struct rq *rq)
+{
+ ktime_t delta;
+ write_seqlock(&rq->iowait_lock);
+ if (!--rq->nr_iowait) {
+ delta = ktime_sub(ktime_get(), rq->iowait_start);
+ rq->iowait_time = ktime_add(rq->iowait_time, delta);
+ }
+ write_sequnlock(&rq->iowait_lock);
+}
+
/*
* This task is about to go to sleep on IO. Increment rq->nr_iowait so
* that process accounting knows that this is a task in IO wait state.
@@ -4073,12 +4126,12 @@ void __sched io_schedule(void)
struct rq *rq = raw_rq();

delayacct_blkio_start();
- atomic_inc(&rq->nr_iowait);
+ cpu_iowait_start(rq);
blk_flush_plug(current);
current->in_iowait = 1;
schedule();
current->in_iowait = 0;
- atomic_dec(&rq->nr_iowait);
+ cpu_iowait_end(rq);
delayacct_blkio_end();
}
EXPORT_SYMBOL(io_schedule);
@@ -4089,12 +4142,12 @@ long __sched io_schedule_timeout(long timeout)
long ret;

delayacct_blkio_start();
- atomic_inc(&rq->nr_iowait);
+ cpu_iowait_start(rq);
blk_flush_plug(current);
current->in_iowait = 1;
ret = schedule_timeout(timeout);
current->in_iowait = 0;
- atomic_dec(&rq->nr_iowait);
+ cpu_iowait_end(rq);
delayacct_blkio_end();
return ret;
}
@@ -6673,7 +6726,8 @@ void __init sched_init(void)
#endif
#endif
init_rq_hrtick(rq);
- atomic_set(&rq->nr_iowait, 0);
+ rq->nr_iowait = 0;
+ seqlock_init(&rq->iowait_lock);
}

set_load_weight(&init_task);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 9994791..4ba6cd2 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -248,7 +248,7 @@ void account_idle_time(cputime_t cputime)
u64 *cpustat = kcpustat_this_cpu->cpustat;
struct rq *rq = this_rq();

- if (atomic_read(&rq->nr_iowait) > 0)
+ if (rq->nr_iowait > 0)
cpustat[CPUTIME_IOWAIT] += (__force u64) cputime;
else
cpustat[CPUTIME_IDLE] += (__force u64) cputime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d69cb32..a385bae 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -457,7 +457,10 @@ struct rq {
u64 clock;
u64 clock_task;

- atomic_t nr_iowait;
+ unsigned int nr_iowait;
+ ktime_t iowait_start;
+ ktime_t iowait_time;
+ seqlock_t iowait_lock;

#ifdef CONFIG_SMP
struct root_domain *rd;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index fbcb249..a038d4e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -408,9 +408,6 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)

/* Updates the per cpu time idle statistics counters */
delta = ktime_sub(now, ts->idle_entrytime);
- if (nr_iowait_cpu(smp_processor_id()) > 0)
- ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
-
ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
ts->idle_active = 0;

@@ -463,6 +460,10 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)

iowait = get_cpu_iowait_time_us(cpu, NULL);

+ /*
+ * FIXME: Now that iowait time is accounted also when the CPU is non-idle,
+ * this difference is broken.
+ */
if (ktime_compare(idle, us_to_ktime(iowait)) > 0)
idle = ktime_sub_us(idle, iowait);

@@ -471,43 +472,6 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
}
EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);

-/**
- * get_cpu_iowait_time_us - get the total iowait time of a cpu
- * @cpu: CPU number to query
- * @last_update_time: variable to store update time in. Do not update
- * counters if NULL.
- *
- * Return the cummulative iowait time (since boot) for a given
- * CPU, in microseconds.
- *
- * This time is measured via accounting rather than sampling,
- * and is as accurate as ktime_get() is.
- *
- * This function returns -1 if NOHZ is not enabled.
- */
-u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
-{
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
- ktime_t now, iowait;
-
- if (!tick_nohz_enabled)
- return -1;
-
- now = ktime_get();
- if (last_update_time)
- *last_update_time = ktime_to_us(now);
-
- if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
- iowait = ktime_add(ts->iowait_sleeptime, delta);
- } else {
- iowait = ts->iowait_sleeptime;
- }
-
- return ktime_to_us(iowait);
-}
-EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
-
static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
ktime_t now, int cpu)
{
--
1.8.3.1

2013-10-19 15:18:04

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 3/5] timer: Change idle/iowait accounting semantics

To prepare for fixing a race between iowait and idle time stats,
this patch changes the following semantics:

* iowait time is going to be accounted from the scheduler rather than
the dynticks idle code, lets remove it from the /proc/timer_list dump.

* idle sleeptime now also includes the iowait time for simplicity.
Its accounting was relying on nr_iowait_cpu() which made the whole
share of time accounting between idle and iowait very racy. So
accounting the whole sleeptime in ts->idle_sleeptime makes it more
simple and improve its correctness.

Given the semantic change of ts->idle_sleeptime, it results in another
ABI change on /proc/timer_list for this field.

/proc/stat and other callers of get_cpu_idle_time_us() and
get_cpu_iowait_time_us() are not concerned though because we maintain
the old ABI by substracting the iowait time from the idle time.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Fernando Luis Vazquez Cao <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---
include/linux/ktime.h | 7 +++++++
kernel/time/tick-sched.c | 12 +++++++++---
kernel/time/timer_list.c | 3 +--
3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/include/linux/ktime.h b/include/linux/ktime.h
index 31c0cd1..7a98f6a 100644
--- a/include/linux/ktime.h
+++ b/include/linux/ktime.h
@@ -379,6 +379,13 @@ static inline ktime_t ns_to_ktime(u64 ns)
return ktime_add_ns(ktime_zero, ns);
}

+static inline ktime_t us_to_ktime(u64 us)
+{
+ static const ktime_t ktime_zero = { .tv64 = 0 };
+
+ return ktime_add_us(ktime_zero, us);
+}
+
static inline ktime_t ms_to_ktime(u64 ms)
{
static const ktime_t ktime_zero = { .tv64 = 0 };
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 7f0fb78..fbcb249 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -410,8 +410,8 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
delta = ktime_sub(now, ts->idle_entrytime);
if (nr_iowait_cpu(smp_processor_id()) > 0)
ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
- else
- ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+
+ ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
ts->idle_active = 0;

sched_clock_idle_wakeup_event(0);
@@ -445,6 +445,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
{
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
ktime_t now, idle;
+ u64 iowait;

if (!tick_nohz_enabled)
return -1;
@@ -453,13 +454,18 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
if (last_update_time)
*last_update_time = ktime_to_us(now);

- if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+ if (ts->idle_active) {
ktime_t delta = ktime_sub(now, ts->idle_entrytime);
idle = ktime_add(ts->idle_sleeptime, delta);
} else {
idle = ts->idle_sleeptime;
}

+ iowait = get_cpu_iowait_time_us(cpu, NULL);
+
+ if (ktime_compare(idle, us_to_ktime(iowait)) > 0)
+ idle = ktime_sub_us(idle, iowait);
+
return ktime_to_us(idle);

}
diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index 61ed862..9a3f8e2 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -182,7 +182,6 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
P_ns(idle_waketime);
P_ns(idle_exittime);
P_ns(idle_sleeptime);
- P_ns(iowait_sleeptime);
P(last_jiffies);
P(next_jiffies);
P_ns(idle_expires);
@@ -256,7 +255,7 @@ static void timer_list_show_tickdevices_header(struct seq_file *m)

static inline void timer_list_header(struct seq_file *m, u64 now)
{
- SEQ_printf(m, "Timer List Version: v0.7\n");
+ SEQ_printf(m, "Timer List Version: v0.8\n");
SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES);
SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now);
SEQ_printf(m, "\n");
--
1.8.3.1

2013-10-19 15:18:01

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 2/5] nohz: Only update sleeptime stats locally

The idle and io sleeptime stats can be updated concurrently from callers
of get_cpu_idle_time_us(), get_cpu_iowait_time_us() and
tick_nohz_stop_idle().

Updaters can easily race and mess up with internal datas coherency,
for example when a governor calls a get_cpu_*_time_us() API and the
target CPU exits idle at the same time, because no locking or whatsoever
is there to protect against this concurrency.

To fix this, lets only update the sleeptime stats locally when the CPU
exits from idle. This is the only place where a real update is truly
needed. The callers of get_cpu_*_time_us() can simply add up the pending
sleep time delta to the last sleeptime snapshot in order to get a coherent
result. There is no need for them to also update the stats.

Reported-by: Fernando Luis Vazquez Cao <[email protected]>
Reported-by: Tetsuo Handa <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Fernando Luis Vazquez Cao <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---
kernel/time/tick-sched.c | 63 ++++++++++++++++--------------------------------
1 file changed, 21 insertions(+), 42 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b93b5b9..7f0fb78 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -402,31 +402,16 @@ static void tick_nohz_update_jiffies(ktime_t now)
touch_softlockup_watchdog();
}

-/*
- * Updates the per cpu time idle statistics counters
- */
-static void
-update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
+static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
{
ktime_t delta;

- if (ts->idle_active) {
- delta = ktime_sub(now, ts->idle_entrytime);
- if (nr_iowait_cpu(cpu) > 0)
- ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
- else
- ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
- ts->idle_entrytime = now;
- }
-
- if (last_update_time)
- *last_update_time = ktime_to_us(now);
-
-}
-
-static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
-{
- update_ts_time_stats(smp_processor_id(), ts, now, NULL);
+ /* Updates the per cpu time idle statistics counters */
+ delta = ktime_sub(now, ts->idle_entrytime);
+ if (nr_iowait_cpu(smp_processor_id()) > 0)
+ ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
+ else
+ ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
ts->idle_active = 0;

sched_clock_idle_wakeup_event(0);
@@ -465,17 +450,14 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
return -1;

now = ktime_get();
- if (last_update_time) {
- update_ts_time_stats(cpu, ts, now, last_update_time);
- idle = ts->idle_sleeptime;
- } else {
- if (ts->idle_active && !nr_iowait_cpu(cpu)) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+ if (last_update_time)
+ *last_update_time = ktime_to_us(now);

- idle = ktime_add(ts->idle_sleeptime, delta);
- } else {
- idle = ts->idle_sleeptime;
- }
+ if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+ ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+ idle = ktime_add(ts->idle_sleeptime, delta);
+ } else {
+ idle = ts->idle_sleeptime;
}

return ktime_to_us(idle);
@@ -506,17 +488,14 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
return -1;

now = ktime_get();
- if (last_update_time) {
- update_ts_time_stats(cpu, ts, now, last_update_time);
- iowait = ts->iowait_sleeptime;
- } else {
- if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+ if (last_update_time)
+ *last_update_time = ktime_to_us(now);

- iowait = ktime_add(ts->iowait_sleeptime, delta);
- } else {
- iowait = ts->iowait_sleeptime;
- }
+ if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+ ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+ iowait = ktime_add(ts->iowait_sleeptime, delta);
+ } else {
+ iowait = ts->iowait_sleeptime;
}

return ktime_to_us(iowait);
--
1.8.3.1

2013-10-19 15:35:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: Refactor iowait accounting

On Sat, Oct 19, 2013 at 05:17:20PM +0200, Frederic Weisbecker wrote:
> +u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
> +{
> + ktime_t iowait, delta = { .tv64 = 0 };
> + struct rq *rq = cpu_rq(cpu);
> + ktime_t now = ktime_get();
> + unsigned int seq;
> +
> + do {
> + seq = read_seqbegin(&rq->iowait_lock);
> + if (rq->nr_iowait)
> + delta = ktime_sub(now, rq->iowait_start);
> + iowait = ktime_add(rq->iowait_time, delta);
> + } while (read_seqretry(&rq->iowait_lock, seq));
> +
> + if (last_update_time)
> + *last_update_time = ktime_to_us(now);
> +
> + return ktime_to_us(iowait);
> +}
> +EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
> +
> +static void cpu_iowait_start(struct rq *rq)
> +{
> + write_seqlock(&rq->iowait_lock);
> + if (!rq->nr_iowait++)
> + rq->iowait_start = ktime_get();
> + write_sequnlock(&rq->iowait_lock);
> +}
> +
> +static void cpu_iowait_end(struct rq *rq)
> +{
> + ktime_t delta;
> + write_seqlock(&rq->iowait_lock);
> + if (!--rq->nr_iowait) {
> + delta = ktime_sub(ktime_get(), rq->iowait_start);
> + rq->iowait_time = ktime_add(rq->iowait_time, delta);
> + }
> + write_sequnlock(&rq->iowait_lock);
> +}

Yeah, so using ktime_get() for this is completely insane ;-)

I just had a look at delayacct; wth wrote that crap; that too uses
gtod.

2013-10-19 16:02:42

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: Refactor iowait accounting

On Sat, Oct 19, 2013 at 05:35:35PM +0200, Peter Zijlstra wrote:
> On Sat, Oct 19, 2013 at 05:17:20PM +0200, Frederic Weisbecker wrote:
> > +u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
> > +{
> > + ktime_t iowait, delta = { .tv64 = 0 };
> > + struct rq *rq = cpu_rq(cpu);
> > + ktime_t now = ktime_get();
> > + unsigned int seq;
> > +
> > + do {
> > + seq = read_seqbegin(&rq->iowait_lock);
> > + if (rq->nr_iowait)
> > + delta = ktime_sub(now, rq->iowait_start);
> > + iowait = ktime_add(rq->iowait_time, delta);
> > + } while (read_seqretry(&rq->iowait_lock, seq));
> > +
> > + if (last_update_time)
> > + *last_update_time = ktime_to_us(now);
> > +
> > + return ktime_to_us(iowait);
> > +}
> > +EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
> > +
> > +static void cpu_iowait_start(struct rq *rq)
> > +{
> > + write_seqlock(&rq->iowait_lock);
> > + if (!rq->nr_iowait++)
> > + rq->iowait_start = ktime_get();
> > + write_sequnlock(&rq->iowait_lock);
> > +}
> > +
> > +static void cpu_iowait_end(struct rq *rq)
> > +{
> > + ktime_t delta;
> > + write_seqlock(&rq->iowait_lock);
> > + if (!--rq->nr_iowait) {
> > + delta = ktime_sub(ktime_get(), rq->iowait_start);
> > + rq->iowait_time = ktime_add(rq->iowait_time, delta);
> > + }
> > + write_sequnlock(&rq->iowait_lock);
> > +}
>
> Yeah, so using ktime_get() for this is completely insane ;-)

Woops, yeah I clearly overlooked that while moving the code :)
local_clock() should be fine.

>
> I just had a look at delayacct; wth wrote that crap; that too uses
> gtod.

I can't find where it does that. kernel/delayacct.c doesn't seem to at least.

2013-10-19 16:05:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: Refactor iowait accounting

On Sat, Oct 19, 2013 at 06:02:37PM +0200, Frederic Weisbecker wrote:
> > I just had a look at delayacct; wth wrote that crap; that too uses
> > gtod.
>
> I can't find where it does that. kernel/delayacct.c doesn't seem to at least.

Look for do_posix_clock_monotonic_gettime() ;-)

2013-10-19 16:20:18

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: Refactor iowait accounting

On Sat, Oct 19, 2013 at 06:05:17PM +0200, Peter Zijlstra wrote:
> On Sat, Oct 19, 2013 at 06:02:37PM +0200, Frederic Weisbecker wrote:
> > > I just had a look at delayacct; wth wrote that crap; that too uses
> > > gtod.
> >
> > I can't find where it does that. kernel/delayacct.c doesn't seem to at least.
>
> Look for do_posix_clock_monotonic_gettime() ;-)

Oh. Well that's a good news, it appears to be a nice low hanging fruit :)

Also perhaps we could make it use the same timestamp as cpu_iowait_start/end.

2013-10-20 07:34:59

by Andreas Mohr

[permalink] [raw]
Subject: Re: [PATCH 3/5] timer: Change idle/iowait accounting semantics

Hi,

just wanted to report that this capricious open-coded (ok, lone-coded :)
converter:

+static inline ktime_t us_to_ktime(u64 us)
+{
+ static const ktime_t ktime_zero = { .tv64 = 0 };
+
+ return ktime_add_us(ktime_zero, us);
+}

triggered a virtual red flag in my processing.

We obviously seem to be compensating for a domain mismatch here:

+ iowait = get_cpu_iowait_time_us(cpu, NULL);
+
+ if (ktime_compare(idle, us_to_ktime(iowait)) > 0)
+ idle = ktime_sub_us(idle, iowait);
+

And sure enough:
get_cpu_iowait_time_us() does a final to-us conversion
only right before result is returned,
and it's located right within the same build unit (tick-sched.c).
(and then we go ahead and do a from-us round trip :-P)

So, in case we want to standardize on ktime_t domain
for these parts of purely *in-kernel* time handling,
how about adding a _unit-local_ helper for providing a ktime-only result
and convert get_cpu_iowait_time_us() into a simple to-us *external user code*
one-line wrapper for it?
(and include updates to all other places which would benefit from this change)

And preferably launch such patch as a preparatory patch for this
subsequent 3/5 patch, within the series?

Unless I missed some other restrictions which cause us to need to cling
to that manual from-us conversion, for now or forever...

Andreas Mohr

2013-10-20 11:10:09

by Andreas Mohr

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: Refactor iowait accounting

Hi,


+u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
+{
+ ktime_t iowait, delta = { .tv64 = 0 };
+ struct rq *rq = cpu_rq(cpu);
+ ktime_t now = ktime_get();
+ unsigned int seq;
+
+ do {
+ seq = read_seqbegin(&rq->iowait_lock);
+ if (rq->nr_iowait)
+ delta = ktime_sub(now, rq->iowait_start);
+ iowait = ktime_add(rq->iowait_time, delta);
+ } while (read_seqretry(&rq->iowait_lock, seq));


AFAICS that's slightly buggy, in light of:


+static void cpu_iowait_end(struct rq *rq)
+{
+ ktime_t delta;
+ write_seqlock(&rq->iowait_lock);
+ if (!--rq->nr_iowait) {
+ delta = ktime_sub(ktime_get(), rq->iowait_start);
+ rq->iowait_time = ktime_add(rq->iowait_time, delta);
+ }
+ write_sequnlock(&rq->iowait_lock);
+}

get_cpu_iowait_time_us() loops until update is consistent,
yet its "delta" will have been assigned previously
(*and potentially not updated*),
yet then a subsequent cpu_iowait_end() does provide a final consistent
update (by updating that very iowait_time base value taking the current delta
[destructively] into account!!)
and the other get_cpu_iowait_time_us's delta value remained stale
(iff nr_iowait now back to zero!).

IOW, newly updated iowait_time base value already (and re-evaluated),
yet *old* delta still being added to this *new* base value.



Further thoughts:

Janitorial: cpu_iowait_end(): might be useful to move ktime_t delta
into local scope.

Would be nice to move inner handling of get_cpu_iowait_time_us()
into an rq-focussed properly named helper function (to reflect the fact that
while this is stored in rq it's merely being used to derive CPU-side status
values from it), but it seems "ktime_t now" would then have to be
grabbed multiple times, which would be a complication.

In case of high update traffic of parts guarded by rq->iowait_lock
(is that a relevant case?),
it might be useful to merely grab all relevant values into helper vars
(i.e., establish a consistent "view" on things), now, start, nr_iowait etc. -
this would enable us to do ktime_sub(), ktime_add() calculations
*once*, after the fact. Drawback would be that this reduces seqlock
guard scope (would not be active any more during runtime spent for calculations,
i.e. another update may happen during that calculation time),
but then that function's purpose merely is to provide a *consistent
one-time probe* of a continually updated value anyway, so it does not
matter if we happen to return values of one update less
than is already available.


Thank you very much for working on improving this important infrastructure code!

Andreas Mohr