2014-04-02 20:19:33

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH 1/2] nohz: Only update sleeptime stats locally

The idle and io sleeptime stats can be updated concurrently from callers
of get_cpu_idle_time_us(), get_cpu_iowait_time_us() and
tick_nohz_stop_idle().

Updaters can easily race and mess up with internal datas coherency,
for example when a governor calls a get_cpu_*_time_us() API and the
target CPU exits idle at the same time, because no locking or whatsoever
is there to protect against this concurrency.

To fix this, lets only update the sleeptime stats locally when the CPU
exits from idle. This is the only place where a real update is truly
needed. The callers of get_cpu_*_time_us() can simply add up the pending
sleep time delta to the last sleeptime snapshot in order to get a coherent
result. There is no need for them to also update the stats.

Patch by Frederic Weisbecker, rebased to Linus' tree by Denys Vlasenko.

Reported-by: Fernando Luis Vazquez Cao <[email protected]>
Reported-by: Tetsuo Handa <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Hidetoshi Seto <[email protected]>
Cc: Fernando Luis Vazquez Cao <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---
kernel/time/tick-sched.c | 63 ++++++++++++++++--------------------------------
1 file changed, 21 insertions(+), 42 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9f8af69..73ced0c4 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -403,31 +403,16 @@ static void tick_nohz_update_jiffies(ktime_t now)
touch_softlockup_watchdog();
}

-/*
- * Updates the per cpu time idle statistics counters
- */
-static void
-update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
+static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
{
ktime_t delta;

- if (ts->idle_active) {
- delta = ktime_sub(now, ts->idle_entrytime);
- if (nr_iowait_cpu(cpu) > 0)
- ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
- else
- ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
- ts->idle_entrytime = now;
- }
-
- if (last_update_time)
- *last_update_time = ktime_to_us(now);
-
-}
-
-static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
-{
- update_ts_time_stats(smp_processor_id(), ts, now, NULL);
+ /* Updates the per cpu time idle statistics counters */
+ delta = ktime_sub(now, ts->idle_entrytime);
+ if (nr_iowait_cpu(smp_processor_id()) > 0)
+ ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
+ else
+ ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
ts->idle_active = 0;

sched_clock_idle_wakeup_event(0);
@@ -466,17 +451,14 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
return -1;

now = ktime_get();
- if (last_update_time) {
- update_ts_time_stats(cpu, ts, now, last_update_time);
- idle = ts->idle_sleeptime;
- } else {
- if (ts->idle_active && !nr_iowait_cpu(cpu)) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+ if (last_update_time)
+ *last_update_time = ktime_to_us(now);

- idle = ktime_add(ts->idle_sleeptime, delta);
- } else {
- idle = ts->idle_sleeptime;
- }
+ if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+ ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+ idle = ktime_add(ts->idle_sleeptime, delta);
+ } else {
+ idle = ts->idle_sleeptime;
}

return ktime_to_us(idle);
@@ -507,17 +489,14 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
return -1;

now = ktime_get();
- if (last_update_time) {
- update_ts_time_stats(cpu, ts, now, last_update_time);
- iowait = ts->iowait_sleeptime;
- } else {
- if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+ if (last_update_time)
+ *last_update_time = ktime_to_us(now);

- iowait = ktime_add(ts->iowait_sleeptime, delta);
- } else {
- iowait = ts->iowait_sleeptime;
- }
+ if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+ ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+ iowait = ktime_add(ts->iowait_sleeptime, delta);
+ } else {
+ iowait = ts->iowait_sleeptime;
}

return ktime_to_us(iowait);
--
1.8.1.4


2014-04-02 20:19:04

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH 2/2] nohz: Synchronize sleep time stats with memory barriers

When some call site uses get_cpu_*_time_us() to read a sleeptime
stat, it deduces the total sleeptime by adding the pending time
to the last sleeptime snapshot if the CPU target is idle.

But this only works if idle_sleeptime, idle_entrytime and idle_active are
read and updated under some disciplined order.

This patch changes updaters to modify idle_entrytime,
{idle,iowait}_sleeptime, and idle_active exactly in this order,
with write barriers on SMP to ensure other CPUs see then in this order too.
Readers are changed read them in the opposite order, with read barriers.
When readers detect a race by seeing cleared idle_entrytime,
they retry the reads.

The "iowait-ness" of every idle period is decided at the moment it starts:
if this CPU's run-queue had tasks waiting on I/O, then this idle
period's duration will be added to iowait_sleeptime.
This, along with proper SMP syncronization, fixes the bug where iowait
counts could go backwards.

Signed-off-by: Denys Vlasenko <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Hidetoshi Seto <[email protected]>
Cc: Fernando Luis Vazquez Cao <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---
kernel/time/tick-sched.c | 79 +++++++++++++++++++++++++++++++++++++-----------
1 file changed, 62 insertions(+), 17 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 73ced0c4..ed0c1bd 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -409,10 +409,15 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)

/* Updates the per cpu time idle statistics counters */
delta = ktime_sub(now, ts->idle_entrytime);
- if (nr_iowait_cpu(smp_processor_id()) > 0)
+ ts->idle_entrytime.tv64 = 0;
+ smp_wmb();
+
+ if (ts->idle_active == 2)
ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
else
ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+
+ smp_wmb();
ts->idle_active = 0;

sched_clock_idle_wakeup_event(0);
@@ -423,7 +428,8 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
ktime_t now = ktime_get();

ts->idle_entrytime = now;
- ts->idle_active = 1;
+ smp_wmb();
+ ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
sched_clock_idle_sleep_event();
return now;
}
@@ -444,25 +450,44 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
*/
u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
{
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
- ktime_t now, idle;
+ struct tick_sched *ts;
+ ktime_t now, count;
+ int active;

if (!tick_nohz_active)
return -1;

+ ts = &per_cpu(tick_cpu_sched, cpu);
+
now = ktime_get();
if (last_update_time)
*last_update_time = ktime_to_us(now);

- if (ts->idle_active && !nr_iowait_cpu(cpu)) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
- idle = ktime_add(ts->idle_sleeptime, delta);
+ again:
+ active = ACCESS_ONCE(ts->idle_active);
+ smp_rmb();
+ count = ACCESS_ONCE(ts->idle_sleeptime);
+ if (active == 1) {
+ ktime_t delta, start;
+
+ smp_rmb();
+ start = ACCESS_ONCE(ts->idle_entrytime);
+ if (start.tv64 == 0)
+ /* Other CPU is updating the count.
+ * We don't know whether fetched count is valid.
+ */
+ goto again;
+
+ delta = ktime_sub(now, start);
+ count = ktime_add(count, delta);
} else {
- idle = ts->idle_sleeptime;
+ /* Possible concurrent tick_nohz_stop_idle() already
+ * cleared idle_active. We fetched count *after*
+ * we fetched idle_active, so count must be valid.
+ */
}

- return ktime_to_us(idle);
-
+ return ktime_to_us(count);
}
EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);

@@ -482,24 +507,44 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
*/
u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
{
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
- ktime_t now, iowait;
+ struct tick_sched *ts;
+ ktime_t now, count;
+ int active;

if (!tick_nohz_active)
return -1;

+ ts = &per_cpu(tick_cpu_sched, cpu);
+
now = ktime_get();
if (last_update_time)
*last_update_time = ktime_to_us(now);

- if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
- iowait = ktime_add(ts->iowait_sleeptime, delta);
+ again:
+ active = ACCESS_ONCE(ts->idle_active);
+ smp_rmb();
+ count = ACCESS_ONCE(ts->iowait_sleeptime);
+ if (active == 2) {
+ ktime_t delta, start;
+
+ smp_rmb();
+ start = ACCESS_ONCE(ts->idle_entrytime);
+ if (start.tv64 == 0)
+ /* Other CPU is updating the count.
+ * We don't know whether fetched count is valid.
+ */
+ goto again;
+
+ delta = ktime_sub(now, start);
+ count = ktime_add(count, delta);
} else {
- iowait = ts->iowait_sleeptime;
+ /* Possible concurrent tick_nohz_stop_idle() already
+ * cleared idle_active. We fetched count *after*
+ * we fetched idle_active, so count must be valid.
+ */
}

- return ktime_to_us(iowait);
+ return ktime_to_us(count);
}
EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);

--
1.8.1.4

2014-04-15 10:51:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] nohz: Synchronize sleep time stats with memory barriers

On Wed, Apr 02, 2014 at 10:17:52PM +0200, Denys Vlasenko wrote:
> When some call site uses get_cpu_*_time_us() to read a sleeptime
> stat, it deduces the total sleeptime by adding the pending time
> to the last sleeptime snapshot if the CPU target is idle.
>
> But this only works if idle_sleeptime, idle_entrytime and idle_active are
> read and updated under some disciplined order.
>
> This patch changes updaters to modify idle_entrytime,
> {idle,iowait}_sleeptime, and idle_active exactly in this order,
> with write barriers on SMP to ensure other CPUs see then in this order too.
> Readers are changed read them in the opposite order, with read barriers.
> When readers detect a race by seeing cleared idle_entrytime,
> they retry the reads.
>
> The "iowait-ness" of every idle period is decided at the moment it starts:
> if this CPU's run-queue had tasks waiting on I/O, then this idle
> period's duration will be added to iowait_sleeptime.
> This, along with proper SMP syncronization, fixes the bug where iowait
> counts could go backwards.

It also makes for a near infinite source of iowait. Who is to say the
CPU that started with iowait will ever wake up? The nohz sleep time is
practically unbounded.

> ---
> kernel/time/tick-sched.c | 79 +++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 62 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 73ced0c4..ed0c1bd 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -409,10 +409,15 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
>
> /* Updates the per cpu time idle statistics counters */
> delta = ktime_sub(now, ts->idle_entrytime);
> + ts->idle_entrytime.tv64 = 0;

One must at all times describe the memory ordering and pairing barrier
in a comment when placing barriers.

> + smp_wmb();
> +
> + if (ts->idle_active == 2)
> ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> else
> ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
> +
> + smp_wmb();
> ts->idle_active = 0;
>
> sched_clock_idle_wakeup_event(0);
> @@ -423,7 +428,8 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
> ktime_t now = ktime_get();
>
> ts->idle_entrytime = now;
> + smp_wmb();
> + ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
> sched_clock_idle_sleep_event();
> return now;
> }
> @@ -444,25 +450,44 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
> */
> u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
> {
> + struct tick_sched *ts;
> + ktime_t now, count;
> + int active;
>
> if (!tick_nohz_active)
> return -1;
>
> + ts = &per_cpu(tick_cpu_sched, cpu);
> +
> now = ktime_get();
> if (last_update_time)
> *last_update_time = ktime_to_us(now);
>
> + again:
> + active = ACCESS_ONCE(ts->idle_active);
> + smp_rmb();
> + count = ACCESS_ONCE(ts->idle_sleeptime);
> + if (active == 1) {
> + ktime_t delta, start;
> +
> + smp_rmb();
> + start = ACCESS_ONCE(ts->idle_entrytime);
> + if (start.tv64 == 0)
> + /* Other CPU is updating the count.
> + * We don't know whether fetched count is valid.
> + */
> + goto again;

This is double wrong; any multi line stmt (even if its a single stmt)
should have {}. Also, wrong multi line comment style.

> +
> + delta = ktime_sub(now, start);
> + count = ktime_add(count, delta);
> } else {
> + /* Possible concurrent tick_nohz_stop_idle() already
> + * cleared idle_active. We fetched count *after*
> + * we fetched idle_active, so count must be valid.
> + */

Wrong comment style again.

> }
>
> + return ktime_to_us(count);
> }
> EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
>
> @@ -482,24 +507,44 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
> */
> u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
> {
> + struct tick_sched *ts;
> + ktime_t now, count;
> + int active;
>
> if (!tick_nohz_active)
> return -1;
>
> + ts = &per_cpu(tick_cpu_sched, cpu);
> +
> now = ktime_get();
> if (last_update_time)
> *last_update_time = ktime_to_us(now);
>
> + again:
> + active = ACCESS_ONCE(ts->idle_active);
> + smp_rmb();
> + count = ACCESS_ONCE(ts->iowait_sleeptime);
> + if (active == 2) {
> + ktime_t delta, start;
> +
> + smp_rmb();
> + start = ACCESS_ONCE(ts->idle_entrytime);
> + if (start.tv64 == 0)
> + /* Other CPU is updating the count.
> + * We don't know whether fetched count is valid.
> + */
> + goto again;
> +
> + delta = ktime_sub(now, start);
> + count = ktime_add(count, delta);
> } else {
> + /* Possible concurrent tick_nohz_stop_idle() already
> + * cleared idle_active. We fetched count *after*
> + * we fetched idle_active, so count must be valid.
> + */
> }

You're not nearly lazy enough; this is a near identical copy of the
above, two nearly identical copies of 'tricky' code is a bad idea.

>
> + return ktime_to_us(count);
> }
> EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);


So the proposed ordering is something like:

[w] ->idle_entrytime = 0
WMB /* k0 <-> r1 */
[w] ->*_sleeptime += delta
WMB /* k1 <-> r0 */
[w] ->idle_active = 0

VS

[w] ->idle_entrytime = now
WMB /* s0 <-> r0 */
[w] ->idle_active = 1 + !!nr_iowait

VS

[r] active = ->idle_active
RMB /* r0 <-> s0, k1 */
[r] count = ->idle_sleeptime
RMB /* r1 <-> k0 */
[r] start = ->idle_entrytime


Which on first reading seems sound enough; but needs more words to
describe why.

That said; I thing the fundamental flaw in the entire thing is
accounting the entire nohz idle period as one type, since the period is
basically unbounded.

2014-04-15 11:14:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] nohz: Synchronize sleep time stats with memory barriers

On Tue, Apr 15, 2014 at 12:51:38PM +0200, Peter Zijlstra wrote:
> That said; I thing the fundamental flaw in the entire thing is
> accounting the entire nohz idle period as one type, since the period is
> basically unbounded.

So basically anything with the current per-cpu nr_iowait is bound to
yield crack induced results.

Like you said in your other email, we need to somehow intersect actual
idle time with the presence of iowait tasks. Now that's a global problem
and is unlikely to scale well. Which is of course why we have this
problem to begin with, if it were easy it'd be done right to begin with
(optimistic world view here).

2014-04-23 19:00:04

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH 2/2] nohz: Synchronize sleep time stats with memory barriers

On Tue, Apr 15, 2014 at 12:51 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Apr 02, 2014 at 10:17:52PM +0200, Denys Vlasenko wrote:
>> When some call site uses get_cpu_*_time_us() to read a sleeptime
>> stat, it deduces the total sleeptime by adding the pending time
>> to the last sleeptime snapshot if the CPU target is idle.
>>
>> But this only works if idle_sleeptime, idle_entrytime and idle_active are
>> read and updated under some disciplined order.
>>
>> This patch changes updaters to modify idle_entrytime,
>> {idle,iowait}_sleeptime, and idle_active exactly in this order,
>> with write barriers on SMP to ensure other CPUs see then in this order too.
>> Readers are changed read them in the opposite order, with read barriers.
>> When readers detect a race by seeing cleared idle_entrytime,
>> they retry the reads.
>>
>> The "iowait-ness" of every idle period is decided at the moment it starts:
>> if this CPU's run-queue had tasks waiting on I/O, then this idle
>> period's duration will be added to iowait_sleeptime.
>> This, along with proper SMP syncronization, fixes the bug where iowait
>> counts could go backwards.
>
> It also makes for a near infinite source of iowait. Who is to say the
> CPU that started with iowait will ever wake up? The nohz sleep time is
> practically unbounded.

Yes, it does that.
I prepared a new patchset which fixes that.

It also addresses your other points:

>> /* Updates the per cpu time idle statistics counters */
>> delta = ktime_sub(now, ts->idle_entrytime);
>> + ts->idle_entrytime.tv64 = 0;
>
> One must at all times describe the memory ordering and pairing barrier
> in a comment when placing barriers.

Done.

>> + if (start.tv64 == 0)
>> + /* Other CPU is updating the count.
>> + * We don't know whether fetched count is valid.
>> + */
>> + goto again;
>
> This is double wrong; any multi line stmt (even if its a single stmt)
> should have {}. Also, wrong multi line comment style.

Fixed.

> You're not nearly lazy enough; this is a near identical copy of the
> above, two nearly identical copies of 'tricky' code is a bad idea.

True. I was planning to address that, but new logic added
to avoid overcounting iowait made these two functions
not as similar as they used to be.

Please review a newer patchset, I'll send it in a minute.