2023-02-20 12:41:44

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 0/7] timers/nohz: Fixes and cleanups v2

Try to (partially) fix the issue reported in https://lore.kernel.org/lkml/[email protected]/

Changes since v1:

* Fix compute_delta left unused (thanks Hillf)
* Pass directly struct tick_sched to get_cpu_sleep_time_us()
* Add Peterz ack
* Remove selftests with wrong assertions

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
timers/core

HEAD: 0b86fc6ed10742ec8e54cd4d495c1ce9d4c1b4e0

Thanks,
Frederic
---

Frederic Weisbecker (7):
timers/nohz: Restructure and reshuffle struct tick_sched
timers/nohz: Only ever update sleeptime from idle exit
timers/nohz: Protect idle/iowait sleep time under seqcount
timers/nohz: Add a comment about broken iowait counter update race
timers/nohz: Remove middle-function __tick_nohz_idle_stop_tick()
MAINTAINERS: Remove stale email address
selftests/proc: Remove idle time monotonicity assertions


MAINTAINERS | 2 +-
kernel/time/tick-sched.c | 135 ++++++++++++-------------
kernel/time/tick-sched.h | 67 +++++++-----
tools/testing/selftests/proc/.gitignore | 2 -
tools/testing/selftests/proc/Makefile | 2 -
tools/testing/selftests/proc/proc-uptime-001.c | 45 ---------
tools/testing/selftests/proc/proc-uptime-002.c | 79 ---------------
tools/testing/selftests/proc/proc-uptime.h | 60 -----------
8 files changed, 106 insertions(+), 286 deletions(-)
---
git range-diff since v1

1: 899f01a80e5b ! 1: 0e7aede86812 timers/nohz: Restructure and reshuffle struct tick_sched
@@ Commit message
@last_jiffies and @idle_expires.

Reported-by: Thomas Gleixner <[email protected]>
+ Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Yu Liao <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Wei Li <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mirsad Goran Todorovac <[email protected]>
- Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>

## kernel/time/tick-sched.h ##
2: 4ea8a8e98eb0 ! 2: 8e8a18cee3d5 timers/nohz: Only ever update sleeptime from idle exit
@@ Commit message
reader VS writer races to handle. A subsequent patch will fix one.

Reported-by: Yu Liao <[email protected]>
+ Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Yu Liao <[email protected]>
Cc: Ingo Molnar <[email protected]>
@@ Commit message
Cc: Wei Li <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mirsad Goran Todorovac <[email protected]>
- Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>

## kernel/time/tick-sched.c ##
@@ kernel/time/tick-sched.c: static void tick_nohz_start_idle(struct tick_sched *ts
sched_clock_idle_sleep_event();
}

-+static u64 get_cpu_sleep_time_us(int cpu, ktime_t *sleeptime,
++static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
+ bool compute_delta, u64 *last_update_time)
+{
-+ struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+ ktime_t now, idle;
+
+ if (!tick_nohz_active)
@@ kernel/time/tick-sched.c: static void tick_nohz_start_idle(struct tick_sched *ts
+ if (last_update_time)
+ *last_update_time = ktime_to_us(now);
+
-+ if (ts->idle_active && !nr_iowait_cpu(cpu)) {
++ if (ts->idle_active && compute_delta) {
+ ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+
+ idle = ktime_add(*sleeptime, delta);
@@ kernel/time/tick-sched.c: static void tick_nohz_start_idle(struct tick_sched *ts
-
- return ktime_to_us(idle);

-+ return get_cpu_sleep_time_us(cpu, &ts->idle_sleeptime,
++ return get_cpu_sleep_time_us(ts, &ts->idle_sleeptime,
+ !nr_iowait_cpu(cpu), last_update_time);
}
EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
@@ kernel/time/tick-sched.c: EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
- }

- return ktime_to_us(iowait);
-+ return get_cpu_sleep_time_us(cpu, &ts->iowait_sleeptime,
++ return get_cpu_sleep_time_us(ts, &ts->iowait_sleeptime,
+ nr_iowait_cpu(cpu), last_update_time);
}
EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
3: 131b09a345c6 ! 3: 61b56e1d6c33 timers/nohz: Protect idle/iowait sleep time under seqcount
@@ Commit message
can hardly be fixed.

Reported-by: Yu Liao <[email protected]>
+ Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Yu Liao <[email protected]>
Cc: Ingo Molnar <[email protected]>
@@ Commit message
Cc: Wei Li <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mirsad Goran Todorovac <[email protected]>
- Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>

## kernel/time/tick-sched.c ##
@@ kernel/time/tick-sched.c: static void tick_nohz_stop_idle(struct tick_sched *ts,
sched_clock_idle_sleep_event();
}

-@@ kernel/time/tick-sched.c: static u64 get_cpu_sleep_time_us(int cpu, ktime_t *sleeptime,
+@@ kernel/time/tick-sched.c: static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
+ bool compute_delta, u64 *last_update_time)
{
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
ktime_t now, idle;
+ unsigned int seq;

if (!tick_nohz_active)
return -1;
-@@ kernel/time/tick-sched.c: static u64 get_cpu_sleep_time_us(int cpu, ktime_t *sleeptime,
+@@ kernel/time/tick-sched.c: static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
if (last_update_time)
*last_update_time = ktime_to_us(now);

-- if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+- if (ts->idle_active && compute_delta) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+ do {
+ seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
@@ kernel/time/tick-sched.c: static u64 get_cpu_sleep_time_us(int cpu, ktime_t *sle
- } else {
- idle = *sleeptime;
- }
-+ if (ts->idle_active && !nr_iowait_cpu(cpu)) {
++ if (ts->idle_active && compute_delta) {
+ ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+
+ idle = ktime_add(*sleeptime, delta);
4: 4ff478886c2c ! 4: 9147cd64f3ba timers/nohz: Add a comment about broken iowait counter update race
@@ Commit message
This is unfortunately hardly fixable. Just add a comment about that
condition.

+ Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Yu Liao <[email protected]>
Cc: Ingo Molnar <[email protected]>
@@ Commit message
Cc: Wei Li <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mirsad Goran Todorovac <[email protected]>
- Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>

## kernel/time/tick-sched.c ##
-@@ kernel/time/tick-sched.c: static u64 get_cpu_sleep_time_us(int cpu, ktime_t *sleeptime,
+@@ kernel/time/tick-sched.c: static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
* counters if NULL.
*
* Return the cumulative idle time (since boot) for a given
5: 6eb31238e057 ! 5: 4863214a905f timers/nohz: Remove middle-function __tick_nohz_idle_stop_tick()
@@ Commit message
tick_nohz_idle_stop_tick() and its implementation. Remove that
unnecessary step.

+ Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Yu Liao <[email protected]>
Cc: Ingo Molnar <[email protected]>
@@ Commit message
Cc: Wei Li <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mirsad Goran Todorovac <[email protected]>
- Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>

## kernel/time/tick-sched.c ##
6: d1dc3edc39a7 ! 6: 170828be3e96 MAINTAINERS: Remove stale email address
@@ Metadata
## Commit message ##
MAINTAINERS: Remove stale email address

+ Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Yu Liao <[email protected]>
Cc: Ingo Molnar <[email protected]>
@@ Commit message
Cc: Wei Li <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mirsad Goran Todorovac <[email protected]>
- Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>

## MAINTAINERS ##
-: ------------ > 7: 0b86fc6ed107 selftests/proc: Remove idle time monotonicity assertions




2023-02-20 12:41:51

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 1/7] timers/nohz: Restructure and reshuffle struct tick_sched

Restructure and group fields by access in order to optimize cache
layout. While at it, also add missing kernel doc for two fields:
@last_jiffies and @idle_expires.

Reported-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Yu Liao <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Wei Li <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mirsad Goran Todorovac <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-sched.h | 66 +++++++++++++++++++++++++---------------
1 file changed, 41 insertions(+), 25 deletions(-)

diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h
index 504649513399..c6663254d17d 100644
--- a/kernel/time/tick-sched.h
+++ b/kernel/time/tick-sched.h
@@ -22,65 +22,81 @@ enum tick_nohz_mode {

/**
* struct tick_sched - sched tick emulation and no idle tick control/stats
- * @sched_timer: hrtimer to schedule the periodic tick in high
- * resolution mode
- * @check_clocks: Notification mechanism about clocksource changes
- * @nohz_mode: Mode - one state of tick_nohz_mode
+ *
* @inidle: Indicator that the CPU is in the tick idle mode
* @tick_stopped: Indicator that the idle tick has been stopped
* @idle_active: Indicator that the CPU is actively in the tick idle mode;
* it is reset during irq handling phases.
- * @do_timer_lst: CPU was the last one doing do_timer before going idle
+ * @do_timer_last: CPU was the last one doing do_timer before going idle
* @got_idle_tick: Tick timer function has run with @inidle set
+ * @stalled_jiffies: Number of stalled jiffies detected across ticks
+ * @last_tick_jiffies: Value of jiffies seen on last tick
+ * @sched_timer: hrtimer to schedule the periodic tick in high
+ * resolution mode
* @last_tick: Store the last tick expiry time when the tick
* timer is modified for nohz sleeps. This is necessary
* to resume the tick timer operation in the timeline
* when the CPU returns from nohz sleep.
* @next_tick: Next tick to be fired when in dynticks mode.
* @idle_jiffies: jiffies at the entry to idle for idle time accounting
+ * @idle_waketime: Time when the idle was interrupted
+ * @idle_entrytime: Time when the idle call was entered
+ * @nohz_mode: Mode - one state of tick_nohz_mode
+ * @last_jiffies: Base jiffies snapshot when next event was last computed
+ * @timer_expires_base: Base time clock monotonic for @timer_expires
+ * @timer_expires: Anticipated timer expiration time (in case sched tick is stopped)
+ * @next_timer: Expiry time of next expiring timer for debugging purpose only
+ * @idle_expires: Next tick in idle, for debugging purpose only
* @idle_calls: Total number of idle calls
* @idle_sleeps: Number of idle calls, where the sched tick was stopped
- * @idle_entrytime: Time when the idle call was entered
- * @idle_waketime: Time when the idle was interrupted
* @idle_exittime: Time when the idle state was left
* @idle_sleeptime: Sum of the time slept in idle with sched tick stopped
* @iowait_sleeptime: Sum of the time slept in idle with sched tick stopped, with IO outstanding
- * @timer_expires: Anticipated timer expiration time (in case sched tick is stopped)
- * @timer_expires_base: Base time clock monotonic for @timer_expires
- * @next_timer: Expiry time of next expiring timer for debugging purpose only
* @tick_dep_mask: Tick dependency mask - is set, if someone needs the tick
- * @last_tick_jiffies: Value of jiffies seen on last tick
- * @stalled_jiffies: Number of stalled jiffies detected across ticks
+ * @check_clocks: Notification mechanism about clocksource changes
*/
struct tick_sched {
- struct hrtimer sched_timer;
- unsigned long check_clocks;
- enum tick_nohz_mode nohz_mode;
-
+ /* Common flags */
unsigned int inidle : 1;
unsigned int tick_stopped : 1;
unsigned int idle_active : 1;
unsigned int do_timer_last : 1;
unsigned int got_idle_tick : 1;

+ /* Tick handling: jiffies stall check */
+ unsigned int stalled_jiffies;
+ unsigned long last_tick_jiffies;
+
+ /* Tick handling */
+ struct hrtimer sched_timer;
ktime_t last_tick;
ktime_t next_tick;
unsigned long idle_jiffies;
- unsigned long idle_calls;
- unsigned long idle_sleeps;
- ktime_t idle_entrytime;
ktime_t idle_waketime;
- ktime_t idle_exittime;
- ktime_t idle_sleeptime;
- ktime_t iowait_sleeptime;
+
+ /* Idle entry */
+ ktime_t idle_entrytime;
+
+ /* Tick stop */
+ enum tick_nohz_mode nohz_mode;
unsigned long last_jiffies;
- u64 timer_expires;
u64 timer_expires_base;
+ u64 timer_expires;
u64 next_timer;
ktime_t idle_expires;
+ unsigned long idle_calls;
+ unsigned long idle_sleeps;
+
+ /* Idle exit */
+ ktime_t idle_exittime;
+ ktime_t idle_sleeptime;
+ ktime_t iowait_sleeptime;
+
+ /* Full dynticks handling */
atomic_t tick_dep_mask;
- unsigned long last_tick_jiffies;
- unsigned int stalled_jiffies;
+
+ /* Clocksource changes */
+ unsigned long check_clocks;
};

extern struct tick_sched *tick_get_tick_sched(int cpu);
--
2.34.1


2023-02-20 12:41:57

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 3/7] timers/nohz: Protect idle/iowait sleep time under seqcount

Reading idle/io sleep time (eg: from /proc/stat) can race with idle exit
updates because the state machine handling the stats is not atomic and
requires a coherent read batch.

As a result reading the sleep time may report irrelevant or backward
values.

Fix this with protecting the simple state machine within a seqcount.
This is expected to be cheap enough not to add measurable performance
impact on the idle path.

Note this only fixes reader VS writer condition partitially. A race
remains that involves remote updates of the CPU iowait task counter. It
can hardly be fixed.

Reported-by: Yu Liao <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Yu Liao <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wei Li <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mirsad Goran Todorovac <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-sched.c | 22 ++++++++++++++++------
kernel/time/tick-sched.h | 1 +
2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9058b9eb8bc1..90d9b7b29875 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -646,6 +646,7 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)

delta = ktime_sub(now, ts->idle_entrytime);

+ write_seqcount_begin(&ts->idle_sleeptime_seq);
if (nr_iowait_cpu(smp_processor_id()) > 0)
ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
else
@@ -653,14 +654,18 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)

ts->idle_entrytime = now;
ts->idle_active = 0;
+ write_seqcount_end(&ts->idle_sleeptime_seq);

sched_clock_idle_wakeup_event();
}

static void tick_nohz_start_idle(struct tick_sched *ts)
{
+ write_seqcount_begin(&ts->idle_sleeptime_seq);
ts->idle_entrytime = ktime_get();
ts->idle_active = 1;
+ write_seqcount_end(&ts->idle_sleeptime_seq);
+
sched_clock_idle_sleep_event();
}

@@ -668,6 +673,7 @@ static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
bool compute_delta, u64 *last_update_time)
{
ktime_t now, idle;
+ unsigned int seq;

if (!tick_nohz_active)
return -1;
@@ -676,13 +682,17 @@ static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
if (last_update_time)
*last_update_time = ktime_to_us(now);

- if (ts->idle_active && compute_delta) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+ do {
+ seq = read_seqcount_begin(&ts->idle_sleeptime_seq);

- idle = ktime_add(*sleeptime, delta);
- } else {
- idle = *sleeptime;
- }
+ if (ts->idle_active && compute_delta) {
+ ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+
+ idle = ktime_add(*sleeptime, delta);
+ } else {
+ idle = *sleeptime;
+ }
+ } while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));

return ktime_to_us(idle);

diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h
index c6663254d17d..5ed5a9d41d5a 100644
--- a/kernel/time/tick-sched.h
+++ b/kernel/time/tick-sched.h
@@ -75,6 +75,7 @@ struct tick_sched {
ktime_t idle_waketime;

/* Idle entry */
+ seqcount_t idle_sleeptime_seq;
ktime_t idle_entrytime;

/* Tick stop */
--
2.34.1


2023-02-20 12:41:57

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 2/7] timers/nohz: Only ever update sleeptime from idle exit

The idle and io sleeptime statistics appearing in /proc/stat can be
currently updated from two sites: locally on idle exit and remotely
by cpufreq. However there is no synchronization mechanism protecting
concurrent updates. It is therefore possible to account the sleeptime
twice, among all the possible broken scenarios.

To prevent from breaking the sleeptime accounting source, restrict the
sleeptime updates to the local idle exit site. If there is a delta to
add since the last update, IO/Idle sleep time readers will now only
compute the delta without actually writing it back to the internal idle
statistic fields.

This fixes a writer VS writer race. Note there are still two known
reader VS writer races to handle. A subsequent patch will fix one.

Reported-by: Yu Liao <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Yu Liao <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wei Li <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mirsad Goran Todorovac <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-sched.c | 103 ++++++++++++++++-----------------------
1 file changed, 41 insertions(+), 62 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b0e3c9205946..9058b9eb8bc1 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -637,31 +637,21 @@ static void tick_nohz_update_jiffies(ktime_t now)
touch_softlockup_watchdog_sched();
}

-/*
- * Updates the per-CPU time idle statistics counters
- */
-static void
-update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
-{
- ktime_t delta;
-
- if (ts->idle_active) {
- delta = ktime_sub(now, ts->idle_entrytime);
- if (nr_iowait_cpu(cpu) > 0)
- ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
- else
- ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
- ts->idle_entrytime = now;
- }
-
- if (last_update_time)
- *last_update_time = ktime_to_us(now);
-
-}
-
static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
{
- update_ts_time_stats(smp_processor_id(), ts, now, NULL);
+ ktime_t delta;
+
+ if (WARN_ON_ONCE(!ts->idle_active))
+ return;
+
+ delta = ktime_sub(now, ts->idle_entrytime);
+
+ if (nr_iowait_cpu(smp_processor_id()) > 0)
+ ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
+ else
+ ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+
+ ts->idle_entrytime = now;
ts->idle_active = 0;

sched_clock_idle_wakeup_event();
@@ -674,6 +664,30 @@ static void tick_nohz_start_idle(struct tick_sched *ts)
sched_clock_idle_sleep_event();
}

+static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
+ bool compute_delta, u64 *last_update_time)
+{
+ ktime_t now, idle;
+
+ if (!tick_nohz_active)
+ return -1;
+
+ now = ktime_get();
+ if (last_update_time)
+ *last_update_time = ktime_to_us(now);
+
+ if (ts->idle_active && compute_delta) {
+ ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+
+ idle = ktime_add(*sleeptime, delta);
+ } else {
+ idle = *sleeptime;
+ }
+
+ return ktime_to_us(idle);
+
+}
+
/**
* get_cpu_idle_time_us - get the total idle time of a CPU
* @cpu: CPU number to query
@@ -691,27 +705,9 @@ static void tick_nohz_start_idle(struct tick_sched *ts)
u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
{
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
- ktime_t now, idle;
-
- if (!tick_nohz_active)
- return -1;
-
- now = ktime_get();
- if (last_update_time) {
- update_ts_time_stats(cpu, ts, now, last_update_time);
- idle = ts->idle_sleeptime;
- } else {
- if (ts->idle_active && !nr_iowait_cpu(cpu)) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-
- idle = ktime_add(ts->idle_sleeptime, delta);
- } else {
- idle = ts->idle_sleeptime;
- }
- }
-
- return ktime_to_us(idle);

+ return get_cpu_sleep_time_us(ts, &ts->idle_sleeptime,
+ !nr_iowait_cpu(cpu), last_update_time);
}
EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);

@@ -732,26 +728,9 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
{
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
- ktime_t now, iowait;

- if (!tick_nohz_active)
- return -1;
-
- now = ktime_get();
- if (last_update_time) {
- update_ts_time_stats(cpu, ts, now, last_update_time);
- iowait = ts->iowait_sleeptime;
- } else {
- if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
- ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-
- iowait = ktime_add(ts->iowait_sleeptime, delta);
- } else {
- iowait = ts->iowait_sleeptime;
- }
- }
-
- return ktime_to_us(iowait);
+ return get_cpu_sleep_time_us(ts, &ts->iowait_sleeptime,
+ nr_iowait_cpu(cpu), last_update_time);
}
EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);

--
2.34.1


2023-02-20 12:41:59

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 4/7] timers/nohz: Add a comment about broken iowait counter update race

The per-cpu iowait task counter is incremented locally upon sleeping.
But since the task can be woken to (and by) another CPU, the counter may
then be decremented remotely. This is the source of a race involving
readers VS writer of idle/iowait sleeptime.

The following scenario shows an example where a /proc/stat reader
observes a pending sleep time as IO whereas that pending sleep time
later eventually gets accounted as non-IO.

CPU 0 CPU 1 CPU 2
----- ----- ------
//io_schedule() TASK A
current->in_iowait = 1
rq(0)->nr_iowait++
//switch to idle
// READ /proc/stat
// See nr_iowait_cpu(0) == 1
return ts->iowait_sleeptime +
ktime_sub(ktime_get(), ts->idle_entrytime)

//try_to_wake_up(TASK A)
rq(0)->nr_iowait--
//idle exit
// See nr_iowait_cpu(0) == 0
ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime)

As a result subsequent reads on /proc/stat may expose backward progress.

This is unfortunately hardly fixable. Just add a comment about that
condition.

Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Yu Liao <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wei Li <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mirsad Goran Todorovac <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-sched.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 90d9b7b29875..edd6e9f26d16 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -705,7 +705,10 @@ static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
* counters if NULL.
*
* Return the cumulative idle time (since boot) for a given
- * CPU, in microseconds.
+ * CPU, in microseconds. Note this is partially broken due to
+ * the counter of iowait tasks that can be remotely updated without
+ * any synchronization. Therefore it is possible to observe backward
+ * values within two consecutive reads.
*
* This time is measured via accounting rather than sampling,
* and is as accurate as ktime_get() is.
@@ -728,7 +731,10 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
* counters if NULL.
*
* Return the cumulative iowait time (since boot) for a given
- * CPU, in microseconds.
+ * CPU, in microseconds. Note this is partially broken due to
+ * the counter of iowait tasks that can be remotely updated without
+ * any synchronization. Therefore it is possible to observe backward
+ * values within two consecutive reads.
*
* This time is measured via accounting rather than sampling,
* and is as accurate as ktime_get() is.
--
2.34.1


2023-02-20 12:42:03

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 5/7] timers/nohz: Remove middle-function __tick_nohz_idle_stop_tick()

There is no need for the __tick_nohz_idle_stop_tick() function between
tick_nohz_idle_stop_tick() and its implementation. Remove that
unnecessary step.

Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Yu Liao <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wei Li <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mirsad Goran Todorovac <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-sched.c | 20 ++++++++------------
1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index edd6e9f26d16..3b53b894ca98 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1079,10 +1079,16 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
return true;
}

-static void __tick_nohz_idle_stop_tick(struct tick_sched *ts)
+/**
+ * tick_nohz_idle_stop_tick - stop the idle tick from the idle task
+ *
+ * When the next event is more than a tick into the future, stop the idle tick
+ */
+void tick_nohz_idle_stop_tick(void)
{
+ struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+ int cpu = smp_processor_id();
ktime_t expires;
- int cpu = smp_processor_id();

/*
* If tick_nohz_get_sleep_length() ran tick_nohz_next_event(), the
@@ -1114,16 +1120,6 @@ static void __tick_nohz_idle_stop_tick(struct tick_sched *ts)
}
}

-/**
- * tick_nohz_idle_stop_tick - stop the idle tick from the idle task
- *
- * When the next event is more than a tick into the future, stop the idle tick
- */
-void tick_nohz_idle_stop_tick(void)
-{
- __tick_nohz_idle_stop_tick(this_cpu_ptr(&tick_cpu_sched));
-}
-
void tick_nohz_idle_retain_tick(void)
{
tick_nohz_retain_tick(this_cpu_ptr(&tick_cpu_sched));
--
2.34.1


2023-02-20 12:42:05

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 6/7] MAINTAINERS: Remove stale email address

Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Yu Liao <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wei Li <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mirsad Goran Todorovac <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
MAINTAINERS | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index fb1471cb5ed3..300ca61fa0bc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14794,7 +14794,7 @@ F: include/uapi/linux/nitro_enclaves.h
F: samples/nitro_enclaves/

NOHZ, DYNTICKS SUPPORT
-M: Frederic Weisbecker <[email protected]>
+M: Frederic Weisbecker <[email protected]>
M: Thomas Gleixner <[email protected]>
M: Ingo Molnar <[email protected]>
L: [email protected]
--
2.34.1


2023-02-20 12:42:16

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 7/7] selftests/proc: Remove idle time monotonicity assertions

Due to broken iowait task counting design (cf: comments above
get_cpu_idle_time_us() and nr_iowait()), it is not possible to provide
the guarantee that /proc/stat or /proc/uptime display monotonic idle
time values.

Remove the selftests that verify the related wrong assumption so that
testers and maintainers don't spend more time on that.

Reported-by: Yu Liao <[email protected]>
Reported-by: Thomas Gleixner <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Wei Li <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Mirsad Goran Todorovac <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
tools/testing/selftests/proc/.gitignore | 2 -
tools/testing/selftests/proc/Makefile | 2 -
.../testing/selftests/proc/proc-uptime-001.c | 45 -----------
.../testing/selftests/proc/proc-uptime-002.c | 79 -------------------
tools/testing/selftests/proc/proc-uptime.h | 60 --------------
5 files changed, 188 deletions(-)
delete mode 100644 tools/testing/selftests/proc/proc-uptime-001.c
delete mode 100644 tools/testing/selftests/proc/proc-uptime-002.c
delete mode 100644 tools/testing/selftests/proc/proc-uptime.h

diff --git a/tools/testing/selftests/proc/.gitignore b/tools/testing/selftests/proc/.gitignore
index a156ac5dd2c6..448db45f08dc 100644
--- a/tools/testing/selftests/proc/.gitignore
+++ b/tools/testing/selftests/proc/.gitignore
@@ -13,8 +13,6 @@
/proc-self-wchan
/proc-subset-pid
/proc-tid0
-/proc-uptime-001
-/proc-uptime-002
/read
/self
/setns-dcache
diff --git a/tools/testing/selftests/proc/Makefile b/tools/testing/selftests/proc/Makefile
index cd95369254c0..22ff9341f97c 100644
--- a/tools/testing/selftests/proc/Makefile
+++ b/tools/testing/selftests/proc/Makefile
@@ -16,8 +16,6 @@ TEST_GEN_PROGS += proc-self-syscall
TEST_GEN_PROGS += proc-self-wchan
TEST_GEN_PROGS += proc-subset-pid
TEST_GEN_PROGS += proc-tid0
-TEST_GEN_PROGS += proc-uptime-001
-TEST_GEN_PROGS += proc-uptime-002
TEST_GEN_PROGS += read
TEST_GEN_PROGS += self
TEST_GEN_PROGS += setns-dcache
diff --git a/tools/testing/selftests/proc/proc-uptime-001.c b/tools/testing/selftests/proc/proc-uptime-001.c
deleted file mode 100644
index 781f7a50fc3f..000000000000
--- a/tools/testing/selftests/proc/proc-uptime-001.c
+++ /dev/null
@@ -1,45 +0,0 @@
-/*
- * Copyright © 2018 Alexey Dobriyan <[email protected]>
- *
- * Permission to use, copy, modify, and distribute this software for any
- * purpose with or without fee is hereby granted, provided that the above
- * copyright notice and this permission notice appear in all copies.
- *
- * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
- * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
- * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
- * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
- * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
- * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
- * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
- */
-// Test that values in /proc/uptime increment monotonically.
-#undef NDEBUG
-#include <assert.h>
-#include <stdint.h>
-#include <sys/types.h>
-#include <sys/stat.h>
-#include <fcntl.h>
-
-#include "proc-uptime.h"
-
-int main(void)
-{
- uint64_t start, u0, u1, i0, i1;
- int fd;
-
- fd = open("/proc/uptime", O_RDONLY);
- assert(fd >= 0);
-
- proc_uptime(fd, &u0, &i0);
- start = u0;
- do {
- proc_uptime(fd, &u1, &i1);
- assert(u1 >= u0);
- assert(i1 >= i0);
- u0 = u1;
- i0 = i1;
- } while (u1 - start < 100);
-
- return 0;
-}
diff --git a/tools/testing/selftests/proc/proc-uptime-002.c b/tools/testing/selftests/proc/proc-uptime-002.c
deleted file mode 100644
index 7d0aa22bdc12..000000000000
--- a/tools/testing/selftests/proc/proc-uptime-002.c
+++ /dev/null
@@ -1,79 +0,0 @@
-/*
- * Copyright © 2018 Alexey Dobriyan <[email protected]>
- *
- * Permission to use, copy, modify, and distribute this software for any
- * purpose with or without fee is hereby granted, provided that the above
- * copyright notice and this permission notice appear in all copies.
- *
- * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
- * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
- * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
- * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
- * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
- * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
- * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
- */
-// Test that values in /proc/uptime increment monotonically
-// while shifting across CPUs.
-#undef NDEBUG
-#include <assert.h>
-#include <errno.h>
-#include <unistd.h>
-#include <sys/syscall.h>
-#include <stdlib.h>
-#include <string.h>
-
-#include <stdint.h>
-#include <sys/types.h>
-#include <sys/stat.h>
-#include <fcntl.h>
-
-#include "proc-uptime.h"
-
-static inline int sys_sched_getaffinity(pid_t pid, unsigned int len, unsigned long *m)
-{
- return syscall(SYS_sched_getaffinity, pid, len, m);
-}
-
-static inline int sys_sched_setaffinity(pid_t pid, unsigned int len, unsigned long *m)
-{
- return syscall(SYS_sched_setaffinity, pid, len, m);
-}
-
-int main(void)
-{
- unsigned int len;
- unsigned long *m;
- unsigned int cpu;
- uint64_t u0, u1, i0, i1;
- int fd;
-
- /* find out "nr_cpu_ids" */
- m = NULL;
- len = 0;
- do {
- len += sizeof(unsigned long);
- free(m);
- m = malloc(len);
- } while (sys_sched_getaffinity(0, len, m) == -1 && errno == EINVAL);
-
- fd = open("/proc/uptime", O_RDONLY);
- assert(fd >= 0);
-
- proc_uptime(fd, &u0, &i0);
- for (cpu = 0; cpu < len * 8; cpu++) {
- memset(m, 0, len);
- m[cpu / (8 * sizeof(unsigned long))] |= 1UL << (cpu % (8 * sizeof(unsigned long)));
-
- /* CPU might not exist, ignore error */
- sys_sched_setaffinity(0, len, m);
-
- proc_uptime(fd, &u1, &i1);
- assert(u1 >= u0);
- assert(i1 >= i0);
- u0 = u1;
- i0 = i1;
- }
-
- return 0;
-}
diff --git a/tools/testing/selftests/proc/proc-uptime.h b/tools/testing/selftests/proc/proc-uptime.h
deleted file mode 100644
index dc6a42b1d6b0..000000000000
--- a/tools/testing/selftests/proc/proc-uptime.h
+++ /dev/null
@@ -1,60 +0,0 @@
-/*
- * Copyright © 2018 Alexey Dobriyan <[email protected]>
- *
- * Permission to use, copy, modify, and distribute this software for any
- * purpose with or without fee is hereby granted, provided that the above
- * copyright notice and this permission notice appear in all copies.
- *
- * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
- * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
- * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
- * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
- * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
- * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
- * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
- */
-#undef NDEBUG
-#include <assert.h>
-#include <errno.h>
-#include <string.h>
-#include <stdlib.h>
-#include <unistd.h>
-
-#include "proc.h"
-
-static void proc_uptime(int fd, uint64_t *uptime, uint64_t *idle)
-{
- uint64_t val1, val2;
- char buf[64], *p;
- ssize_t rv;
-
- /* save "p < end" checks */
- memset(buf, 0, sizeof(buf));
- rv = pread(fd, buf, sizeof(buf), 0);
- assert(0 <= rv && rv <= sizeof(buf));
- buf[sizeof(buf) - 1] = '\0';
-
- p = buf;
-
- val1 = xstrtoull(p, &p);
- assert(p[0] == '.');
- assert('0' <= p[1] && p[1] <= '9');
- assert('0' <= p[2] && p[2] <= '9');
- assert(p[3] == ' ');
-
- val2 = (p[1] - '0') * 10 + p[2] - '0';
- *uptime = val1 * 100 + val2;
-
- p += 4;
-
- val1 = xstrtoull(p, &p);
- assert(p[0] == '.');
- assert('0' <= p[1] && p[1] <= '9');
- assert('0' <= p[2] && p[2] <= '9');
- assert(p[3] == '\n');
-
- val2 = (p[1] - '0') * 10 + p[2] - '0';
- *idle = val1 * 100 + val2;
-
- assert(p + 4 == buf + rv);
-}
--
2.34.1


2023-02-20 20:16:38

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH 7/7] selftests/proc: Remove idle time monotonicity assertions

On Mon, Feb 20, 2023 at 01:41:29PM +0100, Frederic Weisbecker wrote:
> Due to broken iowait task counting design (cf: comments above
> get_cpu_idle_time_us() and nr_iowait()), it is not possible to provide
> the guarantee that /proc/stat or /proc/uptime display monotonic idle
> time values.

Those test uptime _and_ idle time? You aren't sacking uptime, right?

> delete mode 100644 tools/testing/selftests/proc/proc-uptime-001.c
> delete mode 100644 tools/testing/selftests/proc/proc-uptime-002.c

2023-02-20 21:53:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 7/7] selftests/proc: Remove idle time monotonicity assertions

On Mon, Feb 20 2023 at 13:41, Frederic Weisbecker wrote:

> Due to broken iowait task counting design (cf: comments above
> get_cpu_idle_time_us() and nr_iowait()), it is not possible to provide
> the guarantee that /proc/stat or /proc/uptime display monotonic idle
> time values.
>
> Remove the selftests that verify the related wrong assumption so that
> testers and maintainers don't spend more time on that.
>
> Reported-by: Yu Liao <[email protected]>
> Reported-by: Thomas Gleixner <[email protected]>

I did really not ask you to remove the selftests alltogether.

Those tests check uptime and idle time. I asked you to remove the idle
time monotonicity assertion.

uptime is really supposed to be monotonically increasing. It's based on
CLOCK_BOOTTIME. If that goes backwards then we surely have more trouble
than /proc/uptime.

But it would be a good thing to change the test in the following way:

ut1 = parse("/proc/uptime");
bt1 = clock_gettime(CLOCK_BOOTTIME);
assert(ut1 <= bt1);
ut2 = parse("/proc/uptime");
bt2 = clock_gettime(CLOCK_BOOTTIME);
assert(ut2 <= bt2);
assert(ut1 <= ut2);
....

Hmm?

Thanks,

tglx

2023-02-22 14:50:21

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 7/7] selftests/proc: Remove idle time monotonicity assertions

On Mon, Feb 20, 2023 at 10:53:17PM +0100, Thomas Gleixner wrote:
> On Mon, Feb 20 2023 at 13:41, Frederic Weisbecker wrote:
>
> > Due to broken iowait task counting design (cf: comments above
> > get_cpu_idle_time_us() and nr_iowait()), it is not possible to provide
> > the guarantee that /proc/stat or /proc/uptime display monotonic idle
> > time values.
> >
> > Remove the selftests that verify the related wrong assumption so that
> > testers and maintainers don't spend more time on that.
> >
> > Reported-by: Yu Liao <[email protected]>
> > Reported-by: Thomas Gleixner <[email protected]>
>
> I did really not ask you to remove the selftests alltogether.
>
> Those tests check uptime and idle time. I asked you to remove the idle
> time monotonicity assertion.
>
> uptime is really supposed to be monotonically increasing. It's based on
> CLOCK_BOOTTIME. If that goes backwards then we surely have more trouble
> than /proc/uptime.

Right, I naively thought it wasn't worth checking CLOCK_BOOTTIME monotonicity
alone... Anyway, kept that in the new version.

>
> But it would be a good thing to change the test in the following way:
>
> ut1 = parse("/proc/uptime");
> bt1 = clock_gettime(CLOCK_BOOTTIME);
> assert(ut1 <= bt1);
> ut2 = parse("/proc/uptime");
> bt2 = clock_gettime(CLOCK_BOOTTIME);
> assert(ut2 <= bt2);
> assert(ut1 <= ut2);
> ....
>
> Hmm?

Makes sense, added to the new version.

Thanks!

>
> Thanks,
>
> tglx