Various updates for the timer/nohz side.
* Two fixes (maybe 01/10 deserves a stable tag, I should check)
* A Kconfig help change
* A debug check while enqueuing timers when the tick is stopped in idle.
* The rest is noise reduction for full nohz ("tick/nohz: Kick only
_queued_ task whose tick dependency is updated").
It relies on the fact that p->on_rq is never set to anything else than
TASK_ON_RQ_QUEUED while the task is running on a CPU, the only relevant
exception being the task dequeuing itself on schedule().
I haven't found issues in my modest reviews of deactivate_task() callers
but I lost my headlamp while visiting fair's sched entity dequeuing
and throttling (had to find my way back in the dark again).
So please double check the last patch.
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
timers/nohz
HEAD: e469edfa00f97e5ec5d31fe31d3aef0a5c9bd607
Thanks,
Frederic
---
Frederic Weisbecker (5):
tick/nohz: Add tick_nohz_full_this_cpu()
tick/nohz: Remove superflous check for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
timer: Report ignored local enqueue in nohz mode
tick/nohz: Update nohz_full Kconfig help
tick/nohz: Only wakeup a single target cpu when kicking a task
Marcelo Tosatti (2):
tick/nohz: Change signal tick dependency to wakeup CPUs of member tasks
tick/nohz: Kick only _queued_ task whose tick dependency is updated
Yunfeng Ye (2):
tick/nohz: Conditionally restart tick on idle exit
tick/nohz: Update idle_exittime on actual idle exit
Zhou Ti (x2019cwm) (1):
tick/nohz: Prevent tick_nohz_get_sleep_length() from returning negative value
include/linux/sched.h | 2 +
include/linux/tick.h | 19 ++++--
kernel/sched/core.c | 25 +++++++-
kernel/time/Kconfig | 11 ++--
kernel/time/posix-cpu-timers.c | 4 +-
kernel/time/tick-sched.c | 134 +++++++++++++++++++++++++++++------------
6 files changed, 142 insertions(+), 53 deletions(-)
From: Yunfeng Ye <[email protected]>
In nohz_full mode, switching from idle to a task will unconditionally
issue a tick restart. If the task is alone in the runqueue or is the
highest priority, the tick will fire once then eventually stop. But that
alone is still undesired noise.
Therefore, only restart the tick on idle exit when it's strictly
necessary.
Signed-off-by: Yunfeng Ye <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-sched.c | 50 ++++++++++++++++++++++++----------------
1 file changed, 30 insertions(+), 20 deletions(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index af76cfa51b57..77dc8cd61dc8 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -922,26 +922,30 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
tick_nohz_restart(ts, now);
}
-static void tick_nohz_full_update_tick(struct tick_sched *ts)
+static void __tick_nohz_full_update_tick(struct tick_sched *ts,
+ ktime_t now)
{
#ifdef CONFIG_NO_HZ_FULL
- int cpu;
-
- if (!tick_nohz_full_this_cpu())
- return;
-
- if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
- return;
-
- cpu = smp_processor_id();
+ int cpu = smp_processor_id();
if (can_stop_full_tick(cpu, ts))
tick_nohz_stop_sched_tick(ts, cpu);
else if (ts->tick_stopped)
- tick_nohz_restart_sched_tick(ts, ktime_get());
+ tick_nohz_restart_sched_tick(ts, now);
#endif
}
+static void tick_nohz_full_update_tick(struct tick_sched *ts)
+{
+ if (!tick_nohz_full_this_cpu())
+ return;
+
+ if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
+ return;
+
+ __tick_nohz_full_update_tick(ts, ktime_get());
+}
+
static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
{
/*
@@ -1209,18 +1213,24 @@ static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
#endif
}
-static void __tick_nohz_idle_restart_tick(struct tick_sched *ts, ktime_t now)
-{
- tick_nohz_restart_sched_tick(ts, now);
- tick_nohz_account_idle_ticks(ts);
-}
-
void tick_nohz_idle_restart_tick(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
- if (ts->tick_stopped)
- __tick_nohz_idle_restart_tick(ts, ktime_get());
+ if (ts->tick_stopped) {
+ tick_nohz_restart_sched_tick(ts, ktime_get());
+ tick_nohz_account_idle_ticks(ts);
+ }
+}
+
+static void tick_nohz_idle_update_tick(struct tick_sched *ts, ktime_t now)
+{
+ if (tick_nohz_full_this_cpu())
+ __tick_nohz_full_update_tick(ts, now);
+ else
+ tick_nohz_restart_sched_tick(ts, now);
+
+ tick_nohz_account_idle_ticks(ts);
}
/**
@@ -1252,7 +1262,7 @@ void tick_nohz_idle_exit(void)
tick_nohz_stop_idle(ts, now);
if (tick_stopped)
- __tick_nohz_idle_restart_tick(ts, now);
+ tick_nohz_idle_update_tick(ts, now);
local_irq_enable();
}
--
2.25.1
The vtime_accounting_enabled_this_cpu() early check already makes what
follows as dead code in the case of CONFIG_VIRT_CPU_ACCOUNTING_NATIVE.
No need to keep the ifdeferry around.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Yunfeng Ye <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
---
kernel/time/tick-sched.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 77dc8cd61dc8..c86b586d65e0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1194,7 +1194,6 @@ unsigned long tick_nohz_get_idle_calls(void)
static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
{
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
unsigned long ticks;
if (vtime_accounting_enabled_this_cpu())
@@ -1210,7 +1209,6 @@ static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
*/
if (ticks && ticks < LONG_MAX)
account_idle_ticks(ticks);
-#endif
}
void tick_nohz_idle_restart_tick(void)
--
2.25.1
From: Yunfeng Ye <[email protected]>
The idle_exittime field of tick_sched is used to record the time when
the idle state was left. but currently the idle_exittime is updated in
the function tick_nohz_restart_sched_tick(), which is not always in idle
state when nohz_full is configured:
tick_irq_exit
tick_nohz_irq_exit
tick_nohz_full_update_tick
tick_nohz_restart_sched_tick
ts->idle_exittime = now;
It's thus overwritten by mistake on nohz_full tick restart. Move the
update to the appropriate idle exit path instead.
Signed-off-by: Yunfeng Ye <[email protected]>
Cc: Yunfeng Ye <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-sched.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c86b586d65e0..2a041d0dc3eb 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -917,8 +917,6 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
* Cancel the scheduled timer and restore the tick
*/
ts->tick_stopped = 0;
- ts->idle_exittime = now;
-
tick_nohz_restart(ts, now);
}
@@ -1192,10 +1190,13 @@ unsigned long tick_nohz_get_idle_calls(void)
return ts->idle_calls;
}
-static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
+static void tick_nohz_account_idle_time(struct tick_sched *ts,
+ ktime_t now)
{
unsigned long ticks;
+ ts->idle_exittime = now;
+
if (vtime_accounting_enabled_this_cpu())
return;
/*
@@ -1216,8 +1217,9 @@ void tick_nohz_idle_restart_tick(void)
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
if (ts->tick_stopped) {
- tick_nohz_restart_sched_tick(ts, ktime_get());
- tick_nohz_account_idle_ticks(ts);
+ ktime_t now = ktime_get();
+ tick_nohz_restart_sched_tick(ts, now);
+ tick_nohz_account_idle_time(ts, now);
}
}
@@ -1228,7 +1230,7 @@ static void tick_nohz_idle_update_tick(struct tick_sched *ts, ktime_t now)
else
tick_nohz_restart_sched_tick(ts, now);
- tick_nohz_account_idle_ticks(ts);
+ tick_nohz_account_idle_time(ts, now);
}
/**
--
2.25.1
Enqueuing a local timer after the tick has been stopped will result in
the timer being ignored until the next random interrupt.
Perform sanity checks to report these situations.
Reviewed-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Paul E. McKenney <[email protected]>
---
kernel/sched/core.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca2bb629595f..24552911f92b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -674,6 +674,22 @@ int get_nohz_timer_target(void)
return cpu;
}
+/* Make sure the timer won't be ignored in dynticks-idle case */
+static void wake_idle_assert_possible(void)
+{
+#ifdef CONFIG_SCHED_DEBUG
+ /*
+ * Timers are re-evaluated after idle IRQs. In case of softirq,
+ * we assume IRQ tail. Ksoftirqd shouldn't reach here as the
+ * timer base wouldn't be idle. And inline softirq processing
+ * after a call to local_bh_enable() within idle loop sound too
+ * fun to be considered here.
+ */
+ WARN_ONCE(in_task(),
+ "Late timer enqueue may be ignored\n");
+#endif
+}
+
/*
* When add_timer_on() enqueues a timer into the timer wheel of an
* idle CPU then this timer might expire before the next timer event
@@ -688,8 +704,10 @@ static void wake_up_idle_cpu(int cpu)
{
struct rq *rq = cpu_rq(cpu);
- if (cpu == smp_processor_id())
+ if (cpu == smp_processor_id()) {
+ wake_idle_assert_possible();
return;
+ }
if (set_nr_and_not_polling(rq->idle))
smp_send_reschedule(cpu);
--
2.25.1
When adding a tick dependency to a task, its necessary to
wakeup the CPU where the task resides to reevaluate tick
dependencies on that CPU.
However the current code wakes up all nohz_full CPUs, which
is unnecessary.
Switch to waking up a single CPU, by using ordering of writes
to task->cpu and task->tick_dep_mask.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Yunfeng Ye <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
kernel/time/tick-sched.c | 40 +++++++++++++++++++++++++++-------------
1 file changed, 27 insertions(+), 13 deletions(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 2a041d0dc3eb..8457f15a5073 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -322,6 +322,31 @@ void tick_nohz_full_kick_cpu(int cpu)
irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
}
+static void tick_nohz_kick_task(struct task_struct *tsk)
+{
+ int cpu = task_cpu(tsk);
+
+ /*
+ * If the task concurrently migrates to another cpu,
+ * we guarantee it sees the new tick dependency upon
+ * schedule.
+ *
+ *
+ * set_task_cpu(p, cpu);
+ * STORE p->cpu = @cpu
+ * __schedule() (switch to task 'p')
+ * LOCK rq->lock
+ * smp_mb__after_spin_lock() STORE p->tick_dep_mask
+ * tick_nohz_task_switch() smp_mb() (atomic_fetch_or())
+ * LOAD p->tick_dep_mask LOAD p->cpu
+ */
+
+ preempt_disable();
+ if (cpu_online(cpu))
+ tick_nohz_full_kick_cpu(cpu);
+ preempt_enable();
+}
+
/*
* Kick all full dynticks CPUs in order to force these to re-evaluate
* their dependency on the tick and restart it if necessary.
@@ -404,19 +429,8 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_cpu);
*/
void tick_nohz_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit)
{
- if (!atomic_fetch_or(BIT(bit), &tsk->tick_dep_mask)) {
- if (tsk == current) {
- preempt_disable();
- tick_nohz_full_kick();
- preempt_enable();
- } else {
- /*
- * Some future tick_nohz_full_kick_task()
- * should optimize this.
- */
- tick_nohz_full_kick_all();
- }
- }
+ if (!atomic_fetch_or(BIT(bit), &tsk->tick_dep_mask))
+ tick_nohz_kick_task(tsk);
}
EXPORT_SYMBOL_GPL(tick_nohz_dep_set_task);
--
2.25.1
From: Marcelo Tosatti <[email protected]>
Rather than waking up all nohz_full CPUs on the system, only wakeup
the target CPUs of member threads of the signal.
Reduces interruptions to nohz_full CPUs.
Signed-off-by: Marcelo Tosatti <[email protected]>
Cc: Yunfeng Ye <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/tick.h | 8 ++++----
kernel/time/posix-cpu-timers.c | 4 ++--
kernel/time/tick-sched.c | 15 +++++++++++++--
3 files changed, 19 insertions(+), 8 deletions(-)
diff --git a/include/linux/tick.h b/include/linux/tick.h
index bfc96cbe955c..9e71ab1889aa 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -215,7 +215,7 @@ extern void tick_nohz_dep_set_task(struct task_struct *tsk,
enum tick_dep_bits bit);
extern void tick_nohz_dep_clear_task(struct task_struct *tsk,
enum tick_dep_bits bit);
-extern void tick_nohz_dep_set_signal(struct signal_struct *signal,
+extern void tick_nohz_dep_set_signal(struct task_struct *tsk,
enum tick_dep_bits bit);
extern void tick_nohz_dep_clear_signal(struct signal_struct *signal,
enum tick_dep_bits bit);
@@ -260,11 +260,11 @@ static inline void tick_dep_clear_task(struct task_struct *tsk,
if (tick_nohz_full_enabled())
tick_nohz_dep_clear_task(tsk, bit);
}
-static inline void tick_dep_set_signal(struct signal_struct *signal,
+static inline void tick_dep_set_signal(struct task_struct *tsk,
enum tick_dep_bits bit)
{
if (tick_nohz_full_enabled())
- tick_nohz_dep_set_signal(signal, bit);
+ tick_nohz_dep_set_signal(tsk, bit);
}
static inline void tick_dep_clear_signal(struct signal_struct *signal,
enum tick_dep_bits bit)
@@ -293,7 +293,7 @@ static inline void tick_dep_set_task(struct task_struct *tsk,
enum tick_dep_bits bit) { }
static inline void tick_dep_clear_task(struct task_struct *tsk,
enum tick_dep_bits bit) { }
-static inline void tick_dep_set_signal(struct signal_struct *signal,
+static inline void tick_dep_set_signal(struct task_struct *tsk,
enum tick_dep_bits bit) { }
static inline void tick_dep_clear_signal(struct signal_struct *signal,
enum tick_dep_bits bit) { }
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index a71758e34e45..932e0cb4b57b 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -523,7 +523,7 @@ static void arm_timer(struct k_itimer *timer, struct task_struct *p)
if (CPUCLOCK_PERTHREAD(timer->it_clock))
tick_dep_set_task(p, TICK_DEP_BIT_POSIX_TIMER);
else
- tick_dep_set_signal(p->signal, TICK_DEP_BIT_POSIX_TIMER);
+ tick_dep_set_signal(p, TICK_DEP_BIT_POSIX_TIMER);
}
/*
@@ -1358,7 +1358,7 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clkid,
if (*newval < *nextevt)
*nextevt = *newval;
- tick_dep_set_signal(tsk->signal, TICK_DEP_BIT_POSIX_TIMER);
+ tick_dep_set_signal(tsk, TICK_DEP_BIT_POSIX_TIMER);
}
static int do_cpu_nanosleep(const clockid_t which_clock, int flags,
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 8457f15a5073..a866fd8e7bb5 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -444,9 +444,20 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_task);
* Set a per-taskgroup tick dependency. Posix CPU timers need this in order to elapse
* per process timers.
*/
-void tick_nohz_dep_set_signal(struct signal_struct *sig, enum tick_dep_bits bit)
+void tick_nohz_dep_set_signal(struct task_struct *tsk,
+ enum tick_dep_bits bit)
{
- tick_nohz_dep_set_all(&sig->tick_dep_mask, bit);
+ int prev;
+ struct signal_struct *sig = tsk->signal;
+
+ prev = atomic_fetch_or(BIT(bit), &sig->tick_dep_mask);
+ if (!prev) {
+ struct task_struct *t;
+
+ lockdep_assert_held(&tsk->sighand->siglock);
+ __for_each_thread(sig, t)
+ tick_nohz_kick_task(t);
+ }
}
void tick_nohz_dep_clear_signal(struct signal_struct *sig, enum tick_dep_bits bit)
--
2.25.1
From: Marcelo Tosatti <[email protected]>
When the tick dependency of a task is updated, we want it to aknowledge
the new state and restart the tick if needed. If the task is not
running, we don't need to kick it because it will observe the new
dependency upon scheduling in. But if the task is running, we may need
to send an IPI to it so that it gets notified.
Unfortunately we don't have the means to check if a task is running
in a race free way. Checking p->on_cpu in a synchronized way against
p->tick_dep_mask would imply adding a full barrier between
prepare_task_switch() and tick_nohz_task_switch(), which we want to
avoid in this fast-path.
Therefore we blindly fire an IPI to the task's CPU.
Meanwhile we can check if the task is queued on the CPU rq because
p->on_rq is always set to TASK_ON_RQ_QUEUED _before_ schedule() and its
full barrier that precedes tick_nohz_task_switch(). And if the task is
queued on a nohz_full CPU, it also has fair chances to be running as the
isolation constraints prescribe running single tasks on full dynticks
CPUs.
So use this as a trick to check if we can spare an IPI toward a
non-running task.
NOTE: For the ordering to be correct, it is assumed that we never
deactivate a task while it is running, the only exception being the task
deactivating itself while scheduling out.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
Cc: Yunfeng Ye <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 5 +++++
kernel/time/tick-sched.c | 19 +++++++++++++++++--
3 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ef00bb22164c..64dd6f698f3a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1999,6 +1999,8 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)
#endif /* CONFIG_SMP */
+extern bool sched_task_on_rq(struct task_struct *p);
+
/*
* In order to reduce various lock holder preemption latencies provide an
* interface to see if a vCPU is currently running or not.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 24552911f92b..7ec46f3e8110 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1598,6 +1598,11 @@ static inline void uclamp_post_fork(struct task_struct *p) { }
static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */
+bool sched_task_on_rq(struct task_struct *p)
+{
+ return task_on_rq_queued(p);
+}
+
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
if (!(flags & ENQUEUE_NOCLOCK))
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a866fd8e7bb5..015281c3dd8d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -324,8 +324,6 @@ void tick_nohz_full_kick_cpu(int cpu)
static void tick_nohz_kick_task(struct task_struct *tsk)
{
- int cpu = task_cpu(tsk);
-
/*
* If the task concurrently migrates to another cpu,
* we guarantee it sees the new tick dependency upon
@@ -340,6 +338,23 @@ static void tick_nohz_kick_task(struct task_struct *tsk)
* tick_nohz_task_switch() smp_mb() (atomic_fetch_or())
* LOAD p->tick_dep_mask LOAD p->cpu
*/
+ int cpu = task_cpu(tsk);
+
+ /*
+ * If the task is not running, run_posix_cpu_timers
+ * has nothing to elapsed, can spare IPI in that
+ * case.
+ *
+ * activate_task() STORE p->tick_dep_mask
+ * STORE p->on_rq
+ * __schedule() (switch to task 'p') smp_mb() (atomic_fetch_or())
+ * LOCK rq->lock LOAD p->on_rq
+ * smp_mb__after_spin_lock()
+ * tick_nohz_task_switch()
+ * LOAD p->tick_dep_mask
+ */
+ if (!sched_task_on_rq(tsk))
+ return;
preempt_disable();
if (cpu_online(cpu))
--
2.25.1
From: "Zhou Ti (x2019cwm)" <[email protected]>
If the hardware clock happens to fire its interrupts late, two possible
issues can happen while calling tick_nohz_get_sleep_length(). Either:
1) The next clockevent device event is due past the last idle entry time.
or:
2) The last timekeeping update happened before the last idle entry time
and the next timer callback expires before the last idle entry time.
Make sure that both cases are handled to avoid returning a negative
duration to the cpuidle governors.
Signed-off-by: Ti Zhou <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-sched.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index e10a4af88737..22b6a46818cf 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1142,6 +1142,9 @@ ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
*delta_next = ktime_sub(dev->next_event, now);
+ if (unlikely(*delta_next < 0))
+ *delta_next = 0;
+
if (!can_stop_idle_tick(cpu, ts))
return *delta_next;
@@ -1156,6 +1159,9 @@ ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
next_event = min_t(u64, next_event,
hrtimer_next_event_without(&ts->sched_timer));
+ if (unlikely(next_event < now))
+ next_event = now;
+
return ktime_sub(next_event, now);
}
--
2.25.1
Optimize further the check for local full dynticks CPU. Testing directly
tick_nohz_full_cpu(smp_processor_id()) is suboptimal because the
compiler first fetches the CPU number and only then processes the
static key.
It's best to evaluate the static branch before anything. Provide with a
function that handles that correctly and convert some users along.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Yunfeng Ye <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
---
include/linux/tick.h | 11 ++++++++++-
kernel/time/tick-sched.c | 12 +++++-------
2 files changed, 15 insertions(+), 8 deletions(-)
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7340613c7eff..bfc96cbe955c 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -193,6 +193,14 @@ static inline bool tick_nohz_full_cpu(int cpu)
return cpumask_test_cpu(cpu, tick_nohz_full_mask);
}
+static inline bool tick_nohz_full_this_cpu(void)
+{
+ if (!tick_nohz_full_enabled())
+ return false;
+
+ return cpumask_test_cpu(smp_processor_id(), tick_nohz_full_mask);
+}
+
static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask)
{
if (tick_nohz_full_enabled())
@@ -271,6 +279,7 @@ extern void __init tick_nohz_full_setup(cpumask_var_t cpumask);
#else
static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
+static inline bool tick_nohz_full_this_cpu(void) { return false; }
static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }
static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { }
@@ -296,7 +305,7 @@ static inline void tick_nohz_full_setup(cpumask_var_t cpumask) { }
static inline void tick_nohz_task_switch(void)
{
- if (tick_nohz_full_enabled())
+ if (tick_nohz_full_this_cpu())
__tick_nohz_task_switch();
}
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 22b6a46818cf..af76cfa51b57 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -304,7 +304,7 @@ static DEFINE_PER_CPU(struct irq_work, nohz_full_kick_work) =
*/
static void tick_nohz_full_kick(void)
{
- if (!tick_nohz_full_cpu(smp_processor_id()))
+ if (!tick_nohz_full_this_cpu())
return;
irq_work_queue(this_cpu_ptr(&nohz_full_kick_work));
@@ -452,9 +452,6 @@ void __tick_nohz_task_switch(void)
local_irq_save(flags);
- if (!tick_nohz_full_cpu(smp_processor_id()))
- goto out;
-
ts = this_cpu_ptr(&tick_cpu_sched);
if (ts->tick_stopped) {
@@ -462,7 +459,6 @@ void __tick_nohz_task_switch(void)
atomic_read(¤t->signal->tick_dep_mask))
tick_nohz_full_kick();
}
-out:
local_irq_restore(flags);
}
@@ -929,14 +925,16 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
static void tick_nohz_full_update_tick(struct tick_sched *ts)
{
#ifdef CONFIG_NO_HZ_FULL
- int cpu = smp_processor_id();
+ int cpu;
- if (!tick_nohz_full_cpu(cpu))
+ if (!tick_nohz_full_this_cpu())
return;
if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
return;
+ cpu = smp_processor_id();
+
if (can_stop_full_tick(cpu, ts))
tick_nohz_stop_sched_tick(ts, cpu);
else if (ts->tick_stopped)
--
2.25.1
CONFIG_NO_HZ_FULL behaves just like CONFIG_NO_HZ_IDLE by default.
Reassure distros about it.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Yunfeng Ye <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
---
kernel/time/Kconfig | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 83e158d016ba..6649e1d2dba5 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -117,13 +117,14 @@ config NO_HZ_FULL
the task mostly runs in userspace and has few kernel activity.
You need to fill up the nohz_full boot parameter with the
- desired range of dynticks CPUs.
+ desired range of dynticks CPUs to use it. This is implemented at
+ the expense of some overhead in user <-> kernel transitions:
+ syscalls, exceptions and interrupts.
- This is implemented at the expense of some overhead in user <-> kernel
- transitions: syscalls, exceptions and interrupts. Even when it's
- dynamically off.
+ By default, without passing nohz_full parameter, this behaves just
+ like NO_HZ_IDLE.
- Say N.
+ If you're a distro say Y.
endchoice
--
2.25.1
On Tue, Mar 16, 2021 at 03:35:37PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 16, 2021 at 02:37:03PM +0100, Frederic Weisbecker wrote:
> > On Tue, Mar 16, 2021 at 01:21:29PM +0100, Peter Zijlstra wrote:
> > > On Thu, Mar 11, 2021 at 01:36:59PM +0100, Frederic Weisbecker wrote:
> > > > From: "Zhou Ti (x2019cwm)" <[email protected]>
> > > >
> > > > If the hardware clock happens to fire its interrupts late, two possible
> > > > issues can happen while calling tick_nohz_get_sleep_length(). Either:
> > > >
> > > > 1) The next clockevent device event is due past the last idle entry time.
> > > >
> > > > or:
> > > >
> > > > 2) The last timekeeping update happened before the last idle entry time
> > > > and the next timer callback expires before the last idle entry time.
> > > >
> > > > Make sure that both cases are handled to avoid returning a negative
> > > > duration to the cpuidle governors.
> > >
> > > Why? ... and wouldn't it be cheaper the fix the caller to
> > > check negative once, instead of adding two branches here?
> >
> > There are already two callers and potentially two return values to check
> > for each because the function returns two values.
> >
> > I'd rather make the API more robust instead of fixing each callers and worrying
> > about future ones.
>
> But what's the actual problem? The Changelog doesn't say why returning a
> negative value is a problem, and in fact the return value is explicitly
> signed.
>
> Anyway, I don't terribly mind the patch, I was just confused by the lack
> of actual justification.
And you're right, the motivation is pure FUD: I don't know exactly
how the cpuidle governors may react to such negative values and so this
is just to prevent from potential accident.
Rafael, does that look harmless to you?
On Thu, Mar 11, 2021 at 01:37:04PM +0100, Frederic Weisbecker wrote:
> Enqueuing a local timer after the tick has been stopped will result in
> the timer being ignored until the next random interrupt.
>
> Perform sanity checks to report these situations.
>
> Reviewed-by: Rafael J. Wysocki <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> ---
> kernel/sched/core.c | 20 +++++++++++++++++++-
> 1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ca2bb629595f..24552911f92b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -674,6 +674,22 @@ int get_nohz_timer_target(void)
> return cpu;
> }
>
> +/* Make sure the timer won't be ignored in dynticks-idle case */
> +static void wake_idle_assert_possible(void)
> +{
> +#ifdef CONFIG_SCHED_DEBUG
> + /*
> + * Timers are re-evaluated after idle IRQs. In case of softirq,
> + * we assume IRQ tail. Ksoftirqd shouldn't reach here as the
> + * timer base wouldn't be idle. And inline softirq processing
> + * after a call to local_bh_enable() within idle loop sound too
> + * fun to be considered here.
> + */
> + WARN_ONCE(in_task(),
> + "Late timer enqueue may be ignored\n");
> +#endif
> +}
> +
> /*
> * When add_timer_on() enqueues a timer into the timer wheel of an
> * idle CPU then this timer might expire before the next timer event
> @@ -688,8 +704,10 @@ static void wake_up_idle_cpu(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
>
> - if (cpu == smp_processor_id())
> + if (cpu == smp_processor_id()) {
> + wake_idle_assert_possible();
> return;
> + }
>
> if (set_nr_and_not_polling(rq->idle))
> smp_send_reschedule(cpu);
I'm not entirely sure I understand this one. What's the callchain that
leads to this?
On Thu, Mar 11, 2021 at 01:37:00PM +0100, Frederic Weisbecker wrote:
> Optimize further the check for local full dynticks CPU. Testing directly
> tick_nohz_full_cpu(smp_processor_id()) is suboptimal because the
> compiler first fetches the CPU number and only then processes the
> static key.
>
> It's best to evaluate the static branch before anything.
Or you do tricky things like this ;-)
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7340613c7eff..bd4a6b055b80 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -185,13 +185,12 @@ static inline bool tick_nohz_full_enabled(void)
return tick_nohz_full_running;
}
-static inline bool tick_nohz_full_cpu(int cpu)
-{
- if (!tick_nohz_full_enabled())
- return false;
-
- return cpumask_test_cpu(cpu, tick_nohz_full_mask);
-}
+#define tick_nohz_full_cpu(_cpu) ({ \
+ bool __ret = false; \
+ if (tick_nohz_full_enabled()) \
+ __ret = cpumask_test_cpu((_cpu), tick_nohz_full_mask); \
+ __ret; \
+})
static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask)
{
On Thu, Mar 11, 2021 at 01:36:59PM +0100, Frederic Weisbecker wrote:
> From: "Zhou Ti (x2019cwm)" <[email protected]>
>
> If the hardware clock happens to fire its interrupts late, two possible
> issues can happen while calling tick_nohz_get_sleep_length(). Either:
>
> 1) The next clockevent device event is due past the last idle entry time.
>
> or:
>
> 2) The last timekeeping update happened before the last idle entry time
> and the next timer callback expires before the last idle entry time.
>
> Make sure that both cases are handled to avoid returning a negative
> duration to the cpuidle governors.
Why? ... and wouldn't it be cheaper the fix the caller to
check negative once, instead of adding two branches here?
On Tue, Mar 16, 2021 at 01:28:01PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 11, 2021 at 01:37:00PM +0100, Frederic Weisbecker wrote:
> > Optimize further the check for local full dynticks CPU. Testing directly
> > tick_nohz_full_cpu(smp_processor_id()) is suboptimal because the
> > compiler first fetches the CPU number and only then processes the
> > static key.
> >
> > It's best to evaluate the static branch before anything.
>
> Or you do tricky things like this ;-)
Good point!
I'll check the asm diff to see if that really does what we want.
I expect it will.
Thanks.
>
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index 7340613c7eff..bd4a6b055b80 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -185,13 +185,12 @@ static inline bool tick_nohz_full_enabled(void)
> return tick_nohz_full_running;
> }
>
> -static inline bool tick_nohz_full_cpu(int cpu)
> -{
> - if (!tick_nohz_full_enabled())
> - return false;
> -
> - return cpumask_test_cpu(cpu, tick_nohz_full_mask);
> -}
> +#define tick_nohz_full_cpu(_cpu) ({ \
> + bool __ret = false; \
> + if (tick_nohz_full_enabled()) \
> + __ret = cpumask_test_cpu((_cpu), tick_nohz_full_mask); \
> + __ret; \
> +})
>
> static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask)
> {
>
>
On Tue, Mar 16, 2021 at 01:21:29PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 11, 2021 at 01:36:59PM +0100, Frederic Weisbecker wrote:
> > From: "Zhou Ti (x2019cwm)" <[email protected]>
> >
> > If the hardware clock happens to fire its interrupts late, two possible
> > issues can happen while calling tick_nohz_get_sleep_length(). Either:
> >
> > 1) The next clockevent device event is due past the last idle entry time.
> >
> > or:
> >
> > 2) The last timekeeping update happened before the last idle entry time
> > and the next timer callback expires before the last idle entry time.
> >
> > Make sure that both cases are handled to avoid returning a negative
> > duration to the cpuidle governors.
>
> Why? ... and wouldn't it be cheaper the fix the caller to
> check negative once, instead of adding two branches here?
There are already two callers and potentially two return values to check
for each because the function returns two values.
I'd rather make the API more robust instead of fixing each callers and worrying
about future ones.
On Tue, Mar 16, 2021 at 02:37:03PM +0100, Frederic Weisbecker wrote:
> On Tue, Mar 16, 2021 at 01:21:29PM +0100, Peter Zijlstra wrote:
> > On Thu, Mar 11, 2021 at 01:36:59PM +0100, Frederic Weisbecker wrote:
> > > From: "Zhou Ti (x2019cwm)" <[email protected]>
> > >
> > > If the hardware clock happens to fire its interrupts late, two possible
> > > issues can happen while calling tick_nohz_get_sleep_length(). Either:
> > >
> > > 1) The next clockevent device event is due past the last idle entry time.
> > >
> > > or:
> > >
> > > 2) The last timekeeping update happened before the last idle entry time
> > > and the next timer callback expires before the last idle entry time.
> > >
> > > Make sure that both cases are handled to avoid returning a negative
> > > duration to the cpuidle governors.
> >
> > Why? ... and wouldn't it be cheaper the fix the caller to
> > check negative once, instead of adding two branches here?
>
> There are already two callers and potentially two return values to check
> for each because the function returns two values.
>
> I'd rather make the API more robust instead of fixing each callers and worrying
> about future ones.
But what's the actual problem? The Changelog doesn't say why returning a
negative value is a problem, and in fact the return value is explicitly
signed.
Anyway, I don't terribly mind the patch, I was just confused by the lack
of actual justification.
On 3/16/2021 3:53 PM, Frederic Weisbecker wrote:
> On Tue, Mar 16, 2021 at 03:35:37PM +0100, Peter Zijlstra wrote:
>> On Tue, Mar 16, 2021 at 02:37:03PM +0100, Frederic Weisbecker wrote:
>>> On Tue, Mar 16, 2021 at 01:21:29PM +0100, Peter Zijlstra wrote:
>>>> On Thu, Mar 11, 2021 at 01:36:59PM +0100, Frederic Weisbecker wrote:
>>>>> From: "Zhou Ti (x2019cwm)" <[email protected]>
>>>>>
>>>>> If the hardware clock happens to fire its interrupts late, two possible
>>>>> issues can happen while calling tick_nohz_get_sleep_length(). Either:
>>>>>
>>>>> 1) The next clockevent device event is due past the last idle entry time.
>>>>>
>>>>> or:
>>>>>
>>>>> 2) The last timekeeping update happened before the last idle entry time
>>>>> and the next timer callback expires before the last idle entry time.
>>>>>
>>>>> Make sure that both cases are handled to avoid returning a negative
>>>>> duration to the cpuidle governors.
>>>> Why? ... and wouldn't it be cheaper the fix the caller to
>>>> check negative once, instead of adding two branches here?
>>> There are already two callers and potentially two return values to check
>>> for each because the function returns two values.
>>>
>>> I'd rather make the API more robust instead of fixing each callers and worrying
>>> about future ones.
>> But what's the actual problem? The Changelog doesn't say why returning a
>> negative value is a problem, and in fact the return value is explicitly
>> signed.
>>
>> Anyway, I don't terribly mind the patch, I was just confused by the lack
>> of actual justification.
> And you're right, the motivation is pure FUD: I don't know exactly
> how the cpuidle governors may react to such negative values and so this
> is just to prevent from potential accident.
>
> Rafael, does that look harmless to you?
No, this is a problem now. Both governors using this assign the return
value of it to a u64 var and so negative values confuse them.
That said I think it's better to deal with the issue in the callers.
I can send a patch for that if needed.
Yes, the return of a negative number results in a very large unsigned integer, which idle governors use as a baseline prediction for future interrupts and to correct their own parameters. This problem can lead to the selection of idle states that are too deep, which can be detrimental to both energy and performance.
________________________________________
??????: Rafael J. Wysocki <[email protected]>
????ʱ??: 2021??3??16?? 3:26
?ռ???: Frederic Weisbecker; Peter Zijlstra
????: Thomas Gleixner; LKML; Zhou Ti (x2019cwm); Yunfeng Ye; Paul E . McKenney; Marcelo Tosatti; Ingo Molnar; [email protected]
????: Re: [PATCH 01/10] tick/nohz: Prevent tick_nohz_get_sleep_length() from returning negative value
On 3/16/2021 3:53 PM, Frederic Weisbecker wrote:
> On Tue, Mar 16, 2021 at 03:35:37PM +0100, Peter Zijlstra wrote:
>> On Tue, Mar 16, 2021 at 02:37:03PM +0100, Frederic Weisbecker wrote:
>>> On Tue, Mar 16, 2021 at 01:21:29PM +0100, Peter Zijlstra wrote:
>>>> On Thu, Mar 11, 2021 at 01:36:59PM +0100, Frederic Weisbecker wrote:
>>>>> From: "Zhou Ti (x2019cwm)" <[email protected]>
>>>>>
>>>>> If the hardware clock happens to fire its interrupts late, two possible
>>>>> issues can happen while calling tick_nohz_get_sleep_length(). Either:
>>>>>
>>>>> 1) The next clockevent device event is due past the last idle entry time.
>>>>>
>>>>> or:
>>>>>
>>>>> 2) The last timekeeping update happened before the last idle entry time
>>>>> and the next timer callback expires before the last idle entry time.
>>>>>
>>>>> Make sure that both cases are handled to avoid returning a negative
>>>>> duration to the cpuidle governors.
>>>> Why? ... and wouldn't it be cheaper the fix the caller to
>>>> check negative once, instead of adding two branches here?
>>> There are already two callers and potentially two return values to check
>>> for each because the function returns two values.
>>>
>>> I'd rather make the API more robust instead of fixing each callers and worrying
>>> about future ones.
>> But what's the actual problem? The Changelog doesn't say why returning a
>> negative value is a problem, and in fact the return value is explicitly
>> signed.
>>
>> Anyway, I don't terribly mind the patch, I was just confused by the lack
>> of actual justification.
> And you're right, the motivation is pure FUD: I don't know exactly
> how the cpuidle governors may react to such negative values and so this
> is just to prevent from potential accident.
>
> Rafael, does that look harmless to you?
No, this is a problem now. Both governors using this assign the return
value of it to a u64 var and so negative values confuse them.
That said I think it's better to deal with the issue in the callers.
I can send a patch for that if needed.
On Tue, Mar 16, 2021 at 04:27:56PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 11, 2021 at 01:37:04PM +0100, Frederic Weisbecker wrote:
> > Enqueuing a local timer after the tick has been stopped will result in
> > the timer being ignored until the next random interrupt.
> >
> > Perform sanity checks to report these situations.
> >
> > Reviewed-by: Rafael J. Wysocki <[email protected]>
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
> > ---
> > kernel/sched/core.c | 20 +++++++++++++++++++-
> > 1 file changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index ca2bb629595f..24552911f92b 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -674,6 +674,22 @@ int get_nohz_timer_target(void)
> > return cpu;
> > }
> >
> > +/* Make sure the timer won't be ignored in dynticks-idle case */
> > +static void wake_idle_assert_possible(void)
> > +{
> > +#ifdef CONFIG_SCHED_DEBUG
> > + /*
> > + * Timers are re-evaluated after idle IRQs. In case of softirq,
> > + * we assume IRQ tail. Ksoftirqd shouldn't reach here as the
> > + * timer base wouldn't be idle. And inline softirq processing
> > + * after a call to local_bh_enable() within idle loop sound too
> > + * fun to be considered here.
> > + */
> > + WARN_ONCE(in_task(),
> > + "Late timer enqueue may be ignored\n");
> > +#endif
> > +}
> > +
> > /*
> > * When add_timer_on() enqueues a timer into the timer wheel of an
> > * idle CPU then this timer might expire before the next timer event
> > @@ -688,8 +704,10 @@ static void wake_up_idle_cpu(int cpu)
> > {
> > struct rq *rq = cpu_rq(cpu);
> >
> > - if (cpu == smp_processor_id())
> > + if (cpu == smp_processor_id()) {
> > + wake_idle_assert_possible();
> > return;
> > + }
> >
> > if (set_nr_and_not_polling(rq->idle))
> > smp_send_reschedule(cpu);
>
> I'm not entirely sure I understand this one. What's the callchain that
> leads to this?
That's while calling add_timer*() or mod_timer() on an idle target.
Now the issue is only relevant when these timer functions are called
after cpuidle_select(), which arguably makes a small vulnerable window
that could be spotted in the future if the timer functions are called
after instrumentation_end()?
Thanks.