5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
has introduced an issue with RCU. This is a proposal to solve the
situation after realizing that fixing that on the timers side wouldn't
be pretty to say the least.
Oh and the last patch is absolutely irrelevant to the issue...
Frederic Weisbecker (3):
hrtimer: Report offline hrtimer enqueue
rcu: Defer RCU kthreads wakeup when CPU is dying
rcu/exp: Remove full barrier upon main thread wakeup
include/linux/hrtimer.h | 3 ++-
kernel/rcu/tree.c | 34 +++++++++++++++++++++++++++++++++-
kernel/rcu/tree_exp.h | 8 +++-----
kernel/time/hrtimer.c | 3 +++
4 files changed, 41 insertions(+), 7 deletions(-)
--
2.42.1
The hrtimers migration on CPU-down hotplug process has been moved
earlier, before the CPU actually goes to die. This leaves a small window
of opportunity to queue an hrtimer in a blind spot, leaving it ignored.
For example a practical case has been reported with RCU waking up a
SCHED_FIFO task right before the CPUHP_AP_IDLE_DEAD stage, queuing that
way a sched/rt timer to the local offline CPU.
Make sure such situations never go unnoticed and warn when that happens.
Reported-by: Paul E. McKenney <[email protected]>
Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/hrtimer.h | 3 ++-
kernel/time/hrtimer.c | 3 +++
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index f2044d5a652b..f0204630a443 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -219,7 +219,8 @@ struct hrtimer_cpu_base {
unsigned int hres_active : 1,
in_hrtirq : 1,
hang_detected : 1,
- softirq_activated : 1;
+ softirq_activated : 1,
+ online : 1;
#ifdef CONFIG_HIGH_RES_TIMERS
unsigned int nr_events;
unsigned short nr_retries;
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 760793998cdd..edb0f821dcea 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1085,6 +1085,7 @@ static int enqueue_hrtimer(struct hrtimer *timer,
enum hrtimer_mode mode)
{
debug_activate(timer, mode);
+ WARN_ON_ONCE(!base->cpu_base->online);
base->cpu_base->active_bases |= 1 << base->index;
@@ -2183,6 +2184,7 @@ int hrtimers_prepare_cpu(unsigned int cpu)
cpu_base->softirq_next_timer = NULL;
cpu_base->expires_next = KTIME_MAX;
cpu_base->softirq_expires_next = KTIME_MAX;
+ cpu_base->online = 1;
hrtimer_cpu_base_init_expiry_lock(cpu_base);
return 0;
}
@@ -2250,6 +2252,7 @@ int hrtimers_cpu_dying(unsigned int dying_cpu)
smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);
raw_spin_unlock(&new_base->lock);
+ old_base->online = 0;
raw_spin_unlock(&old_base->lock);
return 0;
--
2.42.1
When the CPU goes idle for the last time during the CPU down hotplug
process, RCU reports a final quiescent state for the current CPU. If
this quiescent state propagates up to the top, some tasks may then be
woken up to complete the grace period: the main grace period kthread
and/or the expedited main workqueue (or kworker).
If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU. Since this happens
after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
timer gets ignored. Therefore if the RCU kthreads are waiting for RT
bandwidth to be available, they may never be actually scheduled.
This triggers TREE03 rcutorture hangs:
rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6)
rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x2eb/0xa80
schedule+0x1f/0x90
schedule_timeout+0x163/0x270
? __pfx_process_timeout+0x10/0x10
rcu_gp_fqs_loop+0x37c/0x5b0
? __pfx_rcu_gp_kthread+0x10/0x10
rcu_gp_kthread+0x17c/0x200
kthread+0xde/0x110
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2b/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
The situation can't be solved with just unpinning the timer. The hrtimer
infrastructure and the nohz heuristics involved in finding the best
remote target for an unpinned timer would then also need to handle
enqueues from an offline CPU in the most horrendous way.
So fix this on the RCU side instead and defer the wake up to an online
CPU if it's too late for the local one.
Reported-by: Paul E. McKenney <[email protected]>
Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/rcu/tree.c | 34 +++++++++++++++++++++++++++++++++-
kernel/rcu/tree_exp.h | 3 +--
2 files changed, 34 insertions(+), 3 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 3ac3c846105f..157f3ca2a9b5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
return needmore;
}
+static void swake_up_one_online_ipi(void *arg)
+{
+ struct swait_queue_head *wqh = arg;
+
+ swake_up_one(wqh);
+}
+
+static void swake_up_one_online(struct swait_queue_head *wqh)
+{
+ int cpu = get_cpu();
+
+ /*
+ * If called from rcutree_report_cpu_starting(), wake up
+ * is dangerous that late in the CPU-down hotplug process. The
+ * scheduler might queue an ignored hrtimer. Defer the wake up
+ * to an online CPU instead.
+ */
+ if (unlikely(cpu_is_offline(cpu))) {
+ int target;
+
+ target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU),
+ cpu_online_mask);
+
+ smp_call_function_single(target, swake_up_one_online_ipi,
+ wqh, 0);
+ put_cpu();
+ } else {
+ put_cpu();
+ swake_up_one(wqh);
+ }
+}
+
/*
* Awaken the grace-period kthread. Don't do a self-awaken (unless in an
* interrupt or softirq handler, in which case we just might immediately
@@ -1037,7 +1069,7 @@ static void rcu_gp_kthread_wake(void)
return;
WRITE_ONCE(rcu_state.gp_wake_time, jiffies);
WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq));
- swake_up_one(&rcu_state.gp_wq);
+ swake_up_one_online(&rcu_state.gp_wq);
}
/*
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 6d7cea5d591f..2ac440bc7e10 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -173,7 +173,6 @@ static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp)
return ret;
}
-
/*
* Report the exit from RCU read-side critical section for the last task
* that queued itself during or before the current expedited preemptible-RCU
@@ -201,7 +200,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp,
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
if (wake) {
smp_mb(); /* EGP done before wake_up(). */
- swake_up_one(&rcu_state.expedited_wq);
+ swake_up_one_online(&rcu_state.expedited_wq);
}
break;
}
--
2.42.1
When an expedited grace period is ending, care must be taken so that all
the quiescent states propagated up to the root are correctly ordered
against the wake up of the main expedited grace period workqueue.
This ordering is already carried through the root rnp locking augmented
by an smp_mb__after_unlock_lock() barrier.
Therefore the explicit smp_mb() placed before the wake up is not needed
and can be removed.
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/rcu/tree_exp.h | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 2ac440bc7e10..014ddf672165 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -198,10 +198,9 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp,
}
if (rnp->parent == NULL) {
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
- if (wake) {
- smp_mb(); /* EGP done before wake_up(). */
+ if (wake)
swake_up_one_online(&rcu_state.expedited_wq);
- }
+
break;
}
mask = rnp->grpmask;
--
2.42.1
On Tue, Dec 19, 2023 at 12:19:13AM +0100, Frederic Weisbecker wrote:
> 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
> has introduced an issue with RCU. This is a proposal to solve the
> situation after realizing that fixing that on the timers side wouldn't
> be pretty to say the least.
>
> Oh and the last patch is absolutely irrelevant to the issue...
I pulled all three in and started testing.
That last one will need some careful review, but you knew that already. ;-)
Thanx, Paul
> Frederic Weisbecker (3):
> hrtimer: Report offline hrtimer enqueue
> rcu: Defer RCU kthreads wakeup when CPU is dying
> rcu/exp: Remove full barrier upon main thread wakeup
>
> include/linux/hrtimer.h | 3 ++-
> kernel/rcu/tree.c | 34 +++++++++++++++++++++++++++++++++-
> kernel/rcu/tree_exp.h | 8 +++-----
> kernel/time/hrtimer.c | 3 +++
> 4 files changed, 41 insertions(+), 7 deletions(-)
>
> --
> 2.42.1
>
On Tue, Dec 19, 2023 at 12:19:15AM +0100, Frederic Weisbecker wrote:
> When the CPU goes idle for the last time during the CPU down hotplug
> process, RCU reports a final quiescent state for the current CPU. If
> this quiescent state propagates up to the top, some tasks may then be
> woken up to complete the grace period: the main grace period kthread
> and/or the expedited main workqueue (or kworker).
>
> If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
> arm the RT bandwith timer to the local offline CPU. Since this happens
> after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
> timer gets ignored. Therefore if the RCU kthreads are waiting for RT
> bandwidth to be available, they may never be actually scheduled.
>
> This triggers TREE03 rcutorture hangs:
>
> rcu: INFO: rcu_preempt self-detected stall on CPU
> rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
> rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6)
> rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
> rcu: RCU grace-period kthread stack dump:
> task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000
> Call Trace:
> <TASK>
> __schedule+0x2eb/0xa80
> schedule+0x1f/0x90
> schedule_timeout+0x163/0x270
> ? __pfx_process_timeout+0x10/0x10
> rcu_gp_fqs_loop+0x37c/0x5b0
> ? __pfx_rcu_gp_kthread+0x10/0x10
> rcu_gp_kthread+0x17c/0x200
> kthread+0xde/0x110
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x2b/0x40
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1b/0x30
> </TASK>
>
> The situation can't be solved with just unpinning the timer. The hrtimer
> infrastructure and the nohz heuristics involved in finding the best
> remote target for an unpinned timer would then also need to handle
> enqueues from an offline CPU in the most horrendous way.
>
> So fix this on the RCU side instead and defer the wake up to an online
> CPU if it's too late for the local one.
Ah, ideally we'd not run into this if sched_feat(TTWU_QUEUE) was enabled
but then in any case there is also the ttwu_queue_cond() also shutting down
the remote queueing..
> Reported-by: Paul E. McKenney <[email protected]>
> Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
> Signed-off-by: Frederic Weisbecker <[email protected]>
> ---
> kernel/rcu/tree.c | 34 +++++++++++++++++++++++++++++++++-
> kernel/rcu/tree_exp.h | 3 +--
> 2 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 3ac3c846105f..157f3ca2a9b5 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
> return needmore;
> }
>
> +static void swake_up_one_online_ipi(void *arg)
> +{
> + struct swait_queue_head *wqh = arg;
> +
> + swake_up_one(wqh);
> +}
Speaking of, the scheduler refuses to do remote-IPI-style wakeups
(TTWU_QUEUE) whenever the destination CPU is in a hotplug state.
static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
{
/*
* Do not complicate things with the async wake_list while the CPU is
* in hotplug state.
*/
if (!cpu_active(cpu))
return false;
...
}
Along these lines, I wonder if, it is safe to do a wakeup in this fashion (as
done by this patch) if the destination CPU was also going down.
Also the same ttwu_queue_cond() checks for CPU affinities before deciding to
not do the IPI-style queue.
/* Ensure the task will still be allowed to run on the CPU. */
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
return false;
Not that anyone should be changing RCU thread priorities around while the IPI
is in flight, but...
I wonder if the reason TTWU is excessively paranoid is that the IPI can be
delayed for example, leading to race conditions.
Anyway, just my 2 cents.
Happy holidays! thanks,
- Joel
> +
> +static void swake_up_one_online(struct swait_queue_head *wqh)
> +{
> + int cpu = get_cpu();
> +
> + /*
> + * If called from rcutree_report_cpu_starting(), wake up
> + * is dangerous that late in the CPU-down hotplug process. The
> + * scheduler might queue an ignored hrtimer. Defer the wake up
> + * to an online CPU instead.
> + */
> + if (unlikely(cpu_is_offline(cpu))) {
> + int target;
> +
> + target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU),
> + cpu_online_mask);
> +
> + smp_call_function_single(target, swake_up_one_online_ipi,
> + wqh, 0);
> + put_cpu();
> + } else {
> + put_cpu();
> + swake_up_one(wqh);
> + }
> +}
> +
> /*
> * Awaken the grace-period kthread. Don't do a self-awaken (unless in an
> * interrupt or softirq handler, in which case we just might immediately
> @@ -1037,7 +1069,7 @@ static void rcu_gp_kthread_wake(void)
> return;
> WRITE_ONCE(rcu_state.gp_wake_time, jiffies);
> WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq));
> - swake_up_one(&rcu_state.gp_wq);
> + swake_up_one_online(&rcu_state.gp_wq);
> }
>
> /*
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 6d7cea5d591f..2ac440bc7e10 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -173,7 +173,6 @@ static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp)
> return ret;
> }
>
> -
> /*
> * Report the exit from RCU read-side critical section for the last task
> * that queued itself during or before the current expedited preemptible-RCU
> @@ -201,7 +200,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp,
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> if (wake) {
> smp_mb(); /* EGP done before wake_up(). */
> - swake_up_one(&rcu_state.expedited_wq);
> + swake_up_one_online(&rcu_state.expedited_wq);
> }
> break;
> }
> --
> 2.42.1
>
On Tue, 19 Dec 2023 00:19:15 +0100 Frederic Weisbecker <[email protected]>
> +static void swake_up_one_online(struct swait_queue_head *wqh)
> +{
> + int cpu = get_cpu();
> +
> + /*
> + * If called from rcutree_report_cpu_starting(), wake up
> + * is dangerous that late in the CPU-down hotplug process. The
> + * scheduler might queue an ignored hrtimer. Defer the wake up
> + * to an online CPU instead.
> + */
But why is scheduler having any interest selecting a dying CPU for
adding a hrtimer on at the first place?
> + if (unlikely(cpu_is_offline(cpu))) {
> + int target;
> +
> + target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU),
> + cpu_online_mask);
> +
> + smp_call_function_single(target, swake_up_one_online_ipi,
> + wqh, 0);
> + put_cpu();
> + } else {
> + put_cpu();
> + swake_up_one(wqh);
> + }
> +}
On Mon, Dec 18, 2023 at 10:38:52PM -0500, Joel Fernandes wrote:
> On Tue, Dec 19, 2023 at 12:19:15AM +0100, Frederic Weisbecker wrote:
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 3ac3c846105f..157f3ca2a9b5 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
> > return needmore;
> > }
> >
> > +static void swake_up_one_online_ipi(void *arg)
> > +{
> > + struct swait_queue_head *wqh = arg;
> > +
> > + swake_up_one(wqh);
> > +}
>
> Speaking of, the scheduler refuses to do remote-IPI-style wakeups
> (TTWU_QUEUE) whenever the destination CPU is in a hotplug state.
>
> static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
> {
> /*
> * Do not complicate things with the async wake_list while the CPU is
> * in hotplug state.
> */
> if (!cpu_active(cpu))
> return false;
> ...
> }
Yes, because all irrelevant tasks must be migrated out upon
CPUHP_AP_SCHED_WAIT_EMPTY, thanks to balance_push_set().
(Though right now I'm missing the flush_smp_call_function_queue() call that flushes
the ttwu queue between sched_cpu_deactivate() and sched_cpu_wait_empty())
>
> Along these lines, I wonder if, it is safe to do a wakeup in this fashion (as
> done by this patch) if the destination CPU was also going down.
>
> Also the same ttwu_queue_cond() checks for CPU affinities before deciding to
> not do the IPI-style queue.
>
> /* Ensure the task will still be allowed to run on the CPU. */
> if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> return false;
>
> Not that anyone should be changing RCU thread priorities around while the IPI
> is in flight, but...
>
> I wonder if the reason TTWU is excessively paranoid is that the IPI can be
> delayed for example, leading to race conditions.
It's because nothing irrelevant must be queued after sched_cpu_wait_empty().
But note this patch does something different, it doesn't defer the runqueue
enqueue like ttwu queue does. It defers the whole actual wakeup. This means that the
decision as to where to queue the task is delegated to an online CPU. So it's
not the same constraints. Waking up a task _from_ a CPU that is active or not but
at least online is supposed to be fine.
Thanks.
On Tue, Dec 19, 2023 at 12:19:15AM +0100, Frederic Weisbecker wrote:
> When the CPU goes idle for the last time during the CPU down hotplug
> process, RCU reports a final quiescent state for the current CPU. If
> this quiescent state propagates up to the top, some tasks may then be
> woken up to complete the grace period: the main grace period kthread
> and/or the expedited main workqueue (or kworker).
>
> If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
> arm the RT bandwith timer to the local offline CPU. Since this happens
> after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
> timer gets ignored. Therefore if the RCU kthreads are waiting for RT
> bandwidth to be available, they may never be actually scheduled.
>
> This triggers TREE03 rcutorture hangs:
>
> rcu: INFO: rcu_preempt self-detected stall on CPU
> rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
> rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6)
> rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
> rcu: RCU grace-period kthread stack dump:
> task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000
> Call Trace:
> <TASK>
> __schedule+0x2eb/0xa80
> schedule+0x1f/0x90
> schedule_timeout+0x163/0x270
> ? __pfx_process_timeout+0x10/0x10
> rcu_gp_fqs_loop+0x37c/0x5b0
> ? __pfx_rcu_gp_kthread+0x10/0x10
> rcu_gp_kthread+0x17c/0x200
> kthread+0xde/0x110
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x2b/0x40
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1b/0x30
> </TASK>
>
> The situation can't be solved with just unpinning the timer. The hrtimer
> infrastructure and the nohz heuristics involved in finding the best
> remote target for an unpinned timer would then also need to handle
> enqueues from an offline CPU in the most horrendous way.
>
> So fix this on the RCU side instead and defer the wake up to an online
> CPU if it's too late for the local one.
One question below...
> Reported-by: Paul E. McKenney <[email protected]>
> Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
> Signed-off-by: Frederic Weisbecker <[email protected]>
> ---
> kernel/rcu/tree.c | 34 +++++++++++++++++++++++++++++++++-
> kernel/rcu/tree_exp.h | 3 +--
> 2 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 3ac3c846105f..157f3ca2a9b5 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
> return needmore;
> }
>
> +static void swake_up_one_online_ipi(void *arg)
> +{
> + struct swait_queue_head *wqh = arg;
> +
> + swake_up_one(wqh);
> +}
> +
> +static void swake_up_one_online(struct swait_queue_head *wqh)
> +{
> + int cpu = get_cpu();
This works because get_cpu() is currently preempt_disable(). If there are plans to
make get_cpu() be some sort of read lock, we might deadlock when synchronize_rcu()
is invoked from a CPU-hotplug notifier, correct?
Might this be worth a comment somewhere at some point?
Thanx, Paul
> +
> + /*
> + * If called from rcutree_report_cpu_starting(), wake up
> + * is dangerous that late in the CPU-down hotplug process. The
> + * scheduler might queue an ignored hrtimer. Defer the wake up
> + * to an online CPU instead.
> + */
> + if (unlikely(cpu_is_offline(cpu))) {
> + int target;
> +
> + target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU),
> + cpu_online_mask);
> +
> + smp_call_function_single(target, swake_up_one_online_ipi,
> + wqh, 0);
> + put_cpu();
> + } else {
> + put_cpu();
> + swake_up_one(wqh);
> + }
> +}
> +
> /*
> * Awaken the grace-period kthread. Don't do a self-awaken (unless in an
> * interrupt or softirq handler, in which case we just might immediately
> @@ -1037,7 +1069,7 @@ static void rcu_gp_kthread_wake(void)
> return;
> WRITE_ONCE(rcu_state.gp_wake_time, jiffies);
> WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq));
> - swake_up_one(&rcu_state.gp_wq);
> + swake_up_one_online(&rcu_state.gp_wq);
> }
>
> /*
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 6d7cea5d591f..2ac440bc7e10 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -173,7 +173,6 @@ static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp)
> return ret;
> }
>
> -
> /*
> * Report the exit from RCU read-side critical section for the last task
> * that queued itself during or before the current expedited preemptible-RCU
> @@ -201,7 +200,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp,
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> if (wake) {
> smp_mb(); /* EGP done before wake_up(). */
> - swake_up_one(&rcu_state.expedited_wq);
> + swake_up_one_online(&rcu_state.expedited_wq);
> }
> break;
> }
> --
> 2.42.1
>
On Tue, Dec 19, 2023 at 07:29:23AM -0800, Paul E. McKenney wrote:
> > +static void swake_up_one_online(struct swait_queue_head *wqh)
> > +{
> > + int cpu = get_cpu();
>
> This works because get_cpu() is currently preempt_disable(). If there are plans to
> make get_cpu() be some sort of read lock, we might deadlock when synchronize_rcu()
> is invoked from a CPU-hotplug notifier, correct?
>
> Might this be worth a comment somewhere at some point?
Sure, I can add that.
Thanks.
On Tue, Dec 19, 2023 at 12:42:47PM +0800, Hillf Danton wrote:
> On Tue, 19 Dec 2023 00:19:15 +0100 Frederic Weisbecker <[email protected]>
> > +static void swake_up_one_online(struct swait_queue_head *wqh)
> > +{
> > + int cpu = get_cpu();
> > +
> > + /*
> > + * If called from rcutree_report_cpu_starting(), wake up
> > + * is dangerous that late in the CPU-down hotplug process. The
> > + * scheduler might queue an ignored hrtimer. Defer the wake up
> > + * to an online CPU instead.
> > + */
>
> But why is scheduler having any interest selecting a dying CPU for
> adding a hrtimer on at the first place?
So indeed that timer could be unpinned. But we tried that and it's not
enough. If we want to make hrtimers and nohz infrastructure aware of
the fact the current CPU is offline when it queues an hrtimer, we must
face the ugliness below. And still it's hacky because we must also find an
online target whose earliest deadline is below/equal the scheduler hrtimer
we are trying to enqueue. And that requires even more ugliness that isn't
handled below.
So for now I assume that queuing a timer after hrtimers_cpu_dying() is
unreasonable and that RCU is the only candidate trying that. If there are
more to be reported, we shall see...
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index f2044d5a652b..9eac39fad31c 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -234,6 +234,7 @@ struct hrtimer_cpu_base {
struct hrtimer *next_timer;
ktime_t softirq_expires_next;
struct hrtimer *softirq_next_timer;
+ int online;
struct hrtimer_clock_base clock_base[HRTIMER_MAX_CLOCK_BASES];
} ____cacheline_aligned;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a708d225c28e..83c75768f290 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1088,7 +1088,7 @@ int get_nohz_timer_target(void)
struct sched_domain *sd;
const struct cpumask *hk_mask;
- if (housekeeping_cpu(cpu, HK_TYPE_TIMER)) {
+ if (housekeeping_cpu(cpu, HK_TYPE_TIMER) && cpu_online(cpu)) {
if (!idle_cpu(cpu))
return cpu;
default_cpu = cpu;
@@ -1109,7 +1109,8 @@ int get_nohz_timer_target(void)
}
if (default_cpu == -1)
- default_cpu = housekeeping_any_cpu(HK_TYPE_TIMER);
+ default_cpu = cpumask_any_and(housekeeping_cpumask(HK_TYPE_TIMER),
+ cpu_online_mask);
return default_cpu;
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 6aaf0a3d6081..26cb9455272a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -118,7 +118,7 @@ static inline void do_start_rt_bandwidth(struct rt_bandwidth *rt_b)
*/
hrtimer_forward_now(&rt_b->rt_period_timer, ns_to_ktime(0));
hrtimer_start_expires(&rt_b->rt_period_timer,
- HRTIMER_MODE_ABS_PINNED_HARD);
+ HRTIMER_MODE_ABS_HARD);
}
raw_spin_unlock(&rt_b->rt_runtime_lock);
}
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 760793998cdd..82f9ace2e4fd 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -36,6 +36,7 @@
#include <linux/sched/sysctl.h>
#include <linux/sched/rt.h>
#include <linux/sched/deadline.h>
+#include <linux/sched/isolation.h>
#include <linux/sched/nohz.h>
#include <linux/sched/debug.h>
#include <linux/timer.h>
@@ -206,6 +207,12 @@ struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base,
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
if (static_branch_likely(&timers_migration_enabled) && !pinned)
return &per_cpu(hrtimer_bases, get_nohz_timer_target());
+#else
+ if (!base->online) {
+ int cpu = cpumask_any_and(housekeeping_cpumask(HK_TYPE_TIMER),
+ cpu_online_mask);
+ base = &per_cpu(hrtimer_bases, cpu);
+ }
#endif
return base;
}
@@ -254,7 +261,13 @@ switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
raw_spin_lock(&new_base->cpu_base->lock);
if (new_cpu_base != this_cpu_base &&
- hrtimer_check_target(timer, new_base)) {
+ hrtimer_check_target(timer, new_base) &&
+ /*
+ * Crude hack and buggy: if this CPU is offline and
+ * the timer is the earliest on the remote target, the timer
+ * will expire late...
+ */
+ this_cpu_base->online) {
raw_spin_unlock(&new_base->cpu_base->lock);
raw_spin_lock(&base->cpu_base->lock);
new_cpu_base = this_cpu_base;
@@ -2183,6 +2196,7 @@ int hrtimers_prepare_cpu(unsigned int cpu)
cpu_base->softirq_next_timer = NULL;
cpu_base->expires_next = KTIME_MAX;
cpu_base->softirq_expires_next = KTIME_MAX;
+ cpu_base->online = 1;
hrtimer_cpu_base_init_expiry_lock(cpu_base);
return 0;
}
@@ -2248,7 +2262,7 @@ int hrtimers_cpu_dying(unsigned int dying_cpu)
__hrtimer_get_next_event(new_base, HRTIMER_ACTIVE_SOFT);
/* Tell the other CPU to retrigger the next event */
smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);
-
+ old_base->online = 0;
raw_spin_unlock(&new_base->lock);
raw_spin_unlock(&old_base->lock);
On Tue, Dec 19, 2023 at 12:56:50PM +0100, Frederic Weisbecker wrote:
> On Mon, Dec 18, 2023 at 10:38:52PM -0500, Joel Fernandes wrote:
> > On Tue, Dec 19, 2023 at 12:19:15AM +0100, Frederic Weisbecker wrote:
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 3ac3c846105f..157f3ca2a9b5 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
> > > return needmore;
> > > }
> > >
> > > +static void swake_up_one_online_ipi(void *arg)
> > > +{
> > > + struct swait_queue_head *wqh = arg;
> > > +
> > > + swake_up_one(wqh);
> > > +}
> >
> > Speaking of, the scheduler refuses to do remote-IPI-style wakeups
> > (TTWU_QUEUE) whenever the destination CPU is in a hotplug state.
> >
> > static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
> > {
> > /*
> > * Do not complicate things with the async wake_list while the CPU is
> > * in hotplug state.
> > */
> > if (!cpu_active(cpu))
> > return false;
> > ...
> > }
>
> Yes, because all irrelevant tasks must be migrated out upon
> CPUHP_AP_SCHED_WAIT_EMPTY, thanks to balance_push_set().
Ah, got it.
> (Though right now I'm missing the flush_smp_call_function_queue() call that flushes
> the ttwu queue between sched_cpu_deactivate() and sched_cpu_wait_empty())
Possible. I saw your IRC message to Peter on that as well, thanks for
following up. I need to find some time to look more into that, but that does
sound concerning.
> > Along these lines, I wonder if, it is safe to do a wakeup in this fashion (as
> > done by this patch) if the destination CPU was also going down.
> >
> > Also the same ttwu_queue_cond() checks for CPU affinities before deciding to
> > not do the IPI-style queue.
> >
> > /* Ensure the task will still be allowed to run on the CPU. */
> > if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> > return false;
> >
> > Not that anyone should be changing RCU thread priorities around while the IPI
> > is in flight, but...
> >
> > I wonder if the reason TTWU is excessively paranoid is that the IPI can be
> > delayed for example, leading to race conditions.
>
> It's because nothing irrelevant must be queued after sched_cpu_wait_empty().
Makes sense.
> But note this patch does something different, it doesn't defer the runqueue
> enqueue like ttwu queue does. It defers the whole actual wakeup. This means that the
> decision as to where to queue the task is delegated to an online CPU. So it's
> not the same constraints. Waking up a task _from_ a CPU that is active or not but
> at least online is supposed to be fine.
Agreed, thanks for the clarifications. But along similar lines (and at the
risk of oversimplifying), is it not possible to send an IPI to an online CPU
to queue the hrtimer locally there if you detect that the current CPU is
going down? In the other thread to Hilf, you mentioned the hrtimer infra has
to have equal or earlier deadline, but you can just queue the hrtimer from
the IPI handler and that should take care of it?
Let me know if I missed something which should make for some good holiday
reading material. ;-)
thanks,
- Joel
>
> When the CPU goes idle for the last time during the CPU down hotplug
> process, RCU reports a final quiescent state for the current CPU. If
> this quiescent state propagates up to the top, some tasks may then be
> woken up to complete the grace period: the main grace period kthread
> and/or the expedited main workqueue (or kworker).
>
> If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
> arm the RT bandwith timer to the local offline CPU. Since this happens
> after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
> timer gets ignored. Therefore if the RCU kthreads are waiting for RT
> bandwidth to be available, they may never be actually scheduled.
>
In the rcutree_report_cpu_dead(), the rcuog kthreads may also be wakeup in
do_nocb_deferred_wakeup(), if the rcuog kthreads is rt-fifo and wakeup happen,
the rt_period_active is set 1 and enqueue hrtimer to offline CPU in
do_start_rt_bandwidth(),
after that, we invoke swake_up_one_online() send ipi to online CPU, due to the
rt_period_active is 1, the rt-bandwith hrtimer will not enqueue to online CPU.
any thoughts?
Thanks
Zqiang
>
> This triggers TREE03 rcutorture hangs:
>
> rcu: INFO: rcu_preempt self-detected stall on CPU
> rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
> rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6)
> rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
> rcu: RCU grace-period kthread stack dump:
> task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000
> Call Trace:
> <TASK>
> __schedule+0x2eb/0xa80
> schedule+0x1f/0x90
> schedule_timeout+0x163/0x270
> ? __pfx_process_timeout+0x10/0x10
> rcu_gp_fqs_loop+0x37c/0x5b0
> ? __pfx_rcu_gp_kthread+0x10/0x10
> rcu_gp_kthread+0x17c/0x200
> kthread+0xde/0x110
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x2b/0x40
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1b/0x30
> </TASK>
>
> The situation can't be solved with just unpinning the timer. The hrtimer
> infrastructure and the nohz heuristics involved in finding the best
> remote target for an unpinned timer would then also need to handle
> enqueues from an offline CPU in the most horrendous way.
>
> So fix this on the RCU side instead and defer the wake up to an online
> CPU if it's too late for the local one.
>
> Reported-by: Paul E. McKenney <[email protected]>
> Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
> Signed-off-by: Frederic Weisbecker <[email protected]>
> ---
> kernel/rcu/tree.c | 34 +++++++++++++++++++++++++++++++++-
> kernel/rcu/tree_exp.h | 3 +--
> 2 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 3ac3c846105f..157f3ca2a9b5 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
> return needmore;
> }
>
> +static void swake_up_one_online_ipi(void *arg)
> +{
> + struct swait_queue_head *wqh = arg;
> +
> + swake_up_one(wqh);
> +}
> +
> +static void swake_up_one_online(struct swait_queue_head *wqh)
> +{
> + int cpu = get_cpu();
> +
> + /*
> + * If called from rcutree_report_cpu_starting(), wake up
> + * is dangerous that late in the CPU-down hotplug process. The
> + * scheduler might queue an ignored hrtimer. Defer the wake up
> + * to an online CPU instead.
> + */
> + if (unlikely(cpu_is_offline(cpu))) {
> + int target;
> +
> + target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU),
> + cpu_online_mask);
> +
> + smp_call_function_single(target, swake_up_one_online_ipi,
> + wqh, 0);
> + put_cpu();
> + } else {
> + put_cpu();
> + swake_up_one(wqh);
> + }
> +}
> +
> /*
> * Awaken the grace-period kthread. Don't do a self-awaken (unless in an
> * interrupt or softirq handler, in which case we just might immediately
> @@ -1037,7 +1069,7 @@ static void rcu_gp_kthread_wake(void)
> return;
> WRITE_ONCE(rcu_state.gp_wake_time, jiffies);
> WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq));
> - swake_up_one(&rcu_state.gp_wq);
> + swake_up_one_online(&rcu_state.gp_wq);
> }
>
> /*
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 6d7cea5d591f..2ac440bc7e10 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -173,7 +173,6 @@ static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp)
> return ret;
> }
>
> -
> /*
> * Report the exit from RCU read-side critical section for the last task
> * that queued itself during or before the current expedited preemptible-RCU
> @@ -201,7 +200,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp,
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> if (wake) {
> smp_mb(); /* EGP done before wake_up(). */
> - swake_up_one(&rcu_state.expedited_wq);
> + swake_up_one_online(&rcu_state.expedited_wq);
> }
> break;
> }
> --
> 2.42.1
>
Le Wed, Dec 20, 2023 at 04:24:35PM +0800, Z qiang a ?crit :
> >
> > When the CPU goes idle for the last time during the CPU down hotplug
> > process, RCU reports a final quiescent state for the current CPU. If
> > this quiescent state propagates up to the top, some tasks may then be
> > woken up to complete the grace period: the main grace period kthread
> > and/or the expedited main workqueue (or kworker).
> >
> > If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
> > arm the RT bandwith timer to the local offline CPU. Since this happens
> > after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
> > timer gets ignored. Therefore if the RCU kthreads are waiting for RT
> > bandwidth to be available, they may never be actually scheduled.
> >
>
> In the rcutree_report_cpu_dead(), the rcuog kthreads may also be wakeup in
> do_nocb_deferred_wakeup(), if the rcuog kthreads is rt-fifo and wakeup happen,
> the rt_period_active is set 1 and enqueue hrtimer to offline CPU in
> do_start_rt_bandwidth(),
> after that, we invoke swake_up_one_online() send ipi to online CPU, due to the
> rt_period_active is 1, the rt-bandwith hrtimer will not enqueue to online CPU.
> any thoughts?
Duh, you're right, that one too. How many more? This hrtimer situation is scary...
Thanks.
Le Tue, Dec 19, 2023 at 10:01:55PM -0500, Joel Fernandes a ?crit :
> > (Though right now I'm missing the flush_smp_call_function_queue() call that flushes
> > the ttwu queue between sched_cpu_deactivate() and sched_cpu_wait_empty())
>
> Possible. I saw your IRC message to Peter on that as well, thanks for
> following up. I need to find some time to look more into that, but that does
> sound concerning.
Found it! It's smpcfd_dying_cpu().
> > But note this patch does something different, it doesn't defer the runqueue
> > enqueue like ttwu queue does. It defers the whole actual wakeup. This means that the
> > decision as to where to queue the task is delegated to an online CPU. So it's
> > not the same constraints. Waking up a task _from_ a CPU that is active or not but
> > at least online is supposed to be fine.
>
> Agreed, thanks for the clarifications. But along similar lines (and at the
> risk of oversimplifying), is it not possible to send an IPI to an online CPU
> to queue the hrtimer locally there if you detect that the current CPU is
> going down? In the other thread to Hilf, you mentioned the hrtimer infra has
> to have equal or earlier deadline, but you can just queue the hrtimer from
> the IPI handler and that should take care of it?
This is something that Thomas wanted to avoid IIRC, because the IPI can make
it miss the deadline. But I guess in the case of an offline CPU, it can be a
last resort.
> Let me know if I missed something which should make for some good holiday
> reading material. ;-)
Let me summarize the possible fixes we can have:
1) It's RCU's fault! We must check and fix all the wake ups performed by RCU
from rcutree_report_cpu_dead(). But beware other possible wake-ups/timer
enqueue from the outgoing CPU after hrtimers are migrated.
2) It's scheduler's fault! do_start_rt_bandwidth() should check if the current
CPU is offline and place manually the timer to an online CPU (through an
IPI? yuck)
3) It's hrtimer's fault! If the current CPU is offline, it must arrange for
queueing to an online CPU. Not easy to do as we must find one whose next
expiry is below/equal the scheduler timer. As a last resort, this could be
force queued to any and then signalled through an IPI, even though it's
something we've tried to avoid until now.
Also It's hard for me to think about another way to fix the deadlock fixed
by 5c0930ccaad5a74d74e8b18b648c5eb21ed2fe94. Hrtimers migration can't happen
after rcutree_report_cpu_dead(), because it may use RCU...
None of the above look pretty anyway. Thoughts?
On Wed, Dec 20, 2023 at 04:50:41PM +0100, Frederic Weisbecker wrote:
> Le Tue, Dec 19, 2023 at 10:01:55PM -0500, Joel Fernandes a ?crit :
> > > (Though right now I'm missing the flush_smp_call_function_queue() call that flushes
> > > the ttwu queue between sched_cpu_deactivate() and sched_cpu_wait_empty())
> >
> > Possible. I saw your IRC message to Peter on that as well, thanks for
> > following up. I need to find some time to look more into that, but that does
> > sound concerning.
>
> Found it! It's smpcfd_dying_cpu().
>
> > > But note this patch does something different, it doesn't defer the runqueue
> > > enqueue like ttwu queue does. It defers the whole actual wakeup. This means that the
> > > decision as to where to queue the task is delegated to an online CPU. So it's
> > > not the same constraints. Waking up a task _from_ a CPU that is active or not but
> > > at least online is supposed to be fine.
> >
> > Agreed, thanks for the clarifications. But along similar lines (and at the
> > risk of oversimplifying), is it not possible to send an IPI to an online CPU
> > to queue the hrtimer locally there if you detect that the current CPU is
> > going down? In the other thread to Hilf, you mentioned the hrtimer infra has
> > to have equal or earlier deadline, but you can just queue the hrtimer from
> > the IPI handler and that should take care of it?
>
> This is something that Thomas wanted to avoid IIRC, because the IPI can make
> it miss the deadline. But I guess in the case of an offline CPU, it can be a
> last resort.
>
> > Let me know if I missed something which should make for some good holiday
> > reading material. ;-)
>
> Let me summarize the possible fixes we can have:
>
> 1) It's RCU's fault! We must check and fix all the wake ups performed by RCU
> from rcutree_report_cpu_dead(). But beware other possible wake-ups/timer
> enqueue from the outgoing CPU after hrtimers are migrated.
>
> 2) It's scheduler's fault! do_start_rt_bandwidth() should check if the current
> CPU is offline and place manually the timer to an online CPU (through an
> IPI? yuck)
>
> 3) It's hrtimer's fault! If the current CPU is offline, it must arrange for
> queueing to an online CPU. Not easy to do as we must find one whose next
> expiry is below/equal the scheduler timer. As a last resort, this could be
> force queued to any and then signalled through an IPI, even though it's
> something we've tried to avoid until now.
>
> Also It's hard for me to think about another way to fix the deadlock fixed
> by 5c0930ccaad5a74d74e8b18b648c5eb21ed2fe94. Hrtimers migration can't happen
> after rcutree_report_cpu_dead(), because it may use RCU...
>
> None of the above look pretty anyway. Thoughts?
Make one of the surviving CPUs grab any leftover timers from the outgoing
CPU, possibly checking periodically. Not pretty either, but three ugly
options deserve a fourth one!
Thanx, Paul