2024-03-18 23:08:03

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 0/2] timers: More fixes

Paul reported some rcutorture's TREE07 splats. This set isn't claiming
to fix them (yet) but here are some issues I found on my way while
reproducing and tracing.

Thanks.

Frederic Weisbecker (2):
timers/migration: Fix endless timer requeue after idle interrupts
timers: Fix removed self-IPI on global timer's enqueue in nohz_full

kernel/time/timer.c | 12 +++++++++++-
kernel/time/timer_migration.c | 11 +++++++++--
2 files changed, 20 insertions(+), 3 deletions(-)

--
2.44.0



2024-03-18 23:08:12

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 1/2] timers/migration: Fix endless timer requeue after idle interrupts

When a CPU is an idle migrator, but another CPU wakes up before it,
becomes an active migrator and handles the queue, the initial idle
migrator may end up endlessly reprogramming its clockevent, chasing ghost
timers forever such as in the following scenario:

[GRP0:0]
migrator = 0
active = 0
nextevt = T1
/ \
0 1
active idle (T1)

0) CPU 1 is idle and has a timer queued (T1), CPU 0 is active and is
the active migrator.

[GRP0:0]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1
idle idle (T1)
wakeup = T1

1) CPU 0 is now idle and is therefore the idle migrator. It has
programmed its next timer interrupt to handle T1.

[GRP0:0]
migrator = 1
active = 1
nextevt = KTIME_MAX
/ \
0 1
idle active
wakeup = T1

2) CPU 1 has woken up, it is now active and it has just handled its own
timer T1.

3) CPU 0 gets a timer interrupt to handle T1 but tmigr_handle_remote()
realize it is not the migrator anymore. So it early returns without
observing that T1 has been expired already and therefore without
updating its ->wakeup value.

4) CPU 0 goes into tmigr_cpu_new_timer() which also early returns
because it doesn't queue a timer of its own. So ->wakeup is left
unchanged and the next timer is programmed to fire now.

5) goto 3) forever

This results in timer interrupt storms in idle and also in nohz_full (as
observed in rcutorture's TREE07 scenario).

Fix this with forcing a re-evaluation of tmc->wakeup while trying
remote timer handling when the CPU isn't the migrator anymmore. The
check is inherently racy but in the worst case the CPU just races setting
the KTIME_MAX value that a remote expiry also tries to set.

Reported-by: Paul E. McKenney <[email protected]>
Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/timer_migration.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 611cd904f035..c63a0afdcebe 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1038,8 +1038,15 @@ void tmigr_handle_remote(void)
* in tmigr_handle_remote_up() anyway. Keep this check to speed up the
* return when nothing has to be done.
*/
- if (!tmigr_check_migrator(tmc->tmgroup, tmc->childmask))
- return;
+ if (!tmigr_check_migrator(tmc->tmgroup, tmc->childmask)) {
+ /*
+ * If this CPU was an idle migrator, make sure to clear its wakeup
+ * value so it won't chase timers that have already expired elsewhere.
+ * This avoids endless requeue from tmigr_new_timer().
+ */
+ if (READ_ONCE(tmc->wakeup) == KTIME_MAX)
+ return;
+ }

data.now = get_jiffies_update(&data.basej);

--
2.44.0


2024-03-18 23:08:17

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

While running in nohz_full mode, a task may enqueue a timer while the
tick is stopped. However the only places where the timer wheel,
alongside the timer migration machinery's decision, may reprogram the
next event accordingly with that new timer's expiry are the idle loop or
any IRQ tail.

However neither the idle task nor an interrupt may run on the CPU if it
resumes busy work in userspace for a long while in full dynticks mode.

To solve this, the timer enqueue path raises a self-IPI that will
re-evaluate the timer wheel on its IRQ tail. This asynchronous solution
avoids potential locking inversion.

This is supposed to happen both for local and global timers but commit:

b2cf7507e186 ("timers: Always queue timers on the local CPU")

broke the global timers case with removing the ->is_idle field handling
for the global base. As a result, global timers enqueue may go unnoticed
in nohz_full.

Fix this with restoring the idle tracking of the global timer's base,
allowing self-IPIs again on enqueue time.

Reported-by: Paul E. McKenney <[email protected]>
Fixes: b2cf7507e186 ("timers: Always queue timers on the local CPU")
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/timer.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index e69e75d3858c..dee29f1f5b75 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -642,7 +642,8 @@ trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
* the base lock:
*/
if (base->is_idle) {
- WARN_ON_ONCE(!(timer->flags & TIMER_PINNED));
+ WARN_ON_ONCE(!(timer->flags & TIMER_PINNED ||
+ tick_nohz_full_cpu(base->cpu)));
wake_up_nohz_cpu(base->cpu);
}
}
@@ -2292,6 +2293,13 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
*/
if (!base_local->is_idle && time_after(nextevt, basej + 1)) {
base_local->is_idle = true;
+ /*
+ * Global timers queued locally while running in a task
+ * in nohz_full mode need a self-IPI to kick reprogramming
+ * in IRQ tail.
+ */
+ if (tick_nohz_full_cpu(base_local->cpu))
+ base_global->is_idle = true;
trace_timer_base_idle(true, base_local->cpu);
}
*idle = base_local->is_idle;
@@ -2364,6 +2372,8 @@ void timer_clear_idle(void)
* path. Required for BASE_LOCAL only.
*/
__this_cpu_write(timer_bases[BASE_LOCAL].is_idle, false);
+ if (tick_nohz_full_cpu(smp_processor_id()))
+ __this_cpu_write(timer_bases[BASE_GLOBAL].is_idle, false);
trace_timer_base_idle(false, smp_processor_id());

/* Activate without holding the timer_base->lock */
--
2.44.0


2024-03-19 09:18:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

On Tue, Mar 19, 2024 at 12:07:29AM +0100, Frederic Weisbecker wrote:
> While running in nohz_full mode, a task may enqueue a timer while the
> tick is stopped. However the only places where the timer wheel,
> alongside the timer migration machinery's decision, may reprogram the
> next event accordingly with that new timer's expiry are the idle loop or
> any IRQ tail.
>
> However neither the idle task nor an interrupt may run on the CPU if it
> resumes busy work in userspace for a long while in full dynticks mode.
>
> To solve this, the timer enqueue path raises a self-IPI that will
> re-evaluate the timer wheel on its IRQ tail. This asynchronous solution
> avoids potential locking inversion.
>
> This is supposed to happen both for local and global timers but commit:
>
> b2cf7507e186 ("timers: Always queue timers on the local CPU")
>
> broke the global timers case with removing the ->is_idle field handling
> for the global base. As a result, global timers enqueue may go unnoticed
> in nohz_full.
>
> Fix this with restoring the idle tracking of the global timer's base,
> allowing self-IPIs again on enqueue time.

Testing with the previous patch (1/2 in this series) reduced the number of
problems by about an order of magnitude, down to two sched_tick_remote()
instances and one enqueue_hrtimer() instance, very good!

I have kicked off a test including this patch. Here is hoping! ;-)

Thanx, Paul

> Reported-by: Paul E. McKenney <[email protected]>
> Fixes: b2cf7507e186 ("timers: Always queue timers on the local CPU")
> Signed-off-by: Frederic Weisbecker <[email protected]>
> ---
> kernel/time/timer.c | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index e69e75d3858c..dee29f1f5b75 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -642,7 +642,8 @@ trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
> * the base lock:
> */
> if (base->is_idle) {
> - WARN_ON_ONCE(!(timer->flags & TIMER_PINNED));
> + WARN_ON_ONCE(!(timer->flags & TIMER_PINNED ||
> + tick_nohz_full_cpu(base->cpu)));
> wake_up_nohz_cpu(base->cpu);
> }
> }
> @@ -2292,6 +2293,13 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
> */
> if (!base_local->is_idle && time_after(nextevt, basej + 1)) {
> base_local->is_idle = true;
> + /*
> + * Global timers queued locally while running in a task
> + * in nohz_full mode need a self-IPI to kick reprogramming
> + * in IRQ tail.
> + */
> + if (tick_nohz_full_cpu(base_local->cpu))
> + base_global->is_idle = true;
> trace_timer_base_idle(true, base_local->cpu);
> }
> *idle = base_local->is_idle;
> @@ -2364,6 +2372,8 @@ void timer_clear_idle(void)
> * path. Required for BASE_LOCAL only.
> */
> __this_cpu_write(timer_bases[BASE_LOCAL].is_idle, false);
> + if (tick_nohz_full_cpu(smp_processor_id()))
> + __this_cpu_write(timer_bases[BASE_GLOBAL].is_idle, false);
> trace_timer_base_idle(false, smp_processor_id());
>
> /* Activate without holding the timer_base->lock */
> --
> 2.44.0
>

2024-03-20 11:17:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

On Tue, Mar 19, 2024 at 02:18:00AM -0700, Paul E. McKenney wrote:
> On Tue, Mar 19, 2024 at 12:07:29AM +0100, Frederic Weisbecker wrote:
> > While running in nohz_full mode, a task may enqueue a timer while the
> > tick is stopped. However the only places where the timer wheel,
> > alongside the timer migration machinery's decision, may reprogram the
> > next event accordingly with that new timer's expiry are the idle loop or
> > any IRQ tail.
> >
> > However neither the idle task nor an interrupt may run on the CPU if it
> > resumes busy work in userspace for a long while in full dynticks mode.
> >
> > To solve this, the timer enqueue path raises a self-IPI that will
> > re-evaluate the timer wheel on its IRQ tail. This asynchronous solution
> > avoids potential locking inversion.
> >
> > This is supposed to happen both for local and global timers but commit:
> >
> > b2cf7507e186 ("timers: Always queue timers on the local CPU")
> >
> > broke the global timers case with removing the ->is_idle field handling
> > for the global base. As a result, global timers enqueue may go unnoticed
> > in nohz_full.
> >
> > Fix this with restoring the idle tracking of the global timer's base,
> > allowing self-IPIs again on enqueue time.
>
> Testing with the previous patch (1/2 in this series) reduced the number of
> problems by about an order of magnitude, down to two sched_tick_remote()
> instances and one enqueue_hrtimer() instance, very good!
>
> I have kicked off a test including this patch. Here is hoping! ;-)

And 22*100 hours of TREE07 got me one run with a sched_tick_remote()
complaint and another run with a starved RCU grace-period kthread.
So this is definitely getting more reliable, but still a little ways
to go.

Thanx, Paul

> > Reported-by: Paul E. McKenney <[email protected]>
> > Fixes: b2cf7507e186 ("timers: Always queue timers on the local CPU")
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > ---
> > kernel/time/timer.c | 12 +++++++++++-
> > 1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> > index e69e75d3858c..dee29f1f5b75 100644
> > --- a/kernel/time/timer.c
> > +++ b/kernel/time/timer.c
> > @@ -642,7 +642,8 @@ trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
> > * the base lock:
> > */
> > if (base->is_idle) {
> > - WARN_ON_ONCE(!(timer->flags & TIMER_PINNED));
> > + WARN_ON_ONCE(!(timer->flags & TIMER_PINNED ||
> > + tick_nohz_full_cpu(base->cpu)));
> > wake_up_nohz_cpu(base->cpu);
> > }
> > }
> > @@ -2292,6 +2293,13 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
> > */
> > if (!base_local->is_idle && time_after(nextevt, basej + 1)) {
> > base_local->is_idle = true;
> > + /*
> > + * Global timers queued locally while running in a task
> > + * in nohz_full mode need a self-IPI to kick reprogramming
> > + * in IRQ tail.
> > + */
> > + if (tick_nohz_full_cpu(base_local->cpu))
> > + base_global->is_idle = true;
> > trace_timer_base_idle(true, base_local->cpu);
> > }
> > *idle = base_local->is_idle;
> > @@ -2364,6 +2372,8 @@ void timer_clear_idle(void)
> > * path. Required for BASE_LOCAL only.
> > */
> > __this_cpu_write(timer_bases[BASE_LOCAL].is_idle, false);
> > + if (tick_nohz_full_cpu(smp_processor_id()))
> > + __this_cpu_write(timer_bases[BASE_GLOBAL].is_idle, false);
> > trace_timer_base_idle(false, smp_processor_id());
> >
> > /* Activate without holding the timer_base->lock */
> > --
> > 2.44.0
> >

2024-03-20 16:15:56

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

Le Wed, Mar 20, 2024 at 04:14:24AM -0700, Paul E. McKenney a ?crit :
> On Tue, Mar 19, 2024 at 02:18:00AM -0700, Paul E. McKenney wrote:
> > On Tue, Mar 19, 2024 at 12:07:29AM +0100, Frederic Weisbecker wrote:
> > > While running in nohz_full mode, a task may enqueue a timer while the
> > > tick is stopped. However the only places where the timer wheel,
> > > alongside the timer migration machinery's decision, may reprogram the
> > > next event accordingly with that new timer's expiry are the idle loop or
> > > any IRQ tail.
> > >
> > > However neither the idle task nor an interrupt may run on the CPU if it
> > > resumes busy work in userspace for a long while in full dynticks mode.
> > >
> > > To solve this, the timer enqueue path raises a self-IPI that will
> > > re-evaluate the timer wheel on its IRQ tail. This asynchronous solution
> > > avoids potential locking inversion.
> > >
> > > This is supposed to happen both for local and global timers but commit:
> > >
> > > b2cf7507e186 ("timers: Always queue timers on the local CPU")
> > >
> > > broke the global timers case with removing the ->is_idle field handling
> > > for the global base. As a result, global timers enqueue may go unnoticed
> > > in nohz_full.
> > >
> > > Fix this with restoring the idle tracking of the global timer's base,
> > > allowing self-IPIs again on enqueue time.
> >
> > Testing with the previous patch (1/2 in this series) reduced the number of
> > problems by about an order of magnitude, down to two sched_tick_remote()
> > instances and one enqueue_hrtimer() instance, very good!
> >
> > I have kicked off a test including this patch. Here is hoping! ;-)
>
> And 22*100 hours of TREE07 got me one run with a sched_tick_remote()
> complaint and another run with a starved RCU grace-period kthread.
> So this is definitely getting more reliable, but still a little ways
> to go.

Right, there is clearly something else. Investigation continues...

2024-03-20 22:58:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

On Wed, Mar 20, 2024 at 05:15:48PM +0100, Frederic Weisbecker wrote:
> Le Wed, Mar 20, 2024 at 04:14:24AM -0700, Paul E. McKenney a ?crit :
> > On Tue, Mar 19, 2024 at 02:18:00AM -0700, Paul E. McKenney wrote:
> > > On Tue, Mar 19, 2024 at 12:07:29AM +0100, Frederic Weisbecker wrote:
> > > > While running in nohz_full mode, a task may enqueue a timer while the
> > > > tick is stopped. However the only places where the timer wheel,
> > > > alongside the timer migration machinery's decision, may reprogram the
> > > > next event accordingly with that new timer's expiry are the idle loop or
> > > > any IRQ tail.
> > > >
> > > > However neither the idle task nor an interrupt may run on the CPU if it
> > > > resumes busy work in userspace for a long while in full dynticks mode.
> > > >
> > > > To solve this, the timer enqueue path raises a self-IPI that will
> > > > re-evaluate the timer wheel on its IRQ tail. This asynchronous solution
> > > > avoids potential locking inversion.
> > > >
> > > > This is supposed to happen both for local and global timers but commit:
> > > >
> > > > b2cf7507e186 ("timers: Always queue timers on the local CPU")
> > > >
> > > > broke the global timers case with removing the ->is_idle field handling
> > > > for the global base. As a result, global timers enqueue may go unnoticed
> > > > in nohz_full.
> > > >
> > > > Fix this with restoring the idle tracking of the global timer's base,
> > > > allowing self-IPIs again on enqueue time.
> > >
> > > Testing with the previous patch (1/2 in this series) reduced the number of
> > > problems by about an order of magnitude, down to two sched_tick_remote()
> > > instances and one enqueue_hrtimer() instance, very good!
> > >
> > > I have kicked off a test including this patch. Here is hoping! ;-)
> >
> > And 22*100 hours of TREE07 got me one run with a sched_tick_remote()

Sigh. s/22/12/

> > complaint and another run with a starved RCU grace-period kthread.
> > So this is definitely getting more reliable, but still a little ways
> > to go.

An additional eight hours got anohtre sched_tick_remote().

> Right, there is clearly something else. Investigation continues...

Please let me know if there is a debug patch that I could apply.

Thanx, Paul

Subject: [tip: timers/urgent] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

The following commit has been merged into the timers/urgent branch of tip:

Commit-ID: 03877039863be021a19fda307136657bb6d61f75
Gitweb: https://git.kernel.org/tip/03877039863be021a19fda307136657bb6d61f75
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Tue, 19 Mar 2024 00:07:29 +01:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Tue, 19 Mar 2024 10:14:55 +01:00

timers: Fix removed self-IPI on global timer's enqueue in nohz_full

While running in nohz_full mode, a task may enqueue a timer while the
tick is stopped. However the only places where the timer wheel,
alongside the timer migration machinery's decision, may reprogram the
next event accordingly with that new timer's expiry are the idle loop or
any IRQ tail.

However neither the idle task nor an interrupt may run on the CPU if it
resumes busy work in userspace for a long while in full dynticks mode.

To solve this, the timer enqueue path raises a self-IPI that will
re-evaluate the timer wheel on its IRQ tail. This asynchronous solution
avoids potential locking inversion.

This is supposed to happen both for local and global timers but commit:

b2cf7507e186 ("timers: Always queue timers on the local CPU")

broke the global timers case with removing the ->is_idle field handling
for the global base. As a result, global timers enqueue may go unnoticed
in nohz_full.

Fix this with restoring the idle tracking of the global timer's base,
allowing self-IPIs again on enqueue time.

Fixes: b2cf7507e186 ("timers: Always queue timers on the local CPU")
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
kernel/time/timer.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index e69e75d..dee29f1 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -642,7 +642,8 @@ trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
* the base lock:
*/
if (base->is_idle) {
- WARN_ON_ONCE(!(timer->flags & TIMER_PINNED));
+ WARN_ON_ONCE(!(timer->flags & TIMER_PINNED ||
+ tick_nohz_full_cpu(base->cpu)));
wake_up_nohz_cpu(base->cpu);
}
}
@@ -2292,6 +2293,13 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
*/
if (!base_local->is_idle && time_after(nextevt, basej + 1)) {
base_local->is_idle = true;
+ /*
+ * Global timers queued locally while running in a task
+ * in nohz_full mode need a self-IPI to kick reprogramming
+ * in IRQ tail.
+ */
+ if (tick_nohz_full_cpu(base_local->cpu))
+ base_global->is_idle = true;
trace_timer_base_idle(true, base_local->cpu);
}
*idle = base_local->is_idle;
@@ -2364,6 +2372,8 @@ void timer_clear_idle(void)
* path. Required for BASE_LOCAL only.
*/
__this_cpu_write(timer_bases[BASE_LOCAL].is_idle, false);
+ if (tick_nohz_full_cpu(smp_processor_id()))
+ __this_cpu_write(timer_bases[BASE_GLOBAL].is_idle, false);
trace_timer_base_idle(false, smp_processor_id());

/* Activate without holding the timer_base->lock */

Subject: [tip: timers/urgent] timers/migration: Fix endless timer requeue after idle interrupts

The following commit has been merged into the timers/urgent branch of tip:

Commit-ID: f55acb1e44f3d4bf1ca7926d777895a67d4ec606
Gitweb: https://git.kernel.org/tip/f55acb1e44f3d4bf1ca7926d777895a67d4ec606
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Tue, 19 Mar 2024 00:07:28 +01:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Tue, 19 Mar 2024 10:14:55 +01:00

timers/migration: Fix endless timer requeue after idle interrupts

When a CPU is an idle migrator, but another CPU wakes up before it,
becomes an active migrator and handles the queue, the initial idle
migrator may end up endlessly reprogramming its clockevent, chasing ghost
timers forever such as in the following scenario:

[GRP0:0]
migrator = 0
active = 0
nextevt = T1
/ \
0 1
active idle (T1)

0) CPU 1 is idle and has a timer queued (T1), CPU 0 is active and is
the active migrator.

[GRP0:0]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1
idle idle (T1)
wakeup = T1

1) CPU 0 is now idle and is therefore the idle migrator. It has
programmed its next timer interrupt to handle T1.

[GRP0:0]
migrator = 1
active = 1
nextevt = KTIME_MAX
/ \
0 1
idle active
wakeup = T1

2) CPU 1 has woken up, it is now active and it has just handled its own
timer T1.

3) CPU 0 gets a timer interrupt to handle T1 but tmigr_handle_remote()
realize it is not the migrator anymore. So it early returns without
observing that T1 has been expired already and therefore without
updating its ->wakeup value.

4) CPU 0 goes into tmigr_cpu_new_timer() which also early returns
because it doesn't queue a timer of its own. So ->wakeup is left
unchanged and the next timer is programmed to fire now.

5) goto 3) forever

This results in timer interrupt storms in idle and also in nohz_full (as
observed in rcutorture's TREE07 scenario).

Fix this with forcing a re-evaluation of tmc->wakeup while trying
remote timer handling when the CPU isn't the migrator anymmore. The
check is inherently racy but in the worst case the CPU just races setting
the KTIME_MAX value that a remote expiry also tries to set.

Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
kernel/time/timer_migration.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 611cd90..c63a0af 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1038,8 +1038,15 @@ void tmigr_handle_remote(void)
* in tmigr_handle_remote_up() anyway. Keep this check to speed up the
* return when nothing has to be done.
*/
- if (!tmigr_check_migrator(tmc->tmgroup, tmc->childmask))
- return;
+ if (!tmigr_check_migrator(tmc->tmgroup, tmc->childmask)) {
+ /*
+ * If this CPU was an idle migrator, make sure to clear its wakeup
+ * value so it won't chase timers that have already expired elsewhere.
+ * This avoids endless requeue from tmigr_new_timer().
+ */
+ if (READ_ONCE(tmc->wakeup) == KTIME_MAX)
+ return;
+ }

data.now = get_jiffies_update(&data.basej);


2024-03-21 11:42:21

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

On Wed, Mar 20, 2024 at 03:55:17PM -0700, Paul E. McKenney wrote:
> On Wed, Mar 20, 2024 at 05:15:48PM +0100, Frederic Weisbecker wrote:
> > Le Wed, Mar 20, 2024 at 04:14:24AM -0700, Paul E. McKenney a ?crit :
> > > On Tue, Mar 19, 2024 at 02:18:00AM -0700, Paul E. McKenney wrote:
> > > > On Tue, Mar 19, 2024 at 12:07:29AM +0100, Frederic Weisbecker wrote:
> > > > > While running in nohz_full mode, a task may enqueue a timer while the
> > > > > tick is stopped. However the only places where the timer wheel,
> > > > > alongside the timer migration machinery's decision, may reprogram the
> > > > > next event accordingly with that new timer's expiry are the idle loop or
> > > > > any IRQ tail.
> > > > >
> > > > > However neither the idle task nor an interrupt may run on the CPU if it
> > > > > resumes busy work in userspace for a long while in full dynticks mode.
> > > > >
> > > > > To solve this, the timer enqueue path raises a self-IPI that will
> > > > > re-evaluate the timer wheel on its IRQ tail. This asynchronous solution
> > > > > avoids potential locking inversion.
> > > > >
> > > > > This is supposed to happen both for local and global timers but commit:
> > > > >
> > > > > b2cf7507e186 ("timers: Always queue timers on the local CPU")
> > > > >
> > > > > broke the global timers case with removing the ->is_idle field handling
> > > > > for the global base. As a result, global timers enqueue may go unnoticed
> > > > > in nohz_full.
> > > > >
> > > > > Fix this with restoring the idle tracking of the global timer's base,
> > > > > allowing self-IPIs again on enqueue time.
> > > >
> > > > Testing with the previous patch (1/2 in this series) reduced the number of
> > > > problems by about an order of magnitude, down to two sched_tick_remote()
> > > > instances and one enqueue_hrtimer() instance, very good!
> > > >
> > > > I have kicked off a test including this patch. Here is hoping! ;-)
> > >
> > > And 22*100 hours of TREE07 got me one run with a sched_tick_remote()
>
> Sigh. s/22/12/
>
> > > complaint and another run with a starved RCU grace-period kthread.
> > > So this is definitely getting more reliable, but still a little ways
> > > to go.
>
> An additional eight hours got anohtre sched_tick_remote().
>
> > Right, there is clearly something else. Investigation continues...
>
> Please let me know if there is a debug patch that I could apply.

So there are three things:

_ The sched_tick_remote() warning. I can easily trigger this one and while
trying a bisection, I realize it actually also triggers on v6.8
I'm not really tempted to investigate further because the warning doesn't make
much sense to me. This computes the delta between the time the kworker got
scheduled and the time it reaches the middle of the work. It happens to be
~3s but isn't it something to be expected with jitter and all involved into
rcutorture?

We should probably just remove this warning. This remote tick is my most horrible
hackery anyway.

_ The hrtimer enqueue to an offline CPU warning:

[ 1054.265335] WARNING: CPU: 8 PID: 150 at kernel/time/hrtimer.c:1091 enqueue_hrtimer+0x6f/0x80
[ 1054.269166] Modules linked in:
[ 1054.270077] CPU: 8 PID: 150 Comm: rcu_torture_rea Not tainted 6.8.0-11407-ge990136580ab-dirty #21
[ 1054.272768] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[ 1054.275893] RIP: 0010:enqueue_hrtimer+0x6f/0x80
[ 1054.277252] Code: a3 05 b5 ff b7 01 73 bd 48 8b 05 44 c5 b5 01 48 85 c0 74 0c 48 8b 78 08 48 89 ee e8 7b b5 ff ff 48 8b 03 f6 40 10 10 75 a5 90 <0f> 0b 90 eb 9f 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
[ 1054.282448] RSP: 0000:ffffb1480056fd70 EFLAGS: 00010046
[ 1054.283931] RAX: ffff95b7df2616c0 RBX: ffff95b7df261700 RCX: ffff95b7df2616c0
[ 1054.286091] RDX: 0000000000000001 RSI: ffff95b7df261700 RDI: ffffb1480056fde0
[ 1054.288095] RBP: ffffb1480056fde0 R08: 0000000000000001 R09: 000000000000fc03
[ 1054.290189] R10: 0000000000000001 R11: ffff95b7c1271c80 R12: 0000000000000040
[ 1054.292592] R13: ffff95b7df261700 R14: ffff95b7df261700 R15: ffff95b7df2616c0
[ 1054.294622] FS: 0000000000000000(0000) GS:ffff95b7df400000(0000) knlGS:0000000000000000
[ 1054.296884] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1054.298497] CR2: 0000000000000000 CR3: 000000001822c000 CR4: 00000000000006f0
[ 1054.300475] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1054.302516] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1054.304580] Call Trace:
[ 1054.305354] <TASK>
[ 1054.306009] ? __warn+0x77/0x120
[ 1054.306973] ? enqueue_hrtimer+0x6f/0x80
[ 1054.308177] ? report_bug+0xf1/0x1d0
[ 1054.310200] ? handle_bug+0x40/0x70
[ 1054.311252] ? exc_invalid_op+0x13/0x70
[ 1054.312437] ? asm_exc_invalid_op+0x16/0x20
[ 1054.313738] ? enqueue_hrtimer+0x6f/0x80
[ 1054.314994] hrtimer_start_range_ns+0x258/0x2f0
[ 1054.316531] ? __pfx_rcu_torture_reader+0x10/0x10
[ 1054.317983] schedule_hrtimeout_range_clock+0x95/0x120
[ 1054.319604] ? __pfx_hrtimer_wakeup+0x10/0x10
[ 1054.320957] torture_hrtimeout_ns+0x50/0x70
[ 1054.322211] rcu_torture_reader+0x1be/0x280
[ 1054.323455] ? __pfx_rcu_torture_timer+0x10/0x10
[ 1054.329888] ? kthread+0xc4/0xf0
[ 1054.330880] ? __pfx_rcu_torture_reader+0x10/0x10
[ 1054.332334] kthread+0xc4/0xf0
[ 1054.333305] ? __pfx_kthread+0x10/0x10
[ 1054.334466] ret_from_fork+0x2b/0x40
[ 1054.335551] ? __pfx_kthread+0x10/0x10
[ 1054.336690] ret_from_fork_asm+0x1a/0x30
[ 1054.337878] </TASK>

I don't know how it happened. This is where I'm investigating now.

_ The RCU CPU Stall report. I strongly suspect the cause is the hrtimer
enqueue to an offline CPU. Let's solve that and we'll see if it still
triggers.

2024-03-21 12:48:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

On Thu, Mar 21, 2024 at 12:42:09PM +0100, Frederic Weisbecker wrote:
> On Wed, Mar 20, 2024 at 03:55:17PM -0700, Paul E. McKenney wrote:
> > On Wed, Mar 20, 2024 at 05:15:48PM +0100, Frederic Weisbecker wrote:
> > > Le Wed, Mar 20, 2024 at 04:14:24AM -0700, Paul E. McKenney a ?crit :
> > > > On Tue, Mar 19, 2024 at 02:18:00AM -0700, Paul E. McKenney wrote:
> > > > > On Tue, Mar 19, 2024 at 12:07:29AM +0100, Frederic Weisbecker wrote:
> > > > > > While running in nohz_full mode, a task may enqueue a timer while the
> > > > > > tick is stopped. However the only places where the timer wheel,
> > > > > > alongside the timer migration machinery's decision, may reprogram the
> > > > > > next event accordingly with that new timer's expiry are the idle loop or
> > > > > > any IRQ tail.
> > > > > >
> > > > > > However neither the idle task nor an interrupt may run on the CPU if it
> > > > > > resumes busy work in userspace for a long while in full dynticks mode.
> > > > > >
> > > > > > To solve this, the timer enqueue path raises a self-IPI that will
> > > > > > re-evaluate the timer wheel on its IRQ tail. This asynchronous solution
> > > > > > avoids potential locking inversion.
> > > > > >
> > > > > > This is supposed to happen both for local and global timers but commit:
> > > > > >
> > > > > > b2cf7507e186 ("timers: Always queue timers on the local CPU")
> > > > > >
> > > > > > broke the global timers case with removing the ->is_idle field handling
> > > > > > for the global base. As a result, global timers enqueue may go unnoticed
> > > > > > in nohz_full.
> > > > > >
> > > > > > Fix this with restoring the idle tracking of the global timer's base,
> > > > > > allowing self-IPIs again on enqueue time.
> > > > >
> > > > > Testing with the previous patch (1/2 in this series) reduced the number of
> > > > > problems by about an order of magnitude, down to two sched_tick_remote()
> > > > > instances and one enqueue_hrtimer() instance, very good!
> > > > >
> > > > > I have kicked off a test including this patch. Here is hoping! ;-)
> > > >
> > > > And 22*100 hours of TREE07 got me one run with a sched_tick_remote()
> >
> > Sigh. s/22/12/
> >
> > > > complaint and another run with a starved RCU grace-period kthread.
> > > > So this is definitely getting more reliable, but still a little ways
> > > > to go.
> >
> > An additional eight hours got anohtre sched_tick_remote().
> >
> > > Right, there is clearly something else. Investigation continues...
> >
> > Please let me know if there is a debug patch that I could apply.
>
> So there are three things:
>
> _ The sched_tick_remote() warning. I can easily trigger this one and while
> trying a bisection, I realize it actually also triggers on v6.8
> I'm not really tempted to investigate further because the warning doesn't make
> much sense to me. This computes the delta between the time the kworker got
> scheduled and the time it reaches the middle of the work. It happens to be
> ~3s but isn't it something to be expected with jitter and all involved into
> rcutorture?
>
> We should probably just remove this warning. This remote tick is my most horrible
> hackery anyway.

Would it make sense to increase it to somewhere around the timeout for
RCU CPU stall warnings, softlockup, and so on?

> _ The hrtimer enqueue to an offline CPU warning:
>
> [ 1054.265335] WARNING: CPU: 8 PID: 150 at kernel/time/hrtimer.c:1091 enqueue_hrtimer+0x6f/0x80
> [ 1054.269166] Modules linked in:
> [ 1054.270077] CPU: 8 PID: 150 Comm: rcu_torture_rea Not tainted 6.8.0-11407-ge990136580ab-dirty #21
> [ 1054.272768] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
> [ 1054.275893] RIP: 0010:enqueue_hrtimer+0x6f/0x80
> [ 1054.277252] Code: a3 05 b5 ff b7 01 73 bd 48 8b 05 44 c5 b5 01 48 85 c0 74 0c 48 8b 78 08 48 89 ee e8 7b b5 ff ff 48 8b 03 f6 40 10 10 75 a5 90 <0f> 0b 90 eb 9f 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
> [ 1054.282448] RSP: 0000:ffffb1480056fd70 EFLAGS: 00010046
> [ 1054.283931] RAX: ffff95b7df2616c0 RBX: ffff95b7df261700 RCX: ffff95b7df2616c0
> [ 1054.286091] RDX: 0000000000000001 RSI: ffff95b7df261700 RDI: ffffb1480056fde0
> [ 1054.288095] RBP: ffffb1480056fde0 R08: 0000000000000001 R09: 000000000000fc03
> [ 1054.290189] R10: 0000000000000001 R11: ffff95b7c1271c80 R12: 0000000000000040
> [ 1054.292592] R13: ffff95b7df261700 R14: ffff95b7df261700 R15: ffff95b7df2616c0
> [ 1054.294622] FS: 0000000000000000(0000) GS:ffff95b7df400000(0000) knlGS:0000000000000000
> [ 1054.296884] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1054.298497] CR2: 0000000000000000 CR3: 000000001822c000 CR4: 00000000000006f0
> [ 1054.300475] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1054.302516] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1054.304580] Call Trace:
> [ 1054.305354] <TASK>
> [ 1054.306009] ? __warn+0x77/0x120
> [ 1054.306973] ? enqueue_hrtimer+0x6f/0x80
> [ 1054.308177] ? report_bug+0xf1/0x1d0
> [ 1054.310200] ? handle_bug+0x40/0x70
> [ 1054.311252] ? exc_invalid_op+0x13/0x70
> [ 1054.312437] ? asm_exc_invalid_op+0x16/0x20
> [ 1054.313738] ? enqueue_hrtimer+0x6f/0x80
> [ 1054.314994] hrtimer_start_range_ns+0x258/0x2f0
> [ 1054.316531] ? __pfx_rcu_torture_reader+0x10/0x10
> [ 1054.317983] schedule_hrtimeout_range_clock+0x95/0x120
> [ 1054.319604] ? __pfx_hrtimer_wakeup+0x10/0x10
> [ 1054.320957] torture_hrtimeout_ns+0x50/0x70
> [ 1054.322211] rcu_torture_reader+0x1be/0x280
> [ 1054.323455] ? __pfx_rcu_torture_timer+0x10/0x10
> [ 1054.329888] ? kthread+0xc4/0xf0
> [ 1054.330880] ? __pfx_rcu_torture_reader+0x10/0x10
> [ 1054.332334] kthread+0xc4/0xf0
> [ 1054.333305] ? __pfx_kthread+0x10/0x10
> [ 1054.334466] ret_from_fork+0x2b/0x40
> [ 1054.335551] ? __pfx_kthread+0x10/0x10
> [ 1054.336690] ret_from_fork_asm+0x1a/0x30
> [ 1054.337878] </TASK>
>
> I don't know how it happened. This is where I'm investigating now.

It seems that rcutorture is the gift that keeps on giving? ;-)

> _ The RCU CPU Stall report. I strongly suspect the cause is the hrtimer
> enqueue to an offline CPU. Let's solve that and we'll see if it still
> triggers.

Sounds like a plan!

Thanx, Paul

2024-03-22 11:32:46

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

On Thu, Mar 21, 2024 at 05:47:59AM -0700, Paul E. McKenney wrote:
> > We should probably just remove this warning. This remote tick is my most horrible
> > hackery anyway.
>
> Would it make sense to increase it to somewhere around the timeout for
> RCU CPU stall warnings, softlockup, and so on?

We can also do that yes.

>
> > _ The hrtimer enqueue to an offline CPU warning:
> >
> > [ 1054.265335] WARNING: CPU: 8 PID: 150 at kernel/time/hrtimer.c:1091 enqueue_hrtimer+0x6f/0x80
> > [ 1054.269166] Modules linked in:
> > [ 1054.270077] CPU: 8 PID: 150 Comm: rcu_torture_rea Not tainted 6.8.0-11407-ge990136580ab-dirty #21
> > [ 1054.272768] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
> > [ 1054.275893] RIP: 0010:enqueue_hrtimer+0x6f/0x80
> > [ 1054.277252] Code: a3 05 b5 ff b7 01 73 bd 48 8b 05 44 c5 b5 01 48 85 c0 74 0c 48 8b 78 08 48 89 ee e8 7b b5 ff ff 48 8b 03 f6 40 10 10 75 a5 90 <0f> 0b 90 eb 9f 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
> > [ 1054.282448] RSP: 0000:ffffb1480056fd70 EFLAGS: 00010046
> > [ 1054.283931] RAX: ffff95b7df2616c0 RBX: ffff95b7df261700 RCX: ffff95b7df2616c0
> > [ 1054.286091] RDX: 0000000000000001 RSI: ffff95b7df261700 RDI: ffffb1480056fde0
> > [ 1054.288095] RBP: ffffb1480056fde0 R08: 0000000000000001 R09: 000000000000fc03
> > [ 1054.290189] R10: 0000000000000001 R11: ffff95b7c1271c80 R12: 0000000000000040
> > [ 1054.292592] R13: ffff95b7df261700 R14: ffff95b7df261700 R15: ffff95b7df2616c0
> > [ 1054.294622] FS: 0000000000000000(0000) GS:ffff95b7df400000(0000) knlGS:0000000000000000
> > [ 1054.296884] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1054.298497] CR2: 0000000000000000 CR3: 000000001822c000 CR4: 00000000000006f0
> > [ 1054.300475] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 1054.302516] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [ 1054.304580] Call Trace:
> > [ 1054.305354] <TASK>
> > [ 1054.306009] ? __warn+0x77/0x120
> > [ 1054.306973] ? enqueue_hrtimer+0x6f/0x80
> > [ 1054.308177] ? report_bug+0xf1/0x1d0
> > [ 1054.310200] ? handle_bug+0x40/0x70
> > [ 1054.311252] ? exc_invalid_op+0x13/0x70
> > [ 1054.312437] ? asm_exc_invalid_op+0x16/0x20
> > [ 1054.313738] ? enqueue_hrtimer+0x6f/0x80
> > [ 1054.314994] hrtimer_start_range_ns+0x258/0x2f0
> > [ 1054.316531] ? __pfx_rcu_torture_reader+0x10/0x10
> > [ 1054.317983] schedule_hrtimeout_range_clock+0x95/0x120
> > [ 1054.319604] ? __pfx_hrtimer_wakeup+0x10/0x10
> > [ 1054.320957] torture_hrtimeout_ns+0x50/0x70
> > [ 1054.322211] rcu_torture_reader+0x1be/0x280
> > [ 1054.323455] ? __pfx_rcu_torture_timer+0x10/0x10
> > [ 1054.329888] ? kthread+0xc4/0xf0
> > [ 1054.330880] ? __pfx_rcu_torture_reader+0x10/0x10
> > [ 1054.332334] kthread+0xc4/0xf0
> > [ 1054.333305] ? __pfx_kthread+0x10/0x10
> > [ 1054.334466] ret_from_fork+0x2b/0x40
> > [ 1054.335551] ? __pfx_kthread+0x10/0x10
> > [ 1054.336690] ret_from_fork_asm+0x1a/0x30
> > [ 1054.337878] </TASK>
> >
> > I don't know how it happened. This is where I'm investigating now.
>
> It seems that rcutorture is the gift that keeps on giving? ;-)

Right. Also I just managed to reproduce that hrtimer issue on v6.8:

[ 1484.955213] WARNING: CPU: 6 PID: 162 at kernel/time/hrtimer.c:1088 enqueue_hrtimer+0x6f/0x80
[ 1484.962513] Modules linked in:
[ 1484.966476] CPU: 6 PID: 162 Comm: rcu_torture_rea Not tainted 6.8.0 #25
[ 1484.972975] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[ 1484.977653] RIP: 0010:enqueue_hrtimer+0x6f/0x80
[ 1484.978679] Code: a3 05 75 6a b7 01 73 bd 48 8b 05 e4 47 b5 01 48 85 c0 74 0c 48 8b 78 08 48 89 ee e8 eb bd ff ff 48 8b 03 f6 40 10 10 75 a5 90 <0f> 0b 90 eb 9f 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
[ 1484.983136] RSP: 0000:ffffa553805cfd70 EFLAGS: 00010046
[ 1484.984386] RAX: ffff8c395f260440 RBX: ffff8c395f260480 RCX: ffff8c395f260440
[ 1484.986121] RDX: 0000000000000000 RSI: ffff8c395f260480 RDI: ffffa553805cfde0
[ 1484.987704] RBP: ffffa553805cfde0 R08: 0000000000000001 R09: 000000000000fc03
[ 1484.989513] R10: 0000000000000001 R11: ffff8c3941248e40 R12: 0000000000000040
[ 1484.991116] R13: ffff8c395f260480 R14: ffff8c395f260480 R15: ffff8c395f260440
[ 1484.992835] FS: 0000000000000000(0000) GS:ffff8c395f380000(0000) knlGS:0000000000000000
[ 1484.994683] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1484.995985] CR2: 0000000000000000 CR3: 000000001a62c000 CR4: 00000000000006f0
[ 1484.997789] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1484.999376] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1485.001177] Call Trace:
[ 1485.002388] <TASK>
[ 1485.002923] ? __warn+0x78/0x120
[ 1485.003641] ? enqueue_hrtimer+0x6f/0x80
[ 1485.004641] ? report_bug+0xf1/0x1d0
[ 1485.005537] ? handle_bug+0x40/0x70
[ 1485.006318] ? exc_invalid_op+0x13/0x70
[ 1485.007167] ? asm_exc_invalid_op+0x16/0x20
[ 1485.008117] ? enqueue_hrtimer+0x6f/0x80
[ 1485.009131] hrtimer_start_range_ns+0x258/0x2f0
[ 1485.010153] ? __pfx_rcu_torture_reader+0x10/0x10
[ 1485.011185] schedule_hrtimeout_range_clock+0x96/0x120
[ 1485.012394] ? __pfx_hrtimer_wakeup+0x10/0x10
[ 1485.013502] stutter_wait+0x7f/0x90
[ 1485.014319] rcu_torture_reader+0x10e/0x280
[ 1485.015240] ? __pfx_rcu_torture_timer+0x10/0x10
[ 1485.016317] ? kthread+0xc6/0xf0
[ 1485.017169] ? __pfx_rcu_torture_reader+0x10/0x10
[ 1485.018215] kthread+0xc6/0xf0
[ 1485.018899] ? __pfx_kthread+0x10/0x10
[ 1485.019703] ret_from_fork+0x2b/0x40
[ 1485.020546] ? __pfx_kthread+0x10/0x10
[ 1485.021428] ret_from_fork_asm+0x1b/0x30
[ 1485.022295] </TASK>

This happens within the following loop

for_each_domain(cpu, sd) {
for_each_cpu_and(i, sched_domain_span(sd), hk_mask) {
if (cpu == i)
continue;

if (!idle_cpu(i))
return i;
}
}

An offline CPU is returned from there. Which is not supposed to happen
as it's within an RCU read side. But I can't manage to find where those
rq->sd things are modified when a CPU is offlining. The code path is hard
to follow. Adding some scheduler people involved in topology just in case.
Investigation continues...

Thanks.

2024-03-22 13:22:30

by Frederic Weisbecker

[permalink] [raw]
Subject: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
> On Thu, Mar 21, 2024 at 05:47:59AM -0700, Paul E. McKenney wrote:
> [ 1484.955213] WARNING: CPU: 6 PID: 162 at kernel/time/hrtimer.c:1088 enqueue_hrtimer+0x6f/0x80
> [ 1484.962513] Modules linked in:
> [ 1484.966476] CPU: 6 PID: 162 Comm: rcu_torture_rea Not tainted 6.8.0 #25
> [ 1484.972975] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
> [ 1484.977653] RIP: 0010:enqueue_hrtimer+0x6f/0x80
> [ 1484.978679] Code: a3 05 75 6a b7 01 73 bd 48 8b 05 e4 47 b5 01 48 85 c0 74 0c 48 8b 78 08 48 89 ee e8 eb bd ff ff 48 8b 03 f6 40 10 10 75 a5 90 <0f> 0b 90 eb 9f 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
> [ 1484.983136] RSP: 0000:ffffa553805cfd70 EFLAGS: 00010046
> [ 1484.984386] RAX: ffff8c395f260440 RBX: ffff8c395f260480 RCX: ffff8c395f260440
> [ 1484.986121] RDX: 0000000000000000 RSI: ffff8c395f260480 RDI: ffffa553805cfde0
> [ 1484.987704] RBP: ffffa553805cfde0 R08: 0000000000000001 R09: 000000000000fc03
> [ 1484.989513] R10: 0000000000000001 R11: ffff8c3941248e40 R12: 0000000000000040
> [ 1484.991116] R13: ffff8c395f260480 R14: ffff8c395f260480 R15: ffff8c395f260440
> [ 1484.992835] FS: 0000000000000000(0000) GS:ffff8c395f380000(0000) knlGS:0000000000000000
> [ 1484.994683] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1484.995985] CR2: 0000000000000000 CR3: 000000001a62c000 CR4: 00000000000006f0
> [ 1484.997789] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1484.999376] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1485.001177] Call Trace:
> [ 1485.002388] <TASK>
> [ 1485.002923] ? __warn+0x78/0x120
> [ 1485.003641] ? enqueue_hrtimer+0x6f/0x80
> [ 1485.004641] ? report_bug+0xf1/0x1d0
> [ 1485.005537] ? handle_bug+0x40/0x70
> [ 1485.006318] ? exc_invalid_op+0x13/0x70
> [ 1485.007167] ? asm_exc_invalid_op+0x16/0x20
> [ 1485.008117] ? enqueue_hrtimer+0x6f/0x80
> [ 1485.009131] hrtimer_start_range_ns+0x258/0x2f0
> [ 1485.010153] ? __pfx_rcu_torture_reader+0x10/0x10
> [ 1485.011185] schedule_hrtimeout_range_clock+0x96/0x120
> [ 1485.012394] ? __pfx_hrtimer_wakeup+0x10/0x10
> [ 1485.013502] stutter_wait+0x7f/0x90
> [ 1485.014319] rcu_torture_reader+0x10e/0x280
> [ 1485.015240] ? __pfx_rcu_torture_timer+0x10/0x10
> [ 1485.016317] ? kthread+0xc6/0xf0
> [ 1485.017169] ? __pfx_rcu_torture_reader+0x10/0x10
> [ 1485.018215] kthread+0xc6/0xf0
> [ 1485.018899] ? __pfx_kthread+0x10/0x10
> [ 1485.019703] ret_from_fork+0x2b/0x40
> [ 1485.020546] ? __pfx_kthread+0x10/0x10
> [ 1485.021428] ret_from_fork_asm+0x1b/0x30
> [ 1485.022295] </TASK>
>
> This happens within the following loop
>
> for_each_domain(cpu, sd) {
> for_each_cpu_and(i, sched_domain_span(sd), hk_mask) {
> if (cpu == i)
> continue;
>
> if (!idle_cpu(i))
> return i;
> }
> }
>
> An offline CPU is returned from there. Which is not supposed to happen
> as it's within an RCU read side. But I can't manage to find where those
> rq->sd things are modified when a CPU is offlining. The code path is hard
> to follow. Adding some scheduler people involved in topology just in case.
> Investigation continues...

So I just tried a simple test. I applied the following PoF patch, under the
assumption that right after the exit of takedown_cpu(), the target CPU is
offline and all CPUs must see the updates perfomed by the now dead target
CPU. I expect also that the target CPU should have removed itself from the
scheduling domains (though I don't know where that happens in the maze):

diff --git a/include/linux/sched/nohz.h b/include/linux/sched/nohz.h
index 6d67e9a5af6b..2e0580abc73a 100644
--- a/include/linux/sched/nohz.h
+++ b/include/linux/sched/nohz.h
@@ -9,6 +9,7 @@
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
extern void nohz_balance_enter_idle(int cpu);
extern int get_nohz_timer_target(void);
+extern void sched_assert_online_domains(int cpu);
#else
static inline void nohz_balance_enter_idle(int cpu) { }
#endif
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 6c0a92ca6bb5..3424d1d40142 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1024,7 +1024,7 @@ void clear_tasks_mm_cpumask(int cpu)
}

/* Take this CPU down. */
-static int take_cpu_down(void *_param)
+int take_cpu_down(void *_param)
{
struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d18c3969f90..57f483fc1037 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1055,6 +1055,22 @@ void resched_cpu(int cpu)

#ifdef CONFIG_SMP
#ifdef CONFIG_NO_HZ_COMMON
+void sched_assert_online_domains(int cpu)
+{
+ struct sched_domain *sd;
+ int i;
+
+ rcu_read_lock();
+ for_each_domain(cpu, sd) {
+ for_each_cpu(i, sched_domain_span(sd)) {
+ if (cpu == i)
+ continue;
+
+ WARN_ON_ONCE(cpu_is_offline(i));
+ }
+ }
+ rcu_read_unlock();
+}
/*
* In the semi idle case, use the nearest busy CPU for migrating timers
* from an idle CPU. This is good for power-savings.
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index cedb17ba158a..354806156bd9 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -22,6 +22,7 @@
#include <linux/atomic.h>
#include <linux/nmi.h>
#include <linux/sched/wake_q.h>
+#include <linux/sched/nohz.h>

/*
* Structure to determine completion condition and record errors. May
@@ -199,6 +200,8 @@ notrace void __weak stop_machine_yield(const struct cpumask *cpumask)
cpu_relax();
}

+extern int take_cpu_down(void *_param);
+
/* This is the cpu_stop function which stops the CPU. */
static int multi_cpu_stop(void *data)
{
@@ -255,6 +258,8 @@ static int multi_cpu_stop(void *data)
} while (curstate != MULTI_STOP_EXIT);

local_irq_restore(flags);
+ if (!is_active && msdata->fn == take_cpu_down)
+ sched_assert_online_domains(smp_processor_id());
return err;
}

Then with that patch I ran TREE07, just some short iterations:

tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2

And the warning triggers very quickly. At least since v6.3 but maybe since
earlier. Is this expected behaviour or am I right to assume that
for_each_domain()/sched_domain_span() shouldn't return an offline CPU?

Thanks.

2024-03-26 17:12:19

by Valentin Schneider

[permalink] [raw]
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)


I've just noticed this thread while cleaning up my inbox, apologies for the
delayed response.

On 22/03/24 14:22, Frederic Weisbecker wrote:
> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
>> On Thu, Mar 21, 2024 at 05:47:59AM -0700, Paul E. McKenney wrote:
>> [ 1484.955213] WARNING: CPU: 6 PID: 162 at kernel/time/hrtimer.c:1088 enqueue_hrtimer+0x6f/0x80
>> [ 1484.962513] Modules linked in:
>> [ 1484.966476] CPU: 6 PID: 162 Comm: rcu_torture_rea Not tainted 6.8.0 #25
>> [ 1484.972975] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
>> [ 1484.977653] RIP: 0010:enqueue_hrtimer+0x6f/0x80
>> [ 1484.978679] Code: a3 05 75 6a b7 01 73 bd 48 8b 05 e4 47 b5 01 48 85 c0 74 0c 48 8b 78 08 48 89 ee e8 eb bd ff ff 48 8b 03 f6 40 10 10 75 a5 90 <0f> 0b 90 eb 9f 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
>> [ 1484.983136] RSP: 0000:ffffa553805cfd70 EFLAGS: 00010046
>> [ 1484.984386] RAX: ffff8c395f260440 RBX: ffff8c395f260480 RCX: ffff8c395f260440
>> [ 1484.986121] RDX: 0000000000000000 RSI: ffff8c395f260480 RDI: ffffa553805cfde0
>> [ 1484.987704] RBP: ffffa553805cfde0 R08: 0000000000000001 R09: 000000000000fc03
>> [ 1484.989513] R10: 0000000000000001 R11: ffff8c3941248e40 R12: 0000000000000040
>> [ 1484.991116] R13: ffff8c395f260480 R14: ffff8c395f260480 R15: ffff8c395f260440
>> [ 1484.992835] FS: 0000000000000000(0000) GS:ffff8c395f380000(0000) knlGS:0000000000000000
>> [ 1484.994683] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1484.995985] CR2: 0000000000000000 CR3: 000000001a62c000 CR4: 00000000000006f0
>> [ 1484.997789] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 1484.999376] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 1485.001177] Call Trace:
>> [ 1485.002388] <TASK>
>> [ 1485.002923] ? __warn+0x78/0x120
>> [ 1485.003641] ? enqueue_hrtimer+0x6f/0x80
>> [ 1485.004641] ? report_bug+0xf1/0x1d0
>> [ 1485.005537] ? handle_bug+0x40/0x70
>> [ 1485.006318] ? exc_invalid_op+0x13/0x70
>> [ 1485.007167] ? asm_exc_invalid_op+0x16/0x20
>> [ 1485.008117] ? enqueue_hrtimer+0x6f/0x80
>> [ 1485.009131] hrtimer_start_range_ns+0x258/0x2f0
>> [ 1485.010153] ? __pfx_rcu_torture_reader+0x10/0x10
>> [ 1485.011185] schedule_hrtimeout_range_clock+0x96/0x120
>> [ 1485.012394] ? __pfx_hrtimer_wakeup+0x10/0x10
>> [ 1485.013502] stutter_wait+0x7f/0x90
>> [ 1485.014319] rcu_torture_reader+0x10e/0x280
>> [ 1485.015240] ? __pfx_rcu_torture_timer+0x10/0x10
>> [ 1485.016317] ? kthread+0xc6/0xf0
>> [ 1485.017169] ? __pfx_rcu_torture_reader+0x10/0x10
>> [ 1485.018215] kthread+0xc6/0xf0
>> [ 1485.018899] ? __pfx_kthread+0x10/0x10
>> [ 1485.019703] ret_from_fork+0x2b/0x40
>> [ 1485.020546] ? __pfx_kthread+0x10/0x10
>> [ 1485.021428] ret_from_fork_asm+0x1b/0x30
>> [ 1485.022295] </TASK>
>>
>> This happens within the following loop
>>
>> for_each_domain(cpu, sd) {
>> for_each_cpu_and(i, sched_domain_span(sd), hk_mask) {
>> if (cpu == i)
>> continue;
>>
>> if (!idle_cpu(i))
>> return i;
>> }
>> }
>>
>> An offline CPU is returned from there. Which is not supposed to happen
>> as it's within an RCU read side. But I can't manage to find where those
>> rq->sd things are modified when a CPU is offlining. The code path is hard
>> to follow. Adding some scheduler people involved in topology just in case.
>> Investigation continues...
>
> So I just tried a simple test. I applied the following PoF patch, under the
> assumption that right after the exit of takedown_cpu(), the target CPU is
> offline and all CPUs must see the updates perfomed by the now dead target
> CPU. I expect also that the target CPU should have removed itself from the
> scheduling domains (though I don't know where that happens in the maze):
>

If it's of any help:

during hot-unplug, the scheduler will start poking the sched_domains in
CPUHP_AP_ACTIVE:sched_cpu_deactivate(). We then get to, in the "worst" case
with CONFIG_CPUSET=y:

sched_cpu_deactivate()
`\
cpuset_cpu_inactive()
`\
cpuset_update_active_cpus() # delegated to a workqueue because cpusets are fun

cpuset_hotplug_workfn()
`\
rebuild_sched_domains()
`\
rebuild_sched_domains_locked()

And there, eventually, we assign a NULL rq->sd to the offline CPU:

partition_sched_domains_locked()
`\
detach_destroy_domains()
`\
for_each_cpu(i, cpu_map)
cpu_attach_domain(NULL, &def_root_domain, i);
rcu_assign_pointer(rq->sd, sd);

The steps can be followed with CONFIG_SCHED_DEBUG=y and sched_verbose on
the cmdline:

# echo 0 > /sys/devices/system/cpu/cpu3/online
[ 23.173551] smpboot: CPU 3 is now offline

# That's the detach_destroy_domains() loop:
[ 23.174125] CPU0 attaching NULL sched-domain.
[ 23.174570] CPU1 attaching NULL sched-domain.
[ 23.175008] CPU2 attaching NULL sched-domain.
[ 23.175410] CPU3 attaching NULL sched-domain.

# That's the build_sched_domains() loop:
[ 23.175767] CPU0 attaching sched-domain(s):
[ 23.176091] domain-0: span=0-2 level=MC
[ 23.176381] groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }
[ 23.176805] CPU1 attaching sched-domain(s):
[ 23.177127] domain-0: span=0-2 level=MC
[ 23.177426] groups: 1:{ span=1 }, 2:{ span=2 }, 0:{ span=0 }
[ 23.177849] CPU2 attaching sched-domain(s):
[ 23.178171] domain-0: span=0-2 level=MC
[ 23.178482] groups: 2:{ span=2 }, 0:{ span=0 }, 1:{ span=1 }
[ 23.178902] root domain span: 0-2

> diff --git a/include/linux/sched/nohz.h b/include/linux/sched/nohz.h
> index 6d67e9a5af6b..2e0580abc73a 100644
> --- a/include/linux/sched/nohz.h
> +++ b/include/linux/sched/nohz.h
> @@ -9,6 +9,7 @@
> #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
> extern void nohz_balance_enter_idle(int cpu);
> extern int get_nohz_timer_target(void);
> +extern void sched_assert_online_domains(int cpu);
> #else
> static inline void nohz_balance_enter_idle(int cpu) { }
> #endif
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 6c0a92ca6bb5..3424d1d40142 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1024,7 +1024,7 @@ void clear_tasks_mm_cpumask(int cpu)
> }
>
> /* Take this CPU down. */
> -static int take_cpu_down(void *_param)
> +int take_cpu_down(void *_param)
> {
> struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0d18c3969f90..57f483fc1037 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1055,6 +1055,22 @@ void resched_cpu(int cpu)
>
> #ifdef CONFIG_SMP
> #ifdef CONFIG_NO_HZ_COMMON
> +void sched_assert_online_domains(int cpu)
> +{
> + struct sched_domain *sd;
> + int i;
> +
> + rcu_read_lock();
> + for_each_domain(cpu, sd) {
> + for_each_cpu(i, sched_domain_span(sd)) {
> + if (cpu == i)
> + continue;
> +
> + WARN_ON_ONCE(cpu_is_offline(i));
> + }
> + }
> + rcu_read_unlock();
> +}
> /*
> * In the semi idle case, use the nearest busy CPU for migrating timers
> * from an idle CPU. This is good for power-savings.
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index cedb17ba158a..354806156bd9 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -22,6 +22,7 @@
> #include <linux/atomic.h>
> #include <linux/nmi.h>
> #include <linux/sched/wake_q.h>
> +#include <linux/sched/nohz.h>
>
> /*
> * Structure to determine completion condition and record errors. May
> @@ -199,6 +200,8 @@ notrace void __weak stop_machine_yield(const struct cpumask *cpumask)
> cpu_relax();
> }
>
> +extern int take_cpu_down(void *_param);
> +
> /* This is the cpu_stop function which stops the CPU. */
> static int multi_cpu_stop(void *data)
> {
> @@ -255,6 +258,8 @@ static int multi_cpu_stop(void *data)
> } while (curstate != MULTI_STOP_EXIT);
>
> local_irq_restore(flags);
> + if (!is_active && msdata->fn == take_cpu_down)
> + sched_assert_online_domains(smp_processor_id());
> return err;
> }
>
> Then with that patch I ran TREE07, just some short iterations:
>
> tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
>
> And the warning triggers very quickly. At least since v6.3 but maybe since
> earlier. Is this expected behaviour or am I right to assume that
> for_each_domain()/sched_domain_span() shouldn't return an offline CPU?
>

I would very much assume an offline CPU shouldn't show up in a
sched_domain_span().

Now, on top of the above, there's one more thing worth noting:
cpu_up_down_serialize_trainwrecks()

This just flushes the cpuset work, so after that the sched_domain topology
should be sane. However I see it's invoked at the tail end of _cpu_down(),
IOW /after/ takedown_cpu() has run, which sounds too late. The comments
around this vs. lock ordering aren't very reassuring however, so I need to
look into this more.

Maybe as a "quick" test to see if this is the right culprit, you could try
that with CONFIG_CPUSET=n? Because in that case the sched_domain update is
ran within sched_cpu_deactivate().

> Thanks.


2024-03-27 14:15:18

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

Le Tue, Mar 26, 2024 at 05:46:07PM +0100, Valentin Schneider a ?crit :
> > Then with that patch I ran TREE07, just some short iterations:
> >
> > tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
> >
> > And the warning triggers very quickly. At least since v6.3 but maybe since
> > earlier. Is this expected behaviour or am I right to assume that
> > for_each_domain()/sched_domain_span() shouldn't return an offline CPU?
> >
>
> I would very much assume an offline CPU shouldn't show up in a
> sched_domain_span().
>
> Now, on top of the above, there's one more thing worth noting:
> cpu_up_down_serialize_trainwrecks()
>
> This just flushes the cpuset work, so after that the sched_domain topology
> should be sane. However I see it's invoked at the tail end of _cpu_down(),
> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
> around this vs. lock ordering aren't very reassuring however, so I need to
> look into this more.

Ouch...

>
> Maybe as a "quick" test to see if this is the right culprit, you could try
> that with CONFIG_CPUSET=n? Because in that case the sched_domain update is
> ran within sched_cpu_deactivate().

I just tried and I fear that doesn't help. It still triggers even without
cpusets :-s

Thanks.

2024-03-27 15:18:37

by Valentin Schneider

[permalink] [raw]
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

On 27/03/24 13:42, Frederic Weisbecker wrote:
> Le Tue, Mar 26, 2024 at 05:46:07PM +0100, Valentin Schneider a écrit :
>> > Then with that patch I ran TREE07, just some short iterations:
>> >
>> > tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
>> >
>> > And the warning triggers very quickly. At least since v6.3 but maybe since
>> > earlier. Is this expected behaviour or am I right to assume that
>> > for_each_domain()/sched_domain_span() shouldn't return an offline CPU?
>> >
>>
>> I would very much assume an offline CPU shouldn't show up in a
>> sched_domain_span().
>>
>> Now, on top of the above, there's one more thing worth noting:
>> cpu_up_down_serialize_trainwrecks()
>>
>> This just flushes the cpuset work, so after that the sched_domain topology
>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>> around this vs. lock ordering aren't very reassuring however, so I need to
>> look into this more.
>
> Ouch...
>
>>
>> Maybe as a "quick" test to see if this is the right culprit, you could try
>> that with CONFIG_CPUSET=n? Because in that case the sched_domain update is
>> ran within sched_cpu_deactivate().
>
> I just tried and I fear that doesn't help. It still triggers even without
> cpusets :-s
>

What, you mean I can't always blame cgroups? What has the world come to?

That's interesting, it means the deferred work item isn't the (only)
issue. I'll grab your test patch and try to reproduce on TREE07.

> Thanks.


2024-03-27 20:54:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
> On 22/03/24 14:22, Frederic Weisbecker wrote:
>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
> Now, on top of the above, there's one more thing worth noting:
> cpu_up_down_serialize_trainwrecks()
>
> This just flushes the cpuset work, so after that the sched_domain topology
> should be sane. However I see it's invoked at the tail end of _cpu_down(),
> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
> around this vs. lock ordering aren't very reassuring however, so I need to
> look into this more.

commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
the change log.

TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
cpu_hotplug_lock in the hotplug path for whatever silly reason. So
trying to flush the work from within the cpu hotplug lock write held
region is guaranteed to dead lock.

That's why all of this got deferred to a work queue. The flush at the
end of the hotplug code after dropping the hotplug lock is there to
prevent that user space sees the CPU hotplug uevent before the work is
done. But of course between bringing the CPU offline and the flush the
kernel internal state is inconsistent.

There was an attempt to make the CPU hotplug path synchronous in commit
a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
reverted with commit 2b729fe7f3 for the very wrong reason:

https://lore.kernel.org/all/[email protected]/T/#u

(cpu_hotplug_lock){++++}-{0:0}:
lock_acquire+0xe4/0x25c
cpus_read_lock+0x50/0x154
static_key_slow_inc+0x18/0x30
mem_cgroup_css_alloc+0x824/0x8b0
cgroup_apply_control_enable+0x1d8/0x56c
cgroup_apply_control+0x40/0x344
cgroup_subtree_control_write+0x664/0x69c
cgroup_file_write+0x130/0x2e8
kernfs_fop_write+0x228/0x32c
__vfs_write+0x84/0x1d8
vfs_write+0x13c/0x1b4
ksys_write+0xb0/0x120

Instead of the revert this should have been fixed by doing:

+ cpus_read_lock();
mutex_lock();
mem_cgroup_css_alloc();
- static_key_slow_inc();
+ static_key_slow_inc_cpuslocked();

Sorry that I did not notice this back then because I was too focussed on
fixing that uevent nonsense in a halfways sane way.

After that revert cpuset locking became completely insane.

cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
advanced part of that insanity. Seriously this commit is just tasteless
and disgusting demonstration of how to paper over a problem which is not
fully understood.

After staring at it some more (including the history which led up to
these insanities) I really think that the CPU hotplug path can be made
truly synchronous and the memory hotplug path should just get the lock
ordering correct.

Can we please fix this for real and get all of this insanity out of the
way including the nonsensical comments in the cpuset code:

* Call with cpuset_mutex held. Takes cpus_read_lock().

...

lockdep_assert_cpus_held();
lockdep_assert_held(&cpuset_mutex);

Oh well...

Thanks,

tglx

2024-03-28 14:09:25

by Valentin Schneider

[permalink] [raw]
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

On 27/03/24 15:28, Valentin Schneider wrote:
> On 27/03/24 13:42, Frederic Weisbecker wrote:
>> Le Tue, Mar 26, 2024 at 05:46:07PM +0100, Valentin Schneider a écrit :
>>> > Then with that patch I ran TREE07, just some short iterations:
>>> >
>>> > tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
>>> >
>>> > And the warning triggers very quickly. At least since v6.3 but maybe since
>>> > earlier. Is this expected behaviour or am I right to assume that
>>> > for_each_domain()/sched_domain_span() shouldn't return an offline CPU?
>>> >
>>>
>>> I would very much assume an offline CPU shouldn't show up in a
>>> sched_domain_span().
>>>
>>> Now, on top of the above, there's one more thing worth noting:
>>> cpu_up_down_serialize_trainwrecks()
>>>
>>> This just flushes the cpuset work, so after that the sched_domain topology
>>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>>> around this vs. lock ordering aren't very reassuring however, so I need to
>>> look into this more.
>>
>> Ouch...
>>
>>>
>>> Maybe as a "quick" test to see if this is the right culprit, you could try
>>> that with CONFIG_CPUSET=n? Because in that case the sched_domain update is
>>> ran within sched_cpu_deactivate().
>>
>> I just tried and I fear that doesn't help. It still triggers even without
>> cpusets :-s
>>
>
> What, you mean I can't always blame cgroups? What has the world come to?
>
> That's interesting, it means the deferred work item isn't the (only)
> issue. I'll grab your test patch and try to reproduce on TREE07.
>

Unfortunately I haven't been able to trigger your warning with ~20 runs of
TREE07 & CONFIG_CPUSETS=n, however it does trigger reliably with
CONFIG_CPUSETS=y, so I'm back to thinking the cpuset work is a likely
culprit...

>> Thanks.


2024-03-28 16:58:37

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

Le Thu, Mar 28, 2024 at 03:08:08PM +0100, Valentin Schneider a ?crit :
> On 27/03/24 15:28, Valentin Schneider wrote:
> > On 27/03/24 13:42, Frederic Weisbecker wrote:
> >> Le Tue, Mar 26, 2024 at 05:46:07PM +0100, Valentin Schneider a ?crit :
> >>> > Then with that patch I ran TREE07, just some short iterations:
> >>> >
> >>> > tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
> >>> >
> >>> > And the warning triggers very quickly. At least since v6.3 but maybe since
> >>> > earlier. Is this expected behaviour or am I right to assume that
> >>> > for_each_domain()/sched_domain_span() shouldn't return an offline CPU?
> >>> >
> >>>
> >>> I would very much assume an offline CPU shouldn't show up in a
> >>> sched_domain_span().
> >>>
> >>> Now, on top of the above, there's one more thing worth noting:
> >>> cpu_up_down_serialize_trainwrecks()
> >>>
> >>> This just flushes the cpuset work, so after that the sched_domain topology
> >>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
> >>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
> >>> around this vs. lock ordering aren't very reassuring however, so I need to
> >>> look into this more.
> >>
> >> Ouch...
> >>
> >>>
> >>> Maybe as a "quick" test to see if this is the right culprit, you could try
> >>> that with CONFIG_CPUSET=n? Because in that case the sched_domain update is
> >>> ran within sched_cpu_deactivate().
> >>
> >> I just tried and I fear that doesn't help. It still triggers even without
> >> cpusets :-s
> >>
> >
> > What, you mean I can't always blame cgroups? What has the world come to?
> >
> > That's interesting, it means the deferred work item isn't the (only)
> > issue. I'll grab your test patch and try to reproduce on TREE07.
> >
>
> Unfortunately I haven't been able to trigger your warning with ~20 runs of
> TREE07 & CONFIG_CPUSETS=n, however it does trigger reliably with
> CONFIG_CPUSETS=y, so I'm back to thinking the cpuset work is a likely
> culprit...

Funny, I just checked again and I can still reliably reproduce with:

/tools/testing/selftests/rcutorture/bin/kvm.sh --kconfig "CONFIG_CPUSETS=n CONFIG_PROC_PID_CPUSET=n" --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2

I'm thinking there might be several culprits... ;-)

2024-03-28 20:31:36

by Valentin Schneider

[permalink] [raw]
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

On 28/03/24 17:58, Frederic Weisbecker wrote:
> Le Thu, Mar 28, 2024 at 03:08:08PM +0100, Valentin Schneider a écrit :
>> On 27/03/24 15:28, Valentin Schneider wrote:
>> > On 27/03/24 13:42, Frederic Weisbecker wrote:
>> >> Le Tue, Mar 26, 2024 at 05:46:07PM +0100, Valentin Schneider a écrit :
>> >>> > Then with that patch I ran TREE07, just some short iterations:
>> >>> >
>> >>> > tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
>> >>> >
>> >>> > And the warning triggers very quickly. At least since v6.3 but maybe since
>> >>> > earlier. Is this expected behaviour or am I right to assume that
>> >>> > for_each_domain()/sched_domain_span() shouldn't return an offline CPU?
>> >>> >
>> >>>
>> >>> I would very much assume an offline CPU shouldn't show up in a
>> >>> sched_domain_span().
>> >>>
>> >>> Now, on top of the above, there's one more thing worth noting:
>> >>> cpu_up_down_serialize_trainwrecks()
>> >>>
>> >>> This just flushes the cpuset work, so after that the sched_domain topology
>> >>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>> >>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>> >>> around this vs. lock ordering aren't very reassuring however, so I need to
>> >>> look into this more.
>> >>
>> >> Ouch...
>> >>
>> >>>
>> >>> Maybe as a "quick" test to see if this is the right culprit, you could try
>> >>> that with CONFIG_CPUSET=n? Because in that case the sched_domain update is
>> >>> ran within sched_cpu_deactivate().
>> >>
>> >> I just tried and I fear that doesn't help. It still triggers even without
>> >> cpusets :-s
>> >>
>> >
>> > What, you mean I can't always blame cgroups? What has the world come to?
>> >
>> > That's interesting, it means the deferred work item isn't the (only)
>> > issue. I'll grab your test patch and try to reproduce on TREE07.
>> >
>>
>> Unfortunately I haven't been able to trigger your warning with ~20 runs of
>> TREE07 & CONFIG_CPUSETS=n, however it does trigger reliably with
>> CONFIG_CPUSETS=y, so I'm back to thinking the cpuset work is a likely
>> culprit...
>
> Funny, I just checked again and I can still reliably reproduce with:
>
> ./tools/testing/selftests/rcutorture/bin/kvm.sh --kconfig "CONFIG_CPUSETS=n CONFIG_PROC_PID_CPUSET=n" --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
>
> I'm thinking there might be several culprits... ;-)

Hmm, frustrating that I can't seem to reproduce this...

Could you run this with CONFIG_SCHED_DEBUG=y and sched_verbose on the
cmdline? And maybe tweak the warning to show which CPU we are scanning the
sched_domain of and which one we found to be offline in the span.


2024-03-28 20:41:03

by Valentin Schneider

[permalink] [raw]
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

On 27/03/24 21:42, Thomas Gleixner wrote:
> On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
>> On 22/03/24 14:22, Frederic Weisbecker wrote:
>>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
>> Now, on top of the above, there's one more thing worth noting:
>> cpu_up_down_serialize_trainwrecks()
>>
>> This just flushes the cpuset work, so after that the sched_domain topology
>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>> around this vs. lock ordering aren't very reassuring however, so I need to
>> look into this more.
>
> commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
> the change log.
>
> TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
> cpu_hotplug_lock in the hotplug path for whatever silly reason. So
> trying to flush the work from within the cpu hotplug lock write held
> region is guaranteed to dead lock.
>
> That's why all of this got deferred to a work queue. The flush at the
> end of the hotplug code after dropping the hotplug lock is there to
> prevent that user space sees the CPU hotplug uevent before the work is
> done. But of course between bringing the CPU offline and the flush the
> kernel internal state is inconsistent.
>

Thanks for the summary!

> There was an attempt to make the CPU hotplug path synchronous in commit
> a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
> reverted with commit 2b729fe7f3 for the very wrong reason:
>
> https://lore.kernel.org/all/[email protected]/T/#u
>
> (cpu_hotplug_lock){++++}-{0:0}:
> lock_acquire+0xe4/0x25c
> cpus_read_lock+0x50/0x154
> static_key_slow_inc+0x18/0x30
> mem_cgroup_css_alloc+0x824/0x8b0
> cgroup_apply_control_enable+0x1d8/0x56c
> cgroup_apply_control+0x40/0x344
> cgroup_subtree_control_write+0x664/0x69c
> cgroup_file_write+0x130/0x2e8
> kernfs_fop_write+0x228/0x32c
> __vfs_write+0x84/0x1d8
> vfs_write+0x13c/0x1b4
> ksys_write+0xb0/0x120
>
> Instead of the revert this should have been fixed by doing:
>
> + cpus_read_lock();
> mutex_lock();
> mem_cgroup_css_alloc();
> - static_key_slow_inc();
> + static_key_slow_inc_cpuslocked();
>

So looking at the state of things now, writing to the
cgroup.subtree_control file looks like:

cgroup_file_write()
`\
cgroup_subtree_control_write()
`\
cgroup_kn_lock_live()
`\
| cgroup_lock()
| `\
| mutex_lock(&cgroup_mutex);
|
cgroup_apply_control()
`\
cgroup_apply_control_enable()
`\
css_create()
`\
ss->css_alloc() [mem_cgroup_css_alloc()]
`\
static_branch_inc()

and same with cgroup_mkdir(). So if we want to fix the ordering that caused
the revert, we'd probably want to go for:

static inline void cgroup_lock(void)
{
+ cpus_read_lock();
mutex_lock(&cgroup_mutex);
}

static inline void cgroup_unlock(void)
{
mutex_unlock(&cgroup_mutex);
- cpus_read_unlock();
}

+ a handful of +_cpuslocked() changes.

As for cpuset, it looks like it's following this lock order:

cpus_read_lock();
mutex_lock(&cpuset_mutex);

Which AFAICT is what we want.

> Sorry that I did not notice this back then because I was too focussed on
> fixing that uevent nonsense in a halfways sane way.
>
> After that revert cpuset locking became completely insane.
>
> cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
> advanced part of that insanity. Seriously this commit is just tasteless
> and disgusting demonstration of how to paper over a problem which is not
> fully understood.
>
> After staring at it some more (including the history which led up to
> these insanities) I really think that the CPU hotplug path can be made
> truly synchronous and the memory hotplug path should just get the lock
> ordering correct.
>
> Can we please fix this for real and get all of this insanity out of the
> way

Yes please!

> including the nonsensical comments in the cpuset code:
>
> * Call with cpuset_mutex held. Takes cpus_read_lock().
>
> ...
>
> lockdep_assert_cpus_held();
> lockdep_assert_held(&cpuset_mutex);
>
> Oh well...
>
> Thanks,
>
> tglx


2024-03-29 02:20:50

by Tejun Heo

[permalink] [raw]
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

(cc'ing Waiman and quoting whole body)

Hello,

On Thu, Mar 28, 2024 at 09:39:56PM +0100, Valentin Schneider wrote:
> On 27/03/24 21:42, Thomas Gleixner wrote:
> > On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
> >> On 22/03/24 14:22, Frederic Weisbecker wrote:
> >>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
> >> Now, on top of the above, there's one more thing worth noting:
> >> cpu_up_down_serialize_trainwrecks()
> >>
> >> This just flushes the cpuset work, so after that the sched_domain topology
> >> should be sane. However I see it's invoked at the tail end of _cpu_down(),
> >> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
> >> around this vs. lock ordering aren't very reassuring however, so I need to
> >> look into this more.
> >
> > commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
> > the change log.
> >
> > TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
> > cpu_hotplug_lock in the hotplug path for whatever silly reason. So
> > trying to flush the work from within the cpu hotplug lock write held
> > region is guaranteed to dead lock.
> >
> > That's why all of this got deferred to a work queue. The flush at the
> > end of the hotplug code after dropping the hotplug lock is there to
> > prevent that user space sees the CPU hotplug uevent before the work is
> > done. But of course between bringing the CPU offline and the flush the
> > kernel internal state is inconsistent.
> >
>
> Thanks for the summary!
>
> > There was an attempt to make the CPU hotplug path synchronous in commit
> > a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
> > reverted with commit 2b729fe7f3 for the very wrong reason:
> >
> > https://lore.kernel.org/all/[email protected]/T/#u
> >
> > (cpu_hotplug_lock){++++}-{0:0}:
> > lock_acquire+0xe4/0x25c
> > cpus_read_lock+0x50/0x154
> > static_key_slow_inc+0x18/0x30
> > mem_cgroup_css_alloc+0x824/0x8b0
> > cgroup_apply_control_enable+0x1d8/0x56c
> > cgroup_apply_control+0x40/0x344
> > cgroup_subtree_control_write+0x664/0x69c
> > cgroup_file_write+0x130/0x2e8
> > kernfs_fop_write+0x228/0x32c
> > __vfs_write+0x84/0x1d8
> > vfs_write+0x13c/0x1b4
> > ksys_write+0xb0/0x120
> >
> > Instead of the revert this should have been fixed by doing:
> >
> > + cpus_read_lock();
> > mutex_lock();
> > mem_cgroup_css_alloc();
> > - static_key_slow_inc();
> > + static_key_slow_inc_cpuslocked();
> >
>
> So looking at the state of things now, writing to the
> cgroup.subtree_control file looks like:
>
> cgroup_file_write()
> `\
> cgroup_subtree_control_write()
> `\
> cgroup_kn_lock_live()
> `\
> | cgroup_lock()
> | `\
> | mutex_lock(&cgroup_mutex);
> |
> cgroup_apply_control()
> `\
> cgroup_apply_control_enable()
> `\
> css_create()
> `\
> ss->css_alloc() [mem_cgroup_css_alloc()]
> `\
> static_branch_inc()
>
> and same with cgroup_mkdir(). So if we want to fix the ordering that caused
> the revert, we'd probably want to go for:
>
> static inline void cgroup_lock(void)
> {
> + cpus_read_lock();
> mutex_lock(&cgroup_mutex);
> }
>
> static inline void cgroup_unlock(void)
> {
> mutex_unlock(&cgroup_mutex);
> - cpus_read_unlock();
> }
>
> + a handful of +_cpuslocked() changes.
>
> As for cpuset, it looks like it's following this lock order:
>
> cpus_read_lock();
> mutex_lock(&cpuset_mutex);
>
> Which AFAICT is what we want.
>
> > Sorry that I did not notice this back then because I was too focussed on
> > fixing that uevent nonsense in a halfways sane way.
> >
> > After that revert cpuset locking became completely insane.
> >
> > cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
> > advanced part of that insanity. Seriously this commit is just tasteless
> > and disgusting demonstration of how to paper over a problem which is not
> > fully understood.
> >
> > After staring at it some more (including the history which led up to
> > these insanities) I really think that the CPU hotplug path can be made
> > truly synchronous and the memory hotplug path should just get the lock
> > ordering correct.
> >
> > Can we please fix this for real and get all of this insanity out of the
> > way
>
> Yes please!

Yeah, making that operation synchronous would be nice. Waiman, you're a lot
more familiar with this part than I am. Can you please take a look and see
whether we can turn the sched domain updates synchronous?

Thanks.

--
tejun

2024-03-29 17:14:07

by Waiman Long

[permalink] [raw]
Subject: Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

On 3/28/24 22:08, Tejun Heo wrote:
> (cc'ing Waiman and quoting whole body)
>
> Hello,
>
> On Thu, Mar 28, 2024 at 09:39:56PM +0100, Valentin Schneider wrote:
>> On 27/03/24 21:42, Thomas Gleixner wrote:
>>> On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
>>>> On 22/03/24 14:22, Frederic Weisbecker wrote:
>>>>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
>>>> Now, on top of the above, there's one more thing worth noting:
>>>> cpu_up_down_serialize_trainwrecks()
>>>>
>>>> This just flushes the cpuset work, so after that the sched_domain topology
>>>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>>>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>>>> around this vs. lock ordering aren't very reassuring however, so I need to
>>>> look into this more.
>>> commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
>>> the change log.
>>>
>>> TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
>>> cpu_hotplug_lock in the hotplug path for whatever silly reason. So
>>> trying to flush the work from within the cpu hotplug lock write held
>>> region is guaranteed to dead lock.
>>>
>>> That's why all of this got deferred to a work queue. The flush at the
>>> end of the hotplug code after dropping the hotplug lock is there to
>>> prevent that user space sees the CPU hotplug uevent before the work is
>>> done. But of course between bringing the CPU offline and the flush the
>>> kernel internal state is inconsistent.
>>>
>> Thanks for the summary!
>>
>>> There was an attempt to make the CPU hotplug path synchronous in commit
>>> a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
>>> reverted with commit 2b729fe7f3 for the very wrong reason:
>>>
>>> https://lore.kernel.org/all/[email protected]/T/#u
>>>
>>> (cpu_hotplug_lock){++++}-{0:0}:
>>> lock_acquire+0xe4/0x25c
>>> cpus_read_lock+0x50/0x154
>>> static_key_slow_inc+0x18/0x30
>>> mem_cgroup_css_alloc+0x824/0x8b0
>>> cgroup_apply_control_enable+0x1d8/0x56c
>>> cgroup_apply_control+0x40/0x344
>>> cgroup_subtree_control_write+0x664/0x69c
>>> cgroup_file_write+0x130/0x2e8
>>> kernfs_fop_write+0x228/0x32c
>>> __vfs_write+0x84/0x1d8
>>> vfs_write+0x13c/0x1b4
>>> ksys_write+0xb0/0x120
>>>
>>> Instead of the revert this should have been fixed by doing:
>>>
>>> + cpus_read_lock();
>>> mutex_lock();
>>> mem_cgroup_css_alloc();
>>> - static_key_slow_inc();
>>> + static_key_slow_inc_cpuslocked();
>>>
>> So looking at the state of things now, writing to the
>> cgroup.subtree_control file looks like:
>>
>> cgroup_file_write()
>> `\
>> cgroup_subtree_control_write()
>> `\
>> cgroup_kn_lock_live()
>> `\
>> | cgroup_lock()
>> | `\
>> | mutex_lock(&cgroup_mutex);
>> |
>> cgroup_apply_control()
>> `\
>> cgroup_apply_control_enable()
>> `\
>> css_create()
>> `\
>> ss->css_alloc() [mem_cgroup_css_alloc()]
>> `\
>> static_branch_inc()
>>
>> and same with cgroup_mkdir(). So if we want to fix the ordering that caused
>> the revert, we'd probably want to go for:
>>
>> static inline void cgroup_lock(void)
>> {
>> + cpus_read_lock();
>> mutex_lock(&cgroup_mutex);
>> }
>>
>> static inline void cgroup_unlock(void)
>> {
>> mutex_unlock(&cgroup_mutex);
>> - cpus_read_unlock();
>> }
>>
>> + a handful of +_cpuslocked() changes.
>>
>> As for cpuset, it looks like it's following this lock order:
>>
>> cpus_read_lock();
>> mutex_lock(&cpuset_mutex);
>>
>> Which AFAICT is what we want.
>>
>>> Sorry that I did not notice this back then because I was too focussed on
>>> fixing that uevent nonsense in a halfways sane way.
>>>
>>> After that revert cpuset locking became completely insane.
>>>
>>> cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
>>> advanced part of that insanity. Seriously this commit is just tasteless
>>> and disgusting demonstration of how to paper over a problem which is not
>>> fully understood.
>>>
>>> After staring at it some more (including the history which led up to
>>> these insanities) I really think that the CPU hotplug path can be made
>>> truly synchronous and the memory hotplug path should just get the lock
>>> ordering correct.
>>>
>>> Can we please fix this for real and get all of this insanity out of the
>>> way
>> Yes please!
> Yeah, making that operation synchronous would be nice. Waiman, you're a lot
> more familiar with this part than I am. Can you please take a look and see
> whether we can turn the sched domain updates synchronous?

Sure. I will take a look on how to make cpu hotplug sched domain update
synchronous.

Cheers,
Longman


2024-04-01 21:26:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

On Thu, Mar 21, 2024 at 05:47:59AM -0700, Paul E. McKenney wrote:
> On Thu, Mar 21, 2024 at 12:42:09PM +0100, Frederic Weisbecker wrote:
> > On Wed, Mar 20, 2024 at 03:55:17PM -0700, Paul E. McKenney wrote:
> > > On Wed, Mar 20, 2024 at 05:15:48PM +0100, Frederic Weisbecker wrote:
> > > > Le Wed, Mar 20, 2024 at 04:14:24AM -0700, Paul E. McKenney a ?crit :
> > > > > On Tue, Mar 19, 2024 at 02:18:00AM -0700, Paul E. McKenney wrote:
> > > > > > On Tue, Mar 19, 2024 at 12:07:29AM +0100, Frederic Weisbecker wrote:
> > > > > > > While running in nohz_full mode, a task may enqueue a timer while the
> > > > > > > tick is stopped. However the only places where the timer wheel,
> > > > > > > alongside the timer migration machinery's decision, may reprogram the
> > > > > > > next event accordingly with that new timer's expiry are the idle loop or
> > > > > > > any IRQ tail.
> > > > > > >
> > > > > > > However neither the idle task nor an interrupt may run on the CPU if it
> > > > > > > resumes busy work in userspace for a long while in full dynticks mode.
> > > > > > >
> > > > > > > To solve this, the timer enqueue path raises a self-IPI that will
> > > > > > > re-evaluate the timer wheel on its IRQ tail. This asynchronous solution
> > > > > > > avoids potential locking inversion.
> > > > > > >
> > > > > > > This is supposed to happen both for local and global timers but commit:
> > > > > > >
> > > > > > > b2cf7507e186 ("timers: Always queue timers on the local CPU")
> > > > > > >
> > > > > > > broke the global timers case with removing the ->is_idle field handling
> > > > > > > for the global base. As a result, global timers enqueue may go unnoticed
> > > > > > > in nohz_full.
> > > > > > >
> > > > > > > Fix this with restoring the idle tracking of the global timer's base,
> > > > > > > allowing self-IPIs again on enqueue time.
> > > > > >
> > > > > > Testing with the previous patch (1/2 in this series) reduced the number of
> > > > > > problems by about an order of magnitude, down to two sched_tick_remote()
> > > > > > instances and one enqueue_hrtimer() instance, very good!
> > > > > >
> > > > > > I have kicked off a test including this patch. Here is hoping! ;-)
> > > > >
> > > > > And 22*100 hours of TREE07 got me one run with a sched_tick_remote()
> > >
> > > Sigh. s/22/12/
> > >
> > > > > complaint and another run with a starved RCU grace-period kthread.
> > > > > So this is definitely getting more reliable, but still a little ways
> > > > > to go.
> > >
> > > An additional eight hours got anohtre sched_tick_remote().
> > >
> > > > Right, there is clearly something else. Investigation continues...
> > >
> > > Please let me know if there is a debug patch that I could apply.
> >
> > So there are three things:
> >
> > _ The sched_tick_remote() warning. I can easily trigger this one and while
> > trying a bisection, I realize it actually also triggers on v6.8
> > I'm not really tempted to investigate further because the warning doesn't make
> > much sense to me. This computes the delta between the time the kworker got
> > scheduled and the time it reaches the middle of the work. It happens to be
> > ~3s but isn't it something to be expected with jitter and all involved into
> > rcutorture?
> >
> > We should probably just remove this warning. This remote tick is my most horrible
> > hackery anyway.
>
> Would it make sense to increase it to somewhere around the timeout for
> RCU CPU stall warnings, softlockup, and so on?
>
> > _ The hrtimer enqueue to an offline CPU warning:
> >
> > [ 1054.265335] WARNING: CPU: 8 PID: 150 at kernel/time/hrtimer.c:1091 enqueue_hrtimer+0x6f/0x80
> > [ 1054.269166] Modules linked in:
> > [ 1054.270077] CPU: 8 PID: 150 Comm: rcu_torture_rea Not tainted 6.8.0-11407-ge990136580ab-dirty #21
> > [ 1054.272768] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
> > [ 1054.275893] RIP: 0010:enqueue_hrtimer+0x6f/0x80
> > [ 1054.277252] Code: a3 05 b5 ff b7 01 73 bd 48 8b 05 44 c5 b5 01 48 85 c0 74 0c 48 8b 78 08 48 89 ee e8 7b b5 ff ff 48 8b 03 f6 40 10 10 75 a5 90 <0f> 0b 90 eb 9f 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
> > [ 1054.282448] RSP: 0000:ffffb1480056fd70 EFLAGS: 00010046
> > [ 1054.283931] RAX: ffff95b7df2616c0 RBX: ffff95b7df261700 RCX: ffff95b7df2616c0
> > [ 1054.286091] RDX: 0000000000000001 RSI: ffff95b7df261700 RDI: ffffb1480056fde0
> > [ 1054.288095] RBP: ffffb1480056fde0 R08: 0000000000000001 R09: 000000000000fc03
> > [ 1054.290189] R10: 0000000000000001 R11: ffff95b7c1271c80 R12: 0000000000000040
> > [ 1054.292592] R13: ffff95b7df261700 R14: ffff95b7df261700 R15: ffff95b7df2616c0
> > [ 1054.294622] FS: 0000000000000000(0000) GS:ffff95b7df400000(0000) knlGS:0000000000000000
> > [ 1054.296884] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1054.298497] CR2: 0000000000000000 CR3: 000000001822c000 CR4: 00000000000006f0
> > [ 1054.300475] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 1054.302516] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [ 1054.304580] Call Trace:
> > [ 1054.305354] <TASK>
> > [ 1054.306009] ? __warn+0x77/0x120
> > [ 1054.306973] ? enqueue_hrtimer+0x6f/0x80
> > [ 1054.308177] ? report_bug+0xf1/0x1d0
> > [ 1054.310200] ? handle_bug+0x40/0x70
> > [ 1054.311252] ? exc_invalid_op+0x13/0x70
> > [ 1054.312437] ? asm_exc_invalid_op+0x16/0x20
> > [ 1054.313738] ? enqueue_hrtimer+0x6f/0x80
> > [ 1054.314994] hrtimer_start_range_ns+0x258/0x2f0
> > [ 1054.316531] ? __pfx_rcu_torture_reader+0x10/0x10
> > [ 1054.317983] schedule_hrtimeout_range_clock+0x95/0x120
> > [ 1054.319604] ? __pfx_hrtimer_wakeup+0x10/0x10
> > [ 1054.320957] torture_hrtimeout_ns+0x50/0x70
> > [ 1054.322211] rcu_torture_reader+0x1be/0x280
> > [ 1054.323455] ? __pfx_rcu_torture_timer+0x10/0x10
> > [ 1054.329888] ? kthread+0xc4/0xf0
> > [ 1054.330880] ? __pfx_rcu_torture_reader+0x10/0x10
> > [ 1054.332334] kthread+0xc4/0xf0
> > [ 1054.333305] ? __pfx_kthread+0x10/0x10
> > [ 1054.334466] ret_from_fork+0x2b/0x40
> > [ 1054.335551] ? __pfx_kthread+0x10/0x10
> > [ 1054.336690] ret_from_fork_asm+0x1a/0x30
> > [ 1054.337878] </TASK>
> >
> > I don't know how it happened. This is where I'm investigating now.
>
> It seems that rcutorture is the gift that keeps on giving? ;-)
>
> > _ The RCU CPU Stall report. I strongly suspect the cause is the hrtimer
> > enqueue to an offline CPU. Let's solve that and we'll see if it still
> > triggers.
>
> Sounds like a plan!

Just checking in on this one. I did reproduce your RCU CPU stall report
and also saw a TREE03 OOM that might (or might not) be related. Please
let me know if hammering TREE03 harder or adding some debug would help.
Otherwise, I will assume that you are getting sufficient bug reports
from your own testing to be getting along with.

Thanx, Paul

2024-04-01 21:56:46

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

Le Mon, Apr 01, 2024 at 02:26:25PM -0700, Paul E. McKenney a ?crit :
> > > _ The RCU CPU Stall report. I strongly suspect the cause is the hrtimer
> > > enqueue to an offline CPU. Let's solve that and we'll see if it still
> > > triggers.
> >
> > Sounds like a plan!
>
> Just checking in on this one. I did reproduce your RCU CPU stall report
> and also saw a TREE03 OOM that might (or might not) be related. Please
> let me know if hammering TREE03 harder or adding some debug would help.
> Otherwise, I will assume that you are getting sufficient bug reports
> from your own testing to be getting along with.

Hehe, there are a lot indeed :-)

So there has been some discussion on CPUSET VS Hotplug, as a problem there
is likely the cause of the hrtimer warning you saw, which in turn might
be the cause of the RCU stalls.

Do you always see the hrtimer warning along the RCU stalls? Because if so, this
might help:
https://lore.kernel.org/lkml/[email protected]/T/#m1bed4d298715d1a6b8289ed48e9353993c63c896

Thanks.

>
> Thanx, Paul

2024-04-02 00:04:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

On Mon, Apr 01, 2024 at 11:56:36PM +0200, Frederic Weisbecker wrote:
> Le Mon, Apr 01, 2024 at 02:26:25PM -0700, Paul E. McKenney a ?crit :
> > > > _ The RCU CPU Stall report. I strongly suspect the cause is the hrtimer
> > > > enqueue to an offline CPU. Let's solve that and we'll see if it still
> > > > triggers.
> > >
> > > Sounds like a plan!
> >
> > Just checking in on this one. I did reproduce your RCU CPU stall report
> > and also saw a TREE03 OOM that might (or might not) be related. Please
> > let me know if hammering TREE03 harder or adding some debug would help.
> > Otherwise, I will assume that you are getting sufficient bug reports
> > from your own testing to be getting along with.
>
> Hehe, there are a lot indeed :-)
>
> So there has been some discussion on CPUSET VS Hotplug, as a problem there
> is likely the cause of the hrtimer warning you saw, which in turn might
> be the cause of the RCU stalls.
>
> Do you always see the hrtimer warning along the RCU stalls? Because if so, this
> might help:
> https://lore.kernel.org/lkml/[email protected]/T/#m1bed4d298715d1a6b8289ed48e9353993c63c896

Not always, but why not give it a shot?

Thanx, Paul

2024-04-02 16:53:07

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

On Mon, Apr 01, 2024 at 05:04:10PM -0700, Paul E. McKenney wrote:
> On Mon, Apr 01, 2024 at 11:56:36PM +0200, Frederic Weisbecker wrote:
> > Le Mon, Apr 01, 2024 at 02:26:25PM -0700, Paul E. McKenney a ?crit :
> > > > > _ The RCU CPU Stall report. I strongly suspect the cause is the hrtimer
> > > > > enqueue to an offline CPU. Let's solve that and we'll see if it still
> > > > > triggers.
> > > >
> > > > Sounds like a plan!
> > >
> > > Just checking in on this one. I did reproduce your RCU CPU stall report
> > > and also saw a TREE03 OOM that might (or might not) be related. Please
> > > let me know if hammering TREE03 harder or adding some debug would help.
> > > Otherwise, I will assume that you are getting sufficient bug reports
> > > from your own testing to be getting along with.
> >
> > Hehe, there are a lot indeed :-)
> >
> > So there has been some discussion on CPUSET VS Hotplug, as a problem there
> > is likely the cause of the hrtimer warning you saw, which in turn might
> > be the cause of the RCU stalls.
> >
> > Do you always see the hrtimer warning along the RCU stalls? Because if so, this
> > might help:
> > https://lore.kernel.org/lkml/[email protected]/T/#m1bed4d298715d1a6b8289ed48e9353993c63c896
>
> Not always, but why not give it a shot?

And no failures, though I would need to run much longer for this to
mean much. These were wide-spectrum tests, so my next step will be to
run only TREE03 and TREE07.

Thanx, Paul

2024-04-03 18:09:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

On Tue, Apr 02, 2024 at 09:47:37AM -0700, Paul E. McKenney wrote:
> On Mon, Apr 01, 2024 at 05:04:10PM -0700, Paul E. McKenney wrote:
> > On Mon, Apr 01, 2024 at 11:56:36PM +0200, Frederic Weisbecker wrote:
> > > Le Mon, Apr 01, 2024 at 02:26:25PM -0700, Paul E. McKenney a ?crit :
> > > > > > _ The RCU CPU Stall report. I strongly suspect the cause is the hrtimer
> > > > > > enqueue to an offline CPU. Let's solve that and we'll see if it still
> > > > > > triggers.
> > > > >
> > > > > Sounds like a plan!
> > > >
> > > > Just checking in on this one. I did reproduce your RCU CPU stall report
> > > > and also saw a TREE03 OOM that might (or might not) be related. Please
> > > > let me know if hammering TREE03 harder or adding some debug would help.
> > > > Otherwise, I will assume that you are getting sufficient bug reports
> > > > from your own testing to be getting along with.
> > >
> > > Hehe, there are a lot indeed :-)
> > >
> > > So there has been some discussion on CPUSET VS Hotplug, as a problem there
> > > is likely the cause of the hrtimer warning you saw, which in turn might
> > > be the cause of the RCU stalls.
> > >
> > > Do you always see the hrtimer warning along the RCU stalls? Because if so, this
> > > might help:
> > > https://lore.kernel.org/lkml/[email protected]/T/#m1bed4d298715d1a6b8289ed48e9353993c63c896
> >
> > Not always, but why not give it a shot?
>
> And no failures, though I would need to run much longer for this to
> mean much. These were wide-spectrum tests, so my next step will be to
> run only TREE03 and TREE07.

And 600 hours each of TREE03 and TREE07 got me a single TREE07 instance
of the sched_tick_remote() failure. This one:

WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);

But this is just rcutorture testing out "short" 14-second stalls, which
can only be expected to trigger this from time to time. The point of
this stall is to test the evasive actions that RCU takes when 50% of
the way to the RCU CPU stall timeout.

One approach would be to increase that "3" to "15", but that sounds
quite fragile. Another would be for rcutorture to communicate the fact
that stall testing is in progress, and then this WARN_ON_ONCE() could
silence itself in that case.

But is there a better approach?

Thanx, Paul