LinuxLists.cc - [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work

Hello,

This patch was applied as aae17ebb53cd3da but as Chris reports with this
commit the kernel can crash at boot time because __queue_delayed_work()
doesn't check that housekeeping_any_cpu() returns a valid cpu < nr_cpu_ids.

Just boot the kernel with nohz_full=mask which includes the boot cpu, say
nohz_full=0-6 on a machine with 8 CPUs. __queue_delayed_work() will use
add_timer_on(timer, NR_CPUS /* returned by housekeeping_any_cpu */) until
start_secondary() brings CPU 7 up.

The problem is simple, but I do not know what should we do, I know nothing
about CPU isolation.

We can fix __queue_delayed_work(), this is simple, but other callers of
housekeeping_any_cpu() seem to assume it must always return a valid CPU
too. So perhaps we should change housekeeping_any_cpu()

- return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ if (cpu < nr_cpu_ids)
+ return cpu;

?

But I'm afraid this can hide other problems. May be

- return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ if (cpu < nr_cpu_ids)
+ return cpu;
+
+ WARN_ON(system_state > SYSTEM_BOOTING);

?

-------------------------------------------------------------------------------
OTOH, Documentation/timers/no_hz.rst says

Therefore, the
boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
"nohz_full=" mask that includes the boot CPU will result in a boot-time
error message, and the boot CPU will be removed from the mask.

and this doesn't match the reality.

So it seems that we should fix housekeeping_setup() ? see the patch below.

In any case the usage of cpu_present_mask doesn't look right to me.

Oleg.

--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -129,7 +154,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
cpumask_andnot(housekeeping_staging,
cpu_possible_mask, non_housekeeping_mask);

- if (!cpumask_intersects(cpu_present_mask, housekeeping_staging)) {
+ if (!cpumask_test_cpu(smp_processor_id(), housekeeping_staging)) {
__cpumask_set_cpu(smp_processor_id(), housekeeping_staging);
__cpumask_clear_cpu(smp_processor_id(), non_housekeeping_mask);
if (!housekeeping.flags) {

On 01/29, Leonardo Bras wrote:
>
> When __queue_delayed_work() is called, it chooses a cpu for handling the
> timer interrupt. As of today, it will pick either the cpu passed as
> parameter or the last cpu used for this.
>
> This is not good if a system does use CPU isolation, because it can take
> away some valuable cpu time to:
> 1 - deal with the timer interrupt,
> 2 - schedule-out the desired task,
> 3 - queue work on a random workqueue, and
> 4 - schedule the desired task back to the cpu.
>
> So to fix this, during __queue_delayed_work(), if cpu isolation is in
> place, pick a random non-isolated cpu to handle the timer interrupt.
>
> As an optimization, if the current cpu is not isolated, use it instead
> of looking for another candidate.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> Changes since v1:
> - Make sure the CPU is isolated for any value of "cpu"
>
> Changes since RFC:
> - Do not use the same cpu from the timer for queueing the work.
> - If the current cpu is not isolated, use it's timer instead of
> looking for another candidate.
>
> kernel/workqueue.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 76e60faed8923..8dd7c01b326a4 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1958,10 +1958,18 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
> dwork->cpu = cpu;
> timer->expires = jiffies + delay;
>
> - if (unlikely(cpu != WORK_CPU_UNBOUND))
> + if (housekeeping_enabled(HK_TYPE_TIMER)) {
> + /* If the current cpu is a housekeeping cpu, use it. */
> + cpu = smp_processor_id();
> + if (!housekeeping_test_cpu(cpu, HK_TYPE_TIMER))
> + cpu = housekeeping_any_cpu(HK_TYPE_TIMER);
> add_timer_on(timer, cpu);
> - else
> - add_timer(timer);
> + } else {
> + if (likely(cpu == WORK_CPU_UNBOUND))
> + add_timer(timer);
> + else
> + add_timer_on(timer, cpu);
> + }
> }
>
> /**
> --
> 2.43.0
>

2024-04-03 19:12:40

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work

(cc'ing Frederic and quoting whole body)

Hello, Oleg.

On Tue, Apr 02, 2024 at 12:58:47PM +0200, Oleg Nesterov wrote:
> This patch was applied as aae17ebb53cd3da but as Chris reports with this
> commit the kernel can crash at boot time because __queue_delayed_work()
> doesn't check that housekeeping_any_cpu() returns a valid cpu < nr_cpu_ids.
>
> Just boot the kernel with nohz_full=mask which includes the boot cpu, say
> nohz_full=0-6 on a machine with 8 CPUs. __queue_delayed_work() will use
> add_timer_on(timer, NR_CPUS /* returned by housekeeping_any_cpu */) until
> start_secondary() brings CPU 7 up.
>
> The problem is simple, but I do not know what should we do, I know nothing
> about CPU isolation.
>
> We can fix __queue_delayed_work(), this is simple, but other callers of
> housekeeping_any_cpu() seem to assume it must always return a valid CPU
> too. So perhaps we should change housekeeping_any_cpu()

Yeah, patching this up from wq side is easy but housekeeping_any_cpu()
always being able to pick a housekeeping CPU would be better.

> - return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + if (cpu < nr_cpu_ids)
> + return cpu;
>
> ?
>
> But I'm afraid this can hide other problems. May be
>
> - return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + if (cpu < nr_cpu_ids)
> + return cpu;
> +
> + WARN_ON(system_state > SYSTEM_BOOTING);
>
> ?
>
> -------------------------------------------------------------------------------
> OTOH, Documentation/timers/no_hz.rst says
>
> Therefore, the
> boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> "nohz_full=" mask that includes the boot CPU will result in a boot-time
> error message, and the boot CPU will be removed from the mask.
>
> and this doesn't match the reality.

Don't some archs allow the boot CPU to go down too tho? If so, this doesn't
really solve the problem, right?

> So it seems that we should fix housekeeping_setup() ? see the patch below.
>
> In any case the usage of cpu_present_mask doesn't look right to me.
>
> Oleg.
>
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -129,7 +154,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
> cpumask_andnot(housekeeping_staging,
> cpu_possible_mask, non_housekeeping_mask);
>
> - if (!cpumask_intersects(cpu_present_mask, housekeeping_staging)) {
> + if (!cpumask_test_cpu(smp_processor_id(), housekeeping_staging)) {
> __cpumask_set_cpu(smp_processor_id(), housekeeping_staging);
> __cpumask_clear_cpu(smp_processor_id(), non_housekeeping_mask);
> if (!housekeeping.flags) {
>
>
> On 01/29, Leonardo Bras wrote:
> >
> > When __queue_delayed_work() is called, it chooses a cpu for handling the
> > timer interrupt. As of today, it will pick either the cpu passed as
> > parameter or the last cpu used for this.
> >
> > This is not good if a system does use CPU isolation, because it can take
> > away some valuable cpu time to:
> > 1 - deal with the timer interrupt,
> > 2 - schedule-out the desired task,
> > 3 - queue work on a random workqueue, and
> > 4 - schedule the desired task back to the cpu.
> >
> > So to fix this, during __queue_delayed_work(), if cpu isolation is in
> > place, pick a random non-isolated cpu to handle the timer interrupt.
> >
> > As an optimization, if the current cpu is not isolated, use it instead
> > of looking for another candidate.
> >
> > Signed-off-by: Leonardo Bras <[email protected]>
> > ---
> > Changes since v1:
> > - Make sure the CPU is isolated for any value of "cpu"
> >
> > Changes since RFC:
> > - Do not use the same cpu from the timer for queueing the work.
> > - If the current cpu is not isolated, use it's timer instead of
> > looking for another candidate.
> >
> > kernel/workqueue.c | 14 +++++++++++---
> > 1 file changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 76e60faed8923..8dd7c01b326a4 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -1958,10 +1958,18 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
> > dwork->cpu = cpu;
> > timer->expires = jiffies + delay;
> >
> > - if (unlikely(cpu != WORK_CPU_UNBOUND))
> > + if (housekeeping_enabled(HK_TYPE_TIMER)) {
> > + /* If the current cpu is a housekeeping cpu, use it. */
> > + cpu = smp_processor_id();
> > + if (!housekeeping_test_cpu(cpu, HK_TYPE_TIMER))
> > + cpu = housekeeping_any_cpu(HK_TYPE_TIMER);
> > add_timer_on(timer, cpu);
> > - else
> > - add_timer(timer);
> > + } else {
> > + if (likely(cpu == WORK_CPU_UNBOUND))
> > + add_timer(timer);
> > + else
> > + add_timer_on(timer, cpu);
> > + }
> > }
> >
> > /**
> > --
> > 2.43.0
> >
>

--
tejun

2024-04-03 20:44:57

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work

Hi Tejun,

On 04/03, Tejun Heo wrote:
>
> (cc'ing Frederic and quoting whole body)
>
> Hello, Oleg.
>
> On Tue, Apr 02, 2024 at 12:58:47PM +0200, Oleg Nesterov wrote:
> > This patch was applied as aae17ebb53cd3da but as Chris reports with this
> > commit the kernel can crash at boot time because __queue_delayed_work()
> > doesn't check that housekeeping_any_cpu() returns a valid cpu < nr_cpu_ids.
> >
> > Just boot the kernel with nohz_full=mask which includes the boot cpu, say
> > nohz_full=0-6 on a machine with 8 CPUs. __queue_delayed_work() will use
> > add_timer_on(timer, NR_CPUS /* returned by housekeeping_any_cpu */) until
> > start_secondary() brings CPU 7 up.
> >
> > The problem is simple, but I do not know what should we do, I know nothing
> > about CPU isolation.
> >
> > We can fix __queue_delayed_work(), this is simple, but other callers of
> > housekeeping_any_cpu() seem to assume it must always return a valid CPU
> > too. So perhaps we should change housekeeping_any_cpu()
>
> Yeah, patching this up from wq side is easy but housekeeping_any_cpu()
> always being able to pick a housekeeping CPU would be better.
>
> > - return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> > + cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> > + if (cpu < nr_cpu_ids)
> > + return cpu;
> >
> > ?
> >
> > But I'm afraid this can hide other problems. May be
> >
> > - return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> > + cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> > + if (cpu < nr_cpu_ids)
> > + return cpu;
> > +
> > + WARN_ON(system_state > SYSTEM_BOOTING);
> >
> > ?
> >
> > -------------------------------------------------------------------------------
> > OTOH, Documentation/timers/no_hz.rst says
> >
> > Therefore, the
> > boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> > "nohz_full=" mask that includes the boot CPU will result in a boot-time
> > error message, and the boot CPU will be removed from the mask.
> >
> > and this doesn't match the reality.
>
> Don't some archs allow the boot CPU to go down too tho? If so, this doesn't
> really solve the problem, right?

I do not know. But I thought about this too.

In the context of this discussion we do not care if the boot CPU goes down.
But we need at least one housekeeping CPU after cpu_down(). The comment in
cpu_down_maps_locked() says

Also keep at least one housekeeping cpu onlined

but it checks HK_TYPE_DOMAIN, and I do not know (and it is too late for me
to try to read the code ;) if housekeeping.cpumasks[HK_TYPE_TIMER] can get
empty or not.

Oleg.

> > So it seems that we should fix housekeeping_setup() ? see the patch below.
> >
> > In any case the usage of cpu_present_mask doesn't look right to me.
> >
> > Oleg.
> >
> > --- a/kernel/sched/isolation.c
> > +++ b/kernel/sched/isolation.c
> > @@ -129,7 +154,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
> > cpumask_andnot(housekeeping_staging,
> > cpu_possible_mask, non_housekeeping_mask);
> >
> > - if (!cpumask_intersects(cpu_present_mask, housekeeping_staging)) {
> > + if (!cpumask_test_cpu(smp_processor_id(), housekeeping_staging)) {
> > __cpumask_set_cpu(smp_processor_id(), housekeeping_staging);
> > __cpumask_clear_cpu(smp_processor_id(), non_housekeeping_mask);
> > if (!housekeeping.flags) {
> >

2024-04-05 14:06:37

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work

On 04/03, Oleg Nesterov wrote:
>
> > > OTOH, Documentation/timers/no_hz.rst says
> > >
> > > Therefore, the
> > > boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> > > "nohz_full=" mask that includes the boot CPU will result in a boot-time
> > > error message, and the boot CPU will be removed from the mask.
> > >
> > > and this doesn't match the reality.
> >
> > Don't some archs allow the boot CPU to go down too tho? If so, this doesn't
> > really solve the problem, right?
>
> I do not know. But I thought about this too.
>
> In the context of this discussion we do not care if the boot CPU goes down.
> But we need at least one housekeeping CPU after cpu_down(). The comment in
> cpu_down_maps_locked() says
>
> Also keep at least one housekeeping cpu onlined
>
> but it checks HK_TYPE_DOMAIN, and I do not know (and it is too late for me
> to try to read the code ;) if housekeeping.cpumasks[HK_TYPE_TIMER] can get
> empty or not.

This nearly killed me, but I managed to convince myself we shouldn't worry
about cpu_down().

HK_FLAG_TIMER implies HK_FLAG_TICK.

HK_FLAG_TICK implies tick_nohz_full_setup() which sets
tick_nohz_full_mask = non_housekeeping_mask.

When tick_setup_device() is called on a housekeeping CPU it does

else if (tick_do_timer_boot_cpu != -1 &&
!tick_nohz_full_cpu(cpu)) {
tick_take_do_timer_from_boot();
tick_do_timer_boot_cpu = -1;

and this sets tick_do_timer_cpu = first-housekeeping-cpu.

cpu_down(tick_do_timer_cpu) will fail, tick_nohz_cpu_down() will nack it.

So cpu_down() can't make housekeeping.cpumasks[HK_FLAG_TIMER] empty and I
still think that the change below is the right approach.

But probably WARN_ON() in housekeeping_any_cpu() makes sense anyway.

What do you think?

Oleg.

> > > So it seems that we should fix housekeeping_setup() ? see the patch below.
> > >
> > > In any case the usage of cpu_present_mask doesn't look right to me.
> > >
> > > Oleg.
> > >
> > > --- a/kernel/sched/isolation.c
> > > +++ b/kernel/sched/isolation.c
> > > @@ -129,7 +154,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
> > > cpumask_andnot(housekeeping_staging,
> > > cpu_possible_mask, non_housekeeping_mask);
> > >
> > > - if (!cpumask_intersects(cpu_present_mask, housekeeping_staging)) {
> > > + if (!cpumask_test_cpu(smp_processor_id(), housekeeping_staging)) {
> > > __cpumask_set_cpu(smp_processor_id(), housekeeping_staging);
> > > __cpumask_clear_cpu(smp_processor_id(), non_housekeeping_mask);
> > > if (!housekeeping.flags) {
> > >

2024-04-05 15:50:13

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work

Hello, Oleg.

On Fri, Apr 05, 2024 at 04:04:49PM +0200, Oleg Nesterov wrote:
..
> > > Don't some archs allow the boot CPU to go down too tho? If so, this doesn't
> > > really solve the problem, right?
> >
> > I do not know. But I thought about this too.
> >
> > In the context of this discussion we do not care if the boot CPU goes down.
> > But we need at least one housekeeping CPU after cpu_down(). The comment in
> > cpu_down_maps_locked() says
> >
> > Also keep at least one housekeeping cpu onlined
> >
> > but it checks HK_TYPE_DOMAIN, and I do not know (and it is too late for me
> > to try to read the code ;) if housekeeping.cpumasks[HK_TYPE_TIMER] can get
> > empty or not.
>
> This nearly killed me, but I managed to convince myself we shouldn't worry

Oh no, don't die. :)

> about cpu_down().
>
> HK_FLAG_TIMER implies HK_FLAG_TICK.
>
> HK_FLAG_TICK implies tick_nohz_full_setup() which sets
> tick_nohz_full_mask = non_housekeeping_mask.
>
> When tick_setup_device() is called on a housekeeping CPU it does
>
> else if (tick_do_timer_boot_cpu != -1 &&
> !tick_nohz_full_cpu(cpu)) {
> tick_take_do_timer_from_boot();
> tick_do_timer_boot_cpu = -1;
>
>
> and this sets tick_do_timer_cpu = first-housekeeping-cpu.
>
> cpu_down(tick_do_timer_cpu) will fail, tick_nohz_cpu_down() will nack it.
>
> So cpu_down() can't make housekeeping.cpumasks[HK_FLAG_TIMER] empty and I
> still think that the change below is the right approach.
>
> But probably WARN_ON() in housekeeping_any_cpu() makes sense anyway.

This would be great.

> What do you think?
>
> Oleg.
>
> > > > So it seems that we should fix housekeeping_setup() ? see the patch below.
> > > >
> > > > In any case the usage of cpu_present_mask doesn't look right to me.
> > > >
> > > > Oleg.
> > > >
> > > > --- a/kernel/sched/isolation.c
> > > > +++ b/kernel/sched/isolation.c
> > > > @@ -129,7 +154,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
> > > > cpumask_andnot(housekeeping_staging,
> > > > cpu_possible_mask, non_housekeeping_mask);
> > > >
> > > > - if (!cpumask_intersects(cpu_present_mask, housekeeping_staging)) {
> > > > + if (!cpumask_test_cpu(smp_processor_id(), housekeeping_staging)) {
> > > > __cpumask_set_cpu(smp_processor_id(), housekeeping_staging);
> > > > __cpumask_clear_cpu(smp_processor_id(), non_housekeeping_mask);
> > > > if (!housekeeping.flags) {

Ensuring the boot CPU always be a housekeeping CPU makes sense to me but I'm
not very familiar with the housekeeping code. Frederic, what do you think?

Thanks.

--
tejun

2024-04-05 22:01:12

[permalink] [raw]

Subject: Nohz_full on boot CPU is broken (was: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work)

+Cc Nick

Le Fri, Apr 05, 2024 at 04:04:49PM +0200, Oleg Nesterov a ?crit :
> On 04/03, Oleg Nesterov wrote:
> >
> > > > OTOH, Documentation/timers/no_hz.rst says
> > > >
> > > > Therefore, the
> > > > boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> > > > "nohz_full=" mask that includes the boot CPU will result in a boot-time
> > > > error message, and the boot CPU will be removed from the mask.
> > > >
> > > > and this doesn't match the reality.
> > >
> > > Don't some archs allow the boot CPU to go down too tho? If so, this doesn't
> > > really solve the problem, right?
> >
> > I do not know. But I thought about this too.
> >
> > In the context of this discussion we do not care if the boot CPU goes down.
> > But we need at least one housekeeping CPU after cpu_down(). The comment in
> > cpu_down_maps_locked() says
> >
> > Also keep at least one housekeeping cpu onlined
> >
> > but it checks HK_TYPE_DOMAIN, and I do not know (and it is too late for me
> > to try to read the code ;) if housekeeping.cpumasks[HK_TYPE_TIMER] can get
> > empty or not.
>
> This nearly killed me, but I managed to convince myself we shouldn't worry
> about cpu_down().
>
> HK_FLAG_TIMER implies HK_FLAG_TICK.
>
> HK_FLAG_TICK implies tick_nohz_full_setup() which sets
> tick_nohz_full_mask = non_housekeeping_mask.
>
> When tick_setup_device() is called on a housekeeping CPU it does
>
> else if (tick_do_timer_boot_cpu != -1 &&
> !tick_nohz_full_cpu(cpu)) {
> tick_take_do_timer_from_boot();
> tick_do_timer_boot_cpu = -1;
>
>
> and this sets tick_do_timer_cpu = first-housekeeping-cpu.
>
> cpu_down(tick_do_timer_cpu) will fail, tick_nohz_cpu_down() will nack it.
>
> So cpu_down() can't make housekeeping.cpumasks[HK_FLAG_TIMER] empty and I
> still think that the change below is the right approach.
>
> But probably WARN_ON() in housekeeping_any_cpu() makes sense anyway.
>
> What do you think?

Good analysis on this nasty housekeeping VS tick code. I promised so many
times to cleanup this mess but things keep piling up.

It is indeed possible for the boot CPU to be a nohz_full CPU and as
you can see, it's only half-working. This is so ever since:

08ae95f4fd3b (nohz_full: Allow the boot CPU to be nohz_full)

I wish I had nacked it before it got merged, especially as the changelog mentions
that the user could have solved this with modifying its setup... I would love
to revert that now but I don't know if anyone uses this and have it working
by chance somewhere... Should we continue to support a broken feature? Can we
break user ABI if it's already half-broken?

Anyway so during boot it's possible to have an empty
housekeeping_mask(HK_TYPE_TIMER) & cpu_online_mask. After boot though (provided
any CPU from the housekeeping_mask(HK_TYPE_TIMER) has actually booted, which
isn't even guaranteed if maxcpus= is passed...) the first online housekeeping can't
go down like you spotted.

Thanks.

2024-04-05 22:04:06

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work

Le Fri, Apr 05, 2024 at 05:38:54AM -1000, Tejun Heo a ?crit :
> > > > > So it seems that we should fix housekeeping_setup() ? see the patch below.
> > > > >
> > > > > In any case the usage of cpu_present_mask doesn't look right to me.
> > > > >
> > > > > Oleg.
> > > > >
> > > > > --- a/kernel/sched/isolation.c
> > > > > +++ b/kernel/sched/isolation.c
> > > > > @@ -129,7 +154,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
> > > > > cpumask_andnot(housekeeping_staging,
> > > > > cpu_possible_mask, non_housekeeping_mask);
> > > > >
> > > > > - if (!cpumask_intersects(cpu_present_mask, housekeeping_staging)) {
> > > > > + if (!cpumask_test_cpu(smp_processor_id(), housekeeping_staging)) {
> > > > > __cpumask_set_cpu(smp_processor_id(), housekeeping_staging);
> > > > > __cpumask_clear_cpu(smp_processor_id(), non_housekeeping_mask);
> > > > > if (!housekeeping.flags) {
>
> Ensuring the boot CPU always be a housekeeping CPU makes sense to me but I'm
> not very familiar with the housekeeping code. Frederic, what do you think?

I vote for reverting:

08ae95f4fd3b (nohz_full: Allow the boot CPU to be nohz_full)

And if that's not possible because there are actual users of it, then
the best we can do is to make housekeeping_any_cpu() return smp_processor_id()
if the housekeeping_mask() is empty. Or rather if the housekeeping_mask is
empty, return cpumask_of(smp_processor_id()).

Of course that's merely just a bandaid.

Thanks.

2024-04-07 13:11:04

[permalink] [raw]

Subject: Re: Nohz_full on boot CPU is broken (was: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work)

On 04/05, Frederic Weisbecker wrote:
>
> +Cc Nick
>
> Le Fri, Apr 05, 2024 at 04:04:49PM +0200, Oleg Nesterov a ?crit :
> > On 04/03, Oleg Nesterov wrote:
> > >
> > > > > OTOH, Documentation/timers/no_hz.rst says
> > > > >
> > > > > Therefore, the
> > > > > boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> > > > > "nohz_full=" mask that includes the boot CPU will result in a boot-time
> > > > > error message, and the boot CPU will be removed from the mask.
> > > > >
> > > > > and this doesn't match the reality.
> > > >
> > > > Don't some archs allow the boot CPU to go down too tho? If so, this doesn't
> > > > really solve the problem, right?
> > >
> > > I do not know. But I thought about this too.
> > >
> > > In the context of this discussion we do not care if the boot CPU goes down.
> > > But we need at least one housekeeping CPU after cpu_down(). The comment in
> > > cpu_down_maps_locked() says
> > >
> > > Also keep at least one housekeeping cpu onlined
> > >
> > > but it checks HK_TYPE_DOMAIN, and I do not know (and it is too late for me
> > > to try to read the code ;) if housekeeping.cpumasks[HK_TYPE_TIMER] can get
> > > empty or not.
> >
> > This nearly killed me, but I managed to convince myself we shouldn't worry
> > about cpu_down().
> >
> > HK_FLAG_TIMER implies HK_FLAG_TICK.
> >
> > HK_FLAG_TICK implies tick_nohz_full_setup() which sets
> > tick_nohz_full_mask = non_housekeeping_mask.
> >
> > When tick_setup_device() is called on a housekeeping CPU it does
> >
> > else if (tick_do_timer_boot_cpu != -1 &&
> > !tick_nohz_full_cpu(cpu)) {
> > tick_take_do_timer_from_boot();
> > tick_do_timer_boot_cpu = -1;
> >
> >
> > and this sets tick_do_timer_cpu = first-housekeeping-cpu.
> >
> > cpu_down(tick_do_timer_cpu) will fail, tick_nohz_cpu_down() will nack it.
> >
> > So cpu_down() can't make housekeeping.cpumasks[HK_FLAG_TIMER] empty and I
> > still think that the change below is the right approach.
> >
> > But probably WARN_ON() in housekeeping_any_cpu() makes sense anyway.
> >
> > What do you think?
>
> Good analysis on this nasty housekeeping VS tick code. I promised so many
> times to cleanup this mess but things keep piling up.
>
> It is indeed possible for the boot CPU to be a nohz_full CPU and as
> you can see, it's only half-working. This is so ever since:
>
> 08ae95f4fd3b (nohz_full: Allow the boot CPU to be nohz_full)

Thanks... So this is intentional. I was confused by

Therefore, the
boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
"nohz_full=" mask that includes the boot CPU will result in a boot-time
error message, and the boot CPU will be removed from the mask.

from Documentation/timers/no_hz.rst

> I would love
> to revert that now but I don't know if anyone uses this and have it working
> by chance somewhere... Should we continue to support a broken feature? Can we
> break user ABI if it's already half-broken?

Well, the changelog says

nohz_full has been trialed at a large supercomputer site and found to
significantly reduce jitter. In order to deploy it in production, they
need CPU0 to be nohz_full

so I guess this feature has users.

But after the commit aae17ebb53cd3da ("workqueue: Avoid using isolated cpus'
timers on queue_delayed_work") the kernel will crash at boot time if the boot
CPU is nohz_full.

So we need a workaround at least. I am starting to think I will send a trivial
patch which changes __queue_delayed_work() to validate the cpu returned by
housekeeping_any_cpu(HK_TYPE_TIMER).

But perhaps something like below makes more sense as a (stupid) workaround?

Oleg.

--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -46,7 +46,15 @@ int housekeeping_any_cpu(enum hk_type type)
if (cpu < nr_cpu_ids)
return cpu;

- return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ if (cpu < nr_cpu_ids)
+ return cpu;
+
+ cpu = READ_ONCE(tick_do_timer_boot_cpu);
+ if (cpu >= 0)
+ return cpu;
+
+ WARN_ON_ONCE(1);
}
}
return smp_processor_id();

2024-04-07 13:54:35

[permalink] [raw]

Subject: Re: Nohz_full on boot CPU is broken (was: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work)

On 04/07, Oleg Nesterov wrote:
>
> On 04/05, Frederic Weisbecker wrote:
> >
> > +Cc Nick
> >
> > Le Fri, Apr 05, 2024 at 04:04:49PM +0200, Oleg Nesterov a ?crit :
> > > On 04/03, Oleg Nesterov wrote:
> > > >
> > > > > > OTOH, Documentation/timers/no_hz.rst says
> > > > > >
> > > > > > Therefore, the
> > > > > > boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> > > > > > "nohz_full=" mask that includes the boot CPU will result in a boot-time
> > > > > > error message, and the boot CPU will be removed from the mask.
> > > > > >
> > > > > > and this doesn't match the reality.
> > > > >
> > > > > Don't some archs allow the boot CPU to go down too tho? If so, this doesn't
> > > > > really solve the problem, right?
> > > >
> > > > I do not know. But I thought about this too.
> > > >
> > > > In the context of this discussion we do not care if the boot CPU goes down.
> > > > But we need at least one housekeeping CPU after cpu_down(). The comment in
> > > > cpu_down_maps_locked() says
> > > >
> > > > Also keep at least one housekeeping cpu onlined
> > > >
> > > > but it checks HK_TYPE_DOMAIN, and I do not know (and it is too late for me
> > > > to try to read the code ;) if housekeeping.cpumasks[HK_TYPE_TIMER] can get
> > > > empty or not.
> > >
> > > This nearly killed me, but I managed to convince myself we shouldn't worry
> > > about cpu_down().
> > >
> > > HK_FLAG_TIMER implies HK_FLAG_TICK.
> > >
> > > HK_FLAG_TICK implies tick_nohz_full_setup() which sets
> > > tick_nohz_full_mask = non_housekeeping_mask.
> > >
> > > When tick_setup_device() is called on a housekeeping CPU it does
> > >
> > > else if (tick_do_timer_boot_cpu != -1 &&
> > > !tick_nohz_full_cpu(cpu)) {
> > > tick_take_do_timer_from_boot();
> > > tick_do_timer_boot_cpu = -1;
> > >
> > >
> > > and this sets tick_do_timer_cpu = first-housekeeping-cpu.
> > >
> > > cpu_down(tick_do_timer_cpu) will fail, tick_nohz_cpu_down() will nack it.
> > >
> > > So cpu_down() can't make housekeeping.cpumasks[HK_FLAG_TIMER] empty and I
> > > still think that the change below is the right approach.
> > >
> > > But probably WARN_ON() in housekeeping_any_cpu() makes sense anyway.
> > >
> > > What do you think?
> >
> > Good analysis on this nasty housekeeping VS tick code. I promised so many
> > times to cleanup this mess but things keep piling up.
> >
> > It is indeed possible for the boot CPU to be a nohz_full CPU and as
> > you can see, it's only half-working. This is so ever since:
> >
> > 08ae95f4fd3b (nohz_full: Allow the boot CPU to be nohz_full)
>
> Thanks... So this is intentional. I was confused by
>
> Therefore, the
> boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> "nohz_full=" mask that includes the boot CPU will result in a boot-time
> error message, and the boot CPU will be removed from the mask.
>
> from Documentation/timers/no_hz.rst
>
> > I would love
> > to revert that now but I don't know if anyone uses this and have it working
> > by chance somewhere... Should we continue to support a broken feature? Can we
> > break user ABI if it's already half-broken?
>
> Well, the changelog says
>
> nohz_full has been trialed at a large supercomputer site and found to
> significantly reduce jitter. In order to deploy it in production, they
> need CPU0 to be nohz_full
>
> so I guess this feature has users.
>
> But after the commit aae17ebb53cd3da ("workqueue: Avoid using isolated cpus'
> timers on queue_delayed_work") the kernel will crash at boot time if the boot
> CPU is nohz_full.
>
> So we need a workaround at least. I am starting to think I will send a trivial
> patch which changes __queue_delayed_work() to validate the cpu returned by
> housekeeping_any_cpu(HK_TYPE_TIMER).
>
> But perhaps something like below makes more sense as a (stupid) workaround?

Or simply

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 373d42c707bc..e912555c6fc8 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -46,7 +46,11 @@ int housekeeping_any_cpu(enum hk_type type)
if (cpu < nr_cpu_ids)
return cpu;

- return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ if (cpu < nr_cpu_ids)
+ return cpu;
+
+ WARN_ON_ONCE(system_state == SYSTEM_RUNNING);
}
}
return smp_processor_id();

2024-04-09 12:04:42

[permalink] [raw]

Subject: Re: Nohz_full on boot CPU is broken (was: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work)

Le Sun, Apr 07, 2024 at 03:09:14PM +0200, Oleg Nesterov a ?crit :
> On 04/05, Frederic Weisbecker wrote:
> >
> > +Cc Nick
> >
> > Le Fri, Apr 05, 2024 at 04:04:49PM +0200, Oleg Nesterov a ?crit :
> > > On 04/03, Oleg Nesterov wrote:
> > > >
> > > > > > OTOH, Documentation/timers/no_hz.rst says
> > > > > >
> > > > > > Therefore, the
> > > > > > boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> > > > > > "nohz_full=" mask that includes the boot CPU will result in a boot-time
> > > > > > error message, and the boot CPU will be removed from the mask.
> > > > > >
> > > > > > and this doesn't match the reality.
> > > > >
> > > > > Don't some archs allow the boot CPU to go down too tho? If so, this doesn't
> > > > > really solve the problem, right?
> > > >
> > > > I do not know. But I thought about this too.
> > > >
> > > > In the context of this discussion we do not care if the boot CPU goes down.
> > > > But we need at least one housekeeping CPU after cpu_down(). The comment in
> > > > cpu_down_maps_locked() says
> > > >
> > > > Also keep at least one housekeeping cpu onlined
> > > >
> > > > but it checks HK_TYPE_DOMAIN, and I do not know (and it is too late for me
> > > > to try to read the code ;) if housekeeping.cpumasks[HK_TYPE_TIMER] can get
> > > > empty or not.
> > >
> > > This nearly killed me, but I managed to convince myself we shouldn't worry
> > > about cpu_down().
> > >
> > > HK_FLAG_TIMER implies HK_FLAG_TICK.
> > >
> > > HK_FLAG_TICK implies tick_nohz_full_setup() which sets
> > > tick_nohz_full_mask = non_housekeeping_mask.
> > >
> > > When tick_setup_device() is called on a housekeeping CPU it does
> > >
> > > else if (tick_do_timer_boot_cpu != -1 &&
> > > !tick_nohz_full_cpu(cpu)) {
> > > tick_take_do_timer_from_boot();
> > > tick_do_timer_boot_cpu = -1;
> > >
> > >
> > > and this sets tick_do_timer_cpu = first-housekeeping-cpu.
> > >
> > > cpu_down(tick_do_timer_cpu) will fail, tick_nohz_cpu_down() will nack it.
> > >
> > > So cpu_down() can't make housekeeping.cpumasks[HK_FLAG_TIMER] empty and I
> > > still think that the change below is the right approach.
> > >
> > > But probably WARN_ON() in housekeeping_any_cpu() makes sense anyway.
> > >
> > > What do you think?
> >
> > Good analysis on this nasty housekeeping VS tick code. I promised so many
> > times to cleanup this mess but things keep piling up.
> >
> > It is indeed possible for the boot CPU to be a nohz_full CPU and as
> > you can see, it's only half-working. This is so ever since:
> >
> > 08ae95f4fd3b (nohz_full: Allow the boot CPU to be nohz_full)
>
> Thanks... So this is intentional. I was confused by
>
> Therefore, the
> boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> "nohz_full=" mask that includes the boot CPU will result in a boot-time
> error message, and the boot CPU will be removed from the mask.
>
> from Documentation/timers/no_hz.rst
>
> > I would love
> > to revert that now but I don't know if anyone uses this and have it working
> > by chance somewhere... Should we continue to support a broken feature? Can we
> > break user ABI if it's already half-broken?
>
> Well, the changelog says
>
> nohz_full has been trialed at a large supercomputer site and found to
> significantly reduce jitter. In order to deploy it in production, they
> need CPU0 to be nohz_full
>
> so I guess this feature has users.
>
> But after the commit aae17ebb53cd3da ("workqueue: Avoid using isolated cpus'
> timers on queue_delayed_work") the kernel will crash at boot time if the boot
> CPU is nohz_full.

Right but there are many possible calls to housekeeping on boot before the
first housekeeper becomes online.

Thanks.

2024-04-09 12:05:56

[permalink] [raw]

Subject: Re: Nohz_full on boot CPU is broken (was: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work)

Le Sun, Apr 07, 2024 at 03:52:48PM +0200, Oleg Nesterov a ?crit :
> On 04/07, Oleg Nesterov wrote:
> >
> > On 04/05, Frederic Weisbecker wrote:
> > >
> > > +Cc Nick
> > >
> > > Le Fri, Apr 05, 2024 at 04:04:49PM +0200, Oleg Nesterov a ?crit :
> > > > On 04/03, Oleg Nesterov wrote:
> > > > >
> > > > > > > OTOH, Documentation/timers/no_hz.rst says
> > > > > > >
> > > > > > > Therefore, the
> > > > > > > boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> > > > > > > "nohz_full=" mask that includes the boot CPU will result in a boot-time
> > > > > > > error message, and the boot CPU will be removed from the mask.
> > > > > > >
> > > > > > > and this doesn't match the reality.
> > > > > >
> > > > > > Don't some archs allow the boot CPU to go down too tho? If so, this doesn't
> > > > > > really solve the problem, right?
> > > > >
> > > > > I do not know. But I thought about this too.
> > > > >
> > > > > In the context of this discussion we do not care if the boot CPU goes down.
> > > > > But we need at least one housekeeping CPU after cpu_down(). The comment in
> > > > > cpu_down_maps_locked() says
> > > > >
> > > > > Also keep at least one housekeeping cpu onlined
> > > > >
> > > > > but it checks HK_TYPE_DOMAIN, and I do not know (and it is too late for me
> > > > > to try to read the code ;) if housekeeping.cpumasks[HK_TYPE_TIMER] can get
> > > > > empty or not.
> > > >
> > > > This nearly killed me, but I managed to convince myself we shouldn't worry
> > > > about cpu_down().
> > > >
> > > > HK_FLAG_TIMER implies HK_FLAG_TICK.
> > > >
> > > > HK_FLAG_TICK implies tick_nohz_full_setup() which sets
> > > > tick_nohz_full_mask = non_housekeeping_mask.
> > > >
> > > > When tick_setup_device() is called on a housekeeping CPU it does
> > > >
> > > > else if (tick_do_timer_boot_cpu != -1 &&
> > > > !tick_nohz_full_cpu(cpu)) {
> > > > tick_take_do_timer_from_boot();
> > > > tick_do_timer_boot_cpu = -1;
> > > >
> > > >
> > > > and this sets tick_do_timer_cpu = first-housekeeping-cpu.
> > > >
> > > > cpu_down(tick_do_timer_cpu) will fail, tick_nohz_cpu_down() will nack it.
> > > >
> > > > So cpu_down() can't make housekeeping.cpumasks[HK_FLAG_TIMER] empty and I
> > > > still think that the change below is the right approach.
> > > >
> > > > But probably WARN_ON() in housekeeping_any_cpu() makes sense anyway.
> > > >
> > > > What do you think?
> > >
> > > Good analysis on this nasty housekeeping VS tick code. I promised so many
> > > times to cleanup this mess but things keep piling up.
> > >
> > > It is indeed possible for the boot CPU to be a nohz_full CPU and as
> > > you can see, it's only half-working. This is so ever since:
> > >
> > > 08ae95f4fd3b (nohz_full: Allow the boot CPU to be nohz_full)
> >
> > Thanks... So this is intentional. I was confused by
> >
> > Therefore, the
> > boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> > "nohz_full=" mask that includes the boot CPU will result in a boot-time
> > error message, and the boot CPU will be removed from the mask.
> >
> > from Documentation/timers/no_hz.rst
> >
> > > I would love
> > > to revert that now but I don't know if anyone uses this and have it working
> > > by chance somewhere... Should we continue to support a broken feature? Can we
> > > break user ABI if it's already half-broken?
> >
> > Well, the changelog says
> >
> > nohz_full has been trialed at a large supercomputer site and found to
> > significantly reduce jitter. In order to deploy it in production, they
> > need CPU0 to be nohz_full
> >
> > so I guess this feature has users.
> >
> > But after the commit aae17ebb53cd3da ("workqueue: Avoid using isolated cpus'
> > timers on queue_delayed_work") the kernel will crash at boot time if the boot
> > CPU is nohz_full.
> >
> > So we need a workaround at least. I am starting to think I will send a trivial
> > patch which changes __queue_delayed_work() to validate the cpu returned by
> > housekeeping_any_cpu(HK_TYPE_TIMER).
> >
> > But perhaps something like below makes more sense as a (stupid) workaround?
>
> Or simply
>
>
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 373d42c707bc..e912555c6fc8 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -46,7 +46,11 @@ int housekeeping_any_cpu(enum hk_type type)
> if (cpu < nr_cpu_ids)
> return cpu;
>
> - return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + if (cpu < nr_cpu_ids)
> + return cpu;
> +
> + WARN_ON_ONCE(system_state == SYSTEM_RUNNING);
> }
> }
> return smp_processor_id();

Yep, looks good!

Thanks.

2024-04-09 13:09:30

[permalink] [raw]

Subject: Re: Nohz_full on boot CPU is broken (was: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work)

On 04/09, Frederic Weisbecker wrote:
>
> Le Sun, Apr 07, 2024 at 03:09:14PM +0200, Oleg Nesterov a ?crit :
> > Well, the changelog says
> >
> > nohz_full has been trialed at a large supercomputer site and found to
> > significantly reduce jitter. In order to deploy it in production, they
> > need CPU0 to be nohz_full
> >
> > so I guess this feature has users.
> >
> > But after the commit aae17ebb53cd3da ("workqueue: Avoid using isolated cpus'
> > timers on queue_delayed_work") the kernel will crash at boot time if the boot
> > CPU is nohz_full.
>
> Right but there are many possible calls to housekeeping on boot before the
> first housekeeper becomes online.

Well, it seems that other callers are more or less fine in this respect...
At least the kernel boots fine with that commit reverted.

But a) I didn't try to really check, and b) this doesn't matter.

I agree, and that is why I never blamed this change in queue_delayed_work().

OK, you seem to agree with the patch below, I'll write the changelog/comment
and send it "officially".

Or did I misunderstand you?

Oleg.

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 373d42c707bc..e912555c6fc8 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -46,7 +46,11 @@ int housekeeping_any_cpu(enum hk_type type)
if (cpu < nr_cpu_ids)
return cpu;

- return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ if (cpu < nr_cpu_ids)
+ return cpu;
+
+ WARN_ON_ONCE(system_state == SYSTEM_RUNNING);
}
}
return smp_processor_id();

2024-04-09 14:00:11

[permalink] [raw]

Subject: Re: Nohz_full on boot CPU is broken (was: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work)

Le Tue, Apr 09, 2024 at 03:07:28PM +0200, Oleg Nesterov a ?crit :
> On 04/09, Frederic Weisbecker wrote:
> >
> > Le Sun, Apr 07, 2024 at 03:09:14PM +0200, Oleg Nesterov a ?crit :
> > > Well, the changelog says
> > >
> > > nohz_full has been trialed at a large supercomputer site and found to
> > > significantly reduce jitter. In order to deploy it in production, they
> > > need CPU0 to be nohz_full
> > >
> > > so I guess this feature has users.
> > >
> > > But after the commit aae17ebb53cd3da ("workqueue: Avoid using isolated cpus'
> > > timers on queue_delayed_work") the kernel will crash at boot time if the boot
> > > CPU is nohz_full.
> >
> > Right but there are many possible calls to housekeeping on boot before the
> > first housekeeper becomes online.
>
> Well, it seems that other callers are more or less fine in this respect...
> At least the kernel boots fine with that commit reverted.
>
> But a) I didn't try to really check, and b) this doesn't matter.
>
> I agree, and that is why I never blamed this change in queue_delayed_work().
>
> OK, you seem to agree with the patch below, I'll write the changelog/comment
> and send it "officially".
>
> Or did I misunderstand you?

Works for me!

Thanks.

>
> Oleg.
>
>
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 373d42c707bc..e912555c6fc8 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -46,7 +46,11 @@ int housekeeping_any_cpu(enum hk_type type)
> if (cpu < nr_cpu_ids)
> return cpu;
>
> - return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + if (cpu < nr_cpu_ids)
> + return cpu;
> +
> + WARN_ON_ONCE(system_state == SYSTEM_RUNNING);
> }
> }
> return smp_processor_id();
>

2024-04-10 04:27:43

by Nicholas Piggin

[permalink] [raw]

Subject: Re: Nohz_full on boot CPU is broken (was: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work)

On Tue Apr 9, 2024 at 11:07 PM AEST, Oleg Nesterov wrote:
> On 04/09, Frederic Weisbecker wrote:
> >
> > Le Sun, Apr 07, 2024 at 03:09:14PM +0200, Oleg Nesterov a écrit :
> > > Well, the changelog says
> > >
> > > nohz_full has been trialed at a large supercomputer site and found to
> > > significantly reduce jitter. In order to deploy it in production, they
> > > need CPU0 to be nohz_full
> > >
> > > so I guess this feature has users.

It was the Summit/Sierra supercomputers which I suppose are still using
it. IIRC it was an existing job scheduler system they had which ran
housekeeping work on the highest numbered core in a socket and allocated
jobs from lowest number. We certainly asked if they could change that,
but apparently that was difficult. I'm surprised nobody ran into it on
x86 though. Maybe the system had more jitter (SMT4 doesn't help), so
maybe it wasn't needed to use isolcpus= in other cases.

The other thing is powerpc can boot on arbitrary CPU number. So if boot
CPU must be in the mask then it could randomly break your boot
config if boot CPU must be in HK mask.

> > >
> > > But after the commit aae17ebb53cd3da ("workqueue: Avoid using isolated cpus'
> > > timers on queue_delayed_work") the kernel will crash at boot time if the boot
> > > CPU is nohz_full.
> >
> > Right but there are many possible calls to housekeeping on boot before the
> > first housekeeper becomes online.
>
> Well, it seems that other callers are more or less fine in this respect...
> At least the kernel boots fine with that commit reverted.
>
> But a) I didn't try to really check, and b) this doesn't matter.
>
> I agree, and that is why I never blamed this change in queue_delayed_work().
>
> OK, you seem to agree with the patch below, I'll write the changelog/comment
> and send it "officially".
>
> Or did I misunderstand you?

Thanks for this. Taking a while to page this back in, the intention is
for housekeeping to be done by boot CPU until house keeper is awake, so
returning smp_processor_id() seems like the right thing to do here for
ephemeral jobs like timers and work, provided that CPU / mask is not
stored somewhere long term by the caller.

For things that set an affinity like kthread, sched, maybe managed
irqs, and such.

There are not many callers of housekeeping_any_cpu() so that's easy
enough to verify. But similar like housekeeping_cpumask() and others
could be an issue or at least a foot-gun, I'm not sure how well I
convinced myself of those.

Could you test like this?

WARN_ON_ONCE(system_state == SYSTEM_RUNNING ||
type != HK_TYPE_TIMER);

With a comment to say other ephemeral mask types could be exempted if
needed.

It would also be nice to warn for cases that would be bugs if the boot
CPU was not in the HK mask. Could that be done by having a
housekeepers_online() call after smp_init() (maybe at the start of
sched_init_smp()) that could verify there is at least one online, and
set a flag that could be used to create warnings.

Thanks,
Nick

>
> Oleg.
>
>
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 373d42c707bc..e912555c6fc8 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -46,7 +46,11 @@ int housekeeping_any_cpu(enum hk_type type)
> if (cpu < nr_cpu_ids)
> return cpu;
>
> - return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + if (cpu < nr_cpu_ids)
> + return cpu;
> +
> + WARN_ON_ONCE(system_state == SYSTEM_RUNNING);
> }
> }
> return smp_processor_id();

2024-04-10 14:04:20

[permalink] [raw]

Subject: Re: Nohz_full on boot CPU is broken (was: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work)

Hi Nicholas,

On 04/10, Nicholas Piggin wrote:
>
> Thanks for this. Taking a while to page this back in, the intention is
> for housekeeping to be done by boot CPU until house keeper is awake, so
> returning smp_processor_id() seems like the right thing to do here for
> ephemeral jobs like timers and work, provided that CPU / mask is not
> stored somewhere long term by the caller.
>
> For things that set an affinity like kthread, sched, maybe managed
> irqs, and such.
>
> There are not many callers of housekeeping_any_cpu() so that's easy
> enough to verify. But similar like housekeeping_cpumask() and others
> could be an issue or at least a foot-gun, I'm not sure how well I
> convinced myself of those.
>
> Could you test like this?
>
> WARN_ON_ONCE(system_state == SYSTEM_RUNNING ||
> type != HK_TYPE_TIMER);
>
> With a comment to say other ephemeral mask types could be exempted if
> needed.

Sorry, I don't understand... Let me repeat, I know absolutely nothing
about nonhz/etc. I didn't even try to really fix the problem(s), I am
only trying to find a minimal/simple workaround to fix the problem we
hit in Red Hat.

This is what I was going to send:

--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -46,7 +46,15 @@ int housekeeping_any_cpu(enum hk_type type)
if (cpu < nr_cpu_ids)
return cpu;

- return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ if (likely(cpu < nr_cpu_ids))
+ return cpu;
+ /*
+ * Unless we have another problem this can only happen
+ * at boot time before start_secondary() brings the 1st
+ * housekeeping CPU up.
+ */
+ WARN_ON_ONCE(system_state == SYSTEM_RUNNING);
}
}
return smp_processor_id();

Yes, this fixes the symptom, not the problem. And yes, "another problem"
mentioned in the comment is very possible, say "maxcpus" kernel-parameter
can be less than the first housekeeping cpu. But in this case the user
should blame himself (and I am not sure the kernel will boot).

I don't understand why do you suggest to add "|| type != HK_TYPE_TIMER",
currently all the callers of housekeeping_any_cpu() use type == HK_TIMER.
But OK, I can add this check. I guess for the case it finds another user
with type != HK_TYPE_TIMER which can't use smp_processor_id() even at
boot time or stores the returned CPU for the long term.

Will you agree with the change above or what do you suggest instead as
a simple workaround?

> It would also be nice to warn for cases that would be bugs if the boot
> CPU was not in the HK mask. Could that be done by having a
> housekeepers_online() call after smp_init() (maybe at the start of
> sched_init_smp()) that could verify there is at least one online, and
> set a flag that could be used to create warnings.

Again, I am not sure I understand, but I too thought that something like

housekeeping_check(void)
{
for_each_set_bit(type, &housekeeping.flags, HK_TYPE_MAX) {
if (!cpumask_intersects(cpu_online, housekeeping.cpumasks[type]))
panic();
}

after bringup_nonboot_cpus(setup_max_cpus).

But I am not sure this is correct and this is another (although related) issue.

Oleg.

2024-04-11 14:07:28

[permalink] [raw]

Subject: Re: Nohz_full on boot CPU is broken (was: Re: [PATCH v2 1/1] wq: Avoid using isolated cpus' timers on queue_delayed_work)

On 04/10, Oleg Nesterov wrote:
>
> Will you agree with the change above or what do you suggest instead as
> a simple workaround?

OK, let me send the patch.

Nicholas, Frederic, please review.

> Again, I am not sure I understand, but I too thought that something like
>
> housekeeping_check(void)
> {
> for_each_set_bit(type, &housekeeping.flags, HK_TYPE_MAX) {
> if (!cpumask_intersects(cpu_online, housekeeping.cpumasks[type]))
> panic();
> }
>
> after bringup_nonboot_cpus(setup_max_cpus).

It seems we can make a simpler patch for this. It will (should) fix another
problem when maxcpus < first-housekeeping.

I'll recheck and try to do this later, lets solve the reported problem first.

Oleg.

2024-04-11 14:43:37

[permalink] [raw]

Subject: [PATCH] sched/isolation: fix boot crash when the boot CPU is nohz_full

Documentation/timers/no_hz.rst states that the "nohz_full=" mask must not
include the boot CPU, this is no longer true after the commit 08ae95f4fd3b
("nohz_full: Allow the boot CPU to be nohz_full").

However after another commit aae17ebb53cd ("workqueue: Avoid using isolated
cpus' timers on queue_delayed_work") the kernel will crash at boot time in
this case; housekeeping_any_cpu() returns an invalid cpu nr until smp_init()
paths bring the 1st housekeeping CPU up.

Change housekeeping_any_cpu() to check the result of cpumask_any_and() and
return smp_processor_id() in this case. Yes, this is just the simple and
backportable workaround which fixes the symptom, but smp_processor_id() at
boot time should be safe at least for type == HK_TYPE_TIMER, this more or
less matches the tick_do_timer_boot_cpu logic.

We should not worry about cpu_down(); tick_nohz_cpu_down() will not allow
to offline tick_do_timer_cpu (the 1st online housekeeping CPU).

Fixes: aae17ebb53cd ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")
Reported-by: Chris von Recklinghausen <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Oleg Nesterov <[email protected]>
---
Documentation/timers/no_hz.rst | 7 ++-----
kernel/sched/isolation.c | 11 ++++++++++-
2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/Documentation/timers/no_hz.rst b/Documentation/timers/no_hz.rst
index f8786be15183..7fe8ef9718d8 100644
--- a/Documentation/timers/no_hz.rst
+++ b/Documentation/timers/no_hz.rst
@@ -129,11 +129,8 @@ adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain
online to handle timekeeping tasks in order to ensure that system
calls like gettimeofday() returns accurate values on adaptive-tick CPUs.
(This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no running
-user processes to observe slight drifts in clock rate.) Therefore, the
-boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
-"nohz_full=" mask that includes the boot CPU will result in a boot-time
-error message, and the boot CPU will be removed from the mask. Note that
-this means that your system must have at least two CPUs in order for
+user processes to observe slight drifts in clock rate.) Note that this
+means that your system must have at least two CPUs in order for
CONFIG_NO_HZ_FULL=y to do anything for you.

Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded.
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 373d42c707bc..2a262d3ecb3d 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -46,7 +46,16 @@ int housekeeping_any_cpu(enum hk_type type)
if (cpu < nr_cpu_ids)
return cpu;

- return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ if (likely(cpu < nr_cpu_ids))
+ return cpu;
+ /*
+ * Unless we have another problem this can only happen
+ * at boot time before start_secondary() brings the 1st
+ * housekeeping CPU up.
+ */
+ WARN_ON_ONCE(system_state == SYSTEM_RUNNING ||
+ type != HK_TYPE_TIMER);
}
}
return smp_processor_id();
--
2.25.1.362.g51ebf55

2024-04-11 18:08:43

[permalink] [raw]

Subject: Re: [PATCH] sched/isolation: fix boot crash when the boot CPU is nohz_full

Just in case...

This reveals another issue, tick_setup_device() can't use
smp_call_function() due to irqs_disabled(), but lets discuss this
later/separately too.

On 04/11, Oleg Nesterov wrote:
>
> Documentation/timers/no_hz.rst states that the "nohz_full=" mask must not
> include the boot CPU, this is no longer true after the commit 08ae95f4fd3b
> ("nohz_full: Allow the boot CPU to be nohz_full").
>
> However after another commit aae17ebb53cd ("workqueue: Avoid using isolated
> cpus' timers on queue_delayed_work") the kernel will crash at boot time in
> this case; housekeeping_any_cpu() returns an invalid cpu nr until smp_init()
> paths bring the 1st housekeeping CPU up.
>
> Change housekeeping_any_cpu() to check the result of cpumask_any_and() and
> return smp_processor_id() in this case. Yes, this is just the simple and
> backportable workaround which fixes the symptom, but smp_processor_id() at
> boot time should be safe at least for type == HK_TYPE_TIMER, this more or
> less matches the tick_do_timer_boot_cpu logic.
>
> We should not worry about cpu_down(); tick_nohz_cpu_down() will not allow
> to offline tick_do_timer_cpu (the 1st online housekeeping CPU).
>
> Fixes: aae17ebb53cd ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")
> Reported-by: Chris von Recklinghausen <[email protected]>
> Closes: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
> Documentation/timers/no_hz.rst | 7 ++-----
> kernel/sched/isolation.c | 11 ++++++++++-
> 2 files changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/timers/no_hz.rst b/Documentation/timers/no_hz.rst
> index f8786be15183..7fe8ef9718d8 100644
> --- a/Documentation/timers/no_hz.rst
> +++ b/Documentation/timers/no_hz.rst
> @@ -129,11 +129,8 @@ adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain
> online to handle timekeeping tasks in order to ensure that system
> calls like gettimeofday() returns accurate values on adaptive-tick CPUs.
> (This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no running
> -user processes to observe slight drifts in clock rate.) Therefore, the
> -boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> -"nohz_full=" mask that includes the boot CPU will result in a boot-time
> -error message, and the boot CPU will be removed from the mask. Note that
> -this means that your system must have at least two CPUs in order for
> +user processes to observe slight drifts in clock rate.) Note that this
> +means that your system must have at least two CPUs in order for
> CONFIG_NO_HZ_FULL=y to do anything for you.
>
> Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded.
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 373d42c707bc..2a262d3ecb3d 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -46,7 +46,16 @@ int housekeeping_any_cpu(enum hk_type type)
> if (cpu < nr_cpu_ids)
> return cpu;
>
> - return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + if (likely(cpu < nr_cpu_ids))
> + return cpu;
> + /*
> + * Unless we have another problem this can only happen
> + * at boot time before start_secondary() brings the 1st
> + * housekeeping CPU up.
> + */
> + WARN_ON_ONCE(system_state == SYSTEM_RUNNING ||
> + type != HK_TYPE_TIMER);
> }
> }
> return smp_processor_id();
> --
> 2.25.1.362.g51ebf55
>

2024-04-13 14:19:30

[permalink] [raw]

Subject: [PATCH] sched/isolation: fix boot crash when maxcpus < first-housekeeping-cpu

housekeeping_setup() checks cpumask_intersects(present, online) to ensure
that the kernel will have at least one housekeeping CPU after smp_init(),
but this doesn't work if the maxcpus= kernel parameter limits the number
of processors available after bootup.

For example, the kernel with "maxcpus=2 nohz_full=0-2" parameters crashes
at boot time on my virtual machine with 4 CPUs.

Change housekeeping_setup() to use cpumask_first_and() and check that the
returned cpu number is valid and less than setup_max_cpus.

Another corner case is "nohz_full=0" on a machine with a single CPU or
with the maxcpus=1 kernel argument. In this case non_housekeeping_mask
is empty and IIUC tick_nohz_full_setup() makes no sense. And indeed, the
kernel hits the WARN_ON(tick_nohz_full_running) in tick_sched_do_timer().

And how should the kernel interpret the "nohz_full=" parameter? I think
it should be silently ignored, but currently cpulist_parse() happily
returns the empty cpumask and this leads to the same problem.

Change housekeeping_setup() to check cpumask_empty(non_housekeeping_mask)
and do nothing in this case.

Signed-off-by: Oleg Nesterov <[email protected]>
---
kernel/sched/isolation.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 2a262d3ecb3d..5891e715f00d 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -118,6 +118,7 @@ static void __init housekeeping_setup_type(enum hk_type type,
static int __init housekeeping_setup(char *str, unsigned long flags)
{
cpumask_var_t non_housekeeping_mask, housekeeping_staging;
+ unsigned int first_cpu;
int err = 0;

if ((flags & HK_FLAG_TICK) && !(housekeeping.flags & HK_FLAG_TICK)) {
@@ -138,7 +139,8 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
cpumask_andnot(housekeeping_staging,
cpu_possible_mask, non_housekeeping_mask);

- if (!cpumask_intersects(cpu_present_mask, housekeeping_staging)) {
+ first_cpu = cpumask_first_and(cpu_present_mask, housekeeping_staging);
+ if (first_cpu >= nr_cpu_ids || first_cpu >= setup_max_cpus) {
__cpumask_set_cpu(smp_processor_id(), housekeeping_staging);
__cpumask_clear_cpu(smp_processor_id(), non_housekeeping_mask);
if (!housekeeping.flags) {
@@ -147,6 +149,9 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
}
}

+ if (cpumask_empty(non_housekeeping_mask))
+ goto free_housekeeping_staging;
+
if (!housekeeping.flags) {
/* First setup call ("nohz_full=" or "isolcpus=") */
enum hk_type type;
--
2.25.1.362.g51ebf55

2024-04-15 21:40:40

[permalink] [raw]

Subject: Re: [PATCH] sched/isolation: fix boot crash when the boot CPU is nohz_full

Le Thu, Apr 11, 2024 at 04:39:05PM +0200, Oleg Nesterov a ?crit :
> Documentation/timers/no_hz.rst states that the "nohz_full=" mask must not
> include the boot CPU, this is no longer true after the commit 08ae95f4fd3b
> ("nohz_full: Allow the boot CPU to be nohz_full").
>
> However after another commit aae17ebb53cd ("workqueue: Avoid using isolated
> cpus' timers on queue_delayed_work") the kernel will crash at boot time in
> this case; housekeeping_any_cpu() returns an invalid cpu nr until smp_init()
> paths bring the 1st housekeeping CPU up.
>
> Change housekeeping_any_cpu() to check the result of cpumask_any_and() and
> return smp_processor_id() in this case. Yes, this is just the simple and
> backportable workaround which fixes the symptom, but smp_processor_id() at
> boot time should be safe at least for type == HK_TYPE_TIMER, this more or
> less matches the tick_do_timer_boot_cpu logic.
>
> We should not worry about cpu_down(); tick_nohz_cpu_down() will not allow
> to offline tick_do_timer_cpu (the 1st online housekeeping CPU).
>
> Fixes: aae17ebb53cd ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")
> Reported-by: Chris von Recklinghausen <[email protected]>
> Closes: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Oleg Nesterov <[email protected]>

Acked-by: Frederic Weisbecker <[email protected]>

2024-04-18 14:50:22

[permalink] [raw]

Subject: Re: [PATCH] sched/isolation: fix boot crash when the boot CPU is nohz_full

On Thu, Apr 11, 2024 at 04:39:05PM +0200 Oleg Nesterov wrote:
> Documentation/timers/no_hz.rst states that the "nohz_full=" mask must not
> include the boot CPU, this is no longer true after the commit 08ae95f4fd3b
> ("nohz_full: Allow the boot CPU to be nohz_full").
>
> However after another commit aae17ebb53cd ("workqueue: Avoid using isolated
> cpus' timers on queue_delayed_work") the kernel will crash at boot time in
> this case; housekeeping_any_cpu() returns an invalid cpu nr until smp_init()
> paths bring the 1st housekeeping CPU up.
>
> Change housekeeping_any_cpu() to check the result of cpumask_any_and() and
> return smp_processor_id() in this case. Yes, this is just the simple and
> backportable workaround which fixes the symptom, but smp_processor_id() at
> boot time should be safe at least for type == HK_TYPE_TIMER, this more or
> less matches the tick_do_timer_boot_cpu logic.
>
> We should not worry about cpu_down(); tick_nohz_cpu_down() will not allow
> to offline tick_do_timer_cpu (the 1st online housekeeping CPU).
>
> Fixes: aae17ebb53cd ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")
> Reported-by: Chris von Recklinghausen <[email protected]>
> Closes: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Oleg Nesterov <[email protected]>

Checking the returned value instead of assuming seems safer in any case.

Reviewed-by: Phil Auld <[email protected]>

> ---
> Documentation/timers/no_hz.rst | 7 ++-----
> kernel/sched/isolation.c | 11 ++++++++++-
> 2 files changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/timers/no_hz.rst b/Documentation/timers/no_hz.rst
> index f8786be15183..7fe8ef9718d8 100644
> --- a/Documentation/timers/no_hz.rst
> +++ b/Documentation/timers/no_hz.rst
> @@ -129,11 +129,8 @@ adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain
> online to handle timekeeping tasks in order to ensure that system
> calls like gettimeofday() returns accurate values on adaptive-tick CPUs.
> (This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no running
> -user processes to observe slight drifts in clock rate.) Therefore, the
> -boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> -"nohz_full=" mask that includes the boot CPU will result in a boot-time
> -error message, and the boot CPU will be removed from the mask. Note that
> -this means that your system must have at least two CPUs in order for
> +user processes to observe slight drifts in clock rate.) Note that this
> +means that your system must have at least two CPUs in order for
> CONFIG_NO_HZ_FULL=y to do anything for you.
>
> Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded.
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 373d42c707bc..2a262d3ecb3d 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -46,7 +46,16 @@ int housekeeping_any_cpu(enum hk_type type)
> if (cpu < nr_cpu_ids)
> return cpu;
>
> - return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + if (likely(cpu < nr_cpu_ids))
> + return cpu;
> + /*
> + * Unless we have another problem this can only happen
> + * at boot time before start_secondary() brings the 1st
> + * housekeeping CPU up.
> + */
> + WARN_ON_ONCE(system_state == SYSTEM_RUNNING ||
> + type != HK_TYPE_TIMER);
> }
> }
> return smp_processor_id();
> --
> 2.25.1.362.g51ebf55
>
>
>

--

2024-04-18 14:55:00

[permalink] [raw]

Subject: Re: [PATCH] sched/isolation: fix boot crash when maxcpus < first-housekeeping-cpu

On Sat, Apr 13, 2024 at 04:17:46PM +0200 Oleg Nesterov wrote:
> housekeeping_setup() checks cpumask_intersects(present, online) to ensure
> that the kernel will have at least one housekeeping CPU after smp_init(),
> but this doesn't work if the maxcpus= kernel parameter limits the number
> of processors available after bootup.
>
> For example, the kernel with "maxcpus=2 nohz_full=0-2" parameters crashes
> at boot time on my virtual machine with 4 CPUs.
>
> Change housekeeping_setup() to use cpumask_first_and() and check that the
> returned cpu number is valid and less than setup_max_cpus.
>
> Another corner case is "nohz_full=0" on a machine with a single CPU or
> with the maxcpus=1 kernel argument. In this case non_housekeeping_mask
> is empty and IIUC tick_nohz_full_setup() makes no sense. And indeed, the
> kernel hits the WARN_ON(tick_nohz_full_running) in tick_sched_do_timer().
>
> And how should the kernel interpret the "nohz_full=" parameter? I think
> it should be silently ignored, but currently cpulist_parse() happily
> returns the empty cpumask and this leads to the same problem.
>
> Change housekeeping_setup() to check cpumask_empty(non_housekeeping_mask)
> and do nothing in this case.
>
> Signed-off-by: Oleg Nesterov <[email protected]>

Looks good to me. One less footgun.

Reviewed-by: Phil Auld <[email protected]>

> ---
> kernel/sched/isolation.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 2a262d3ecb3d..5891e715f00d 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -118,6 +118,7 @@ static void __init housekeeping_setup_type(enum hk_type type,
> static int __init housekeeping_setup(char *str, unsigned long flags)
> {
> cpumask_var_t non_housekeeping_mask, housekeeping_staging;
> + unsigned int first_cpu;
> int err = 0;
>
> if ((flags & HK_FLAG_TICK) && !(housekeeping.flags & HK_FLAG_TICK)) {
> @@ -138,7 +139,8 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
> cpumask_andnot(housekeeping_staging,
> cpu_possible_mask, non_housekeeping_mask);
>
> - if (!cpumask_intersects(cpu_present_mask, housekeeping_staging)) {
> + first_cpu = cpumask_first_and(cpu_present_mask, housekeeping_staging);
> + if (first_cpu >= nr_cpu_ids || first_cpu >= setup_max_cpus) {
> __cpumask_set_cpu(smp_processor_id(), housekeeping_staging);
> __cpumask_clear_cpu(smp_processor_id(), non_housekeeping_mask);
> if (!housekeeping.flags) {
> @@ -147,6 +149,9 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
> }
> }
>
> + if (cpumask_empty(non_housekeeping_mask))
> + goto free_housekeeping_staging;
> +
> if (!housekeeping.flags) {
> /* First setup call ("nohz_full=" or "isolcpus=") */
> enum hk_type type;
> --
> 2.25.1.362.g51ebf55
>
>
>

--

2024-04-18 15:41:07

[permalink] [raw]

Subject: Re: [PATCH] sched/isolation: fix boot crash when maxcpus < first-housekeeping-cpu

Le Sat, Apr 13, 2024 at 04:17:46PM +0200, Oleg Nesterov a ?crit :
> housekeeping_setup() checks cpumask_intersects(present, online) to ensure
> that the kernel will have at least one housekeeping CPU after smp_init(),
> but this doesn't work if the maxcpus= kernel parameter limits the number
> of processors available after bootup.
>
> For example, the kernel with "maxcpus=2 nohz_full=0-2" parameters crashes
> at boot time on my virtual machine with 4 CPUs.
>
> Change housekeeping_setup() to use cpumask_first_and() and check that the
> returned cpu number is valid and less than setup_max_cpus.
>
> Another corner case is "nohz_full=0" on a machine with a single CPU or
> with the maxcpus=1 kernel argument. In this case non_housekeeping_mask
> is empty and IIUC tick_nohz_full_setup() makes no sense. And indeed, the
> kernel hits the WARN_ON(tick_nohz_full_running) in tick_sched_do_timer().
>
> And how should the kernel interpret the "nohz_full=" parameter? I think
> it should be silently ignored, but currently cpulist_parse() happily
> returns the empty cpumask and this leads to the same problem.
>
> Change housekeeping_setup() to check cpumask_empty(non_housekeeping_mask)
> and do nothing in this case.
>
> Signed-off-by: Oleg Nesterov <[email protected]>

Acked-by: Frederic Weisbecker <[email protected]>

2024-04-22 18:52:18

[permalink] [raw]

Subject: Re: [PATCH] sched/isolation: fix boot crash when the boot CPU is nohz_full

Phil, Frederic,

Thanks for your review! Who do you think can take these patches?
At least the 1st one.

To remind, there are more problems with boot CPU in nohz mask, but
can we at least fix the kernel crash?

Oleg.

On 04/18, Phil Auld wrote:
>
> On Thu, Apr 11, 2024 at 04:39:05PM +0200 Oleg Nesterov wrote:
> > Documentation/timers/no_hz.rst states that the "nohz_full=" mask must not
> > include the boot CPU, this is no longer true after the commit 08ae95f4fd3b
> > ("nohz_full: Allow the boot CPU to be nohz_full").
> >
> > However after another commit aae17ebb53cd ("workqueue: Avoid using isolated
> > cpus' timers on queue_delayed_work") the kernel will crash at boot time in
> > this case; housekeeping_any_cpu() returns an invalid cpu nr until smp_init()
> > paths bring the 1st housekeeping CPU up.
> >
> > Change housekeeping_any_cpu() to check the result of cpumask_any_and() and
> > return smp_processor_id() in this case. Yes, this is just the simple and
> > backportable workaround which fixes the symptom, but smp_processor_id() at
> > boot time should be safe at least for type == HK_TYPE_TIMER, this more or
> > less matches the tick_do_timer_boot_cpu logic.
> >
> > We should not worry about cpu_down(); tick_nohz_cpu_down() will not allow
> > to offline tick_do_timer_cpu (the 1st online housekeeping CPU).
> >
> > Fixes: aae17ebb53cd ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")
> > Reported-by: Chris von Recklinghausen <[email protected]>
> > Closes: https://lore.kernel.org/all/[email protected]/
> > Signed-off-by: Oleg Nesterov <[email protected]>
>
> Checking the returned value instead of assuming seems safer in any case.
>
> Reviewed-by: Phil Auld <[email protected]>
>
> > ---
> > Documentation/timers/no_hz.rst | 7 ++-----
> > kernel/sched/isolation.c | 11 ++++++++++-
> > 2 files changed, 12 insertions(+), 6 deletions(-)
> >
> > diff --git a/Documentation/timers/no_hz.rst b/Documentation/timers/no_hz.rst
> > index f8786be15183..7fe8ef9718d8 100644
> > --- a/Documentation/timers/no_hz.rst
> > +++ b/Documentation/timers/no_hz.rst
> > @@ -129,11 +129,8 @@ adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain
> > online to handle timekeeping tasks in order to ensure that system
> > calls like gettimeofday() returns accurate values on adaptive-tick CPUs.
> > (This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no running
> > -user processes to observe slight drifts in clock rate.) Therefore, the
> > -boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> > -"nohz_full=" mask that includes the boot CPU will result in a boot-time
> > -error message, and the boot CPU will be removed from the mask. Note that
> > -this means that your system must have at least two CPUs in order for
> > +user processes to observe slight drifts in clock rate.) Note that this
> > +means that your system must have at least two CPUs in order for
> > CONFIG_NO_HZ_FULL=y to do anything for you.
> >
> > Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded.
> > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > index 373d42c707bc..2a262d3ecb3d 100644
> > --- a/kernel/sched/isolation.c
> > +++ b/kernel/sched/isolation.c
> > @@ -46,7 +46,16 @@ int housekeeping_any_cpu(enum hk_type type)
> > if (cpu < nr_cpu_ids)
> > return cpu;
> >
> > - return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> > + cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> > + if (likely(cpu < nr_cpu_ids))
> > + return cpu;
> > + /*
> > + * Unless we have another problem this can only happen
> > + * at boot time before start_secondary() brings the 1st
> > + * housekeeping CPU up.
> > + */
> > + WARN_ON_ONCE(system_state == SYSTEM_RUNNING ||
> > + type != HK_TYPE_TIMER);
> > }
> > }
> > return smp_processor_id();
> > --
> > 2.25.1.362.g51ebf55
> >
> >
> >
>
> --
>

2024-04-24 14:44:39

[permalink] [raw]

Subject: Re: [PATCH] sched/isolation: fix boot crash when the boot CPU is nohz_full

HI Oleg, Ingo,

On Mon, Apr 22, 2024 at 08:50:21PM +0200 Oleg Nesterov wrote:
> Phil, Frederic,
>
> Thanks for your review! Who do you think can take these patches?
> At least the 1st one.
>
> To remind, there are more problems with boot CPU in nohz mask, but
> can we at least fix the kernel crash?
>

I think that would be good. I don't know if Peter is at full strength.

Ingo could you take look at this, please?

Cheers,
Phil

> Oleg.
>
> On 04/18, Phil Auld wrote:
> >
> > On Thu, Apr 11, 2024 at 04:39:05PM +0200 Oleg Nesterov wrote:
> > > Documentation/timers/no_hz.rst states that the "nohz_full=" mask must not
> > > include the boot CPU, this is no longer true after the commit 08ae95f4fd3b
> > > ("nohz_full: Allow the boot CPU to be nohz_full").
> > >
> > > However after another commit aae17ebb53cd ("workqueue: Avoid using isolated
> > > cpus' timers on queue_delayed_work") the kernel will crash at boot time in
> > > this case; housekeeping_any_cpu() returns an invalid cpu nr until smp_init()
> > > paths bring the 1st housekeeping CPU up.
> > >
> > > Change housekeeping_any_cpu() to check the result of cpumask_any_and() and
> > > return smp_processor_id() in this case. Yes, this is just the simple and
> > > backportable workaround which fixes the symptom, but smp_processor_id() at
> > > boot time should be safe at least for type == HK_TYPE_TIMER, this more or
> > > less matches the tick_do_timer_boot_cpu logic.
> > >
> > > We should not worry about cpu_down(); tick_nohz_cpu_down() will not allow
> > > to offline tick_do_timer_cpu (the 1st online housekeeping CPU).
> > >
> > > Fixes: aae17ebb53cd ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")
> > > Reported-by: Chris von Recklinghausen <[email protected]>
> > > Closes: https://lore.kernel.org/all/[email protected]/
> > > Signed-off-by: Oleg Nesterov <[email protected]>
> >
> > Checking the returned value instead of assuming seems safer in any case.
> >
> > Reviewed-by: Phil Auld <[email protected]>
> >
> > > ---
> > > Documentation/timers/no_hz.rst | 7 ++-----
> > > kernel/sched/isolation.c | 11 ++++++++++-
> > > 2 files changed, 12 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/Documentation/timers/no_hz.rst b/Documentation/timers/no_hz.rst
> > > index f8786be15183..7fe8ef9718d8 100644
> > > --- a/Documentation/timers/no_hz.rst
> > > +++ b/Documentation/timers/no_hz.rst
> > > @@ -129,11 +129,8 @@ adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain
> > > online to handle timekeeping tasks in order to ensure that system
> > > calls like gettimeofday() returns accurate values on adaptive-tick CPUs.
> > > (This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no running
> > > -user processes to observe slight drifts in clock rate.) Therefore, the
> > > -boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> > > -"nohz_full=" mask that includes the boot CPU will result in a boot-time
> > > -error message, and the boot CPU will be removed from the mask. Note that
> > > -this means that your system must have at least two CPUs in order for
> > > +user processes to observe slight drifts in clock rate.) Note that this
> > > +means that your system must have at least two CPUs in order for
> > > CONFIG_NO_HZ_FULL=y to do anything for you.
> > >
> > > Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded.
> > > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > > index 373d42c707bc..2a262d3ecb3d 100644
> > > --- a/kernel/sched/isolation.c
> > > +++ b/kernel/sched/isolation.c
> > > @@ -46,7 +46,16 @@ int housekeeping_any_cpu(enum hk_type type)
> > > if (cpu < nr_cpu_ids)
> > > return cpu;
> > >
> > > - return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> > > + cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> > > + if (likely(cpu < nr_cpu_ids))
> > > + return cpu;
> > > + /*
> > > + * Unless we have another problem this can only happen
> > > + * at boot time before start_secondary() brings the 1st
> > > + * housekeeping CPU up.
> > > + */
> > > + WARN_ON_ONCE(system_state == SYSTEM_RUNNING ||
> > > + type != HK_TYPE_TIMER);
> > > }
> > > }
> > > return smp_processor_id();
> > > --
> > > 2.25.1.362.g51ebf55
> > >
> > >
> > >
> >
> > --
> >
>

--

2024-04-24 20:06:55

[permalink] [raw]

Subject: [tip: sched/urgent] sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: b6ad00418eaf376b4f2a68a1696d6368c1381310
Gitweb: https://git.kernel.org/tip/b6ad00418eaf376b4f2a68a1696d6368c1381310
Author: Oleg Nesterov <[email protected]>
AuthorDate: Sat, 13 Apr 2024 16:17:46 +02:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 24 Apr 2024 21:53:34 +02:00

sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU

housekeeping_setup() checks cpumask_intersects(present, online) to ensure
that the kernel will have at least one housekeeping CPU after smp_init(),
but this doesn't work if the maxcpus= kernel parameter limits the number of
processors available after bootup.

For example, a kernel with "maxcpus=2 nohz_full=0-2" parameters crashes at
boot time on a virtual machine with 4 CPUs.

Change housekeeping_setup() to use cpumask_first_and() and check that the
returned CPU number is valid and less than setup_max_cpus.

Another corner case is "nohz_full=0" on a machine with a single CPU or with
the maxcpus=1 kernel argument. In this case non_housekeeping_mask is empty
and tick_nohz_full_setup() makes no sense. And indeed, the kernel hits the
WARN_ON(tick_nohz_full_running) in tick_sched_do_timer().

And how should the kernel interpret the "nohz_full=" parameter? It should
be silently ignored, but currently cpulist_parse() happily returns the
empty cpumask and this leads to the same problem.

Change housekeeping_setup() to check cpumask_empty(non_housekeeping_mask)
and do nothing in this case.

Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Phil Auld <[email protected]>
Acked-by: Frederic Weisbecker <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/isolation.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 2a262d3..5891e71 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -118,6 +118,7 @@ static void __init housekeeping_setup_type(enum hk_type type,
static int __init housekeeping_setup(char *str, unsigned long flags)
{
cpumask_var_t non_housekeeping_mask, housekeeping_staging;
+ unsigned int first_cpu;
int err = 0;

if ((flags & HK_FLAG_TICK) && !(housekeeping.flags & HK_FLAG_TICK)) {
@@ -138,7 +139,8 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
cpumask_andnot(housekeeping_staging,
cpu_possible_mask, non_housekeeping_mask);

- if (!cpumask_intersects(cpu_present_mask, housekeeping_staging)) {
+ first_cpu = cpumask_first_and(cpu_present_mask, housekeeping_staging);
+ if (first_cpu >= nr_cpu_ids || first_cpu >= setup_max_cpus) {
__cpumask_set_cpu(smp_processor_id(), housekeeping_staging);
__cpumask_clear_cpu(smp_processor_id(), non_housekeeping_mask);
if (!housekeeping.flags) {
@@ -147,6 +149,9 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
}
}

+ if (cpumask_empty(non_housekeeping_mask))
+ goto free_housekeeping_staging;
+
if (!housekeeping.flags) {
/* First setup call ("nohz_full=" or "isolcpus=") */
enum hk_type type;

2024-04-24 20:06:55

[permalink] [raw]

Subject: [tip: sched/urgent] sched/isolation: {revent boot crash when the boot CPU is nohz_full

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 8e3101b38dfc20848a23525b1e6e80bd1641d44c
Gitweb: https://git.kernel.org/tip/8e3101b38dfc20848a23525b1e6e80bd1641d44c
Author: Oleg Nesterov <[email protected]>
AuthorDate: Thu, 11 Apr 2024 16:39:05 +02:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 24 Apr 2024 21:53:34 +02:00

sched/isolation: {revent boot crash when the boot CPU is nohz_full

Documentation/timers/no_hz.rst states that the "nohz_full=" mask must not
include the boot CPU, which is no longer true after commit 08ae95f4fd3b
("nohz_full: Allow the boot CPU to be nohz_full").

However after commit aae17ebb53cd ("workqueue: Avoid using isolated cpus'
timers on queue_delayed_work") the kernel will crash at boot time in this
case; housekeeping_any_cpu() returns an invalid CPU number until smp_init()
brings the first housekeeping CPU up.

Change housekeeping_any_cpu() to check the result of cpumask_any_and() and
return smp_processor_id() in this case.

This is just the simple and backportable workaround which fixes the
symptom, but smp_processor_id() at boot time should be safe at least for
type == HK_TYPE_TIMER, this more or less matches the tick_do_timer_boot_cpu
logic.

There is no worry about cpu_down(); tick_nohz_cpu_down() will not allow to
offline tick_do_timer_cpu (the 1st online housekeeping CPU).

Fixes: aae17ebb53cd ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")
Reported-by: Chris von Recklinghausen <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Phil Auld <[email protected]>
Acked-by: Frederic Weisbecker <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Closes: https://lore.kernel.org/all/[email protected]/
---
Documentation/timers/no_hz.rst | 7 ++-----
kernel/sched/isolation.c | 11 ++++++++++-
2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/Documentation/timers/no_hz.rst b/Documentation/timers/no_hz.rst
index f8786be..7fe8ef9 100644
--- a/Documentation/timers/no_hz.rst
+++ b/Documentation/timers/no_hz.rst
@@ -129,11 +129,8 @@ adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain
online to handle timekeeping tasks in order to ensure that system
calls like gettimeofday() returns accurate values on adaptive-tick CPUs.
(This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no running
-user processes to observe slight drifts in clock rate.) Therefore, the
-boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
-"nohz_full=" mask that includes the boot CPU will result in a boot-time
-error message, and the boot CPU will be removed from the mask. Note that
-this means that your system must have at least two CPUs in order for
+user processes to observe slight drifts in clock rate.) Note that this
+means that your system must have at least two CPUs in order for
CONFIG_NO_HZ_FULL=y to do anything for you.

Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded.
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 373d42c..2a262d3 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -46,7 +46,16 @@ int housekeeping_any_cpu(enum hk_type type)
if (cpu < nr_cpu_ids)
return cpu;

- return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ if (likely(cpu < nr_cpu_ids))
+ return cpu;
+ /*
+ * Unless we have another problem this can only happen
+ * at boot time before start_secondary() brings the 1st
+ * housekeeping CPU up.
+ */
+ WARN_ON_ONCE(system_state == SYSTEM_RUNNING ||
+ type != HK_TYPE_TIMER);
}
}
return smp_processor_id();

2024-04-24 20:41:41

[permalink] [raw]

Subject: Re: [tip: sched/urgent] sched/isolation: {revent boot crash when the boot CPU is nohz_full

On Wed, Apr 24, 2024 at 08:05:02PM -0000 tip-bot2 for Oleg Nesterov wrote:
> The following commit has been merged into the sched/urgent branch of tip:
>
> Commit-ID: 8e3101b38dfc20848a23525b1e6e80bd1641d44c
> Gitweb: https://git.kernel.org/tip/8e3101b38dfc20848a23525b1e6e80bd1641d44c
> Author: Oleg Nesterov <[email protected]>
> AuthorDate: Thu, 11 Apr 2024 16:39:05 +02:00
> Committer: Thomas Gleixner <[email protected]>
> CommitterDate: Wed, 24 Apr 2024 21:53:34 +02:00
>
> sched/isolation: {revent boot crash when the boot CPU is nohz_full
>

Thanks Thomas, Typo in the reworded description :)

> Documentation/timers/no_hz.rst states that the "nohz_full=" mask must not
> include the boot CPU, which is no longer true after commit 08ae95f4fd3b
> ("nohz_full: Allow the boot CPU to be nohz_full").
>
> However after commit aae17ebb53cd ("workqueue: Avoid using isolated cpus'
> timers on queue_delayed_work") the kernel will crash at boot time in this
> case; housekeeping_any_cpu() returns an invalid CPU number until smp_init()
> brings the first housekeeping CPU up.
>
> Change housekeeping_any_cpu() to check the result of cpumask_any_and() and
> return smp_processor_id() in this case.
>
> This is just the simple and backportable workaround which fixes the
> symptom, but smp_processor_id() at boot time should be safe at least for
> type == HK_TYPE_TIMER, this more or less matches the tick_do_timer_boot_cpu
> logic.
>
> There is no worry about cpu_down(); tick_nohz_cpu_down() will not allow to
> offline tick_do_timer_cpu (the 1st online housekeeping CPU).
>
> Fixes: aae17ebb53cd ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")
> Reported-by: Chris von Recklinghausen <[email protected]>
> Signed-off-by: Oleg Nesterov <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Reviewed-by: Phil Auld <[email protected]>
> Acked-by: Frederic Weisbecker <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
> Closes: https://lore.kernel.org/all/[email protected]/
> ---
> Documentation/timers/no_hz.rst | 7 ++-----
> kernel/sched/isolation.c | 11 ++++++++++-
> 2 files changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/timers/no_hz.rst b/Documentation/timers/no_hz.rst
> index f8786be..7fe8ef9 100644
> --- a/Documentation/timers/no_hz.rst
> +++ b/Documentation/timers/no_hz.rst
> @@ -129,11 +129,8 @@ adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain
> online to handle timekeeping tasks in order to ensure that system
> calls like gettimeofday() returns accurate values on adaptive-tick CPUs.
> (This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no running
> -user processes to observe slight drifts in clock rate.) Therefore, the
> -boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
> -"nohz_full=" mask that includes the boot CPU will result in a boot-time
> -error message, and the boot CPU will be removed from the mask. Note that
> -this means that your system must have at least two CPUs in order for
> +user processes to observe slight drifts in clock rate.) Note that this
> +means that your system must have at least two CPUs in order for
> CONFIG_NO_HZ_FULL=y to do anything for you.
>
> Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded.
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 373d42c..2a262d3 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -46,7 +46,16 @@ int housekeeping_any_cpu(enum hk_type type)
> if (cpu < nr_cpu_ids)
> return cpu;
>
> - return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> + if (likely(cpu < nr_cpu_ids))
> + return cpu;
> + /*
> + * Unless we have another problem this can only happen
> + * at boot time before start_secondary() brings the 1st
> + * housekeeping CPU up.
> + */
> + WARN_ON_ONCE(system_state == SYSTEM_RUNNING ||
> + type != HK_TYPE_TIMER);
> }
> }
> return smp_processor_id();
>

--

2024-04-28 08:13:31

by Ingo Molnar

[permalink] [raw]

Subject: Re: [tip: sched/urgent] sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU

* tip-bot2 for Oleg Nesterov <[email protected]> wrote:

> Another corner case is "nohz_full=0" on a machine with a single CPU or with
> the maxcpus=1 kernel argument. In this case non_housekeeping_mask is empty
> and tick_nohz_full_setup() makes no sense. And indeed, the kernel hits the
> WARN_ON(tick_nohz_full_running) in tick_sched_do_timer().
>
> And how should the kernel interpret the "nohz_full=" parameter? It should
> be silently ignored, but currently cpulist_parse() happily returns the
> empty cpumask and this leads to the same problem.
>
> Change housekeeping_setup() to check cpumask_empty(non_housekeeping_mask)
> and do nothing in this case.

So arguably the user meant NOHZ_FULL to be turned off - but it is de-facto
already turned off by the fact that there's only a single CPU available,
right?

Thanks,

Ingo

2024-04-28 08:14:43

by Ingo Molnar

[permalink] [raw]

Subject: Re: [tip: sched/urgent] sched/isolation: {revent boot crash when the boot CPU is nohz_full

* Phil Auld <[email protected]> wrote:

> On Wed, Apr 24, 2024 at 08:05:02PM -0000 tip-bot2 for Oleg Nesterov wrote:
> > The following commit has been merged into the sched/urgent branch of tip:
> >
> > Commit-ID: 8e3101b38dfc20848a23525b1e6e80bd1641d44c
> > Gitweb: https://git.kernel.org/tip/8e3101b38dfc20848a23525b1e6e80bd1641d44c
> > Author: Oleg Nesterov <[email protected]>
> > AuthorDate: Thu, 11 Apr 2024 16:39:05 +02:00
> > Committer: Thomas Gleixner <[email protected]>
> > CommitterDate: Wed, 24 Apr 2024 21:53:34 +02:00
> >
> > sched/isolation: {revent boot crash when the boot CPU is nohz_full
> >
>
> Thanks Thomas, Typo in the reworded description :)

Ok, so normally we wouldn't rebase just for a typo in a changelog, but
that's an annoying typo that will show up in shortlogs - so I fixed it all
up in tip:sched/urgent.

Thanks,

Ingo

2024-04-28 08:24:51

[permalink] [raw]

Subject: [tip: sched/urgent] sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 257bf89d84121280904800acd25cc2c444c717ae
Gitweb: https://git.kernel.org/tip/257bf89d84121280904800acd25cc2c444c717ae
Author: Oleg Nesterov <[email protected]>
AuthorDate: Sat, 13 Apr 2024 16:17:46 +02:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Sun, 28 Apr 2024 10:08:21 +02:00

sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU

housekeeping_setup() checks cpumask_intersects(present, online) to ensure
that the kernel will have at least one housekeeping CPU after smp_init(),
but this doesn't work if the maxcpus= kernel parameter limits the number of
processors available after bootup.

For example, a kernel with "maxcpus=2 nohz_full=0-2" parameters crashes at
boot time on a virtual machine with 4 CPUs.

Change housekeeping_setup() to use cpumask_first_and() and check that the
returned CPU number is valid and less than setup_max_cpus.

Another corner case is "nohz_full=0" on a machine with a single CPU or with
the maxcpus=1 kernel argument. In this case non_housekeeping_mask is empty
and tick_nohz_full_setup() makes no sense. And indeed, the kernel hits the
WARN_ON(tick_nohz_full_running) in tick_sched_do_timer().

And how should the kernel interpret the "nohz_full=" parameter? It should
be silently ignored, but currently cpulist_parse() happily returns the
empty cpumask and this leads to the same problem.

Change housekeeping_setup() to check cpumask_empty(non_housekeeping_mask)
and do nothing in this case.

Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Phil Auld <[email protected]>
Acked-by: Frederic Weisbecker <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/isolation.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 2a262d3..5891e71 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -118,6 +118,7 @@ static void __init housekeeping_setup_type(enum hk_type type,
static int __init housekeeping_setup(char *str, unsigned long flags)
{
cpumask_var_t non_housekeeping_mask, housekeeping_staging;
+ unsigned int first_cpu;
int err = 0;

if ((flags & HK_FLAG_TICK) && !(housekeeping.flags & HK_FLAG_TICK)) {
@@ -138,7 +139,8 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
cpumask_andnot(housekeeping_staging,
cpu_possible_mask, non_housekeeping_mask);

- if (!cpumask_intersects(cpu_present_mask, housekeeping_staging)) {
+ first_cpu = cpumask_first_and(cpu_present_mask, housekeeping_staging);
+ if (first_cpu >= nr_cpu_ids || first_cpu >= setup_max_cpus) {
__cpumask_set_cpu(smp_processor_id(), housekeeping_staging);
__cpumask_clear_cpu(smp_processor_id(), non_housekeeping_mask);
if (!housekeeping.flags) {
@@ -147,6 +149,9 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
}
}

+ if (cpumask_empty(non_housekeeping_mask))
+ goto free_housekeeping_staging;
+
if (!housekeeping.flags) {
/* First setup call ("nohz_full=" or "isolcpus=") */
enum hk_type type;

2024-04-28 08:24:52

[permalink] [raw]

Subject: [tip: sched/urgent] sched/isolation: Prevent boot crash when the boot CPU is nohz_full

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 5097cbcb38e6e0d2627c9dde1985e91d2c9f880e
Gitweb: https://git.kernel.org/tip/5097cbcb38e6e0d2627c9dde1985e91d2c9f880e
Author: Oleg Nesterov <[email protected]>
AuthorDate: Thu, 11 Apr 2024 16:39:05 +02:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Sun, 28 Apr 2024 10:07:12 +02:00

sched/isolation: Prevent boot crash when the boot CPU is nohz_full

Documentation/timers/no_hz.rst states that the "nohz_full=" mask must not
include the boot CPU, which is no longer true after:

08ae95f4fd3b ("nohz_full: Allow the boot CPU to be nohz_full").

However after:

aae17ebb53cd ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")

the kernel will crash at boot time in this case; housekeeping_any_cpu()
returns an invalid CPU number until smp_init() brings the first
housekeeping CPU up.

Change housekeeping_any_cpu() to check the result of cpumask_any_and() and
return smp_processor_id() in this case.

This is just the simple and backportable workaround which fixes the
symptom, but smp_processor_id() at boot time should be safe at least for
type == HK_TYPE_TIMER, this more or less matches the tick_do_timer_boot_cpu
logic.

There is no worry about cpu_down(); tick_nohz_cpu_down() will not allow to
offline tick_do_timer_cpu (the 1st online housekeeping CPU).

Fixes: aae17ebb53cd ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")
Reported-by: Chris von Recklinghausen <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Phil Auld <[email protected]>
Acked-by: Frederic Weisbecker <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Closes: https://lore.kernel.org/all/[email protected]/
---
Documentation/timers/no_hz.rst | 7 ++-----
kernel/sched/isolation.c | 11 ++++++++++-
2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/Documentation/timers/no_hz.rst b/Documentation/timers/no_hz.rst
index f8786be..7fe8ef9 100644
--- a/Documentation/timers/no_hz.rst
+++ b/Documentation/timers/no_hz.rst
@@ -129,11 +129,8 @@ adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain
online to handle timekeeping tasks in order to ensure that system
calls like gettimeofday() returns accurate values on adaptive-tick CPUs.
(This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no running
-user processes to observe slight drifts in clock rate.) Therefore, the
-boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
-"nohz_full=" mask that includes the boot CPU will result in a boot-time
-error message, and the boot CPU will be removed from the mask. Note that
-this means that your system must have at least two CPUs in order for
+user processes to observe slight drifts in clock rate.) Note that this
+means that your system must have at least two CPUs in order for
CONFIG_NO_HZ_FULL=y to do anything for you.

Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded.
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 373d42c..2a262d3 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -46,7 +46,16 @@ int housekeeping_any_cpu(enum hk_type type)
if (cpu < nr_cpu_ids)
return cpu;

- return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
+ if (likely(cpu < nr_cpu_ids))
+ return cpu;
+ /*
+ * Unless we have another problem this can only happen
+ * at boot time before start_secondary() brings the 1st
+ * housekeeping CPU up.
+ */
+ WARN_ON_ONCE(system_state == SYSTEM_RUNNING ||
+ type != HK_TYPE_TIMER);
}
}
return smp_processor_id();

2024-04-28 13:18:27

[permalink] [raw]

Subject: Re: [tip: sched/urgent] sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU

On 04/28, Ingo Molnar wrote:
>
> * tip-bot2 for Oleg Nesterov <[email protected]> wrote:
>
> > Another corner case is "nohz_full=0" on a machine with a single CPU or with
> > the maxcpus=1 kernel argument. In this case non_housekeeping_mask is empty
> > and tick_nohz_full_setup() makes no sense. And indeed, the kernel hits the
> > WARN_ON(tick_nohz_full_running) in tick_sched_do_timer().
> >
> > And how should the kernel interpret the "nohz_full=" parameter? It should
> > be silently ignored, but currently cpulist_parse() happily returns the
> > empty cpumask and this leads to the same problem.
> >
> > Change housekeeping_setup() to check cpumask_empty(non_housekeeping_mask)
> > and do nothing in this case.
>
> So arguably the user meant NOHZ_FULL to be turned off - but it is de-facto
> already turned off by the fact that there's only a single CPU available,
> right?

Or the user passes the empty "nohz_full=" mask on a multi-CPU machine.

In both cases (before this patch) housekeeping_setup() calls
tick_nohz_full_setup(non_housekeeping_mask) which sets
tick_nohz_full_running = true even if tick_nohz_full_mask is empty.

This doesn't look right to me and triggers the "this should not happen"
warning in tick_sched_do_timer().

But let me repeat, I know nothing about nohz/etc.

Oleg.

2024-04-29 11:58:23