2023-08-04 11:01:25

by Zhang, Rui

[permalink] [raw]
Subject: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

Problem statement
-----------------
When using cgroup isolated partition to isolate cpus including cpu0, it
is observed that cpu0 is woken up frequenctly but doing nothing. This is
not good for power efficiency.

<idle>-0 [000] 616.491602: hrtimer_cancel: hrtimer=0xffff8e8fdf623c10
<idle>-0 [000] 616.491608: hrtimer_start: hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 expires=615996000000 softexpires=615996000000
<idle>-0 [000] 616.491616: rcu_utilization: Start context switch
<idle>-0 [000] 616.491618: rcu_utilization: End context switch
<idle>-0 [000] 616.491637: tick_stop: success=1 dependency=NONE
<idle>-0 [000] 616.491637: hrtimer_cancel: hrtimer=0xffff8e8fdf623c10
<idle>-0 [000] 616.491638: hrtimer_start: hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 expires=616420000000 softexpires=616420000000

The above pattern repeats every one or multiple ticks, results in total
2000+ wakeups on cpu0 in 60 seconds, when running workload on the
cpus that are not in the isolated partition.

Rootcause
---------
In NOHZ mode, an active cpu either sends an IPI or touches the idle
cpu's polling flag to wake it up, so that the idle cpu can pull tasks
from the busy cpu. The logic for selecting the target cpu is to use the
first idle cpu that presents in both nohz.idle_cpus_mask and
housekeeping_cpumask.

In the above scenario, when cpu0 is in the cgroup isolated partition,
its sched domain is deteched, but it is still available in both of the
above cpumasks. As a result, cpu0
1. is always selected when kicking idle load balance
2. is woken up from the idle loop
3. calls __schedule() but cannot find any task to pull because it is not
in any sched_domain, thus it does nothing and reenters idle.

Solution
--------
Fix the problem by skipping cpus with no sched domain attached during
NOHZ idle balance.

Signed-off-by: Zhang Rui <[email protected]>
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3e25be58e2b..ea3185a46962 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void)
if (ilb == smp_processor_id())
continue;

+ if (unlikely(on_null_domain(cpu_rq(ilb))))
+ continue;
+
if (idle_cpu(ilb))
return ilb;
}
--
2.34.1



2023-08-09 07:37:17

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

On 2023-08-04 at 17:08:58 +0800, Zhang Rui wrote:
> Problem statement
> -----------------
> When using cgroup isolated partition to isolate cpus including cpu0, it
> is observed that cpu0 is woken up frequenctly but doing nothing. This is
> not good for power efficiency.
>
> <idle>-0 [000] 616.491602: hrtimer_cancel: hrtimer=0xffff8e8fdf623c10
> <idle>-0 [000] 616.491608: hrtimer_start: hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 expires=615996000000 softexpires=615996000000
> <idle>-0 [000] 616.491616: rcu_utilization: Start context switch
> <idle>-0 [000] 616.491618: rcu_utilization: End context switch
> <idle>-0 [000] 616.491637: tick_stop: success=1 dependency=NONE
> <idle>-0 [000] 616.491637: hrtimer_cancel: hrtimer=0xffff8e8fdf623c10
> <idle>-0 [000] 616.491638: hrtimer_start: hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 expires=616420000000 softexpires=616420000000
>
> The above pattern repeats every one or multiple ticks, results in total
> 2000+ wakeups on cpu0 in 60 seconds, when running workload on the
> cpus that are not in the isolated partition.
>
> Rootcause
> ---------
> In NOHZ mode, an active cpu either sends an IPI or touches the idle
> cpu's polling flag to wake it up, so that the idle cpu can pull tasks
> from the busy cpu. The logic for selecting the target cpu is to use the
> first idle cpu that presents in both nohz.idle_cpus_mask and
> housekeeping_cpumask.
>
> In the above scenario, when cpu0 is in the cgroup isolated partition,
> its sched domain is deteched, but it is still available in both of the
> above cpumasks. As a result, cpu0
> 1. is always selected when kicking idle load balance
> 2. is woken up from the idle loop
> 3. calls __schedule() but cannot find any task to pull because it is not
> in any sched_domain, thus it does nothing and reenters idle.
>
> Solution
> --------
> Fix the problem by skipping cpus with no sched domain attached during
> NOHZ idle balance.
>
> Signed-off-by: Zhang Rui <[email protected]>
> ---
> kernel/sched/fair.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b3e25be58e2b..ea3185a46962 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void)
> if (ilb == smp_processor_id())
> continue;
>
> + if (unlikely(on_null_domain(cpu_rq(ilb))))
> + continue;
> +
> if (idle_cpu(ilb))
> return ilb;
> }

Is it possible to pass a valid cpumask to kick_ilb() via nohz_balancer_kick()
and let find_new_ilb() scan in that mask? So we could shrink the scan range
and also reduce the null domain check in each loop. CPUs in different
cpuset are in different root domains, the busy CPU(in cpuset0) will not ask
nohz idle CPU0(in isolated cpuset1) to launch idle load balance.

struct root_domain *rd = rq->rd;
...
kick_ilb(flags, rd->span)

thanks,
Chenyu

2023-08-11 10:11:05

by Zhang, Rui

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

Hi, Yu,

On Wed, 2023-08-09 at 15:00 +0800, Chen Yu wrote:
> On 2023-08-04 at 17:08:58 +0800, Zhang Rui wrote:
> > Problem statement
> > -----------------
> > When using cgroup isolated partition to isolate cpus including
> > cpu0, it
> > is observed that cpu0 is woken up frequenctly but doing nothing.
> > This is
> > not good for power efficiency.
> >
> > <idle>-0     [000]   616.491602: hrtimer_cancel:      
> > hrtimer=0xffff8e8fdf623c10
> > <idle>-0     [000]   616.491608: hrtimer_start:       
> > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > expires=615996000000 softexpires=615996000000
> > <idle>-0     [000]   616.491616: rcu_utilization:      Start
> > context switch
> > <idle>-0     [000]   616.491618: rcu_utilization:      End context
> > switch
> > <idle>-0     [000]   616.491637: tick_stop:            success=1
> > dependency=NONE
> > <idle>-0     [000]   616.491637: hrtimer_cancel:      
> > hrtimer=0xffff8e8fdf623c10
> > <idle>-0     [000]   616.491638: hrtimer_start:       
> > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > expires=616420000000 softexpires=616420000000
> >
> > The above pattern repeats every one or multiple ticks, results in
> > total
> > 2000+ wakeups on cpu0 in 60 seconds, when running workload on the
> > cpus that are not in the isolated partition.
> >
> > Rootcause
> > ---------
> > In NOHZ mode, an active cpu either sends an IPI or touches the idle
> > cpu's polling flag to wake it up, so that the idle cpu can pull
> > tasks
> > from the busy cpu. The logic for selecting the target cpu is to use
> > the
> > first idle cpu that presents in both nohz.idle_cpus_mask and
> > housekeeping_cpumask.
> >
> > In the above scenario, when cpu0 is in the cgroup isolated
> > partition,
> > its sched domain is deteched, but it is still available in both of
> > the
> > above cpumasks. As a result, cpu0
> > 1. is always selected when kicking idle load balance
> > 2. is woken up from the idle loop
> > 3. calls __schedule() but cannot find any task to pull because it
> > is not
> >    in any sched_domain, thus it does nothing and reenters idle.
> >
> > Solution
> > --------
> > Fix the problem by skipping cpus with no sched domain attached
> > during
> > NOHZ idle balance.
> >
> > Signed-off-by: Zhang Rui <[email protected]>
> > ---
> >  kernel/sched/fair.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b3e25be58e2b..ea3185a46962 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void)
> >                 if (ilb == smp_processor_id())
> >                         continue;
> >  
> > +               if (unlikely(on_null_domain(cpu_rq(ilb))))
> > +                       continue;
> > +
> >                 if (idle_cpu(ilb))
> >                         return ilb;
> >         }
>
> Is it possible to pass a valid cpumask to kick_ilb() via
> nohz_balancer_kick()
> and let find_new_ilb() scan in that mask? So we could shrink the scan
> range
> and also reduce the null domain check in each loop. CPUs in different
> cpuset are in different root domains, the busy CPU(in cpuset0) will
> not ask
> nohz idle CPU0(in isolated cpuset1) to launch idle load balance.
>
> struct root_domain *rd = rq->rd;
> ...
> kick_ilb(flags, rd->span)
>         

Yeah. This also sounds like a reasonable approach. I can make a patch
to confirm it works as expected.

thanks,
rui

2023-08-14 05:32:25

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

Hi Rui,

On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote:
> Problem statement
> -----------------
> When using cgroup isolated partition to isolate cpus including cpu0, it
> is observed that cpu0 is woken up frequenctly but doing nothing. This is
> not good for power efficiency.
>
> <idle>-0 [000] 616.491602: hrtimer_cancel: hrtimer=0xffff8e8fdf623c10
> <idle>-0 [000] 616.491608: hrtimer_start: hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 expires=615996000000 softexpires=615996000000
> <idle>-0 [000] 616.491616: rcu_utilization: Start context switch
> <idle>-0 [000] 616.491618: rcu_utilization: End context switch
> <idle>-0 [000] 616.491637: tick_stop: success=1 dependency=NONE
> <idle>-0 [000] 616.491637: hrtimer_cancel: hrtimer=0xffff8e8fdf623c10
> <idle>-0 [000] 616.491638: hrtimer_start: hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 expires=616420000000 softexpires=616420000000
>
> The above pattern repeats every one or multiple ticks, results in total
> 2000+ wakeups on cpu0 in 60 seconds, when running workload on the
> cpus that are not in the isolated partition.
>
> Rootcause
> ---------
> In NOHZ mode, an active cpu either sends an IPI or touches the idle
> cpu's polling flag to wake it up, so that the idle cpu can pull tasks
> from the busy cpu. The logic for selecting the target cpu is to use the
> first idle cpu that presents in both nohz.idle_cpus_mask and
> housekeeping_cpumask.
>
> In the above scenario, when cpu0 is in the cgroup isolated partition,
> its sched domain is deteched, but it is still available in both of the
> above cpumasks. As a result, cpu0

I saw in nohz_balance_enter_idle(), if a cpu is isolated, it will not
set itself in nohz.idle_cpus_mask and thus should not be chosen as
ilb_cpu. I wonder what's stopping this from working?

> 1. is always selected when kicking idle load balance
> 2. is woken up from the idle loop
> 3. calls __schedule() but cannot find any task to pull because it is not
> in any sched_domain, thus it does nothing and reenters idle.
>
> Solution
> --------
> Fix the problem by skipping cpus with no sched domain attached during
> NOHZ idle balance.
>
> Signed-off-by: Zhang Rui <[email protected]>
> ---
> kernel/sched/fair.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b3e25be58e2b..ea3185a46962 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void)
> if (ilb == smp_processor_id())
> continue;
>
> + if (unlikely(on_null_domain(cpu_rq(ilb))))
> + continue;
> +
> if (idle_cpu(ilb))
> return ilb;
> }
> --
> 2.34.1
>

2023-08-14 08:51:55

by Zhang, Rui

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

On Mon, 2023-08-14 at 11:14 +0800, Aaron Lu wrote:
> Hi Rui,
>
> On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote:
> > Problem statement
> > -----------------
> > When using cgroup isolated partition to isolate cpus including
> > cpu0, it
> > is observed that cpu0 is woken up frequenctly but doing nothing.
> > This is
> > not good for power efficiency.
> >
> > <idle>-0     [000]   616.491602: hrtimer_cancel:      
> > hrtimer=0xffff8e8fdf623c10
> > <idle>-0     [000]   616.491608: hrtimer_start:       
> > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > expires=615996000000 softexpires=615996000000
> > <idle>-0     [000]   616.491616: rcu_utilization:      Start
> > context switch
> > <idle>-0     [000]   616.491618: rcu_utilization:      End context
> > switch
> > <idle>-0     [000]   616.491637: tick_stop:            success=1
> > dependency=NONE
> > <idle>-0     [000]   616.491637: hrtimer_cancel:      
> > hrtimer=0xffff8e8fdf623c10
> > <idle>-0     [000]   616.491638: hrtimer_start:       
> > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > expires=616420000000 softexpires=616420000000
> >
> > The above pattern repeats every one or multiple ticks, results in
> > total
> > 2000+ wakeups on cpu0 in 60 seconds, when running workload on the
> > cpus that are not in the isolated partition.
> >
> > Rootcause
> > ---------
> > In NOHZ mode, an active cpu either sends an IPI or touches the idle
> > cpu's polling flag to wake it up, so that the idle cpu can pull
> > tasks
> > from the busy cpu. The logic for selecting the target cpu is to use
> > the
> > first idle cpu that presents in both nohz.idle_cpus_mask and
> > housekeeping_cpumask.
> >
> > In the above scenario, when cpu0 is in the cgroup isolated
> > partition,
> > its sched domain is deteched, but it is still available in both of
> > the
> > above cpumasks. As a result, cpu0
>
> I saw in nohz_balance_enter_idle(), if a cpu is isolated, it will not
> set itself in nohz.idle_cpus_mask and thus should not be chosen as
> ilb_cpu. I wonder what's stopping this from working?

One thing I forgot to mention is that the problem is gone if we offline
and re-online those cpus. In that case, the isolated cpus are removed
from the nohz.idle_cpus_mask in sched_cpu_deactivate() and are never
added back.

At runtime, the cpus can be removed from the nohz.idle_cpus_mask in
below case only
trigger_load_balance()
if (unlikely(on_null_domain(rq) || !cpu_active(cpu_of(rq))))
return;
nohz_balancer_kick(rq);
nohz_balance_exit_idle()

My understanding is that if a cpu is in nohz.idle_cpus_mask when it is
isolated, there is no chance to remove it from that mask later, so the
check in nohz_balance_enter_idle() does not help.

thanks,
rui


>
> > 1. is always selected when kicking idle load balance
> > 2. is woken up from the idle loop
> > 3. calls __schedule() but cannot find any task to pull because it
> > is not
> >    in any sched_domain, thus it does nothing and reenters idle.
> >
> > Solution
> > --------
> > Fix the problem by skipping cpus with no sched domain attached
> > during
> > NOHZ idle balance.
> >
> > Signed-off-by: Zhang Rui <[email protected]>
> > ---
> >  kernel/sched/fair.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b3e25be58e2b..ea3185a46962 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void)
> >                 if (ilb == smp_processor_id())
> >                         continue;
> >  
> > +               if (unlikely(on_null_domain(cpu_rq(ilb))))
> > +                       continue;
> > +
> >                 if (idle_cpu(ilb))
> >                         return ilb;
> >         }
> > --
> > 2.34.1
> >

2023-09-08 10:04:10

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

Hello Rui,

On 8/14/23 10:30, Zhang, Rui wrote:
> On Mon, 2023-08-14 at 11:14 +0800, Aaron Lu wrote:
>> Hi Rui,
>>
>> On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote:
>>> Problem statement
>>> -----------------
>>> When using cgroup isolated partition to isolate cpus including
>>> cpu0, it
>>> is observed that cpu0 is woken up frequenctly but doing nothing.
>>> This is
>>> not good for power efficiency.
>>>
>>> <idle>-0     [000]   616.491602: hrtimer_cancel:
>>> hrtimer=0xffff8e8fdf623c10
>>> <idle>-0     [000]   616.491608: hrtimer_start:
>>> hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
>>> expires=615996000000 softexpires=615996000000
>>> <idle>-0     [000]   616.491616: rcu_utilization:      Start
>>> context switch
>>> <idle>-0     [000]   616.491618: rcu_utilization:      End context
>>> switch
>>> <idle>-0     [000]   616.491637: tick_stop:            success=1
>>> dependency=NONE
>>> <idle>-0     [000]   616.491637: hrtimer_cancel:
>>> hrtimer=0xffff8e8fdf623c10
>>> <idle>-0     [000]   616.491638: hrtimer_start:
>>> hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
>>> expires=616420000000 softexpires=616420000000
>>>
>>> The above pattern repeats every one or multiple ticks, results in
>>> total
>>> 2000+ wakeups on cpu0 in 60 seconds, when running workload on the
>>> cpus that are not in the isolated partition.
>>>
>>> Rootcause
>>> ---------
>>> In NOHZ mode, an active cpu either sends an IPI or touches the idle
>>> cpu's polling flag to wake it up, so that the idle cpu can pull
>>> tasks
>>> from the busy cpu. The logic for selecting the target cpu is to use
>>> the
>>> first idle cpu that presents in both nohz.idle_cpus_mask and
>>> housekeeping_cpumask.
>>>
>>> In the above scenario, when cpu0 is in the cgroup isolated
>>> partition,
>>> its sched domain is deteched, but it is still available in both of
>>> the
>>> above cpumasks. As a result, cpu0
>>
>> I saw in nohz_balance_enter_idle(), if a cpu is isolated, it will not
>> set itself in nohz.idle_cpus_mask and thus should not be chosen as
>> ilb_cpu. I wonder what's stopping this from working?
>
> One thing I forgot to mention is that the problem is gone if we offline
> and re-online those cpus. In that case, the isolated cpus are removed
> from the nohz.idle_cpus_mask in sched_cpu_deactivate() and are never
> added back.
>
> At runtime, the cpus can be removed from the nohz.idle_cpus_mask in
> below case only
> trigger_load_balance()
> if (unlikely(on_null_domain(rq) || !cpu_active(cpu_of(rq))))
> return;
> nohz_balancer_kick(rq);
> nohz_balance_exit_idle()
>
> My understanding is that if a cpu is in nohz.idle_cpus_mask when it is
> isolated, there is no chance to remove it from that mask later, so the
> check in nohz_balance_enter_idle() does not help.


As you mentioned, once a CPU is isolated, its rq->nohz_tick_stopped is
never updated. Instead of avoiding isolated CPUs at each find_new_ilb() call
as this patch does, maybe it would be better to permanently remove these CPUs
from nohz.idle_cpus_mask. This would lower the number of checks done.
This could be done with something like:
@@ -11576,6 +11586,20 @@ void nohz_balance_enter_idle(int cpu)
*/
rq->has_blocked_load = 1;

+ /* If we're a completely isolated CPU, we don't play: */
+ if (on_null_domain(rq)) {
+ if (cpumask_test_cpu(rq->cpu, nohz.idle_cpus_mask)) {
+ cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
+ atomic_dec(&nohz.nr_cpus);
+ }
+
+ /*
+ * Set nohz_tick_stopped on isolated CPUs to avoid keeping the
+ * value that was stored when rebuild_sched_domains() was called.
+ */
+ rq->nohz_tick_stopped = 1;
+ }
+
/*
* The tick is still stopped but load could have been added in the
* meantime. We set the nohz.has_blocked flag to trig a check of the
@@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu)
if (rq->nohz_tick_stopped)
goto out;

- /* If we're a completely isolated CPU, we don't play: */
- if (on_null_domain(rq))
- return;
-
rq->nohz_tick_stopped = 1;

cpumask_set_cpu(cpu, nohz.idle_cpus_mask);

Otherwise I could reproduce the issue and the patch was solving it, so:
Tested-by: Pierre Gondois <[email protected]>





Also, your patch doesn't aim to solve that, but I think there is an issue
when updating cpuset.cpus when an isolated partition was already created:

// Create an isolated partition containing CPU0
# mkdir cgroup
# mount -t cgroup2 none cgroup/
# mkdir cgroup/Testing
# echo "+cpuset" > cgroup/cgroup.subtree_control
# echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
# echo 0 > cgroup/Testing/cpuset.cpus
# echo isolated > cgroup/Testing/cpuset.cpus.partition

// CPU0's sched domain is detached:
# ls /sys/kernel/debug/sched/domains/cpu0/
# ls /sys/kernel/debug/sched/domains/cpu1/
domain0 domain1

// Change the isolated partition to be CPU1
# echo 1 > cgroup/Testing/cpuset.cpus

// CPU[0-1] sched domains are not updated:
# ls /sys/kernel/debug/sched/domains/cpu0/
# ls /sys/kernel/debug/sched/domains/cpu1/
domain0 domain1

Regards,
Pierre

2023-09-12 14:54:32

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

Hello Rui and Aaron,

On 9/11/23 18:23, Zhang, Rui wrote:
> On Mon, 2023-09-11 at 19:42 +0800, Aaron Lu wrote:
>> Hi Pierre,
>>
>> On Fri, Sep 08, 2023 at 11:43:50AM +0200, Pierre Gondois wrote:
>>> Hello Rui,
>>>
>>> On 8/14/23 10:30, Zhang, Rui wrote:
>>>> On Mon, 2023-08-14 at 11:14 +0800, Aaron Lu wrote:
>>>>> Hi Rui,
>>>>>
>>>>> On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote:
>>>>>> Problem statement
>>>>>> -----------------
>>>>>> When using cgroup isolated partition to isolate cpus
>>>>>> including
>>>>>> cpu0, it
>>>>>> is observed that cpu0 is woken up frequenctly but doing
>>>>>> nothing.
>>>>>> This is
>>>>>> not good for power efficiency.
>>>>>>
>>>>>> <idle>-0     [000]   616.491602: hrtimer_cancel:
>>>>>> hrtimer=0xffff8e8fdf623c10
>>>>>> <idle>-0     [000]   616.491608: hrtimer_start:
>>>>>> hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
>>>>>> expires=615996000000 softexpires=615996000000
>>>>>> <idle>-0     [000]   616.491616: rcu_utilization:      Start
>>>>>> context switch
>>>>>> <idle>-0     [000]   616.491618: rcu_utilization:      End
>>>>>> context
>>>>>> switch
>>>>>> <idle>-0     [000]   616.491637: tick_stop:
>>>>>> success=1
>>>>>> dependency=NONE
>>>>>> <idle>-0     [000]   616.491637: hrtimer_cancel:
>>>>>> hrtimer=0xffff8e8fdf623c10
>>>>>> <idle>-0     [000]   616.491638: hrtimer_start:
>>>>>> hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
>>>>>> expires=616420000000 softexpires=616420000000
>>>>>>
>>>>>> The above pattern repeats every one or multiple ticks,
>>>>>> results in
>>>>>> total
>>>>>> 2000+ wakeups on cpu0 in 60 seconds, when running workload on
>>>>>> the
>>>>>> cpus that are not in the isolated partition.
>>>>>>
>>>>>> Rootcause
>>>>>> ---------
>>>>>> In NOHZ mode, an active cpu either sends an IPI or touches
>>>>>> the idle
>>>>>> cpu's polling flag to wake it up, so that the idle cpu can
>>>>>> pull
>>>>>> tasks
>>>>>> from the busy cpu. The logic for selecting the target cpu is
>>>>>> to use
>>>>>> the
>>>>>> first idle cpu that presents in both nohz.idle_cpus_mask and
>>>>>> housekeeping_cpumask.
>>>>>>
>>>>>> In the above scenario, when cpu0 is in the cgroup isolated
>>>>>> partition,
>>>>>> its sched domain is deteched, but it is still available in
>>>>>> both of
>>>>>> the
>>>>>> above cpumasks. As a result, cpu0
>>>>>
>>>>> I saw in nohz_balance_enter_idle(), if a cpu is isolated, it
>>>>> will not
>>>>> set itself in nohz.idle_cpus_mask and thus should not be chosen
>>>>> as
>>>>> ilb_cpu. I wonder what's stopping this from working?
>>>>
>>>> One thing I forgot to mention is that the problem is gone if we
>>>> offline
>>>> and re-online those cpus. In that case, the isolated cpus
>>>> are removed
>>>> from the nohz.idle_cpus_mask in sched_cpu_deactivate() and are
>>>> never
>>>> added back.
>>>>
>>>> At runtime, the cpus can be removed from the nohz.idle_cpus_mask
>>>> in
>>>> below case only
>>>>         trigger_load_balance()
>>>>                 if (unlikely(on_null_domain(rq) ||
>>>> !cpu_active(cpu_of(rq))))
>>>>                         return;
>>>>                 nohz_balancer_kick(rq);
>>>>                         nohz_balance_exit_idle()
>>>>
>>>> My understanding is that if a cpu is in nohz.idle_cpus_mask when
>>>> it is
>>>> isolated, there is no chance to remove it from that mask later,
>>>> so the
>>>> check in nohz_balance_enter_idle() does not help.
>>>
>>>
>>> As you mentioned, once a CPU is isolated, its rq->nohz_tick_stopped
>>> is
>>> never updated. Instead of avoiding isolated CPUs at each
>>> find_new_ilb() call
>>> as this patch does, maybe it would be better to permanently remove
>>> these CPUs
>>> from nohz.idle_cpus_mask. This would lower the number of checks
>>> done.
>>
>> I agree.
>
> Sounds reasonable to me.
>
>>
>>> This could be done with something like:
>>> @@ -11576,6 +11586,20 @@ void nohz_balance_enter_idle(int cpu)
>>>           */
>>>          rq->has_blocked_load = 1;
>>> +       /* If we're a completely isolated CPU, we don't play: */
>>> +       if (on_null_domain(rq)) {
>>> +               if (cpumask_test_cpu(rq->cpu, nohz.idle_cpus_mask))
>>> {
>>> +                       cpumask_clear_cpu(rq->cpu,
>>> nohz.idle_cpus_mask);
>>> +                       atomic_dec(&nohz.nr_cpus);
>>> +               }
>>> +
>>
>>
>>> +               /*
>>> +                * Set nohz_tick_stopped on isolated CPUs to avoid
>>> keeping the
>>> +                * value that was stored when
>>> rebuild_sched_domains() was called.
>>> +                */
>>> +               rq->nohz_tick_stopped = 1;
>>
>> It's not immediately clear to me what the above comment and code
>> mean,
>> maybe that's because I know very little about sched domain related
>> code.
>> Other than this, your change looks good to me.
>>
>
> If we set rq->nohz_tick_stopped for the isolated cpu, next time when we
> invoke nohz_balance_exit_idle(), we clear rq->nohz_tick_stopped and
> decrease nohz.nr_cpus (for the second time). This seems like a problem.
>
> In the current code, when rq->nohz_tick_stopped is set, it means the
> cpu is in the nohz.idle_cpus_mask, and the code above breaks this
> assumption.


Yes right indeed,
This happens when putting a CPU offline (as you mentioned earlier,
putting a CPU offline clears the CPU in the idle_cpus_mask).

The load balancing related variables are unused if a CPU has a NULL
rq as it cannot pull any task. Ideally we should clear them once,
when attaching a NULL sd to the CPU.

The following snipped should do that and solve the issue you mentioned:
--- snip ---
--- a/include/linux/sched/nohz.h
+++ b/include/linux/sched/nohz.h
@@ -9,8 +9,10 @@
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
extern void nohz_balance_enter_idle(int cpu);
extern int get_nohz_timer_target(void);
+extern void nohz_clean_sd_state(int cpu);
#else
static inline void nohz_balance_enter_idle(int cpu) { }
+static inline void nohz_clean_sd_state(int cpu) { }
#endif

#ifdef CONFIG_NO_HZ_COMMON
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3e25be58e2b..6fcabe5d08f5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq *rq)
{
SCHED_WARN_ON(rq != this_rq());

+ if (on_null_domain(rq))
+ return;
+
if (likely(!rq->nohz_tick_stopped))
return;

@@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int cpu)
rcu_read_unlock();
}

+void nohz_clean_sd_state(int cpu) {
+ struct rq *rq = cpu_rq(cpu);
+
+ rq->nohz_tick_stopped = 0;
+ if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) {
+ cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
+ atomic_dec(&nohz.nr_cpus);
+ }
+ set_cpu_sd_state_idle(cpu);
+}
+
/*
* This routine will record that the CPU is going idle with tick stopped.
* This info will be used in performing idle load balancing in the future.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d3a3b2646ec4..d31137b5f0ce 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const struct cpumask *cpu_map)
static_branch_dec_cpuslocked(&sched_asym_cpucapacity);

rcu_read_lock();
- for_each_cpu(i, cpu_map)
+ for_each_cpu(i, cpu_map) {
cpu_attach_domain(NULL, &def_root_domain, i);
+ nohz_clean_sd_state(i);
+ }
rcu_read_unlock();
}

--- snip ---

Regards,
Pierre

>
>>
>>> +       }
>>> +
>>>          /*
>>>           * The tick is still stopped but load could have been
>>> added in the
>>>           * meantime. We set the nohz.has_blocked flag to trig a
>>> check of the
>>> @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu)
>>>          if (rq->nohz_tick_stopped)
>>>                  goto out;
>>> -       /* If we're a completely isolated CPU, we don't play: */
>>> -       if (on_null_domain(rq))
>>> -               return;
>>> -
>>>          rq->nohz_tick_stopped = 1;
>>>          cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
>>>
>>> Otherwise I could reproduce the issue and the patch was solving it,
>>> so:
>>> Tested-by: Pierre Gondois <[email protected]>
>
> Thanks for testing, really appreciated!
>>>
>>>
>>>
>>>
>>>
>>> Also, your patch doesn't aim to solve that, but I think there is an
>>> issue
>>> when updating cpuset.cpus when an isolated partition was already
>>> created:
>>>
>>> // Create an isolated partition containing CPU0
>>> # mkdir cgroup
>>> # mount -t cgroup2 none cgroup/
>>> # mkdir cgroup/Testing
>>> # echo "+cpuset" > cgroup/cgroup.subtree_control
>>> # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
>>> # echo 0 > cgroup/Testing/cpuset.cpus
>>> # echo isolated > cgroup/Testing/cpuset.cpus.partition
>>>
>>> // CPU0's sched domain is detached:
>>> # ls /sys/kernel/debug/sched/domains/cpu0/
>>> # ls /sys/kernel/debug/sched/domains/cpu1/
>>> domain0  domain1
>>>
>>> // Change the isolated partition to be CPU1
>>> # echo 1 > cgroup/Testing/cpuset.cpus
>>>
>>> // CPU[0-1] sched domains are not updated:
>>> # ls /sys/kernel/debug/sched/domains/cpu0/
>>> # ls /sys/kernel/debug/sched/domains/cpu1/
>>> domain0  domain1
>>>
> Interesting. Let me check and get back to you later on this. :)
>
> thanks,
> rui

2023-09-14 14:54:56

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance



On 9/14/23 11:23, Zhang, Rui wrote:
> Hi, Pierre,
>
>>
>> Yes right indeed,
>> This happens when putting a CPU offline (as you mentioned earlier,
>> putting a CPU offline clears the CPU in the idle_cpus_mask).
>>
>> The load balancing related variables
>
> including?

I meant the nohz idle variables in the load balancing, so I was referring to:
(struct sched_domain_shared).nr_busy_cpus
(struct sched_domain).nohz_idle
nohz.idle_cpus_mask
nohz.nr_cpus
(struct rq).nohz_tick_stopped

>
>> are unused if a CPU has a NULL
>> rq as it cannot pull any task. Ideally we should clear them once,
>> when attaching a NULL sd to the CPU.
>
> This sounds good to me. But TBH, I don't have enough confidence to do
> so because I'm not crystal clear about how these variables are used.
>
> Some questions about the code below.
>>
>> The following snipped should do that and solve the issue you
>> mentioned:
>> --- snip ---
>> --- a/include/linux/sched/nohz.h
>> +++ b/include/linux/sched/nohz.h
>> @@ -9,8 +9,10 @@
>>   #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
>>   extern void nohz_balance_enter_idle(int cpu);
>>   extern int get_nohz_timer_target(void);
>> +extern void nohz_clean_sd_state(int cpu);
>>   #else
>>   static inline void nohz_balance_enter_idle(int cpu) { }
>> +static inline void nohz_clean_sd_state(int cpu) { }
>>   #endif
>>
>>   #ifdef CONFIG_NO_HZ_COMMON
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b3e25be58e2b..6fcabe5d08f5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq *rq)
>>   {
>>          SCHED_WARN_ON(rq != this_rq());
>>
>> +       if (on_null_domain(rq))
>> +               return;
>> +
>>          if (likely(!rq->nohz_tick_stopped))
>>                  return;
>>
> if we force clearing rq->nohz_tick_stopped when detaching domain, why
> bother adding the first check?

Yes you're right. I added this check for safety, but this is not
mandatory.

>
>>
>> @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int cpu)
>>          rcu_read_unlock();
>>   }
>>
>> +void nohz_clean_sd_state(int cpu) {
>> +       struct rq *rq = cpu_rq(cpu);
>> +
>> +       rq->nohz_tick_stopped = 0;
>> +       if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) {
>> +               cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
>> +               atomic_dec(&nohz.nr_cpus);
>> +       }
>> +       set_cpu_sd_state_idle(cpu);
>> +}
>> +
>
> detach_destroy_domains
> cpu_attach_domain
> update_top_cache_domain
>
> as we clears per_cpu(sd_llc, cpu) for the isolated cpu in
> cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-op here,
> no?

Yes you're right, cpu_attach_domain() and nohz_clean_sd_state() calls
have to be inverted to avoid what you just described.

It also seems that the current kernel doesn't decrease nr_busy_cpus
when putting CPUs in an isolated partition. Indeed if a CPU is counted
in nr_busy_cpus, putting the CPU in an isolated partition doesn't trigger
any call to set_cpu_sd_state_idle().
So it might an additional argument.

Thanks for reading the patch,
Regards,
Pierre

>
> thanks,
> rui
>>   /*
>>    * This routine will record that the CPU is going idle with tick
>> stopped.
>>    * This info will be used in performing idle load balancing in the
>> future.
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index d3a3b2646ec4..d31137b5f0ce 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const
>> struct cpumask *cpu_map)
>>
>> static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
>>
>>          rcu_read_lock();
>> -       for_each_cpu(i, cpu_map)
>> +       for_each_cpu(i, cpu_map) {
>>                  cpu_attach_domain(NULL, &def_root_domain, i);
>> +               nohz_clean_sd_state(i);
>> +       }
>>          rcu_read_unlock();
>>   }
>>
>> --- snip ---
>>
>> Regards,
>> Pierre
>>
>>>
>>>>
>>>>> +       }
>>>>> +
>>>>>           /*
>>>>>            * The tick is still stopped but load could have been
>>>>> added in the
>>>>>            * meantime. We set the nohz.has_blocked flag to trig
>>>>> a
>>>>> check of the
>>>>> @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu)
>>>>>           if (rq->nohz_tick_stopped)
>>>>>                   goto out;
>>>>> -       /* If we're a completely isolated CPU, we don't play:
>>>>> */
>>>>> -       if (on_null_domain(rq))
>>>>> -               return;
>>>>> -
>>>>>           rq->nohz_tick_stopped = 1;
>>>>>           cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
>>>>>
>>>>> Otherwise I could reproduce the issue and the patch was solving
>>>>> it,
>>>>> so:
>>>>> Tested-by: Pierre Gondois <[email protected]>
>>>
>>> Thanks for testing, really appreciated!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Also, your patch doesn't aim to solve that, but I think there
>>>>> is an
>>>>> issue
>>>>> when updating cpuset.cpus when an isolated partition was
>>>>> already
>>>>> created:
>>>>>
>>>>> // Create an isolated partition containing CPU0
>>>>> # mkdir cgroup
>>>>> # mount -t cgroup2 none cgroup/
>>>>> # mkdir cgroup/Testing
>>>>> # echo "+cpuset" > cgroup/cgroup.subtree_control
>>>>> # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
>>>>> # echo 0 > cgroup/Testing/cpuset.cpus
>>>>> # echo isolated > cgroup/Testing/cpuset.cpus.partition
>>>>>
>>>>> // CPU0's sched domain is detached:
>>>>> # ls /sys/kernel/debug/sched/domains/cpu0/
>>>>> # ls /sys/kernel/debug/sched/domains/cpu1/
>>>>> domain0  domain1
>>>>>
>>>>> // Change the isolated partition to be CPU1
>>>>> # echo 1 > cgroup/Testing/cpuset.cpus
>>>>>
>>>>> // CPU[0-1] sched domains are not updated:
>>>>> # ls /sys/kernel/debug/sched/domains/cpu0/
>>>>> # ls /sys/kernel/debug/sched/domains/cpu1/
>>>>> domain0  domain1
>>>>>
>>> Interesting. Let me check and get back to you later on this. :)
>>>
>>> thanks,
>>> rui
>

2023-09-15 03:07:39

by Zhang, Rui

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

Hi, Pierre,

>
> Yes right indeed,
> This happens when putting a CPU offline (as you mentioned earlier,
> putting a CPU offline clears the CPU in the idle_cpus_mask).
>
> The load balancing related variables

including?

> are unused if a CPU has a NULL
> rq as it cannot pull any task. Ideally we should clear them once,
> when attaching a NULL sd to the CPU.

This sounds good to me. But TBH, I don't have enough confidence to do
so because I'm not crystal clear about how these variables are used.

Some questions about the code below.
>
> The following snipped should do that and solve the issue you
> mentioned:
> --- snip ---
> --- a/include/linux/sched/nohz.h
> +++ b/include/linux/sched/nohz.h
> @@ -9,8 +9,10 @@
>   #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
>   extern void nohz_balance_enter_idle(int cpu);
>   extern int get_nohz_timer_target(void);
> +extern void nohz_clean_sd_state(int cpu);
>   #else
>   static inline void nohz_balance_enter_idle(int cpu) { }
> +static inline void nohz_clean_sd_state(int cpu) { }
>   #endif
>  
>   #ifdef CONFIG_NO_HZ_COMMON
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b3e25be58e2b..6fcabe5d08f5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq *rq)
>   {
>          SCHED_WARN_ON(rq != this_rq());
>  
> +       if (on_null_domain(rq))
> +               return;
> +
>          if (likely(!rq->nohz_tick_stopped))
>                  return;
>
if we force clearing rq->nohz_tick_stopped when detaching domain, why
bother adding the first check?

>  
> @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int cpu)
>          rcu_read_unlock();
>   }
>  
> +void nohz_clean_sd_state(int cpu) {
> +       struct rq *rq = cpu_rq(cpu);
> +
> +       rq->nohz_tick_stopped = 0;
> +       if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) {
> +               cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
> +               atomic_dec(&nohz.nr_cpus);
> +       }
> +       set_cpu_sd_state_idle(cpu);
> +}
> +

detach_destroy_domains
cpu_attach_domain
update_top_cache_domain

as we clears per_cpu(sd_llc, cpu) for the isolated cpu in
cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-op here,
no?

thanks,
rui
>   /*
>    * This routine will record that the CPU is going idle with tick
> stopped.
>    * This info will be used in performing idle load balancing in the
> future.
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d3a3b2646ec4..d31137b5f0ce 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const
> struct cpumask *cpu_map)
>                 
> static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
>  
>          rcu_read_lock();
> -       for_each_cpu(i, cpu_map)
> +       for_each_cpu(i, cpu_map) {
>                  cpu_attach_domain(NULL, &def_root_domain, i);
> +               nohz_clean_sd_state(i);
> +       }
>          rcu_read_unlock();
>   }
>
> --- snip ---
>
> Regards,
> Pierre
>
> >
> > >
> > > > +       }
> > > > +
> > > >           /*
> > > >            * The tick is still stopped but load could have been
> > > > added in the
> > > >            * meantime. We set the nohz.has_blocked flag to trig
> > > > a
> > > > check of the
> > > > @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu)
> > > >           if (rq->nohz_tick_stopped)
> > > >                   goto out;
> > > > -       /* If we're a completely isolated CPU, we don't play:
> > > > */
> > > > -       if (on_null_domain(rq))
> > > > -               return;
> > > > -
> > > >           rq->nohz_tick_stopped = 1;
> > > >           cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> > > >
> > > > Otherwise I could reproduce the issue and the patch was solving
> > > > it,
> > > > so:
> > > > Tested-by: Pierre Gondois <[email protected]>
> >
> > Thanks for testing, really appreciated!
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Also, your patch doesn't aim to solve that, but I think there
> > > > is an
> > > > issue
> > > > when updating cpuset.cpus when an isolated partition was
> > > > already
> > > > created:
> > > >
> > > > // Create an isolated partition containing CPU0
> > > > # mkdir cgroup
> > > > # mount -t cgroup2 none cgroup/
> > > > # mkdir cgroup/Testing
> > > > # echo "+cpuset" > cgroup/cgroup.subtree_control
> > > > # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
> > > > # echo 0 > cgroup/Testing/cpuset.cpus
> > > > # echo isolated > cgroup/Testing/cpuset.cpus.partition
> > > >
> > > > // CPU0's sched domain is detached:
> > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > domain0  domain1
> > > >
> > > > // Change the isolated partition to be CPU1
> > > > # echo 1 > cgroup/Testing/cpuset.cpus
> > > >
> > > > // CPU[0-1] sched domains are not updated:
> > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > domain0  domain1
> > > >
> > Interesting. Let me check and get back to you later on this. :)
> >
> > thanks,
> > rui

2023-09-20 07:27:08

by Zhang, Rui

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

Hi, Pierre,

Sorry for the late response. I'm still ramping up on the related code.

On Thu, 2023-09-14 at 16:53 +0200, Pierre Gondois wrote:
>
>
> On 9/14/23 11:23, Zhang, Rui wrote:
> > Hi, Pierre,
> >
> > >
> > > Yes right indeed,
> > > This happens when putting a CPU offline (as you mentioned
> > > earlier,
> > > putting a CPU offline clears the CPU in the idle_cpus_mask).
> > >
> > > The load balancing related variables
> >
> > including?
>
> I meant the nohz idle variables in the load balancing, so I was
> referring to:
> (struct sched_domain_shared).nr_busy_cpus
> (struct sched_domain).nohz_idle
> nohz.idle_cpus_mask
> nohz.nr_cpus
> (struct rq).nohz_tick_stopped

IMO, the problem is that, for an isolated CPU,
1. it is not an idle cpu (nohz.idle_cpus_mask should be cleared)
2. it is not a busy cpu (sds->nr_busy_cpus should be decreased)

But current code does not have a third state to describe this, so we
need to either
1. add extra logic, like on_null_domain() checks
or
2. rely on current logic, but update all related variables correctly,
like you proposed.

But in any case, we should stick with one direction.

If we follow the first one, the original patch should be used, which
IMO is simple and straight forward.
If we follow the later one, we'd better audit and remove the current
on_null_domain() usage at the same time. TBH, I'm not confident enough
to make such a change. But if you want to propose something, I'd glad
to test it.

thanks,
rui

>
> >
> > >   are unused if a CPU has a NULL
> > > rq as it cannot pull any task. Ideally we should clear them once,
> > > when attaching a NULL sd to the CPU.
> >
> > This sounds good to me. But TBH, I don't have enough confidence to
> > do
> > so because I'm not crystal clear about how these variables are
> > used.
> >
> > Some questions about the code below.
> > >
> > > The following snipped should do that and solve the issue you
> > > mentioned:
> > > --- snip ---
> > > --- a/include/linux/sched/nohz.h
> > > +++ b/include/linux/sched/nohz.h
> > > @@ -9,8 +9,10 @@
> > >    #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
> > >    extern void nohz_balance_enter_idle(int cpu);
> > >    extern int get_nohz_timer_target(void);
> > > +extern void nohz_clean_sd_state(int cpu);
> > >    #else
> > >    static inline void nohz_balance_enter_idle(int cpu) { }
> > > +static inline void nohz_clean_sd_state(int cpu) { }
> > >    #endif
> > >   
> > >    #ifdef CONFIG_NO_HZ_COMMON
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index b3e25be58e2b..6fcabe5d08f5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq
> > > *rq)
> > >    {
> > >           SCHED_WARN_ON(rq != this_rq());
> > >   
> > > +       if (on_null_domain(rq))
> > > +               return;
> > > +
> > >           if (likely(!rq->nohz_tick_stopped))
> > >                   return;
> > >
> > if we force clearing rq->nohz_tick_stopped when detaching domain,
> > why
> > bother adding the first check?
>
> Yes you're right. I added this check for safety, but this is not
> mandatory.
>
> >
> > >   
> > > @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int
> > > cpu)
> > >           rcu_read_unlock();
> > >    }
> > >   
> > > +void nohz_clean_sd_state(int cpu) {
> > > +       struct rq *rq = cpu_rq(cpu);
> > > +
> > > +       rq->nohz_tick_stopped = 0;
> > > +       if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) {
> > > +               cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
> > > +               atomic_dec(&nohz.nr_cpus);
> > > +       }
> > > +       set_cpu_sd_state_idle(cpu);
> > > +}
> > > +
> >
> > detach_destroy_domains
> >         cpu_attach_domain
> >                 update_top_cache_domain
> >
> > as we clears per_cpu(sd_llc, cpu) for the isolated cpu in
> > cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-op
> > here,
> > no?
>
> Yes you're right, cpu_attach_domain() and nohz_clean_sd_state() calls
> have to be inverted to avoid what you just described.
>
> It also seems that the current kernel doesn't decrease nr_busy_cpus
> when putting CPUs in an isolated partition. Indeed if a CPU is
> counted
> in nr_busy_cpus, putting the CPU in an isolated partition doesn't
> trigger
> any call to set_cpu_sd_state_idle().
> So it might an additional argument.
>
> Thanks for reading the patch,
> Regards,
> Pierre
>
> >
> > thanks,
> > rui
> > >    /*
> > >     * This routine will record that the CPU is going idle with
> > > tick
> > > stopped.
> > >     * This info will be used in performing idle load balancing in
> > > the
> > > future.
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index d3a3b2646ec4..d31137b5f0ce 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const
> > > struct cpumask *cpu_map)
> > >                 
> > > static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
> > >   
> > >           rcu_read_lock();
> > > -       for_each_cpu(i, cpu_map)
> > > +       for_each_cpu(i, cpu_map) {
> > >                   cpu_attach_domain(NULL, &def_root_domain, i);
> > > +               nohz_clean_sd_state(i);
> > > +       }
> > >           rcu_read_unlock();
> > >    }
> > >
> > > --- snip ---
> > >
> > > Regards,
> > > Pierre
> > >
> > > >
> > > > >
> > > > > > +       }
> > > > > > +
> > > > > >            /*
> > > > > >             * The tick is still stopped but load could have
> > > > > > been
> > > > > > added in the
> > > > > >             * meantime. We set the nohz.has_blocked flag to
> > > > > > trig
> > > > > > a
> > > > > > check of the
> > > > > > @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int
> > > > > > cpu)
> > > > > >            if (rq->nohz_tick_stopped)
> > > > > >                    goto out;
> > > > > > -       /* If we're a completely isolated CPU, we don't
> > > > > > play:
> > > > > > */
> > > > > > -       if (on_null_domain(rq))
> > > > > > -               return;
> > > > > > -
> > > > > >            rq->nohz_tick_stopped = 1;
> > > > > >            cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> > > > > >
> > > > > > Otherwise I could reproduce the issue and the patch was
> > > > > > solving
> > > > > > it,
> > > > > > so:
> > > > > > Tested-by: Pierre Gondois <[email protected]>
> > > >
> > > > Thanks for testing, really appreciated!
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Also, your patch doesn't aim to solve that, but I think
> > > > > > there
> > > > > > is an
> > > > > > issue
> > > > > > when updating cpuset.cpus when an isolated partition was
> > > > > > already
> > > > > > created:
> > > > > >
> > > > > > // Create an isolated partition containing CPU0
> > > > > > # mkdir cgroup
> > > > > > # mount -t cgroup2 none cgroup/
> > > > > > # mkdir cgroup/Testing
> > > > > > # echo "+cpuset" > cgroup/cgroup.subtree_control
> > > > > > # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
> > > > > > # echo 0 > cgroup/Testing/cpuset.cpus
> > > > > > # echo isolated > cgroup/Testing/cpuset.cpus.partition
> > > > > >
> > > > > > // CPU0's sched domain is detached:
> > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > > > domain0  domain1
> > > > > >
> > > > > > // Change the isolated partition to be CPU1
> > > > > > # echo 1 > cgroup/Testing/cpuset.cpus
> > > > > >
> > > > > > // CPU[0-1] sched domains are not updated:
> > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > > > domain0  domain1
> > > > > >
> > > > Interesting. Let me check and get back to you later on this. :)
> > > >
> > > > thanks,
> > > > rui
> >

2023-09-22 16:15:29

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

Hi Rui,

On 9/20/23 09:24, Zhang, Rui wrote:
> Hi, Pierre,
>
> Sorry for the late response. I'm still ramping up on the related code.

No worries, I also need to read the code,

>
> On Thu, 2023-09-14 at 16:53 +0200, Pierre Gondois wrote:
>>
>>
>> On 9/14/23 11:23, Zhang, Rui wrote:
>>> Hi, Pierre,
>>>
>>>>
>>>> Yes right indeed,
>>>> This happens when putting a CPU offline (as you mentioned
>>>> earlier,
>>>> putting a CPU offline clears the CPU in the idle_cpus_mask).
>>>>
>>>> The load balancing related variables
>>>
>>> including?
>>
>> I meant the nohz idle variables in the load balancing, so I was
>> referring to:
>> (struct sched_domain_shared).nr_busy_cpus
>> (struct sched_domain).nohz_idle
>> nohz.idle_cpus_mask
>> nohz.nr_cpus
>> (struct rq).nohz_tick_stopped
>
> IMO, the problem is that, for an isolated CPU,
> 1. it is not an idle cpu (nohz.idle_cpus_mask should be cleared)
> 2. it is not a busy cpu (sds->nr_busy_cpus should be decreased)
>
> But current code does not have a third state to describe this, so we
> need to either
> 1. add extra logic, like on_null_domain() checks

I m not sure I understand, do you mean adding on_null_domain() in addition
to the one in the present patch ?

> or
> 2. rely on current logic, but update all related variables correctly,
> like you proposed.
>
> But in any case, we should stick with one direction.
>
> If we follow the first one, the original patch should be used, which
> IMO is simple and straight forward.
> If we follow the later one, we'd better audit and remove the current
> on_null_domain() usage at the same time. TBH, I'm not confident enough

Here aswell, I'm not sure I understand whether you are referring to
the on_null_domain() call in the present patch or to the on_null_domain()
calls in fair.c ?

Regards,
Pierre

> to make such a change. But if you want to propose something, I'd glad
> to test it.
>
> thanks,
> rui
>

2023-11-15 20:03:12

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

Hi Rui,

On Wed, 20 Sept 2023 at 09:24, Zhang, Rui <[email protected]> wrote:
>
> Hi, Pierre,
>
> Sorry for the late response. I'm still ramping up on the related code.
>
> On Thu, 2023-09-14 at 16:53 +0200, Pierre Gondois wrote:
> >
> >
> > On 9/14/23 11:23, Zhang, Rui wrote:
> > > Hi, Pierre,
> > >
> > > >
> > > > Yes right indeed,
> > > > This happens when putting a CPU offline (as you mentioned
> > > > earlier,
> > > > putting a CPU offline clears the CPU in the idle_cpus_mask).
> > > >
> > > > The load balancing related variables
> > >
> > > including?
> >
> > I meant the nohz idle variables in the load balancing, so I was
> > referring to:
> > (struct sched_domain_shared).nr_busy_cpus
> > (struct sched_domain).nohz_idle
> > nohz.idle_cpus_mask
> > nohz.nr_cpus
> > (struct rq).nohz_tick_stopped
>
> IMO, the problem is that, for an isolated CPU,
> 1. it is not an idle cpu (nohz.idle_cpus_mask should be cleared)
> 2. it is not a busy cpu (sds->nr_busy_cpus should be decreased)
>
> But current code does not have a third state to describe this, so we
> need to either
> 1. add extra logic, like on_null_domain() checks
> or
> 2. rely on current logic, but update all related variables correctly,
> like you proposed.

Isn't the housekeeping cpu mask there to manage such a case ? I was
expecting that your isolated cpu should be cleared from the
housekeeping cpumask used by scheduler and ILB

I think that your solution is the comment of the ffind_new_ilb() unction:
"
* - HK_TYPE_MISC CPUs are used for this task, because HK_TYPE_SCHED is not set
* anywhere yet.
"

IMO, you should look at enabling and using the HK_TYPE_SCHED for isolated CPU

CCed Frederic to get his opinion

>
> But in any case, we should stick with one direction.
>
> If we follow the first one, the original patch should be used, which
> IMO is simple and straight forward.
> If we follow the later one, we'd better audit and remove the current
> on_null_domain() usage at the same time. TBH, I'm not confident enough
> to make such a change. But if you want to propose something, I'd glad
> to test it.
>
> thanks,
> rui
>
> >
> > >
> > > > are unused if a CPU has a NULL
> > > > rq as it cannot pull any task. Ideally we should clear them once,
> > > > when attaching a NULL sd to the CPU.
> > >
> > > This sounds good to me. But TBH, I don't have enough confidence to
> > > do
> > > so because I'm not crystal clear about how these variables are
> > > used.
> > >
> > > Some questions about the code below.
> > > >
> > > > The following snipped should do that and solve the issue you
> > > > mentioned:
> > > > --- snip ---
> > > > --- a/include/linux/sched/nohz.h
> > > > +++ b/include/linux/sched/nohz.h
> > > > @@ -9,8 +9,10 @@
> > > > #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
> > > > extern void nohz_balance_enter_idle(int cpu);
> > > > extern int get_nohz_timer_target(void);
> > > > +extern void nohz_clean_sd_state(int cpu);
> > > > #else
> > > > static inline void nohz_balance_enter_idle(int cpu) { }
> > > > +static inline void nohz_clean_sd_state(int cpu) { }
> > > > #endif
> > > >
> > > > #ifdef CONFIG_NO_HZ_COMMON
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index b3e25be58e2b..6fcabe5d08f5 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq
> > > > *rq)
> > > > {
> > > > SCHED_WARN_ON(rq != this_rq());
> > > >
> > > > + if (on_null_domain(rq))
> > > > + return;
> > > > +
> > > > if (likely(!rq->nohz_tick_stopped))
> > > > return;
> > > >
> > > if we force clearing rq->nohz_tick_stopped when detaching domain,
> > > why
> > > bother adding the first check?
> >
> > Yes you're right. I added this check for safety, but this is not
> > mandatory.
> >
> > >
> > > >
> > > > @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int
> > > > cpu)
> > > > rcu_read_unlock();
> > > > }
> > > >
> > > > +void nohz_clean_sd_state(int cpu) {
> > > > + struct rq *rq = cpu_rq(cpu);
> > > > +
> > > > + rq->nohz_tick_stopped = 0;
> > > > + if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) {
> > > > + cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
> > > > + atomic_dec(&nohz.nr_cpus);
> > > > + }
> > > > + set_cpu_sd_state_idle(cpu);
> > > > +}
> > > > +
> > >
> > > detach_destroy_domains
> > > cpu_attach_domain
> > > update_top_cache_domain
> > >
> > > as we clears per_cpu(sd_llc, cpu) for the isolated cpu in
> > > cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-op
> > > here,
> > > no?
> >
> > Yes you're right, cpu_attach_domain() and nohz_clean_sd_state() calls
> > have to be inverted to avoid what you just described.
> >
> > It also seems that the current kernel doesn't decrease nr_busy_cpus
> > when putting CPUs in an isolated partition. Indeed if a CPU is
> > counted
> > in nr_busy_cpus, putting the CPU in an isolated partition doesn't
> > trigger
> > any call to set_cpu_sd_state_idle().
> > So it might an additional argument.
> >
> > Thanks for reading the patch,
> > Regards,
> > Pierre
> >
> > >
> > > thanks,
> > > rui
> > > > /*
> > > > * This routine will record that the CPU is going idle with
> > > > tick
> > > > stopped.
> > > > * This info will be used in performing idle load balancing in
> > > > the
> > > > future.
> > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > > index d3a3b2646ec4..d31137b5f0ce 100644
> > > > --- a/kernel/sched/topology.c
> > > > +++ b/kernel/sched/topology.c
> > > > @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const
> > > > struct cpumask *cpu_map)
> > > >
> > > > static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
> > > >
> > > > rcu_read_lock();
> > > > - for_each_cpu(i, cpu_map)
> > > > + for_each_cpu(i, cpu_map) {
> > > > cpu_attach_domain(NULL, &def_root_domain, i);
> > > > + nohz_clean_sd_state(i);
> > > > + }
> > > > rcu_read_unlock();
> > > > }
> > > >
> > > > --- snip ---
> > > >
> > > > Regards,
> > > > Pierre
> > > >
> > > > >
> > > > > >
> > > > > > > + }
> > > > > > > +
> > > > > > > /*
> > > > > > > * The tick is still stopped but load could have
> > > > > > > been
> > > > > > > added in the
> > > > > > > * meantime. We set the nohz.has_blocked flag to
> > > > > > > trig
> > > > > > > a
> > > > > > > check of the
> > > > > > > @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int
> > > > > > > cpu)
> > > > > > > if (rq->nohz_tick_stopped)
> > > > > > > goto out;
> > > > > > > - /* If we're a completely isolated CPU, we don't
> > > > > > > play:
> > > > > > > */
> > > > > > > - if (on_null_domain(rq))
> > > > > > > - return;
> > > > > > > -
> > > > > > > rq->nohz_tick_stopped = 1;
> > > > > > > cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> > > > > > >
> > > > > > > Otherwise I could reproduce the issue and the patch was
> > > > > > > solving
> > > > > > > it,
> > > > > > > so:
> > > > > > > Tested-by: Pierre Gondois <[email protected]>
> > > > >
> > > > > Thanks for testing, really appreciated!
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Also, your patch doesn't aim to solve that, but I think
> > > > > > > there
> > > > > > > is an
> > > > > > > issue
> > > > > > > when updating cpuset.cpus when an isolated partition was
> > > > > > > already
> > > > > > > created:
> > > > > > >
> > > > > > > // Create an isolated partition containing CPU0
> > > > > > > # mkdir cgroup
> > > > > > > # mount -t cgroup2 none cgroup/
> > > > > > > # mkdir cgroup/Testing
> > > > > > > # echo "+cpuset" > cgroup/cgroup.subtree_control
> > > > > > > # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
> > > > > > > # echo 0 > cgroup/Testing/cpuset.cpus
> > > > > > > # echo isolated > cgroup/Testing/cpuset.cpus.partition
> > > > > > >
> > > > > > > // CPU0's sched domain is detached:
> > > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > > > > domain0 domain1
> > > > > > >
> > > > > > > // Change the isolated partition to be CPU1
> > > > > > > # echo 1 > cgroup/Testing/cpuset.cpus
> > > > > > >
> > > > > > > // CPU[0-1] sched domains are not updated:
> > > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > > > > domain0 domain1
> > > > > > >
> > > > > Interesting. Let me check and get back to you later on this. :)
> > > > >
> > > > > thanks,
> > > > > rui
> > >
>

2023-11-16 07:32:31

by Zhang, Rui

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

Hi, Vincent,

Really appreciate your comments on this.

On Wed, 2023-11-15 at 21:01 +0100, Vincent Guittot wrote:
> Hi Rui,
>
> On Wed, 20 Sept 2023 at 09:24, Zhang, Rui <[email protected]>
> wrote:
> >
> > Hi, Pierre,
> >
> > Sorry for the late response. I'm still ramping up on the related
> > code.
> >
> > On Thu, 2023-09-14 at 16:53 +0200, Pierre Gondois wrote:
> > >
> > >
> > > On 9/14/23 11:23, Zhang, Rui wrote:
> > > > Hi, Pierre,
> > > >
> > > > >
> > > > > Yes right indeed,
> > > > > This happens when putting a CPU offline (as you mentioned
> > > > > earlier,
> > > > > putting a CPU offline clears the CPU in the idle_cpus_mask).
> > > > >
> > > > > The load balancing related variables
> > > >
> > > > including?
> > >
> > > I meant the nohz idle variables in the load balancing, so I was
> > > referring to:
> > > (struct sched_domain_shared).nr_busy_cpus
> > > (struct sched_domain).nohz_idle
> > > nohz.idle_cpus_mask
> > > nohz.nr_cpus
> > > (struct rq).nohz_tick_stopped
> >
> > IMO, the problem is that, for an isolated CPU,
> > 1. it is not an idle cpu (nohz.idle_cpus_mask should be cleared)
> > 2. it is not a busy cpu (sds->nr_busy_cpus should be decreased)
> >
> > But current code does not have a third state to describe this, so
> > we
> > need to either
> > 1. add extra logic, like on_null_domain() checks
> > or
> > 2. rely on current logic, but update all related variables
> > correctly,
> > like you proposed.
>
> Isn't the housekeeping cpu mask there to manage such a case ?

This is true for isolated CPUs using boot option "nohz_full=".

But for CPUs in the cgroup isolated partition, the housekeeping cpumask
is not updated.

I don't know if this is intended or not.

> I was
> expecting that your isolated cpu should be cleared from the
> housekeeping cpumask used by scheduler and ILB

This patch is a direct fix when I found the isolated CPUs are woke up
by this piece of code.

>
> I think that your solution is the comment of the ffind_new_ilb()
> unction:
> "
>  * - HK_TYPE_MISC CPUs are used for this task, because HK_TYPE_SCHED
> is not set
>  *   anywhere yet.
> "
>
> IMO, you should look at enabling and using the HK_TYPE_SCHED for
> isolated CPU

yeah, this seems reasonable.

I'm new to cgroup and I'm not sure what should be the proper behavior
for CPUs in isolated partition.

>
> CCed Frederic to get his opinion

Thanks.

-rui


>
> > But in any case, we should stick with one direction.
> >
> > If we follow the first one, the original patch should be used,
> > which
> > IMO is simple and straight forward.
> > If we follow the later one, we'd better audit and remove the
> > current
> > on_null_domain() usage at the same time. TBH, I'm not confident
> > enough
> > to make such a change. But if you want to propose something, I'd
> > glad
> > to test it.
> >
> > thanks,
> > rui
> >
> > >
> > > >
> > > > >   are unused if a CPU has a NULL
> > > > > rq as it cannot pull any task. Ideally we should clear them
> > > > > once,
> > > > > when attaching a NULL sd to the CPU.
> > > >
> > > > This sounds good to me. But TBH, I don't have enough confidence
> > > > to
> > > > do
> > > > so because I'm not crystal clear about how these variables are
> > > > used.
> > > >
> > > > Some questions about the code below.
> > > > >
> > > > > The following snipped should do that and solve the issue you
> > > > > mentioned:
> > > > > --- snip ---
> > > > > --- a/include/linux/sched/nohz.h
> > > > > +++ b/include/linux/sched/nohz.h
> > > > > @@ -9,8 +9,10 @@
> > > > >    #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
> > > > >    extern void nohz_balance_enter_idle(int cpu);
> > > > >    extern int get_nohz_timer_target(void);
> > > > > +extern void nohz_clean_sd_state(int cpu);
> > > > >    #else
> > > > >    static inline void nohz_balance_enter_idle(int cpu) { }
> > > > > +static inline void nohz_clean_sd_state(int cpu) { }
> > > > >    #endif
> > > > >
> > > > >    #ifdef CONFIG_NO_HZ_COMMON
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index b3e25be58e2b..6fcabe5d08f5 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq
> > > > > *rq)
> > > > >    {
> > > > >           SCHED_WARN_ON(rq != this_rq());
> > > > >
> > > > > +       if (on_null_domain(rq))
> > > > > +               return;
> > > > > +
> > > > >           if (likely(!rq->nohz_tick_stopped))
> > > > >                   return;
> > > > >
> > > > if we force clearing rq->nohz_tick_stopped when detaching
> > > > domain,
> > > > why
> > > > bother adding the first check?
> > >
> > > Yes you're right. I added this check for safety, but this is not
> > > mandatory.
> > >
> > > >
> > > > >
> > > > > @@ -11551,6 +11554,17 @@ static void
> > > > > set_cpu_sd_state_idle(int
> > > > > cpu)
> > > > >           rcu_read_unlock();
> > > > >    }
> > > > >
> > > > > +void nohz_clean_sd_state(int cpu) {
> > > > > +       struct rq *rq = cpu_rq(cpu);
> > > > > +
> > > > > +       rq->nohz_tick_stopped = 0;
> > > > > +       if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) {
> > > > > +               cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
> > > > > +               atomic_dec(&nohz.nr_cpus);
> > > > > +       }
> > > > > +       set_cpu_sd_state_idle(cpu);
> > > > > +}
> > > > > +
> > > >
> > > > detach_destroy_domains
> > > >         cpu_attach_domain
> > > >                 update_top_cache_domain
> > > >
> > > > as we clears per_cpu(sd_llc, cpu) for the isolated cpu in
> > > > cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-
> > > > op
> > > > here,
> > > > no?
> > >
> > > Yes you're right, cpu_attach_domain() and nohz_clean_sd_state()
> > > calls
> > > have to be inverted to avoid what you just described.
> > >
> > > It also seems that the current kernel doesn't decrease
> > > nr_busy_cpus
> > > when putting CPUs in an isolated partition. Indeed if a CPU is
> > > counted
> > > in nr_busy_cpus, putting the CPU in an isolated partition doesn't
> > > trigger
> > > any call to set_cpu_sd_state_idle().
> > > So it might an additional argument.
> > >
> > > Thanks for reading the patch,
> > > Regards,
> > > Pierre
> > >
> > > >
> > > > thanks,
> > > > rui
> > > > >    /*
> > > > >     * This routine will record that the CPU is going idle
> > > > > with
> > > > > tick
> > > > > stopped.
> > > > >     * This info will be used in performing idle load
> > > > > balancing in
> > > > > the
> > > > > future.
> > > > > diff --git a/kernel/sched/topology.c
> > > > > b/kernel/sched/topology.c
> > > > > index d3a3b2646ec4..d31137b5f0ce 100644
> > > > > --- a/kernel/sched/topology.c
> > > > > +++ b/kernel/sched/topology.c
> > > > > @@ -2584,8 +2584,10 @@ static void
> > > > > detach_destroy_domains(const
> > > > > struct cpumask *cpu_map)
> > > > >
> > > > > static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
> > > > >
> > > > >           rcu_read_lock();
> > > > > -       for_each_cpu(i, cpu_map)
> > > > > +       for_each_cpu(i, cpu_map) {
> > > > >                   cpu_attach_domain(NULL, &def_root_domain,
> > > > > i);
> > > > > +               nohz_clean_sd_state(i);
> > > > > +       }
> > > > >           rcu_read_unlock();
> > > > >    }
> > > > >
> > > > > --- snip ---
> > > > >
> > > > > Regards,
> > > > > Pierre
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > +       }
> > > > > > > > +
> > > > > > > >            /*
> > > > > > > >             * The tick is still stopped but load could
> > > > > > > > have
> > > > > > > > been
> > > > > > > > added in the
> > > > > > > >             * meantime. We set the nohz.has_blocked
> > > > > > > > flag to
> > > > > > > > trig
> > > > > > > > a
> > > > > > > > check of the
> > > > > > > > @@ -11585,10 +11609,6 @@ void
> > > > > > > > nohz_balance_enter_idle(int
> > > > > > > > cpu)
> > > > > > > >            if (rq->nohz_tick_stopped)
> > > > > > > >                    goto out;
> > > > > > > > -       /* If we're a completely isolated CPU, we don't
> > > > > > > > play:
> > > > > > > > */
> > > > > > > > -       if (on_null_domain(rq))
> > > > > > > > -               return;
> > > > > > > > -
> > > > > > > >            rq->nohz_tick_stopped = 1;
> > > > > > > >            cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> > > > > > > >
> > > > > > > > Otherwise I could reproduce the issue and the patch was
> > > > > > > > solving
> > > > > > > > it,
> > > > > > > > so:
> > > > > > > > Tested-by: Pierre Gondois <[email protected]>
> > > > > >
> > > > > > Thanks for testing, really appreciated!
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Also, your patch doesn't aim to solve that, but I think
> > > > > > > > there
> > > > > > > > is an
> > > > > > > > issue
> > > > > > > > when updating cpuset.cpus when an isolated partition
> > > > > > > > was
> > > > > > > > already
> > > > > > > > created:
> > > > > > > >
> > > > > > > > // Create an isolated partition containing CPU0
> > > > > > > > # mkdir cgroup
> > > > > > > > # mount -t cgroup2 none cgroup/
> > > > > > > > # mkdir cgroup/Testing
> > > > > > > > # echo "+cpuset" > cgroup/cgroup.subtree_control
> > > > > > > > # echo "+cpuset" >
> > > > > > > > cgroup/Testing/cgroup.subtree_control
> > > > > > > > # echo 0 > cgroup/Testing/cpuset.cpus
> > > > > > > > # echo isolated > cgroup/Testing/cpuset.cpus.partition
> > > > > > > >
> > > > > > > > // CPU0's sched domain is detached:
> > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > > > > > domain0  domain1
> > > > > > > >
> > > > > > > > // Change the isolated partition to be CPU1
> > > > > > > > # echo 1 > cgroup/Testing/cpuset.cpus
> > > > > > > >
> > > > > > > > // CPU[0-1] sched domains are not updated:
> > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > > > > > domain0  domain1
> > > > > > > >
> > > > > > Interesting. Let me check and get back to you later on
> > > > > > this. :)
> > > > > >
> > > > > > thanks,
> > > > > > rui
> > > >
> >