2024-04-03 15:09:33

by Pierre Gondois

[permalink] [raw]
Subject: [PATCH 0/7] sched/fair|isolation: Correctly clear nohz.[nr_cpus|idle_cpus_mask] for isolated CPUs

Zhang Rui reported that find_new_ilb() was iterating over CPUs in
isolated cgroup partitions. This triggered spurious wakeups for
theses CPUs. [1]
The initial approach was to ignore CPUs on NULL sched domains, as
isolated CPUs have a NULL sched domain. However a CPU:
- with its tick disabled, so taken into account in
nohz.[idle_cpus_mask|nr_cpus]
- which is placed in an isolated cgroup partition
will never update nohz.[idle_cpus_mask|nr_cpus] again.

To avoid that, the following variables should be cleared
when a CPU is placed in an isolated cgroup partition:
- nohz.idle_cpus_mask
- nohz.nr_cpus
- rq->nohz_tick_stopped
This would allow to avoid considering wrong nohz.* values during
idle load balance.

As suggested in [2] and to avoid calling nohz_balance_[enter|exit]_idle()
from a remote CPU and create concurrency issues, leverage the existing
housekeeping HK_TYPE_SCHED mask to reflect isolated CPUs (i.e. on NULL
sched domains).
Indeed the HK_TYPE_SCHED mask is currently never set by the
isolcpus/nohz_full kernel parameters, so it defaults to cpu_online_mask.
Plus it's current usage fits CPUs that are isolated and should
not take part in load balancing.

Making use of HK_TYPE_SCHED for this purpose implies creating a
housekeeping mask which can be modified at runtime.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/CAKfTPtAMd_KNKhXXGk5MEibzzQUX3BFkWgxtEW2o8FFTX99DKw@mail.gmail.com/

Pierre Gondois (7):
sched/isolation: Introduce housekeeping_runtime isolation
sched/isolation: Move HK_TYPE_SCHED to housekeeping runtime
sched/isolation: Use HKR_TYPE_SCHED in find_new_ilb()
sched/fair: Move/add on_null_domain()/housekeeping_cpu() checks
sched/topology: Remove CPUs with NULL sd from HKR_TYPE_SCHED mask
sched/fair: Remove on_null_domain() and redundant checks
sched/fair: Clear idle_cpus_mask for CPUs with NULL sd

include/linux/sched/isolation.h | 30 ++++++++++++++++++++-
include/linux/sched/nohz.h | 2 ++
kernel/sched/fair.c | 44 +++++++++++++++++-------------
kernel/sched/isolation.c | 48 ++++++++++++++++++++++++++++++++-
kernel/sched/topology.c | 7 +++++
5 files changed, 110 insertions(+), 21 deletions(-)

--
2.25.1



2024-04-03 15:10:53

by Pierre Gondois

[permalink] [raw]
Subject: [PATCH 7/7] sched/fair: Clear idle_cpus_mask for CPUs with NULL sd

As reported in [1], an isolated CPU keeps the values of:
- rq->nohz_tick_stopped
- nohz.idle_cpus_mask
- nohz.nr_cpus
when a NULL sd is attached to the CPU. Clear the values.

[1] https://lore.kernel.org/all/[email protected]/

Signed-off-by: Pierre Gondois <[email protected]>
---
include/linux/sched/nohz.h | 2 ++
kernel/sched/fair.c | 11 +++++++++++
kernel/sched/topology.c | 6 ++++--
3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/nohz.h b/include/linux/sched/nohz.h
index 6d67e9a5af6b..18e620715c9d 100644
--- a/include/linux/sched/nohz.h
+++ b/include/linux/sched/nohz.h
@@ -9,8 +9,10 @@
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
extern void nohz_balance_enter_idle(int cpu);
extern int get_nohz_timer_target(void);
+extern void nohz_clear_state(int cpu);
#else
static inline void nohz_balance_enter_idle(int cpu) { }
+static void nohz_clear_state(int cpu) { }
#endif

#ifdef CONFIG_NO_HZ_COMMON
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9657c8f2176b..6786d4d78e41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12014,6 +12014,17 @@ static void nohz_balancer_kick(struct rq *rq)
kick_ilb(flags);
}

+void nohz_clear_state(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+
+ if (rq->nohz_tick_stopped) {
+ rq->nohz_tick_stopped = 0;
+ cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
+ atomic_dec(&nohz.nr_cpus);
+ }
+}
+
static void set_cpu_sd_state_busy(int cpu)
{
struct sched_domain *sd;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b4fc212ccfb0..e8e40b7d964b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -775,10 +775,12 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)

sched_domain_debug(sd, cpu);

- if (sd)
+ if (sd) {
housekeeping_runtime_set_cpu(cpu, HKR_TYPE_SCHED);
- else
+ } else {
housekeeping_runtime_clear_cpu(cpu, HKR_TYPE_SCHED);
+ nohz_clear_state(cpu);
+ }

rq_attach_root(rq, rd);
tmp = rq->sd;
--
2.25.1


2024-04-03 15:39:33

by Pierre Gondois

[permalink] [raw]
Subject: [PATCH 6/7] sched/fair: Remove on_null_domain() and redundant checks

CPUs with a NULL sched domain are removed from the HKR_TYPE_SCHED
isolation mask. The two following checks are equialent:
- !housekeeping_runtime_test_cpu(cpu, HKR_TYPE_SCHED)
- on_null_domain(rq)

Remove on_null_domain() and the redundant checks.

Signed-off-by: Pierre Gondois <[email protected]>
---
kernel/sched/fair.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e0f2a0f153f..9657c8f2176b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11830,11 +11830,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)

}

-static inline int on_null_domain(struct rq *rq)
-{
- return unlikely(!rcu_dereference_sched(rq->sd));
-}
-
#ifdef CONFIG_NO_HZ_COMMON
/*
* NOHZ idle load balancing (ILB) details:
@@ -12040,7 +12035,7 @@ void nohz_balance_exit_idle(struct rq *rq)
SCHED_WARN_ON(rq != this_rq());

/* If we're a completely isolated CPU, we don't play: */
- if (on_null_domain(rq))
+ if (!housekeeping_runtime_test_cpu(cpu_of(rq), HKR_TYPE_SCHED))
return;

if (likely(!rq->nohz_tick_stopped))
@@ -12090,12 +12085,8 @@ void nohz_balance_enter_idle(int cpu)
*/
rq->has_blocked_load = 1;

- /* Spare idle load balancing on CPUs that don't want to be disturbed: */
- if (!housekeeping_runtime_test_cpu(cpu, HKR_TYPE_SCHED))
- return;
-
/* If we're a completely isolated CPU, we don't play: */
- if (on_null_domain(rq))
+ if (!housekeeping_runtime_test_cpu(cpu, HKR_TYPE_SCHED))
return;

/*
@@ -12504,11 +12495,14 @@ static __latent_entropy void sched_balance_softirq(struct softirq_action *h)
*/
void sched_balance_trigger(struct rq *rq)
{
+ int cpu = cpu_of(rq);
+
/*
* Don't need to rebalance while attached to NULL domain or
* runqueue CPU is not active
*/
- if (unlikely(on_null_domain(rq) || !cpu_active(cpu_of(rq))))
+ if (unlikely(!housekeeping_runtime_test_cpu(cpu, HKR_TYPE_SCHED)) ||
+ !cpu_active(cpu))
return;

if (time_after_eq(jiffies, rq->next_balance))
--
2.25.1


2024-04-04 03:01:20

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/fair|isolation: Correctly clear nohz.[nr_cpus|idle_cpus_mask] for isolated CPUs

On 4/3/24 11:05, Pierre Gondois wrote:
> Zhang Rui reported that find_new_ilb() was iterating over CPUs in
> isolated cgroup partitions. This triggered spurious wakeups for
> theses CPUs. [1]
> The initial approach was to ignore CPUs on NULL sched domains, as
> isolated CPUs have a NULL sched domain. However a CPU:
> - with its tick disabled, so taken into account in
> nohz.[idle_cpus_mask|nr_cpus]
> - which is placed in an isolated cgroup partition
> will never update nohz.[idle_cpus_mask|nr_cpus] again.
>
> To avoid that, the following variables should be cleared
> when a CPU is placed in an isolated cgroup partition:
> - nohz.idle_cpus_mask
> - nohz.nr_cpus
> - rq->nohz_tick_stopped
> This would allow to avoid considering wrong nohz.* values during
> idle load balance.
>
> As suggested in [2] and to avoid calling nohz_balance_[enter|exit]_idle()
> from a remote CPU and create concurrency issues, leverage the existing
> housekeeping HK_TYPE_SCHED mask to reflect isolated CPUs (i.e. on NULL
> sched domains).
> Indeed the HK_TYPE_SCHED mask is currently never set by the
> isolcpus/nohz_full kernel parameters, so it defaults to cpu_online_mask.
> Plus it's current usage fits CPUs that are isolated and should
> not take part in load balancing.
>
> Making use of HK_TYPE_SCHED for this purpose implies creating a
> housekeeping mask which can be modified at runtime.
>
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/all/CAKfTPtAMd_KNKhXXGk5MEibzzQUX3BFkWgxtEW2o8FFTX99DKw@mail.gmail.com/
>
> Pierre Gondois (7):
> sched/isolation: Introduce housekeeping_runtime isolation
> sched/isolation: Move HK_TYPE_SCHED to housekeeping runtime
> sched/isolation: Use HKR_TYPE_SCHED in find_new_ilb()
> sched/fair: Move/add on_null_domain()/housekeeping_cpu() checks
> sched/topology: Remove CPUs with NULL sd from HKR_TYPE_SCHED mask
> sched/fair: Remove on_null_domain() and redundant checks
> sched/fair: Clear idle_cpus_mask for CPUs with NULL sd
>
> include/linux/sched/isolation.h | 30 ++++++++++++++++++++-
> include/linux/sched/nohz.h | 2 ++
> kernel/sched/fair.c | 44 +++++++++++++++++-------------
> kernel/sched/isolation.c | 48 ++++++++++++++++++++++++++++++++-
> kernel/sched/topology.c | 7 +++++
> 5 files changed, 110 insertions(+), 21 deletions(-)
>
I had also posted a patch series on excluding isolated CPUs in isolated
partitions from housekeeping cpumasks earlier this year. See

https://lore.kernel.org/lkml/[email protected]/

It took a different approach from this series. It looks like I should
include HK_TYPE_MISC as well.

Cheers,
Longman


2024-04-04 07:28:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 6/7] sched/fair: Remove on_null_domain() and redundant checks

On Wed, Apr 03, 2024 at 05:05:38PM +0200, Pierre Gondois wrote:
> CPUs with a NULL sched domain are removed from the HKR_TYPE_SCHED
> isolation mask. The two following checks are equialent:
> - !housekeeping_runtime_test_cpu(cpu, HKR_TYPE_SCHED)
> - on_null_domain(rq)
>
> Remove on_null_domain() and the redundant checks.
>
> Signed-off-by: Pierre Gondois <[email protected]>
> ---
> kernel/sched/fair.c | 18 ++++++------------
> 1 file changed, 6 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3e0f2a0f153f..9657c8f2176b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11830,11 +11830,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>
> }
>
> -static inline int on_null_domain(struct rq *rq)
> -{
> - return unlikely(!rcu_dereference_sched(rq->sd));
> -}
> -
> #ifdef CONFIG_NO_HZ_COMMON
> /*
> * NOHZ idle load balancing (ILB) details:
> @@ -12040,7 +12035,7 @@ void nohz_balance_exit_idle(struct rq *rq)
> SCHED_WARN_ON(rq != this_rq());
>
> /* If we're a completely isolated CPU, we don't play: */
> - if (on_null_domain(rq))
> + if (!housekeeping_runtime_test_cpu(cpu_of(rq), HKR_TYPE_SCHED))
> return;
>
> if (likely(!rq->nohz_tick_stopped))

This seems broken, the whole null domain can happen with cpusets, but
this housekeeping nonsense is predicated on CPU_ISOLATION and none of
that is mandatory for CPUSETS.

2024-04-04 13:04:47

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/fair|isolation: Correctly clear nohz.[nr_cpus|idle_cpus_mask] for isolated CPUs

Hello Waiman,
Thanks for the link, I didn't see the patchset previously.

On 4/4/24 05:01, Waiman Long wrote:
> On 4/3/24 11:05, Pierre Gondois wrote:
>> Zhang Rui reported that find_new_ilb() was iterating over CPUs in
>> isolated cgroup partitions. This triggered spurious wakeups for
>> theses CPUs. [1]
>> The initial approach was to ignore CPUs on NULL sched domains, as
>> isolated CPUs have a NULL sched domain. However a CPU:
>> - with its tick disabled, so taken into account in
>> nohz.[idle_cpus_mask|nr_cpus]
>> - which is placed in an isolated cgroup partition
>> will never update nohz.[idle_cpus_mask|nr_cpus] again.
>>
>> To avoid that, the following variables should be cleared
>> when a CPU is placed in an isolated cgroup partition:
>> - nohz.idle_cpus_mask
>> - nohz.nr_cpus
>> - rq->nohz_tick_stopped
>> This would allow to avoid considering wrong nohz.* values during
>> idle load balance.
>>
>> As suggested in [2] and to avoid calling nohz_balance_[enter|exit]_idle()
>> from a remote CPU and create concurrency issues, leverage the existing
>> housekeeping HK_TYPE_SCHED mask to reflect isolated CPUs (i.e. on NULL
>> sched domains).
>> Indeed the HK_TYPE_SCHED mask is currently never set by the
>> isolcpus/nohz_full kernel parameters, so it defaults to cpu_online_mask.
>> Plus it's current usage fits CPUs that are isolated and should
>> not take part in load balancing.
>>
>> Making use of HK_TYPE_SCHED for this purpose implies creating a
>> housekeeping mask which can be modified at runtime.
>>
>> [1] https://lore.kernel.org/all/[email protected]/
>> [2] https://lore.kernel.org/all/CAKfTPtAMd_KNKhXXGk5MEibzzQUX3BFkWgxtEW2o8FFTX99DKw@mail.gmail.com/
>>
>> Pierre Gondois (7):
>> sched/isolation: Introduce housekeeping_runtime isolation
>> sched/isolation: Move HK_TYPE_SCHED to housekeeping runtime
>> sched/isolation: Use HKR_TYPE_SCHED in find_new_ilb()
>> sched/fair: Move/add on_null_domain()/housekeeping_cpu() checks
>> sched/topology: Remove CPUs with NULL sd from HKR_TYPE_SCHED mask
>> sched/fair: Remove on_null_domain() and redundant checks
>> sched/fair: Clear idle_cpus_mask for CPUs with NULL sd
>>
>> include/linux/sched/isolation.h | 30 ++++++++++++++++++++-
>> include/linux/sched/nohz.h | 2 ++
>> kernel/sched/fair.c | 44 +++++++++++++++++-------------
>> kernel/sched/isolation.c | 48 ++++++++++++++++++++++++++++++++-
>> kernel/sched/topology.c | 7 +++++
>> 5 files changed, 110 insertions(+), 21 deletions(-)
>>
> I had also posted a patch series on excluding isolated CPUs in isolated
> partitions from housekeeping cpumasks earlier this year. See
>
> https://lore.kernel.org/lkml/[email protected]/
>
> It took a different approach from this series. It looks like I should
> include HK_TYPE_MISC as well.

The common point between the 2 patchset is that find_new_ilb() won't
take into account isolated CPUs.
The present patchset also:
- clears nohz.[idle_cpus_mask|nr_cpus] variable when a CPU becomes isolated,
cf. [PATCH 7/7] sched/fair: Clear idle_cpus_mask for CPUs with NULL sd
- tries to clean up/gather on_null_domain()/HK_TYPE_SCHED/HK_TYPE_MISC
mask checks, as HK_TYPE_SCHED/HK_TYPE_MISC masks are currently never
set.
but it also:
- updates the housekeeping mask from sched/topology.c. It might be better
to do it from cpuset.c as you did as the update originally comes from
here and it is unlikely another place would require updating housekeeping
CPUs.
A new housekeeping_runtime type is also created, but I think the way you
handle updating housekeeping mask at runtime is better.
- adds a dependency of sched/fair.c over CPU_ISOLATION (cf. housekeeping_*
calls), as Peter noted (IIUC) [1].

Should I re-spin the patchset and try to correct those points ? Or do you
think this should be done differently ?

Regards,
Pierre

[1] https://lore.kernel.org/lkml/[email protected]/

>
> Cheers,
> Longman
>

2024-04-05 00:29:23

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/fair|isolation: Correctly clear nohz.[nr_cpus|idle_cpus_mask] for isolated CPUs

On 4/4/24 08:55, Pierre Gondois wrote:
> Hello Waiman,
> Thanks for the link, I didn't see the patchset previously.
>
> On 4/4/24 05:01, Waiman Long wrote:
>> On 4/3/24 11:05, Pierre Gondois wrote:
>>> Zhang Rui reported that find_new_ilb() was iterating over CPUs in
>>> isolated cgroup partitions. This triggered spurious wakeups for
>>> theses CPUs. [1]
>>> The initial approach was to ignore CPUs on NULL sched domains, as
>>> isolated CPUs have a NULL sched domain. However a CPU:
>>> - with its tick disabled, so taken into account in
>>>     nohz.[idle_cpus_mask|nr_cpus]
>>> - which is placed in an isolated cgroup partition
>>> will never update nohz.[idle_cpus_mask|nr_cpus] again.
>>>
>>> To avoid that, the following variables should be cleared
>>> when a CPU is placed in an isolated cgroup partition:
>>> - nohz.idle_cpus_mask
>>> - nohz.nr_cpus
>>> - rq->nohz_tick_stopped
>>> This would allow to avoid considering wrong nohz.* values during
>>> idle load balance.
>>>
>>> As suggested in [2] and to avoid calling
>>> nohz_balance_[enter|exit]_idle()
>>> from a remote CPU and create concurrency issues, leverage the existing
>>> housekeeping HK_TYPE_SCHED mask to reflect isolated CPUs (i.e. on NULL
>>> sched domains).
>>> Indeed the HK_TYPE_SCHED mask is currently never set by the
>>> isolcpus/nohz_full kernel parameters, so it defaults to
>>> cpu_online_mask.
>>> Plus it's current usage fits CPUs that are isolated and should
>>> not take part in load balancing.
>>>
>>> Making use of HK_TYPE_SCHED for this purpose implies creating a
>>> housekeeping mask which can be modified at runtime.
>>>
>>> [1]
>>> https://lore.kernel.org/all/[email protected]/
>>> [2]
>>> https://lore.kernel.org/all/CAKfTPtAMd_KNKhXXGk5MEibzzQUX3BFkWgxtEW2o8FFTX99DKw@mail.gmail.com/
>>>
>>> Pierre Gondois (7):
>>>     sched/isolation: Introduce housekeeping_runtime isolation
>>>     sched/isolation: Move HK_TYPE_SCHED to housekeeping runtime
>>>     sched/isolation: Use HKR_TYPE_SCHED in find_new_ilb()
>>>     sched/fair: Move/add on_null_domain()/housekeeping_cpu() checks
>>>     sched/topology: Remove CPUs with NULL sd from HKR_TYPE_SCHED mask
>>>     sched/fair: Remove on_null_domain() and redundant checks
>>>     sched/fair: Clear idle_cpus_mask for CPUs with NULL sd
>>>
>>>    include/linux/sched/isolation.h | 30 ++++++++++++++++++++-
>>>    include/linux/sched/nohz.h      |  2 ++
>>>    kernel/sched/fair.c             | 44 +++++++++++++++++-------------
>>>    kernel/sched/isolation.c        | 48
>>> ++++++++++++++++++++++++++++++++-
>>>    kernel/sched/topology.c         |  7 +++++
>>>    5 files changed, 110 insertions(+), 21 deletions(-)
>>>
>> I had also posted a patch series on excluding isolated CPUs in isolated
>> partitions from housekeeping cpumasks earlier this year. See
>>
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> It took a different approach from this series. It looks like I should
>> include HK_TYPE_MISC as well.
>
> The common point between the 2 patchset is that find_new_ilb() won't
> take into account isolated CPUs.
> The present patchset also:
> - clears nohz.[idle_cpus_mask|nr_cpus] variable when a CPU becomes
> isolated,
>   cf. [PATCH 7/7] sched/fair: Clear idle_cpus_mask for CPUs with NULL sd
> - tries to clean up/gather on_null_domain()/HK_TYPE_SCHED/HK_TYPE_MISC
>   mask checks, as HK_TYPE_SCHED/HK_TYPE_MISC masks are currently never
>   set.
> but it also:
> - updates the housekeeping mask from sched/topology.c. It might be better
>   to do it from cpuset.c as you did as the update originally comes from
>   here and it is unlikely another place would require updating
> housekeeping
>   CPUs.
>   A new housekeeping_runtime type is also created, but I think the way
> you
>   handle updating housekeeping mask at runtime is better.
> - adds a dependency of sched/fair.c over CPU_ISOLATION (cf.
> housekeeping_*
>   calls), as Peter noted (IIUC) [1].

That is true. Without CONFIG_CPU_ISOLATION, all the housekeeping* calls
are essentially no-ops.

OTOH, without CONFIG_CPU_ISOLATION, a number of isolation capabilities
won't be there. Most distros will have this config option set anyway.

BTW, a number of the HK_TYPE_* are also used at runtime, like
HK_TYPE_TIMER and HK_TYPE_RCU. So it is hard to artificially distinguish
between runtime or boot time.

I don't believe you need to add direct dependency on
CONFIG_CPU_ISOLATION, but you do have to add any housekeeping check as
an additional check, not as a replacement of the existing check.

Cheers,
Longman


2024-04-05 09:48:29

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH 6/7] sched/fair: Remove on_null_domain() and redundant checks

Hello Peter,

On 4/4/24 09:27, Peter Zijlstra wrote:
> On Wed, Apr 03, 2024 at 05:05:38PM +0200, Pierre Gondois wrote:
>> CPUs with a NULL sched domain are removed from the HKR_TYPE_SCHED
>> isolation mask. The two following checks are equialent:
>> - !housekeeping_runtime_test_cpu(cpu, HKR_TYPE_SCHED)
>> - on_null_domain(rq)
>>
>> Remove on_null_domain() and the redundant checks.
>>
>> Signed-off-by: Pierre Gondois <[email protected]>
>> ---
>> kernel/sched/fair.c | 18 ++++++------------
>> 1 file changed, 6 insertions(+), 12 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 3e0f2a0f153f..9657c8f2176b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11830,11 +11830,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>>
>> }
>>
>> -static inline int on_null_domain(struct rq *rq)
>> -{
>> - return unlikely(!rcu_dereference_sched(rq->sd));
>> -}
>> -
>> #ifdef CONFIG_NO_HZ_COMMON
>> /*
>> * NOHZ idle load balancing (ILB) details:
>> @@ -12040,7 +12035,7 @@ void nohz_balance_exit_idle(struct rq *rq)
>> SCHED_WARN_ON(rq != this_rq());
>>
>> /* If we're a completely isolated CPU, we don't play: */
>> - if (on_null_domain(rq))
>> + if (!housekeeping_runtime_test_cpu(cpu_of(rq), HKR_TYPE_SCHED))
>> return;
>>
>> if (likely(!rq->nohz_tick_stopped))
>
> This seems broken, the whole null domain can happen with cpusets, but
> this housekeeping nonsense is predicated on CPU_ISOLATION and none of
> that is mandatory for CPUSETS.

ok right,
I will try to remove this implicit dependency,

Regards,
Pierre

2024-04-09 13:37:51

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/fair|isolation: Correctly clear nohz.[nr_cpus|idle_cpus_mask] for isolated CPUs



On 4/5/24 02:23, Waiman Long wrote:
> On 4/4/24 08:55, Pierre Gondois wrote:
>> Hello Waiman,
>> Thanks for the link, I didn't see the patchset previously.
>>
>> On 4/4/24 05:01, Waiman Long wrote:
>>> On 4/3/24 11:05, Pierre Gondois wrote:
>>>> Zhang Rui reported that find_new_ilb() was iterating over CPUs in
>>>> isolated cgroup partitions. This triggered spurious wakeups for
>>>> theses CPUs. [1]
>>>> The initial approach was to ignore CPUs on NULL sched domains, as
>>>> isolated CPUs have a NULL sched domain. However a CPU:
>>>> - with its tick disabled, so taken into account in
>>>>     nohz.[idle_cpus_mask|nr_cpus]
>>>> - which is placed in an isolated cgroup partition
>>>> will never update nohz.[idle_cpus_mask|nr_cpus] again.
>>>>
>>>> To avoid that, the following variables should be cleared
>>>> when a CPU is placed in an isolated cgroup partition:
>>>> - nohz.idle_cpus_mask
>>>> - nohz.nr_cpus
>>>> - rq->nohz_tick_stopped
>>>> This would allow to avoid considering wrong nohz.* values during
>>>> idle load balance.
>>>>
>>>> As suggested in [2] and to avoid calling
>>>> nohz_balance_[enter|exit]_idle()
>>>> from a remote CPU and create concurrency issues, leverage the existing
>>>> housekeeping HK_TYPE_SCHED mask to reflect isolated CPUs (i.e. on NULL
>>>> sched domains).
>>>> Indeed the HK_TYPE_SCHED mask is currently never set by the
>>>> isolcpus/nohz_full kernel parameters, so it defaults to
>>>> cpu_online_mask.
>>>> Plus it's current usage fits CPUs that are isolated and should
>>>> not take part in load balancing.
>>>>
>>>> Making use of HK_TYPE_SCHED for this purpose implies creating a
>>>> housekeeping mask which can be modified at runtime.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/all/[email protected]/
>>>> [2]
>>>> https://lore.kernel.org/all/CAKfTPtAMd_KNKhXXGk5MEibzzQUX3BFkWgxtEW2o8FFTX99DKw@mail.gmail.com/
>>>>
>>>> Pierre Gondois (7):
>>>>     sched/isolation: Introduce housekeeping_runtime isolation
>>>>     sched/isolation: Move HK_TYPE_SCHED to housekeeping runtime
>>>>     sched/isolation: Use HKR_TYPE_SCHED in find_new_ilb()
>>>>     sched/fair: Move/add on_null_domain()/housekeeping_cpu() checks
>>>>     sched/topology: Remove CPUs with NULL sd from HKR_TYPE_SCHED mask
>>>>     sched/fair: Remove on_null_domain() and redundant checks
>>>>     sched/fair: Clear idle_cpus_mask for CPUs with NULL sd
>>>>
>>>>    include/linux/sched/isolation.h | 30 ++++++++++++++++++++-
>>>>    include/linux/sched/nohz.h      |  2 ++
>>>>    kernel/sched/fair.c             | 44 +++++++++++++++++-------------
>>>>    kernel/sched/isolation.c        | 48
>>>> ++++++++++++++++++++++++++++++++-
>>>>    kernel/sched/topology.c         |  7 +++++
>>>>    5 files changed, 110 insertions(+), 21 deletions(-)
>>>>
>>> I had also posted a patch series on excluding isolated CPUs in isolated
>>> partitions from housekeeping cpumasks earlier this year. See
>>>
>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> It took a different approach from this series. It looks like I should
>>> include HK_TYPE_MISC as well.

Yes right, but as noted, if CONFIG_CPUSET is set without CPU_ISOLATION,
adding HK_TYPE_MISC to HOUSEKEEPING_FLAGS in your patchset would have no
effect right ?
The only check that would be always true is on_null_domain() from [1].

>>
>> The common point between the 2 patchset is that find_new_ilb() won't
>> take into account isolated CPUs.
>> The present patchset also:
>> - clears nohz.[idle_cpus_mask|nr_cpus] variable when a CPU becomes
>> isolated,
>>   cf. [PATCH 7/7] sched/fair: Clear idle_cpus_mask for CPUs with NULL sd
>> - tries to clean up/gather on_null_domain()/HK_TYPE_SCHED/HK_TYPE_MISC
>>   mask checks, as HK_TYPE_SCHED/HK_TYPE_MISC masks are currently never
>>   set.
>> but it also:
>> - updates the housekeeping mask from sched/topology.c. It might be better
>>   to do it from cpuset.c as you did as the update originally comes from
>>   here and it is unlikely another place would require updating
>> housekeeping
>>   CPUs.
>>   A new housekeeping_runtime type is also created, but I think the way
>> you
>>   handle updating housekeeping mask at runtime is better.
>> - adds a dependency of sched/fair.c over CPU_ISOLATION (cf.
>> housekeeping_*
>>   calls), as Peter noted (IIUC) [1].
>
> That is true. Without CONFIG_CPU_ISOLATION, all the housekeeping* calls
> are essentially no-ops.
>
> OTOH, without CONFIG_CPU_ISOLATION, a number of isolation capabilities
> won't be there. Most distros will have this config option set anyway.
>
> BTW, a number of the HK_TYPE_* are also used at runtime, like
> HK_TYPE_TIMER and HK_TYPE_RCU. So it is hard to artificially distinguish
> between runtime or boot time.
>
> I don't believe you need to add direct dependency on
> CONFIG_CPU_ISOLATION, but you do have to add any housekeeping check as
> an additional check, not as a replacement of the existing check.

(on another topic)

Isolated CPUs currently keep the state they were in when the isolation was
done, i.e. if the tick was stopped when adding the CPU was added to the
isolated cpumask, then the CPU stays in nohz.idle_cpus_mask forever.
Similarly if the tick was not stopped, the CPU is cleared forever in
nohz.idle_cpus_mask.

This patchset also intended to clear isolated CPUs in nohz.idle_cpus_mask
to let them in a known state. This might not be a good approach.

nohz.idle_cpus_mask is also used in:
nohz_run_idle_balance()
\-_nohz_idle_balance()
\-for_each_cpu_wrap(balance_cpu, nohz.idle_cpus_mask, this_cpu+1)
\-update_nohz_stats()
This is apparently done to update 'the blocked load of already idle CPUs'.
However isolated CPUs might not have their blocked load updated as they are:
- currently randomly part of nohz.idle_cpus_mask.
- after this patch, never part of nohz.idle_cpus_mask.

I am also wondering whether this makes sense for isolated CPUs to update
the blocked load of other CPUs, i.e. if the following would not be needed:



diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1dd37168da50..9b92700564b1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12291,6 +12291,9 @@ void nohz_run_idle_balance(int cpu)
{
unsigned int flags;

+ if (on_null_domain(cpu_rq(cpu)))
+ return;
+
flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));

/*


Regards,
Pierre