2024-06-05 17:19:24

by Waiman Long

[permalink] [raw]
Subject: [PATCH-cgroup 0/2] cgroup/cpuset: Fix remote root partition creation problem

While reviewing the generate_sched_domains() function, I found a bug
in generating sched domains for remote non-isolating partitions. After
extending test_cpuset_prs.sh to cover those cases, the bug is confirmed.

The first patch fixes the remote partitions sched domain generation
problem and the second patch updates the test.

Waiman Long (2):
cgroup/cpuset: Fix remote root partition creation problem
selftest/cgroup: Dump expected sched-domain data to console

kernel/cgroup/cpuset.c | 55 ++++++++++++++-----
.../selftests/cgroup/test_cpuset_prs.sh | 29 +++++++++-
2 files changed, 68 insertions(+), 16 deletions(-)

--
2.39.3



2024-06-05 17:19:43

by Waiman Long

[permalink] [raw]
Subject: [PATCH-cgroup 2/2] selftest/cgroup: Dump expected sched-domain data to console

Unlike the list of isolated CPUs, it is not easy to programamatically
determine what sched domains are being created by the scheduler just
by examinng the data in various kernfs filesystems. The easiest way
to get this information is by enabling /sys/kernel/debug/sched/verbose
file to make those information displayed in the console. This is also
what the test_cpuset_prs.sh script is doing when the -v flag is given.

It is rather hard to fetch the data from the console and compare it to
the expected result. An easier way is to dump the expected sched-domain
information out to the console so that they can be visually compared
with the actual sched domain data. However, this have to be done manually
by visual inspection and so will only be done once in a while.

Signed-off-by: Waiman Long <[email protected]>
---
.../selftests/cgroup/test_cpuset_prs.sh | 29 +++++++++++++++++--
1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index b5eb1be2248c..c5464ee4e17e 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -161,6 +161,14 @@ test_add_proc()
# T = put a task into cgroup
# O<c>=<v> = Write <v> to CPU online file of <c>
#
+# ECPUs - effective CPUs of cpusets
+# Pstate - partition root state
+# ISOLCPUS - isolated CPUs (<icpus>[,<icpus2>])
+#
+# Note that if there are 2 fields in ISOLCPUS, the first one is for
+# sched-debug matching which includes offline CPUs and single-CPU partitions
+# while the second one is for matching cpuset.cpus.isolated.
+#
SETUP_A123_PARTITIONS="C1-3:P1:S+ C2-3:P1:S+ C3:P1"
TEST_MATRIX=(
# old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
@@ -233,10 +241,14 @@ TEST_MATRIX=(
A1:P0,A2:P1,A3:P2,B1:P1 2-3"
" C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:,A3:2-3,B1:4 \
A1:P0,A2:P1,A3:P2,B1:P1 2-4,2-3"
+ " C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3:P1 . P1 0 A1:0-1,A2:2-3,A3:2-3,B1:4 \
+ A1:P0,A2:P1,A3:P0,B1:P1"
" C0-3:S+ C1-3:S+ C3 C4 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:2,A3:3,B1:4 \
A1:P0,A2:P1,A3:P2,B1:P1 2-4,3"
" C0-4:S+ C1-4:S+ C2-4 . X2-4 X2-4:P2 X4:P1 . 0 A1:0-1,A2:2-3,A3:4 \
A1:P0,A2:P2,A3:P1 2-4,2-3"
+ " C0-4:S+ C1-4:S+ C2-4 . X2-4 X2-4:P2 X3-4:P1 . 0 A1:0-1,A2:2,A3:3-4 \
+ A1:P0,A2:P2,A3:P1 2"
" C0-4:X2-4:S+ C1-4:X2-4:S+:P2 C2-4:X4:P1 \
. . X5 . . 0 A1:0-4,A2:1-4,A3:2-4 \
A1:P0,A2:P-2,A3:P-1"
@@ -556,14 +568,15 @@ check_cgroup_states()
do
set -- $(echo $CHK | sed -e "s/:/ /g")
CGRP=$1
+ CGRP_DIR=$CGRP
STATE=$2
FILE=
EVAL=$(expr substr $STATE 2 2)
- [[ $CGRP = A2 ]] && CGRP=A1/A2
- [[ $CGRP = A3 ]] && CGRP=A1/A2/A3
+ [[ $CGRP = A2 ]] && CGRP_DIR=A1/A2
+ [[ $CGRP = A3 ]] && CGRP_DIR=A1/A2/A3

case $STATE in
- P*) FILE=$CGRP/cpuset.cpus.partition
+ P*) FILE=$CGRP_DIR/cpuset.cpus.partition
;;
*) echo "Unknown state: $STATE!"
exit 1
@@ -587,6 +600,16 @@ check_cgroup_states()
;;
esac
[[ $EVAL != $VAL ]] && return 1
+
+ #
+ # For root partition, dump sched-domains info to console if
+ # verbose mode set for manual comparison with sched debug info.
+ #
+ [[ $VAL -eq 1 && $VERBOSE -gt 0 ]] && {
+ DOMS=$(cat $CGRP_DIR/cpuset.cpus.effective)
+ [[ -n "$DOMS" ]] &&
+ echo " [$CGRP] sched-domain: $DOMS" > /dev/console
+ }
done
return 0
}
--
2.39.3


2024-06-05 17:19:44

by Waiman Long

[permalink] [raw]
Subject: [PATCH-cgroup 1/2] cgroup/cpuset: Fix remote root partition creation problem

Since commit 181c8e091aae ("cgroup/cpuset: Introduce remote partition"),
a remote partition can be created underneath a non-partition root cpuset
as long as its exclusive_cpus are set to distribute exclusive CPUs down
to its children. The generate_sched_domains() function, however, doesn't
take into account this new behavior and hence will fail to create the
sched domain needed for a remote root (non-isolated) partition.

There are two issues related to remote partition support. First of
all, generate_sched_domains() has a fast path that is activated if
root_load_balance is true and top_cpuset.nr_subparts is non-zero. The
later condition isn't quite correct for remote partitions as nr_subparts
just shows the number of local child partitions underneath it. There
can be no local child partition under top_cpuset even if there are
remote partitions further down the hierarchy. Fix that by checking
for subpartitions_cpus which contains exclusive CPUs allocated to both
local and remote partitions.

Secondly, the valid partition check for subtree skipping in the csa[]
generation loop isn't enough as remote partition does not need to
have a partition root parent. Fix this problem by breaking csa[] array
generation loop of generate_sched_domains() into v1 and v2 specific parts
and checking a cpuset's exclusive_cpus before skipping its subtree in
the v2 case.

Also simplify generate_sched_domains() for cgroup v2 as only
non-isolating partition roots should be included in building the cpuset
array and none of the v1 scheduling attributes other than a different
way to create an isolated partition are supported.

Fixes: 181c8e091aae ("cgroup/cpuset: Introduce remote partition")
Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 55 ++++++++++++++++++++++++++++++++----------
1 file changed, 42 insertions(+), 13 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f9b97f65e204..fb71d710a603 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -169,7 +169,7 @@ struct cpuset {
/* for custom sched domain */
int relax_domain_level;

- /* number of valid sub-partitions */
+ /* number of valid local child partitions */
int nr_subparts;

/* partition root state */
@@ -957,13 +957,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
int nslot; /* next empty doms[] struct cpumask slot */
struct cgroup_subsys_state *pos_css;
bool root_load_balance = is_sched_load_balance(&top_cpuset);
+ bool cgrpv2 = cgroup_subsys_on_dfl(cpuset_cgrp_subsys);

doms = NULL;
dattr = NULL;
csa = NULL;

/* Special case for the 99% of systems with one, full, sched domain */
- if (root_load_balance && !top_cpuset.nr_subparts) {
+ if (root_load_balance && cpumask_empty(subpartitions_cpus)) {
single_root_domain:
ndoms = 1;
doms = alloc_sched_domains(ndoms);
@@ -992,16 +993,18 @@ static int generate_sched_domains(cpumask_var_t **domains,
cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
if (cp == &top_cpuset)
continue;
+
+ if (cgrpv2)
+ goto v2;
+
/*
+ * v1:
* Continue traversing beyond @cp iff @cp has some CPUs and
* isn't load balancing. The former is obvious. The
* latter: All child cpusets contain a subset of the
* parent's cpus, so just skip them, and then we call
* update_domain_attr_tree() to calc relax_domain_level of
* the corresponding sched domain.
- *
- * If root is load-balancing, we can skip @cp if it
- * is a subset of the root's effective_cpus.
*/
if (!cpumask_empty(cp->cpus_allowed) &&
!(is_sched_load_balance(cp) &&
@@ -1009,16 +1012,28 @@ static int generate_sched_domains(cpumask_var_t **domains,
housekeeping_cpumask(HK_TYPE_DOMAIN))))
continue;

- if (root_load_balance &&
- cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
- continue;
-
if (is_sched_load_balance(cp) &&
!cpumask_empty(cp->effective_cpus))
csa[csn++] = cp;

- /* skip @cp's subtree if not a partition root */
- if (!is_partition_valid(cp))
+ /* skip @cp's subtree */
+ pos_css = css_rightmost_descendant(pos_css);
+ continue;
+
+v2:
+ /*
+ * Only valid partition roots that are not isolated and with
+ * non-empty effective_cpus will be saved into csn[].
+ */
+ if ((cp->partition_root_state == PRS_ROOT) &&
+ !cpumask_empty(cp->effective_cpus))
+ csa[csn++] = cp;
+
+ /*
+ * Skip @cp's subtree if not a partition root and has no
+ * exclusive CPUs to be granted to child cpusets.
+ */
+ if (!is_partition_valid(cp) && cpumask_empty(cp->exclusive_cpus))
pos_css = css_rightmost_descendant(pos_css);
}
rcu_read_unlock();
@@ -1072,6 +1087,20 @@ static int generate_sched_domains(cpumask_var_t **domains,
dattr = kmalloc_array(ndoms, sizeof(struct sched_domain_attr),
GFP_KERNEL);

+ /*
+ * Cgroup v2 doesn't support domain attributes, just set all of them
+ * to SD_ATTR_INIT. Also non-isolating partition root CPUs are a
+ * subset of HK_TYPE_DOMAIN housekeeping CPUs.
+ */
+ if (cgrpv2) {
+ for (i = 0; i < ndoms; i++) {
+ cpumask_copy(doms[i], csa[i]->effective_cpus);
+ if (dattr)
+ dattr[i] = SD_ATTR_INIT;
+ }
+ goto done;
+ }
+
for (nslot = 0, i = 0; i < csn; i++) {
struct cpuset *a = csa[i];
struct cpumask *dp;
@@ -1231,7 +1260,7 @@ static void rebuild_sched_domains_locked(void)
* root should be only a subset of the active CPUs. Since a CPU in any
* partition root could be offlined, all must be checked.
*/
- if (top_cpuset.nr_subparts) {
+ if (!cpumask_empty(subpartitions_cpus)) {
rcu_read_lock();
cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
if (!is_partition_valid(cs)) {
@@ -4575,7 +4604,7 @@ static void cpuset_handle_hotplug(void)
* In the rare case that hotplug removes all the cpus in
* subpartitions_cpus, we assumed that cpus are updated.
*/
- if (!cpus_updated && top_cpuset.nr_subparts)
+ if (!cpus_updated && !cpumask_empty(subpartitions_cpus))
cpus_updated = true;

/* For v1, synchronize cpus_allowed to cpu_active_mask */
--
2.39.3


2024-06-06 13:05:16

by Xavier

[permalink] [raw]
Subject: Re: [PATCH-cgroup 1/2] cgroup/cpuset: Fix remote root partition creation problem

Hi Longman,

I have a small question about your new patch.
I wonder that in cgroup v2, will there be any overlap between valid partition roots and top_cpuset? If it is not, the section starting with 'restart:' that searches for overlapping cpusets can be skipped for cgroup v2. Otherwise, if there are any overlap, then the assignment to 'dom' may need perform an cpumask_or operation?

Best regards,
Xavier

>----- Original Message -----
>From: Waiman Long <[email protected]>
>To: Tejun Heo <[email protected]>, Zefan Li <[email protected]>, Johannes Weiner <[email protected]>
>Cc: [email protected], [email protected], Xavier <[email protected]>, Waiman Long <[email protected]>
>Subject: [PATCH-cgroup 1/2] cgroup/cpuset: Fix remote root partition creation problem
>Date: 2024-06-06 01:19
>
>Since commit 181c8e091aae ("cgroup/cpuset: Introduce remote partition"),
>a remote partition can be created underneath a non-partition root cpuset
>as long as its exclusive_cpus are set to distribute exclusive CPUs down
>to its children. The generate_sched_domains() function, however, doesn't
>take into account this new behavior and hence will fail to create the
>sched domain needed for a remote root (non-isolated) partition.
>There are two issues related to remote partition support. First of
>all, generate_sched_domains() has a fast path that is activated if
>root_load_balance is true and top_cpuset.nr_subparts is non-zero. The
>later condition isn't quite correct for remote partitions as nr_subparts
>just shows the number of local child partitions underneath it. There
>can be no local child partition under top_cpuset even if there are
>remote partitions further down the hierarchy. Fix that by checking
>for subpartitions_cpus which contains exclusive CPUs allocated to both
>local and remote partitions.
>Secondly, the valid partition check for subtree skipping in the csa[]
>generation loop isn't enough as remote partition does not need to
>have a partition root parent. Fix this problem by breaking csa[] array
>generation loop of generate_sched_domains() into v1 and v2 specific parts
>and checking a cpuset's exclusive_cpus before skipping its subtree in
>the v2 case.
>Also simplify generate_sched_domains() for cgroup v2 as only
>non-isolating partition roots should be included in building the cpuset
>array and none of the v1 scheduling attributes other than a different
>way to create an isolated partition are supported.
>Fixes: 181c8e091aae ("cgroup/cpuset: Introduce remote partition")
>Signed-off-by: Waiman Long <[email protected]>
>---
> kernel/cgroup/cpuset.c | 55 ++++++++++++++++++++++++++++++++----------
> 1 file changed, 42 insertions(+), 13 deletions(-)
>diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>index f9b97f65e204..fb71d710a603 100644
>--- a/kernel/cgroup/cpuset.c
>+++ b/kernel/cgroup/cpuset.c
>@@ -169,7 +169,7 @@ struct cpuset {
> /* for custom sched domain */
> int relax_domain_level;
>
>- /* number of valid sub-partitions */
>+ /* number of valid local child partitions */
> int nr_subparts;
>
> /* partition root state */
>@@ -957,13 +957,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
> int nslot; /* next empty doms[] struct cpumask slot */
> struct cgroup_subsys_state *pos_css;
> bool root_load_balance = is_sched_load_balance(&top_cpuset);
>+ bool cgrpv2 = cgroup_subsys_on_dfl(cpuset_cgrp_subsys);
>
> doms = NULL;
> dattr = NULL;
> csa = NULL;
>
> /* Special case for the 99% of systems with one, full, sched domain */
>- if (root_load_balance && !top_cpuset.nr_subparts) {
>+ if (root_load_balance && cpumask_empty(subpartitions_cpus)) {
> single_root_domain:
> ndoms = 1;
> doms = alloc_sched_domains(ndoms);
>@@ -992,16 +993,18 @@ static int generate_sched_domains(cpumask_var_t **domains,
> cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
> if (cp == &top_cpuset)
> continue;
>+
>+ if (cgrpv2)
>+ goto v2;
>+
> /*
>+ * v1:
> * Continue traversing beyond @cp iff @cp has some CPUs and
> * isn't load balancing. The former is obvious. The
> * latter: All child cpusets contain a subset of the
> * parent's cpus, so just skip them, and then we call
> * update_domain_attr_tree() to calc relax_domain_level of
> * the corresponding sched domain.
>- *
>- * If root is load-balancing, we can skip @cp if it
>- * is a subset of the root's effective_cpus.
> */
> if (!cpumask_empty(cp->cpus_allowed) &&
> !(is_sched_load_balance(cp) &&
>@@ -1009,16 +1012,28 @@ static int generate_sched_domains(cpumask_var_t **domains,
> housekeeping_cpumask(HK_TYPE_DOMAIN))))
> continue;
>
>- if (root_load_balance &&
>- cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
>- continue;
>-
> if (is_sched_load_balance(cp) &&
> !cpumask_empty(cp->effective_cpus))
> csa[csn++] = cp;
>
>- /* skip @cp's subtree if not a partition root */
>- if (!is_partition_valid(cp))
>+ /* skip @cp's subtree */
>+ pos_css = css_rightmost_descendant(pos_css);
>+ continue;
>+
>+v2:
>+ /*
>+ * Only valid partition roots that are not isolated and with
>+ * non-empty effective_cpus will be saved into csn[].
>+ */
>+ if ((cp->partition_root_state == PRS_ROOT) &&
>+ !cpumask_empty(cp->effective_cpus))
>+ csa[csn++] = cp;
>+
>+ /*
>+ * Skip @cp's subtree if not a partition root and has no
>+ * exclusive CPUs to be granted to child cpusets.
>+ */
>+ if (!is_partition_valid(cp) && cpumask_empty(cp->exclusive_cpus))
> pos_css = css_rightmost_descendant(pos_css);
> }
> rcu_read_unlock();
>@@ -1072,6 +1087,20 @@ static int generate_sched_domains(cpumask_var_t **domains,
> dattr = kmalloc_array(ndoms, sizeof(struct sched_domain_attr),
> GFP_KERNEL);
>
>+ /*
>+ * Cgroup v2 doesn't support domain attributes, just set all of them
>+ * to SD_ATTR_INIT. Also non-isolating partition root CPUs are a
>+ * subset of HK_TYPE_DOMAIN housekeeping CPUs.
>+ */
>+ if (cgrpv2) {
>+ for (i = 0; i < ndoms; i++) {
>+ cpumask_copy(doms[i], csa[i]->effective_cpus); /*****************************************/
>+ if (dattr)
>+ dattr[i] = SD_ATTR_INIT;
>+ }
>+ goto done;
>+ }
>+
> for (nslot = 0, i = 0; i < csn; i++) {
> struct cpuset *a = csa[i];
> struct cpumask *dp;
>@@ -1231,7 +1260,7 @@ static void rebuild_sched_domains_locked(void)
> * root should be only a subset of the active CPUs. Since a CPU in any
> * partition root could be offlined, all must be checked.
> */
>- if (top_cpuset.nr_subparts) {
>+ if (!cpumask_empty(subpartitions_cpus)) {
> rcu_read_lock();
> cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
> if (!is_partition_valid(cs)) {
>@@ -4575,7 +4604,7 @@ static void cpuset_handle_hotplug(void)
> * In the rare case that hotplug removes all the cpus in
> * subpartitions_cpus, we assumed that cpus are updated.
> */
>- if (!cpus_updated && top_cpuset.nr_subparts)
>+ if (!cpus_updated && !cpumask_empty(subpartitions_cpus))
> cpus_updated = true;
>
> /* For v1, synchronize cpus_allowed to cpu_active_mask */
>--
>2.39.3

2024-06-11 18:07:53

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH-cgroup 1/2] cgroup/cpuset: Fix remote root partition creation problem

On 6/6/24 09:01, Xavier wrote:
> Hi Longman,
>
> I have a small question about your new patch.
> I wonder that in cgroup v2, will there be any overlap between valid partition roots and top_cpuset? If it is not, the section starting with 'restart:' that searches for overlapping cpusets can be skipped for cgroup v2. Otherwise, if there are any overlap, then the assignment to 'dom' may need perform an cpumask_or operation?

In cgroup v2, the top_cpuset is a non-isolating partition root by itself
and its partition root state cannot be changed.

The reason for the introduction of the partition feature in cgroup v2
was to support the creation of a separate sched domain. cgroup v1
supports that indirectly via clever use of the cpuset.sched_load_balance
flag. So by definition, the presence of a non-isolating partition root
defines a new sched domain.

Cheers,
Longman

> Best regards,
> Xavier
>
>> ----- Original Message -----
>> From: Waiman Long <[email protected]>
>> To: Tejun Heo <[email protected]>, Zefan Li <[email protected]>, Johannes Weiner <[email protected]>
>> Cc: [email protected], [email protected], Xavier <[email protected]>, Waiman Long <[email protected]>
>> Subject: [PATCH-cgroup 1/2] cgroup/cpuset: Fix remote root partition creation problem
>> Date: 2024-06-06 01:19
>>
>> Since commit 181c8e091aae ("cgroup/cpuset: Introduce remote partition"),
>> a remote partition can be created underneath a non-partition root cpuset
>> as long as its exclusive_cpus are set to distribute exclusive CPUs down
>> to its children. The generate_sched_domains() function, however, doesn't
>> take into account this new behavior and hence will fail to create the
>> sched domain needed for a remote root (non-isolated) partition.
>> There are two issues related to remote partition support. First of
>> all, generate_sched_domains() has a fast path that is activated if
>> root_load_balance is true and top_cpuset.nr_subparts is non-zero. The
>> later condition isn't quite correct for remote partitions as nr_subparts
>> just shows the number of local child partitions underneath it. There
>> can be no local child partition under top_cpuset even if there are
>> remote partitions further down the hierarchy. Fix that by checking
>> for subpartitions_cpus which contains exclusive CPUs allocated to both
>> local and remote partitions.
>> Secondly, the valid partition check for subtree skipping in the csa[]
>> generation loop isn't enough as remote partition does not need to
>> have a partition root parent. Fix this problem by breaking csa[] array
>> generation loop of generate_sched_domains() into v1 and v2 specific parts
>> and checking a cpuset's exclusive_cpus before skipping its subtree in
>> the v2 case.
>> Also simplify generate_sched_domains() for cgroup v2 as only
>> non-isolating partition roots should be included in building the cpuset
>> array and none of the v1 scheduling attributes other than a different
>> way to create an isolated partition are supported.
>> Fixes: 181c8e091aae ("cgroup/cpuset: Introduce remote partition")
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> kernel/cgroup/cpuset.c | 55 ++++++++++++++++++++++++++++++++----------
>> 1 file changed, 42 insertions(+), 13 deletions(-)
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index f9b97f65e204..fb71d710a603 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -169,7 +169,7 @@ struct cpuset {
>> /* for custom sched domain */
>> int relax_domain_level;
>>
>> - /* number of valid sub-partitions */
>> + /* number of valid local child partitions */
>> int nr_subparts;
>>
>> /* partition root state */
>> @@ -957,13 +957,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
>> int nslot; /* next empty doms[] struct cpumask slot */
>> struct cgroup_subsys_state *pos_css;
>> bool root_load_balance = is_sched_load_balance(&top_cpuset);
>> + bool cgrpv2 = cgroup_subsys_on_dfl(cpuset_cgrp_subsys);
>>
>> doms = NULL;
>> dattr = NULL;
>> csa = NULL;
>>
>> /* Special case for the 99% of systems with one, full, sched domain */
>> - if (root_load_balance && !top_cpuset.nr_subparts) {
>> + if (root_load_balance && cpumask_empty(subpartitions_cpus)) {
>> single_root_domain:
>> ndoms = 1;
>> doms = alloc_sched_domains(ndoms);
>> @@ -992,16 +993,18 @@ static int generate_sched_domains(cpumask_var_t **domains,
>> cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
>> if (cp == &top_cpuset)
>> continue;
>> +
>> + if (cgrpv2)
>> + goto v2;
>> +
>> /*
>> + * v1:
>> * Continue traversing beyond @cp iff @cp has some CPUs and
>> * isn't load balancing. The former is obvious. The
>> * latter: All child cpusets contain a subset of the
>> * parent's cpus, so just skip them, and then we call
>> * update_domain_attr_tree() to calc relax_domain_level of
>> * the corresponding sched domain.
>> - *
>> - * If root is load-balancing, we can skip @cp if it
>> - * is a subset of the root's effective_cpus.
>> */
>> if (!cpumask_empty(cp->cpus_allowed) &&
>> !(is_sched_load_balance(cp) &&
>> @@ -1009,16 +1012,28 @@ static int generate_sched_domains(cpumask_var_t **domains,
>> housekeeping_cpumask(HK_TYPE_DOMAIN))))
>> continue;
>>
>> - if (root_load_balance &&
>> - cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
>> - continue;
>> -
>> if (is_sched_load_balance(cp) &&
>> !cpumask_empty(cp->effective_cpus))
>> csa[csn++] = cp;
>>
>> - /* skip @cp's subtree if not a partition root */
>> - if (!is_partition_valid(cp))
>> + /* skip @cp's subtree */
>> + pos_css = css_rightmost_descendant(pos_css);
>> + continue;
>> +
>> +v2:
>> + /*
>> + * Only valid partition roots that are not isolated and with
>> + * non-empty effective_cpus will be saved into csn[].
>> + */
>> + if ((cp->partition_root_state == PRS_ROOT) &&
>> + !cpumask_empty(cp->effective_cpus))
>> + csa[csn++] = cp;
>> +
>> + /*
>> + * Skip @cp's subtree if not a partition root and has no
>> + * exclusive CPUs to be granted to child cpusets.
>> + */
>> + if (!is_partition_valid(cp) && cpumask_empty(cp->exclusive_cpus))
>> pos_css = css_rightmost_descendant(pos_css);
>> }
>> rcu_read_unlock();
>> @@ -1072,6 +1087,20 @@ static int generate_sched_domains(cpumask_var_t **domains,
>> dattr = kmalloc_array(ndoms, sizeof(struct sched_domain_attr),
>> GFP_KERNEL);
>>
>> + /*
>> + * Cgroup v2 doesn't support domain attributes, just set all of them
>> + * to SD_ATTR_INIT. Also non-isolating partition root CPUs are a
>> + * subset of HK_TYPE_DOMAIN housekeeping CPUs.
>> + */
>> + if (cgrpv2) {
>> + for (i = 0; i < ndoms; i++) {
>> + cpumask_copy(doms[i], csa[i]->effective_cpus); /*****************************************/
>> + if (dattr)
>> + dattr[i] = SD_ATTR_INIT;
>> + }
>> + goto done;
>> + }
>> +
>> for (nslot = 0, i = 0; i < csn; i++) {
>> struct cpuset *a = csa[i];
>> struct cpumask *dp;
>> @@ -1231,7 +1260,7 @@ static void rebuild_sched_domains_locked(void)
>> * root should be only a subset of the active CPUs. Since a CPU in any
>> * partition root could be offlined, all must be checked.
>> */
>> - if (top_cpuset.nr_subparts) {
>> + if (!cpumask_empty(subpartitions_cpus)) {
>> rcu_read_lock();
>> cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
>> if (!is_partition_valid(cs)) {
>> @@ -4575,7 +4604,7 @@ static void cpuset_handle_hotplug(void)
>> * In the rare case that hotplug removes all the cpus in
>> * subpartitions_cpus, we assumed that cpus are updated.
>> */
>> - if (!cpus_updated && top_cpuset.nr_subparts)
>> + if (!cpus_updated && !cpumask_empty(subpartitions_cpus))
>> cpus_updated = true;
>>
>> /* For v1, synchronize cpus_allowed to cpu_active_mask */
>> --
>> 2.39.3