2024-05-31 02:49:38

by Xavier

[permalink] [raw]
Subject: [PATCH v2] cpuset: Optimize the number of iterations in the scheduling domain construction process

The process of constructing scheduling domains involves multiple loops
and repeated evaluations, leading to numerous redundant and ineffective
assessments that impact code efficiency.

Here, we use Union-Find to optimize the merging of cpumasks. By employing
path compression and union by rank, we effectively reduce the number of
lookups and merge comparisons.

Signed-off-by: Xavier <[email protected]>
---
kernel/cgroup/cpuset.c | 117 +++++++++++++++++++++++------------------
1 file changed, 66 insertions(+), 51 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c12b9fdb2..4bea1c2db 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -891,6 +891,44 @@ static inline int nr_cpusets(void)
return static_key_count(&cpusets_enabled_key.key) + 1;
}

+/*define a union find node struct*/
+struct uf_node {
+ int parent;
+ int rank;
+};
+
+static int find_root(struct uf_node *nodes, int x)
+{
+ int root = x;
+ int parent;
+
+ /*Find the root node and perform path compression at the same time*/
+ while (nodes[root].parent != root) {
+ parent = nodes[root].parent;
+ nodes[root].parent = nodes[parent].parent;
+ root = parent;
+ }
+ return root;
+}
+
+/*Function to merge two sets, using union by rank*/
+static void union_sets(struct uf_node *nodes, int a, int b)
+{
+ int root_a = find_root(nodes, a);
+ int root_b = find_root(nodes, b);
+
+ if (root_a != root_b) {
+ if (nodes[root_a].rank < nodes[root_b].rank) {
+ nodes[root_a].parent = root_b;
+ } else if (nodes[root_a].rank > nodes[root_b].rank) {
+ nodes[root_b].parent = root_a;
+ } else {
+ nodes[root_b].parent = root_a;
+ nodes[root_a].rank++;
+ }
+ }
+}
+
/*
* generate_sched_domains()
*
@@ -950,13 +988,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
struct cpuset *cp; /* top-down scan of cpusets */
struct cpuset **csa; /* array of all cpuset ptrs */
int csn; /* how many cpuset ptrs in csa so far */
- int i, j, k; /* indices for partition finding loops */
+ int i, j; /* indices for partition finding loops */
cpumask_var_t *doms; /* resulting partition; i.e. sched domains */
struct sched_domain_attr *dattr; /* attributes for custom domains */
int ndoms = 0; /* number of sched domains in result */
int nslot; /* next empty doms[] struct cpumask slot */
struct cgroup_subsys_state *pos_css;
bool root_load_balance = is_sched_load_balance(&top_cpuset);
+ struct uf_node *nodes;

doms = NULL;
dattr = NULL;
@@ -1022,33 +1061,31 @@ static int generate_sched_domains(cpumask_var_t **domains,
}
rcu_read_unlock();

- for (i = 0; i < csn; i++)
- csa[i]->pn = i;
- ndoms = csn;
-
-restart:
- /* Find the best partition (set of sched domains) */
- for (i = 0; i < csn; i++) {
- struct cpuset *a = csa[i];
- int apn = a->pn;
+ nodes = kmalloc_array(csn, sizeof(struct uf_node), GFP_KERNEL);
+ if (!nodes)
+ goto done;

- for (j = 0; j < csn; j++) {
- struct cpuset *b = csa[j];
- int bpn = b->pn;

- if (apn != bpn && cpusets_overlap(a, b)) {
- for (k = 0; k < csn; k++) {
- struct cpuset *c = csa[k];
+ /* Each node is initially its own parent */
+ for (i = 0; i < csn; i++) {
+ nodes[i].parent = i;
+ nodes[i].rank = 0;
+ }

- if (c->pn == bpn)
- c->pn = apn;
- }
- ndoms--; /* one less element */
- goto restart;
- }
+ /* Merge overlapping cpusets */
+ for (i = 0; i < csn; i++) {
+ for (j = i + 1; j < csn; j++) {
+ if (cpusets_overlap(csa[i], csa[j]))
+ union_sets(nodes, i, j);
}
}

+ /* Calculate the number of domains after merging */
+ for (i = 0; i < csn; i++) {
+ if (nodes[i].parent == i)
+ ndoms++;
+ }
+
/*
* Now we know how many domains to create.
* Convert <csn, csa> to <ndoms, doms> and populate cpu masks.
@@ -1065,47 +1102,25 @@ static int generate_sched_domains(cpumask_var_t **domains,
GFP_KERNEL);

for (nslot = 0, i = 0; i < csn; i++) {
- struct cpuset *a = csa[i];
- struct cpumask *dp;
- int apn = a->pn;
-
- if (apn < 0) {
- /* Skip completed partitions */
- continue;
- }
-
- dp = doms[nslot];
-
- if (nslot == ndoms) {
- static int warnings = 10;
- if (warnings) {
- pr_warn("rebuild_sched_domains confused: nslot %d, ndoms %d, csn %d, i %d, apn %d\n",
- nslot, ndoms, csn, i, apn);
- warnings--;
- }
- continue;
- }
+ struct cpumask *dp = doms[nslot];

cpumask_clear(dp);
if (dattr)
*(dattr + nslot) = SD_ATTR_INIT;
for (j = i; j < csn; j++) {
- struct cpuset *b = csa[j];
+ if (find_root(nodes, j) == i) {
+ if (i == j)
+ nslot++;

- if (apn == b->pn) {
- cpumask_or(dp, dp, b->effective_cpus);
+ cpumask_or(dp, dp, csa[j]->effective_cpus);
cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN));
if (dattr)
- update_domain_attr_tree(dattr + nslot, b);
-
- /* Done with this partition */
- b->pn = -1;
+ update_domain_attr_tree(dattr + nslot, csa[j]);
}
}
- nslot++;
}
BUG_ON(nslot != ndoms);
-
+ kfree(nodes);
done:
kfree(csa);

--
2.34.1



2024-05-31 21:15:29

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2] cpuset: Optimize the number of iterations in the scheduling domain construction process

On 5/30/24 22:48, Xavier wrote:
> The process of constructing scheduling domains involves multiple loops
> and repeated evaluations, leading to numerous redundant and ineffective
> assessments that impact code efficiency.
>
> Here, we use Union-Find to optimize the merging of cpumasks. By employing
> path compression and union by rank, we effectively reduce the number of
> lookups and merge comparisons.
>
> Signed-off-by: Xavier <[email protected]>
> ---
> kernel/cgroup/cpuset.c | 117 +++++++++++++++++++++++------------------
> 1 file changed, 66 insertions(+), 51 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index c12b9fdb2..4bea1c2db 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -891,6 +891,44 @@ static inline int nr_cpusets(void)
> return static_key_count(&cpusets_enabled_key.key) + 1;
> }
>
> +/*define a union find node struct*/
> +struct uf_node {
> + int parent;
> + int rank;
> +};
> +
> +static int find_root(struct uf_node *nodes, int x)
> +{
> + int root = x;
> + int parent;
> +
> + /*Find the root node and perform path compression at the same time*/
> + while (nodes[root].parent != root) {
> + parent = nodes[root].parent;
> + nodes[root].parent = nodes[parent].parent;
> + root = parent;
> + }
> + return root;
> +}
> +
> +/*Function to merge two sets, using union by rank*/
> +static void union_sets(struct uf_node *nodes, int a, int b)
> +{
> + int root_a = find_root(nodes, a);
> + int root_b = find_root(nodes, b);
> +
> + if (root_a != root_b) {
> + if (nodes[root_a].rank < nodes[root_b].rank) {
> + nodes[root_a].parent = root_b;
> + } else if (nodes[root_a].rank > nodes[root_b].rank) {
> + nodes[root_b].parent = root_a;
> + } else {
> + nodes[root_b].parent = root_a;
> + nodes[root_a].rank++;
> + }
> + }
> +}
> +
> /*
> * generate_sched_domains()
> *
> @@ -950,13 +988,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
> struct cpuset *cp; /* top-down scan of cpusets */
> struct cpuset **csa; /* array of all cpuset ptrs */
> int csn; /* how many cpuset ptrs in csa so far */
> - int i, j, k; /* indices for partition finding loops */
> + int i, j; /* indices for partition finding loops */
> cpumask_var_t *doms; /* resulting partition; i.e. sched domains */
> struct sched_domain_attr *dattr; /* attributes for custom domains */
> int ndoms = 0; /* number of sched domains in result */
> int nslot; /* next empty doms[] struct cpumask slot */
> struct cgroup_subsys_state *pos_css;
> bool root_load_balance = is_sched_load_balance(&top_cpuset);
> + struct uf_node *nodes;
>
> doms = NULL;
> dattr = NULL;
> @@ -1022,33 +1061,31 @@ static int generate_sched_domains(cpumask_var_t **domains,
> }
> rcu_read_unlock();
>
> - for (i = 0; i < csn; i++)
> - csa[i]->pn = i;
> - ndoms = csn;
> -
> -restart:
> - /* Find the best partition (set of sched domains) */
> - for (i = 0; i < csn; i++) {
> - struct cpuset *a = csa[i];
> - int apn = a->pn;
> + nodes = kmalloc_array(csn, sizeof(struct uf_node), GFP_KERNEL);
> + if (!nodes)
> + goto done;
>
> - for (j = 0; j < csn; j++) {
> - struct cpuset *b = csa[j];
> - int bpn = b->pn;
>
> - if (apn != bpn && cpusets_overlap(a, b)) {
> - for (k = 0; k < csn; k++) {
> - struct cpuset *c = csa[k];
> + /* Each node is initially its own parent */
> + for (i = 0; i < csn; i++) {
> + nodes[i].parent = i;
> + nodes[i].rank = 0;
> + }
>
> - if (c->pn == bpn)
> - c->pn = apn;
> - }
> - ndoms--; /* one less element */
> - goto restart;
> - }
> + /* Merge overlapping cpusets */
> + for (i = 0; i < csn; i++) {
> + for (j = i + 1; j < csn; j++) {
> + if (cpusets_overlap(csa[i], csa[j]))
> + union_sets(nodes, i, j);
> }
> }
>
> + /* Calculate the number of domains after merging */
> + for (i = 0; i < csn; i++) {
> + if (nodes[i].parent == i)
> + ndoms++;
> + }
> +
> /*
> * Now we know how many domains to create.
> * Convert <csn, csa> to <ndoms, doms> and populate cpu masks.
> @@ -1065,47 +1102,25 @@ static int generate_sched_domains(cpumask_var_t **domains,
> GFP_KERNEL);
>
> for (nslot = 0, i = 0; i < csn; i++) {
> - struct cpuset *a = csa[i];
> - struct cpumask *dp;
> - int apn = a->pn;
> -
> - if (apn < 0) {
> - /* Skip completed partitions */
> - continue;
> - }
> -
> - dp = doms[nslot];
> -
> - if (nslot == ndoms) {
> - static int warnings = 10;
> - if (warnings) {
> - pr_warn("rebuild_sched_domains confused: nslot %d, ndoms %d, csn %d, i %d, apn %d\n",
> - nslot, ndoms, csn, i, apn);
> - warnings--;
> - }
> - continue;
> - }
> + struct cpumask *dp = doms[nslot];
>
> cpumask_clear(dp);
> if (dattr)
> *(dattr + nslot) = SD_ATTR_INIT;
> for (j = i; j < csn; j++) {
> - struct cpuset *b = csa[j];
> + if (find_root(nodes, j) == i) {
> + if (i == j)
> + nslot++;
>
> - if (apn == b->pn) {
> - cpumask_or(dp, dp, b->effective_cpus);
> + cpumask_or(dp, dp, csa[j]->effective_cpus);
> cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN));
> if (dattr)
> - update_domain_attr_tree(dattr + nslot, b);
> -
> - /* Done with this partition */
> - b->pn = -1;
> + update_domain_attr_tree(dattr + nslot, csa[j]);
> }
> }
> - nslot++;
> }
> BUG_ON(nslot != ndoms);
> -
> + kfree(nodes);
> done:
> kfree(csa);
>

I have several issues with this patch.

1) The function comment describes the algorithm to find the set of
domains. If you change the algorithm, you have to update the comment as
well.

2) generate_sched_domains() is not in a performance critical path, so
its performance is not as important. Also the csn array is typically not
that big. Changing the algorithm may introduce new bugs which leads to
the next point.

3) How do you test your code to ensure its correctness?

I applied your patch and run the ./test_cpuset_prs.sh got the following:

[   87.850866] BUG: kernel NULL pointer dereference, address:
0000000000000000
[   87.857832] #PF: supervisor write access in kernel mode
[   87.863067] #PF: error_code(0x0002) - not-present page
[   87.868203] PGD 0 P4D 0
[   87.870745] Oops: 0002 [#1] PREEMPT SMP NOPTI
[   87.875105] CPU: 34 PID: 8789 Comm: test_cpuset_prs Kdump: loaded Not
tainted 6.9.0-rc2-test+ #12
[   87.883976] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS
SE5C620.86B.0X.02.0001.043020191705 04/30/2019
[   87.894402] RIP: 0010:memset_orig+0x72/0xb0
[   87.898597] Code: 47 28 48 89 47 30 48 89 47 38 48 8d 7f 40 75 d8 0f
1f 84 00 00 00 00 00 89 d1 83 e1 38 74 14 c1 e9 03 66 0f 1f 44 00 00 ff
c9 <48> 89 07 48 8d 7f 08 75 f5 83 e2 07 74 0a ff ca 88 07 48 8d 7f 01
[   87.917343] RSP: 0018:ffffa1b2a29b7d28 EFLAGS: 00010202
[   87.922568] RAX: 0000000000000000 RBX: 0000000000000002 RCX:
0000000000000001
[   87.929701] RDX: 0000000000000010 RSI: 0000000000000000 RDI:
0000000000000000
[   87.936834] RBP: ffff9184d9647620 R08: 0000000000000008 R09:
0000000000000000
[   87.943967] R10: 0000000000000000 R11: 0000000000000001 R12:
0000000000000002
[   87.951100] R13: 0000000000000000 R14: ffff9184ccca5798 R15:
0000000000000003
[   87.958231] FS:  00007fb51944a740(0000) GS:ffff91947eb00000(0000)
knlGS:0000000000000000
[   87.966316] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   87.972061] CR2: 0000000000000000 CR3: 000000109627c005 CR4:
00000000007706f0
[   87.979194] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[   87.986327] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[   87.993460] PKRU: 55555554
[   87.996174] Call Trace:
[   87.998626]  <TASK>
[   88.000733]  ? __die+0x24/0x70
[   88.003802]  ? page_fault_oops+0x14a/0x510
[   88.007909]  ? __schedule+0x3d1/0x1710
[   88.011659]  ? exc_page_fault+0x77/0x170
[   88.015584]  ? asm_exc_page_fault+0x26/0x30
[   88.019772]  ? memset_orig+0x72/0xb0
[   88.023349]  rebuild_sched_domains_locked+0x4f5/0x920
[   88.028403]  update_prstate+0x137/0x4e0
[   88.032245]  sched_partition_write+0x96/0x180
[   88.036613]  kernfs_fop_write_iter+0x12c/0x1c0
[   88.041067]  vfs_write+0x30c/0x430
[   88.044481]  ksys_write+0x63/0xe0
[   88.047801]  do_syscall_64+0x87/0x170
[   88.051474]  ? exc_page_fault+0x77/0x170
[   88.055400]  entry_SYSCALL_64_after_hwframe+0x71/0x79
[   88.060453] RIP: 0033:0x7fb5192fda57
[   88.064047] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f
1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f
05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[   88.082794] RSP: 002b:00007ffc6e57d2d8 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[   88.090361] RAX: ffffffffffffffda RBX: 0000000000000007 RCX:
00007fb5192fda57
[   88.097490] RDX: 0000000000000007 RSI: 000055aee3c38f40 RDI:
0000000000000001
[   88.104623] RBP: 000055aee3c38f40 R08: 0000000000000000 R09:
00007fb5193b14e0
[   88.111757] R10: 00007fb5193b13e0 R11: 0000000000000246 R12:
0000000000000007
[   88.118890] R13: 00007fb5193fb780 R14: 0000000000000007 R15:
00007fb5193f69e0
[   88.126024]  </TASK>

So it is not ready for prime time yet.

Regards,
Longman


2024-06-03 12:31:57

by Xavier

[permalink] [raw]
Subject: [PATCH v3] cpuset: use Union-Find to optimize the merging of cpumasks

The process of constructing scheduling domains involves multiple loops
and repeated evaluations, leading to numerous redundant and ineffective
assessments that impact code efficiency.

Here, we use Union-Find to optimize the merging of cpumasks. By employing
path compression and union by rank, we effectively reduce the number of
lookups and merge comparisons.

Signed-off-by: Xavier <[email protected]>

Hi Longman,

Thank you for your feedback on the previous version of the patch.

Now I will respond to the three questions you raised:
1) The function comment describes the algorithm to find the set of
domains. If you change the algorithm, you have to update the comment as
well.

Reply: Sorry for not paying attention to the comments before. The new patch (v3) has updated the comment content.

2) generate_sched_domains() is not in a performance critical path, so
its performance is not as important. Also the csn array is typically not
that big. Changing the algorithm may introduce new bugs which leads to
the next point.

Reply: Indeed, this function is not a critical path impacting performance, but it's always good to optimize efficiency. The optimization is limited to the internals of this function and does not affect other modules, so fixing the internal bug should have manageable risks.

3) How do you test your code to ensure its correctness?
I applied your patch and run the ./test_cpuset_prs.sh got the following...

Reply: I'm very sorry, this is my first time submitting a kernel patch and I don't know which test cases need to be run. I just constructed some scenarios locally to test, so the testing scope is limited. Thank you for providing the test cases. I have reproduced and resolved the issue, and have passed several other test cases in CGroup. But currently, I only have QEMU simulation environment, so my testing ability is limited. I hope you can help me with some comprehensive testing of my new version patch. Thank you very much.

I hope you can pay further attention to the new patch.

Best regards,
Xavier

---
kernel/cgroup/cpuset.c | 148 +++++++++++++++++++++++------------------
1 file changed, 82 insertions(+), 66 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c12b9fdb2..c2d030a30 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -891,6 +891,44 @@ static inline int nr_cpusets(void)
return static_key_count(&cpusets_enabled_key.key) + 1;
}

+/*define a union find node struct*/
+struct uf_node {
+ int parent;
+ int rank;
+};
+
+static int find_root(struct uf_node *nodes, int x)
+{
+ int root = x;
+ int parent;
+
+ /*Find the root node and perform path compression at the same time*/
+ while (nodes[root].parent != root) {
+ parent = nodes[root].parent;
+ nodes[root].parent = nodes[parent].parent;
+ root = parent;
+ }
+ return root;
+}
+
+/*Function to merge two sets, using union by rank*/
+static void union_sets(struct uf_node *nodes, int a, int b)
+{
+ int root_a = find_root(nodes, a);
+ int root_b = find_root(nodes, b);
+
+ if (root_a != root_b) {
+ if (nodes[root_a].rank < nodes[root_b].rank) {
+ nodes[root_a].parent = root_b;
+ } else if (nodes[root_a].rank > nodes[root_b].rank) {
+ nodes[root_b].parent = root_a;
+ } else {
+ nodes[root_b].parent = root_a;
+ nodes[root_a].rank++;
+ }
+ }
+}
+
/*
* generate_sched_domains()
*
@@ -930,17 +968,13 @@ static inline int nr_cpusets(void)
* value to determine what partition elements (sched domains)
* were changed (added or removed.)
*
- * Finding the best partition (set of domains):
- * The triple nested loops below over i, j, k scan over the
- * load balanced cpusets (using the array of cpuset pointers in
- * csa[]) looking for pairs of cpusets that have overlapping
- * cpus_allowed, but which don't have the same 'pn' partition
- * number and gives them in the same partition number. It keeps
- * looping on the 'restart' label until it can no longer find
- * any such pairs.
+ * By creating a union-find data structure for all load-balanced cpusets,
+ * cpusets with overlapping cpus_allowed are found and marked with the
+ * same parent node. Path compression is performed during the search to
+ * improve comparison efficiency.
*
* The union of the cpus_allowed masks from the set of
- * all cpusets having the same 'pn' value then form the one
+ * all cpusets having the same parent then form the one
* element of the partition (one sched domain) to be passed to
* partition_sched_domains().
*/
@@ -950,13 +984,15 @@ static int generate_sched_domains(cpumask_var_t **domains,
struct cpuset *cp; /* top-down scan of cpusets */
struct cpuset **csa; /* array of all cpuset ptrs */
int csn; /* how many cpuset ptrs in csa so far */
- int i, j, k; /* indices for partition finding loops */
+ int i, j; /* indices for partition finding loops */
cpumask_var_t *doms; /* resulting partition; i.e. sched domains */
struct sched_domain_attr *dattr; /* attributes for custom domains */
int ndoms = 0; /* number of sched domains in result */
int nslot; /* next empty doms[] struct cpumask slot */
struct cgroup_subsys_state *pos_css;
bool root_load_balance = is_sched_load_balance(&top_cpuset);
+ struct uf_node *nodes;
+ int nslot_update;

doms = NULL;
dattr = NULL;
@@ -1022,40 +1058,38 @@ static int generate_sched_domains(cpumask_var_t **domains,
}
rcu_read_unlock();

- for (i = 0; i < csn; i++)
- csa[i]->pn = i;
- ndoms = csn;
-
-restart:
- /* Find the best partition (set of sched domains) */
- for (i = 0; i < csn; i++) {
- struct cpuset *a = csa[i];
- int apn = a->pn;
+ nodes = kmalloc_array(csn, sizeof(struct uf_node), GFP_KERNEL);
+ if (!nodes)
+ goto done;

- for (j = 0; j < csn; j++) {
- struct cpuset *b = csa[j];
- int bpn = b->pn;

- if (apn != bpn && cpusets_overlap(a, b)) {
- for (k = 0; k < csn; k++) {
- struct cpuset *c = csa[k];
+ /* Each node is initially its own parent */
+ for (i = 0; i < csn; i++) {
+ nodes[i].parent = i;
+ nodes[i].rank = 0;
+ }

- if (c->pn == bpn)
- c->pn = apn;
- }
- ndoms--; /* one less element */
- goto restart;
- }
+ /* Merge overlapping cpusets */
+ for (i = 0; i < csn; i++) {
+ for (j = i + 1; j < csn; j++) {
+ if (cpusets_overlap(csa[i], csa[j]))
+ union_sets(nodes, i, j);
}
}

+ /* Calculate the number of domains after merging */
+ for (i = 0; i < csn; i++) {
+ if (nodes[i].parent == i)
+ ndoms++;
+ }
+
/*
* Now we know how many domains to create.
* Convert <csn, csa> to <ndoms, doms> and populate cpu masks.
*/
doms = alloc_sched_domains(ndoms);
if (!doms)
- goto done;
+ goto free;

/*
* The rest of the code, including the scheduler, can deal with
@@ -1065,47 +1099,29 @@ static int generate_sched_domains(cpumask_var_t **domains,
GFP_KERNEL);

for (nslot = 0, i = 0; i < csn; i++) {
- struct cpuset *a = csa[i];
- struct cpumask *dp;
- int apn = a->pn;
-
- if (apn < 0) {
- /* Skip completed partitions */
- continue;
- }
-
- dp = doms[nslot];
-
- if (nslot == ndoms) {
- static int warnings = 10;
- if (warnings) {
- pr_warn("rebuild_sched_domains confused: nslot %d, ndoms %d, csn %d, i %d, apn %d\n",
- nslot, ndoms, csn, i, apn);
- warnings--;
- }
- continue;
- }
-
- cpumask_clear(dp);
- if (dattr)
- *(dattr + nslot) = SD_ATTR_INIT;
+ nslot_update = 0;
for (j = i; j < csn; j++) {
- struct cpuset *b = csa[j];
-
- if (apn == b->pn) {
- cpumask_or(dp, dp, b->effective_cpus);
+ if (find_root(nodes, j) == i) {
+ struct cpumask *dp = doms[nslot];
+
+ if (i == j) {
+ nslot_update = 1;
+ cpumask_clear(dp);
+ if (dattr)
+ *(dattr + nslot) = SD_ATTR_INIT;
+ }
+ cpumask_or(dp, dp, csa[j]->effective_cpus);
cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN));
if (dattr)
- update_domain_attr_tree(dattr + nslot, b);
-
- /* Done with this partition */
- b->pn = -1;
+ update_domain_attr_tree(dattr + nslot, csa[j]);
}
}
- nslot++;
+ if (nslot_update)
+ nslot++;
}
BUG_ON(nslot != ndoms);
-
+free:
+ kfree(nodes);
done:
kfree(csa);

--
2.34.1


2024-06-04 15:04:11

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3] cpuset: use Union-Find to optimize the merging of cpumasks

On 6/3/24 08:31, Xavier wrote:
> The process of constructing scheduling domains involves multiple loops
> and repeated evaluations, leading to numerous redundant and ineffective
> assessments that impact code efficiency.
>
> Here, we use Union-Find to optimize the merging of cpumasks. By employing
> path compression and union by rank, we effectively reduce the number of
> lookups and merge comparisons.
>
> Signed-off-by: Xavier <[email protected]>
>
> Hi Longman,
>
> Thank you for your feedback on the previous version of the patch.
>
> Now I will respond to the three questions you raised:
> 1) The function comment describes the algorithm to find the set of
> domains. If you change the algorithm, you have to update the comment as
> well.
>
> Reply: Sorry for not paying attention to the comments before. The new patch (v3) has updated the comment content.
>
> 2) generate_sched_domains() is not in a performance critical path, so
> its performance is not as important. Also the csn array is typically not
> that big. Changing the algorithm may introduce new bugs which leads to
> the next point.
>
> Reply: Indeed, this function is not a critical path impacting performance, but it's always good to optimize efficiency. The optimization is limited to the internals of this function and does not affect other modules, so fixing the internal bug should have manageable risks.

In term of efficiency, your patch does eliminate the third iteration (k)
in the csn iteration loop. However the new union_sets() function may go
up the node hierarchy which can considered a third iteration in some
way. So there is some saving, but not as significant as it looks. It
does simplify the code and make it a bit easier to read.

>
> 3) How do you test your code to ensure its correctness?
> I applied your patch and run the ./test_cpuset_prs.sh got the following...
>
> Reply: I'm very sorry, this is my first time submitting a kernel patch and I don't know which test cases need to be run. I just constructed some scenarios locally to test, so the testing scope is limited. Thank you for providing the test cases. I have reproduced and resolved the issue, and have passed several other test cases in CGroup. But currently, I only have QEMU simulation environment, so my testing ability is limited. I hope you can help me with some comprehensive testing of my new version patch. Thank you very much.
>
> I hope you can pay further attention to the new patch.

Also your patch eliminates all the use of the cpuset->pn variable. So
you cab remove it as it is no longer needed.

After a harder look at the generate_sched_domains() code, I have found a
bug in the code with respect to the support of remote partition. I will
send another patch to fix it. I also realize that the function was
originally designed to support v1 cpuset. v2 cpuset is quite different
and the code can be simplified for the v2 use case.

You are welcome to send a v4 patch on top of the new cpuset code base.

Thanks,
Longman


2024-06-05 02:32:43

by Xavier

[permalink] [raw]
Subject: Re: [PATCH v3] cpuset: use Union-Find to optimize the merging of cpumasks

Hi Longman,

I am ready to prepare and send a v4 patch based on the new cpuset codebase. Thanks for your guidance and support.

Best regards,
Xavier

------------------------------------------------------------------------------------------

On 6/3/24 08:31, Xavier wrote:
> The process of constructing scheduling domains involves multiple loops
> and repeated evaluations, leading to numerous redundant and ineffective
> assessments that impact code efficiency.
>
> Here, we use Union-Find to optimize the merging of cpumasks. By employing
> path compression and union by rank, we effectively reduce the number of
> lookups and merge comparisons.
>
> Signed-off-by: Xavier <[email protected]>
>
> Hi Longman,
>
> Thank you for your feedback on the previous version of the patch.
>
> Now I will respond to the three questions you raised:
> 1) The function comment describes the algorithm to find the set of
> domains. If you change the algorithm, you have to update the comment as
> well.
>
> Reply: Sorry for not paying attention to the comments before. The new patch (v3) has updated the comment content.
>
> 2) generate_sched_domains() is not in a performance critical path, so
> its performance is not as important. Also the csn array is typically not
> that big. Changing the algorithm may introduce new bugs which leads to
> the next point.
>
> Reply: Indeed, this function is not a critical path impacting performance, but it's always good to optimize efficiency. The optimization is limited to the internals of this function and does not affect other modules, so fixing the internal bug should have manageable risks.
In term of efficiency, your patch does eliminate the third iteration (k)
in the csn iteration loop. However the new union_sets() function may go
up the node hierarchy which can considered a third iteration in some
way. So there is some saving, but not as significant as it looks. It
does simplify the code and make it a bit easier to read.
>
> 3) How do you test your code to ensure its correctness?
> I applied your patch and run the ./test_cpuset_prs.sh got the following...
>
> Reply: I'm very sorry, this is my first time submitting a kernel patch and I don't know which test cases need to be run. I just constructed some scenarios locally to test, so the testing scope is limited. Thank you for providing the test cases. I have reproduced and resolved the issue, and have passed several other test cases in CGroup. But currently, I only have QEMU simulation environment, so my testing ability is limited. I hope you can help me with some comprehensive testing of my new version patch. Thank you very much.
>
> I hope you can pay further attention to the new patch.
Also your patch eliminates all the use of the cpuset->pn variable. So
you cab remove it as it is no longer needed.
After a harder look at the generate_sched_domains() code, I have found a
bug in the code with respect to the support of remote partition. I will
send another patch to fix it. I also realize that the function was
originally designed to support v1 cpuset. v2 cpuset is quite different
and the code can be simplified for the v2 use case.
You are welcome to send a v4 patch on top of the new cpuset code base.
Thanks,
Longman

2024-06-10 17:19:05

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH v3] cpuset: use Union-Find to optimize the merging of cpumasks

Hello.

On Mon, Jun 03, 2024 at 08:31:01PM GMT, Xavier <[email protected]> wrote:
> The process of constructing scheduling domains involves multiple loops
> and repeated evaluations, leading to numerous redundant and ineffective
> assessments that impact code efficiency.
>
> Here, we use Union-Find to optimize the merging of cpumasks. By employing
> path compression and union by rank, we effectively reduce the number of
> lookups and merge comparisons.

Nice that you found such an application. (As Waiman wrote, the
efficiency is not so important here and it may not be dencreased but I
still think it makes the code more understandable by using standard data
structures.)

Have you looked whether there are other instances of U-F in the kernel?
(My quick search didn't show any.) Still, I think it'd be a good idea to
decouple this into two commits -- 1) implementation of the new U-F (into
lib/), 2) application within cpuset.

> +/*define a union find node struct*/
> +struct uf_node {
> + int parent;

I think this would be better as `struct uf_node *`.

> + int rank;
> +};

`unsigned int` if rank cannot be negative?

> + /* Each node is initially its own parent */
> + for (i = 0; i < csn; i++) {
> + nodes[i].parent = i;
> + nodes[i].rank = 0;
> + }

With the suggestion above, nodes could start with parent = NULL and
self-parent be corrected during the first find_root -- thus whole array
could be simply init'd to zeroes with kzalloc.


My 0.02€,
Michal


Attachments:
(No filename) (1.49 kB)
signature.asc (235.00 B)
Download all attachments

2024-06-14 03:05:31

by Xavier

[permalink] [raw]
Subject: Re: Re: [PATCH v3] cpuset: use Union-Find to optimize the merging of cpumasks

Hi Michal Koutný,

Thank you for your feedback. I think your idea is good, and I will resubmit it after Longman's new patch of cpuset is merged.

Best regards,
Xavier

>
>----- Original Message -----
>From: Michal Koutný <[email protected]>
>To: Xavier <[email protected]>
>Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>Subject: Re: [PATCH v3] cpuset: use Union-Find to optimize the merging of cpumasks
>Date: 2024-06-11 01:19
>
>Hello.
>On Mon, Jun 03, 2024 at 08:31:01PM GMT, Xavier <[email protected]> wrote:
>> The process of constructing scheduling domains involves multiple loops
>> and repeated evaluations, leading to numerous redundant and ineffective
>> assessments that impact code efficiency.
>>
>> Here, we use Union-Find to optimize the merging of cpumasks. By employing
>> path compression and union by rank, we effectively reduce the number of
>> lookups and merge comparisons.
>Nice that you found such an application. (As Waiman wrote, the
>efficiency is not so important here and it may not be dencreased but I
>still think it makes the code more understandable by using standard data
>structures.)
>Have you looked whether there are other instances of U-F in the kernel?
>(My quick search didn't show any.) Still, I think it'd be a good idea to
>decouple this into two commits -- 1) implementation of the new U-F (into
>lib/), 2) application within cpuset.
>> +/*define a union find node struct*/
>> +struct uf_node {
>> + int parent;
>I think this would be better as `struct uf_node *`.
>> + int rank;
>> +};
>`unsigned int` if rank cannot be negative?
>> + /* Each node is initially its own parent */
>> + for (i = 0; i < csn; i++) {
>> + nodes[i].parent = i;
>> + nodes[i].rank = 0;
>> + }
>With the suggestion above, nodes could start with parent = NULL and
>self-parent be corrected during the first find_root -- thus whole array
>could be simply init'd to zeroes with kzalloc.
>My 0.02€,
>Michal