v8:
- Remove cpuset.cpus.isolated and add a new cpuset.sched.domain flag
and rework the code accordingly.
v7:
- Add a root-only cpuset.cpus.isolated control file for CPU isolation.
- Enforce that load_balancing can only be turned off on cpusets with
CPUs from the isolated list.
- Update sched domain generation to allow cpusets with CPUs only
from the isolated CPU list to be in separate root domains.
v6:
- Hide cpuset control knobs in root cgroup.
- Rename effective_cpus and effective_mems to cpus.effective and
mems.effective respectively.
- Remove cpuset.flags and add cpuset.sched_load_balance instead
as the behavior of sched_load_balance has changed and so is
not a simple flag.
- Update cgroup-v2.txt accordingly.
v5:
- Add patch 2 to provide the cpuset.flags control knob for the
sched_load_balance flag which should be the only feature that is
essential as a replacement of the "isolcpus" kernel boot parameter.
v4:
- Further minimize the feature set by removing the flags control knob.
v3:
- Further trim the additional features down to just memory_migrate.
- Update Documentation/cgroup-v2.txt.
v6 patch: https://lkml.org/lkml/2018/3/21/530
v7 patch: https://lkml.org/lkml/2018/4/19/448
The purpose of this patchset is to provide a basic set of cpuset control
files for cgroup v2. This basic set includes the non-root "cpus",
"mems", "sched.load_balance" and "sched.domain". The "cpus.effective"
and "mems.effective" will appear in all cpuset-enabled cgroups.
The new control file that is unique to v2 is "sched.domain". It is a
boolean flag file that designates if a cgroup is a scheduling domain
with its own set of unique list of CPUs from scheduling perspective
disjointed from other scheduling domains. The root cgroup is always a
scheduling domain. Multiple levels of scheduling domains are supported
with some limitations. So a container scheduling domain root can behave
like a real root.
When a scheduling domain cgroup is removed, its list of exclusive CPUs
will be returned to the parent's cpus.effective automatically.
The "sched.load_balance" flag can only be changed in a scheduling domain.
with no child cpuset-enabled cgroups.
This patchset supports isolated CPUs in a child scheduling domain with
load balancing off. It also allows easy setup of multiple scheduling
domains without requiring the trick of turning load balancing off in the
root cgroup.
This patchset does not exclude the possibility of adding more features
in the future after careful consideration.
Patch 1 enables cpuset in cgroup v2 with cpus, mems and their
effective counterparts.
Patch 2 adds a new "sched.domain" control file for setting up multiple
scheduling domains. A scheduling domain implies cpu_exclusive.
Patch 3 adds a "sched.load_balance" flag to turn off load balancing in
a scheduling domain.
Patch 4 updates the scheduling domain genaration code to work with
the new scheduling domain feature.
Patch 5 exposes cpus.effective and mems.effective to the root cgroup as
enabling child scheduling domains will take CPUs away from the root cgroup.
So it will be nice to monitor what CPUs are left there.
Patch 6 enables the printing the debug information about scheduling
domain generation.
Waiman Long (6):
cpuset: Enable cpuset controller in default hierarchy
cpuset: Add new v2 cpuset.sched.domain flag
cpuset: Add cpuset.sched.load_balance flag to v2
cpuset: Make generate_sched_domains() recognize isolated_cpus
cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
cpuset: Allow reporting of sched domain generation info
Documentation/cgroup-v2.txt | 136 +++++++++++++++-
kernel/cgroup/cpuset.c | 375 ++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 492 insertions(+), 19 deletions(-)
--
1.8.3.1
Because of the fact that setting the "cpuset.sched.domain" in a direct
child of root can remove CPUs from the root's effective CPU list, it
makes sense to know what CPUs are left in the root cgroup for scheduling
purpose. So the "cpuset.cpus.effective" control file is now exposed in
the v2 cgroup root.
For consistency, the "cpuset.mems.effective" control file is exposed
as well.
Signed-off-by: Waiman Long <[email protected]>
---
Documentation/cgroup-v2.txt | 4 ++--
kernel/cgroup/cpuset.c | 2 --
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 071b634d..8739b10 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1474,7 +1474,7 @@ Cpuset Interface Files
and won't be affected by any CPU hotplug events.
cpuset.cpus.effective
- A read-only multiple values file which exists on non-root
+ A read-only multiple values file which exists on all
cpuset-enabled cgroups.
It lists the onlined CPUs that are actually allowed to be
@@ -1504,7 +1504,7 @@ Cpuset Interface Files
and won't be affected by any memory nodes hotplug events.
cpuset.mems.effective
- A read-only multiple values file which exists on non-root
+ A read-only multiple values file which exists on all
cpuset-enabled cgroups.
It lists the onlined memory nodes that are actually allowed to
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0e75f83..fb8aa82b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2203,14 +2203,12 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
.name = "cpus.effective",
.seq_show = cpuset_common_seq_show,
.private = FILE_EFFECTIVE_CPULIST,
- .flags = CFTYPE_NOT_ON_ROOT,
},
{
.name = "mems.effective",
.seq_show = cpuset_common_seq_show,
.private = FILE_EFFECTIVE_MEMLIST,
- .flags = CFTYPE_NOT_ON_ROOT,
},
{
--
1.8.3.1
Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.
The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.
This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts. We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.
Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.
Signed-off-by: Waiman Long <[email protected]>
---
Documentation/cgroup-v2.txt | 90 ++++++++++++++++++++++++++++++++++++++++++---
kernel/cgroup/cpuset.c | 48 ++++++++++++++++++++++--
2 files changed, 130 insertions(+), 8 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeae..cf7bac6 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
5-3-2. Writeback
5-4. PID
5-4-1. PID Interface Files
- 5-5. Device
- 5-6. RDMA
- 5-6-1. RDMA Interface Files
- 5-7. Misc
- 5-7-1. perf_event
+ 5-5. Cpuset
+ 5.5-1. Cpuset Interface Files
+ 5-6. Device
+ 5-7. RDMA
+ 5-7-1. RDMA Interface Files
+ 5-8. Misc
+ 5-8-1. perf_event
5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -1435,6 +1437,84 @@ through fork() or clone(). These will return -EAGAIN if the creation
of a new process would cause a cgroup policy to be violated.
+Cpuset
+------
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical. That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~~~~~~~~~~~~~~~~~~~~~
+
+ cpuset.cpus
+ A read-write multiple values file which exists on non-root
+ cpuset-enabled cgroups.
+
+ It lists the CPUs allowed to be used by tasks within this
+ cgroup. The CPU numbers are comma-separated numbers or
+ ranges. For example:
+
+ # cat cpuset.cpus
+ 0-4,6,8-10
+
+ An empty value indicates that the cgroup is using the same
+ setting as the nearest cgroup ancestor with a non-empty
+ "cpuset.cpus" or all the available CPUs if none is found.
+
+ The value of "cpuset.cpus" stays constant until the next update
+ and won't be affected by any CPU hotplug events.
+
+ cpuset.cpus.effective
+ A read-only multiple values file which exists on non-root
+ cpuset-enabled cgroups.
+
+ It lists the onlined CPUs that are actually allowed to be
+ used by tasks within the current cgroup. If "cpuset.cpus"
+ is empty, it shows all the CPUs from the parent cgroup that
+ will be available to be used by this cgroup. Otherwise, it is
+ a subset of "cpuset.cpus". Its value will be affected by CPU
+ hotplug events.
+
+ cpuset.mems
+ A read-write multiple values file which exists on non-root
+ cpuset-enabled cgroups.
+
+ It lists the memory nodes allowed to be used by tasks within
+ this cgroup. The memory node numbers are comma-separated
+ numbers or ranges. For example:
+
+ # cat cpuset.mems
+ 0-1,3
+
+ An empty value indicates that the cgroup is using the same
+ setting as the nearest cgroup ancestor with a non-empty
+ "cpuset.mems" or all the available memory nodes if none
+ is found.
+
+ The value of "cpuset.mems" stays constant until the next update
+ and won't be affected by any memory nodes hotplug events.
+
+ cpuset.mems.effective
+ A read-only multiple values file which exists on non-root
+ cpuset-enabled cgroups.
+
+ It lists the onlined memory nodes that are actually allowed to
+ be used by tasks within the current cgroup. If "cpuset.mems"
+ is empty, it shows all the memory nodes from the parent cgroup
+ that will be available to be used by this cgroup. Otherwise,
+ it is a subset of "cpuset.mems". Its value will be affected
+ by memory nodes hotplug events.
+
+
Device controller
-----------------
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b42037e..419b758 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1823,12 +1823,11 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
return 0;
}
-
/*
* for the common functions, 'private' gives the type of file
*/
-static struct cftype files[] = {
+static struct cftype legacy_files[] = {
{
.name = "cpus",
.seq_show = cpuset_common_seq_show,
@@ -1931,6 +1930,47 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
};
/*
+ * This is currently a minimal set for the default hierarchy. It can be
+ * expanded later on by migrating more features and control files from v1.
+ */
+static struct cftype dfl_files[] = {
+ {
+ .name = "cpus",
+ .seq_show = cpuset_common_seq_show,
+ .write = cpuset_write_resmask,
+ .max_write_len = (100U + 6 * NR_CPUS),
+ .private = FILE_CPULIST,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+
+ {
+ .name = "mems",
+ .seq_show = cpuset_common_seq_show,
+ .write = cpuset_write_resmask,
+ .max_write_len = (100U + 6 * MAX_NUMNODES),
+ .private = FILE_MEMLIST,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+
+ {
+ .name = "cpus.effective",
+ .seq_show = cpuset_common_seq_show,
+ .private = FILE_EFFECTIVE_CPULIST,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+
+ {
+ .name = "mems.effective",
+ .seq_show = cpuset_common_seq_show,
+ .private = FILE_EFFECTIVE_MEMLIST,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+
+ { } /* terminate */
+};
+
+
+/*
* cpuset_css_alloc - allocate a cpuset css
* cgrp: control group that the new cpuset will be part of
*/
@@ -2104,8 +2144,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
.post_attach = cpuset_post_attach,
.bind = cpuset_bind,
.fork = cpuset_fork,
- .legacy_cftypes = files,
+ .legacy_cftypes = legacy_files,
+ .dfl_cftypes = dfl_files,
.early_init = true,
+ .threaded = true,
};
/**
--
1.8.3.1
This patch enables us to report sched domain generation information.
If DYNAMIC_DEBUG is enabled, issuing the following command
echo "file cpuset.c +p" > /sys/kernel/debug/dynamic_debug/control
and setting loglevel to 8 will allow the kernel to show what scheduling
domain changes are being made.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index fb8aa82b..8f586e8 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -820,6 +820,12 @@ static int generate_sched_domains(cpumask_var_t **domains,
}
BUG_ON(nslot != ndoms);
+#ifdef CONFIG_DEBUG_KERNEL
+ for (i = 0; i < ndoms; i++)
+ pr_debug("generate_sched_domains dom %d: %*pbl\n", i,
+ cpumask_pr_args(doms[i]));
+#endif
+
done:
kfree(csa);
--
1.8.3.1
The generate_sched_domains() function and the hotplug code are modified
to make them use the newly introduced isolated_cpus mask for schedule
domains generation.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 33 +++++++++++++++++++++++++++++----
1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 368e1b7..0e75f83 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -672,13 +672,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
int ndoms = 0; /* number of sched domains in result */
int nslot; /* next empty doms[] struct cpumask slot */
struct cgroup_subsys_state *pos_css;
+ bool root_load_balance = is_sched_load_balance(&top_cpuset);
doms = NULL;
dattr = NULL;
csa = NULL;
/* Special case for the 99% of systems with one, full, sched domain */
- if (is_sched_load_balance(&top_cpuset)) {
+ if (root_load_balance && !top_cpuset.isolation_count) {
ndoms = 1;
doms = alloc_sched_domains(ndoms);
if (!doms)
@@ -701,6 +702,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
csn = 0;
rcu_read_lock();
+ if (root_load_balance)
+ csa[csn++] = &top_cpuset;
cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
if (cp == &top_cpuset)
continue;
@@ -711,6 +714,9 @@ static int generate_sched_domains(cpumask_var_t **domains,
* parent's cpus, so just skip them, and then we call
* update_domain_attr_tree() to calc relax_domain_level of
* the corresponding sched domain.
+ *
+ * If root is load-balancing, we can skip @cp if it
+ * is a subset of the root's effective_cpus.
*/
if (!cpumask_empty(cp->cpus_allowed) &&
!(is_sched_load_balance(cp) &&
@@ -718,11 +724,16 @@ static int generate_sched_domains(cpumask_var_t **domains,
housekeeping_cpumask(HK_FLAG_DOMAIN))))
continue;
+ if (root_load_balance &&
+ cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
+ continue;
+
if (is_sched_load_balance(cp))
csa[csn++] = cp;
- /* skip @cp's subtree */
- pos_css = css_rightmost_descendant(pos_css);
+ /* skip @cp's subtree if not a scheduling domain */
+ if (!is_sched_domain(cp))
+ pos_css = css_rightmost_descendant(pos_css);
}
rcu_read_unlock();
@@ -849,7 +860,12 @@ static void rebuild_sched_domains_locked(void)
* passing doms with offlined cpu to partition_sched_domains().
* Anyways, hotplug work item will rebuild sched domains.
*/
- if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+ if (!top_cpuset.isolation_count &&
+ !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+ goto out;
+
+ if (top_cpuset.isolation_count &&
+ !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
goto out;
/* Generate domain masks and attrs */
@@ -2624,6 +2640,11 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
cpumask_copy(&new_cpus, cpu_active_mask);
new_mems = node_states[N_MEMORY];
+ /*
+ * If isolated_cpus is populated, it is likely that the check below
+ * will produce a false positive on cpus_updated when the cpu list
+ * isn't changed. It is extra work, but it is better to be safe.
+ */
cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, &new_cpus);
mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems);
@@ -2632,6 +2653,10 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
spin_lock_irq(&callback_lock);
if (!on_dfl)
cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
+
+ if (top_cpuset.isolation_count)
+ cpumask_andnot(&new_cpus, &new_cpus,
+ top_cpuset.isolated_cpus);
cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
spin_unlock_irq(&callback_lock);
/* we don't mess with cpumasks of tasks in top_cpuset */
--
1.8.3.1
The sched.load_balance flag is needed to enable CPU isolation similar to
what can be done with the "isolcpus" kernel boot parameter. Its value
can only be changed in a scheduling domain with no child cpusets. On
a non-scheduling domain cpuset, the value of sched.load_balance is
inherited from its parent.
This flag is set by the parent and is not delegatable.
Signed-off-by: Waiman Long <[email protected]>
---
Documentation/cgroup-v2.txt | 24 ++++++++++++++++++++
kernel/cgroup/cpuset.c | 53 +++++++++++++++++++++++++++++++++++++++++----
2 files changed, 73 insertions(+), 4 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 54d9e22..071b634d 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1536,6 +1536,30 @@ Cpuset Interface Files
CPUs of the parent cgroup. Once it is set, this flag cannot be
cleared if there are any child cgroups with cpuset enabled.
+ A parent cgroup cannot distribute all its CPUs to child
+ scheduling domain cgroups unless its load balancing flag is
+ turned off.
+
+ cpuset.sched.load_balance
+ A read-write single value file which exists on non-root
+ cpuset-enabled cgroups. It is a binary value flag that accepts
+ either "0" (off) or a non-zero value (on). This flag is set
+ by the parent and is not delegatable.
+
+ When it is on, tasks within this cpuset will be load-balanced
+ by the kernel scheduler. Tasks will be moved from CPUs with
+ high load to other CPUs within the same cpuset with less load
+ periodically.
+
+ When it is off, there will be no load balancing among CPUs on
+ this cgroup. Tasks will stay in the CPUs they are running on
+ and will not be moved to other CPUs.
+
+ The initial value of this flag is "1". This flag is then
+ inherited by child cgroups with cpuset enabled. Its state
+ can only be changed on a scheduling domain cgroup with no
+ cpuset-enabled children.
+
Device controller
-----------------
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e1a1af0..368e1b7 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -510,7 +510,7 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
par = parent_cs(cur);
- /* On legacy hiearchy, we must be a subset of our parent cpuset. */
+ /* On legacy hierarchy, we must be a subset of our parent cpuset. */
ret = -EACCES;
if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
goto out;
@@ -1061,6 +1061,14 @@ static int update_isolated_cpumask(struct cpuset *cpuset,
goto out;
/*
+ * A parent can't distribute all its CPUs to child scheduling
+ * domain cpusets unless load balancing is off.
+ */
+ if (adding & !deleting && is_sched_load_balance(parent) &&
+ cpumask_equal(addmask, parent->effective_cpus))
+ goto out;
+
+ /*
* Check if any CPUs in addmask or delmask are in a sibling cpuset.
* An empty sibling cpus_allowed means it is the same as parent's
* effective_cpus. This checking is skipped if the cpuset is dying.
@@ -1531,6 +1539,16 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
domain_flag_changed = (is_sched_domain(cs) != is_sched_domain(trialcs));
+ /*
+ * On default hierachy, a load balance flag change is only allowed
+ * in a scheduling domain with no child cpuset.
+ */
+ if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
+ (!is_sched_domain(cs) || css_has_online_children(&cs->css))) {
+ err = -EINVAL;
+ goto out;
+ }
+
if (domain_flag_changed) {
err = turning_on
? update_isolated_cpumask(cs, NULL, cs->cpus_allowed)
@@ -2187,6 +2205,14 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
.flags = CFTYPE_NOT_ON_ROOT,
},
+ {
+ .name = "sched.load_balance",
+ .read_u64 = cpuset_read_u64,
+ .write_u64 = cpuset_write_u64,
+ .private = FILE_SCHED_LOAD_BALANCE,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+
{ } /* terminate */
};
@@ -2200,19 +2226,38 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
cpuset_css_alloc(struct cgroup_subsys_state *parent_css)
{
struct cpuset *cs;
+ struct cgroup_subsys_state *errptr = ERR_PTR(-ENOMEM);
if (!parent_css)
return &top_cpuset.css;
cs = kzalloc(sizeof(*cs), GFP_KERNEL);
if (!cs)
- return ERR_PTR(-ENOMEM);
+ return errptr;
if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL))
goto free_cs;
if (!alloc_cpumask_var(&cs->effective_cpus, GFP_KERNEL))
goto free_cpus;
- set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+ /*
+ * On default hierarchy, inherit parent's CS_SCHED_LOAD_BALANCE flag.
+ * Creating new cpuset is also not allowed if the effective_cpus of
+ * its parent is empty.
+ */
+ if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
+ struct cpuset *parent = css_cs(parent_css);
+
+ if (test_bit(CS_SCHED_LOAD_BALANCE, &parent->flags))
+ set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+
+ if (cpumask_empty(parent->effective_cpus)) {
+ errptr = ERR_PTR(-EINVAL);
+ goto free_cpus;
+ }
+ } else {
+ set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+ }
+
cpumask_clear(cs->cpus_allowed);
nodes_clear(cs->mems_allowed);
cpumask_clear(cs->effective_cpus);
@@ -2226,7 +2271,7 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
free_cpumask_var(cs->cpus_allowed);
free_cs:
kfree(cs);
- return ERR_PTR(-ENOMEM);
+ return errptr;
}
static int cpuset_css_online(struct cgroup_subsys_state *css)
--
1.8.3.1
A new cpuset.sched.domain boolean flag is added to cpuset v2. This new
flag indicates that the CPUs in the current cpuset should be treated
as a separate scheduling domain. This new flag is owned by the parent
and will cause the CPUs in the cpuset to be removed from the effective
CPUs of its parent.
This is implemented internally by adding a new isolated_cpus mask that
holds the CPUs belonging to child scheduling domain cpusets so that:
isolated_cpus | effective_cpus = cpus_allowed
isolated_cpus & effective_cpus = 0
This new flag can only be turned on in a cpuset if its parent is either
root or a scheduling domain itself with non-empty cpu list. The state
of this flag cannot be changed if the cpuset has children.
Signed-off-by: Waiman Long <[email protected]>
---
Documentation/cgroup-v2.txt | 22 ++++
kernel/cgroup/cpuset.c | 237 +++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 256 insertions(+), 3 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index cf7bac6..54d9e22 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1514,6 +1514,28 @@ Cpuset Interface Files
it is a subset of "cpuset.mems". Its value will be affected
by memory nodes hotplug events.
+ cpuset.sched.domain
+ A read-write single value file which exists on non-root
+ cpuset-enabled cgroups. It is a binary value flag that accepts
+ either "0" (off) or a non-zero value (on). This flag is set
+ by the parent and is not delegatable.
+
+ If set, it indicates that the CPUs in the current cgroup will
+ be the root of a scheduling domain. The root cgroup is always
+ a scheduling domain. There are constraints on where this flag
+ can be set. It can only be set in a cgroup if all the following
+ conditions are true.
+
+ 1) The parent cgroup is also a scheduling domain with a non-empty
+ cpu list.
+ 2) The list of CPUs are exclusive, i.e. they are not shared by
+ any of its siblings.
+ 3) There is no child cgroups with cpuset enabled.
+
+ Setting this flag will take the CPUs away from the effective
+ CPUs of the parent cgroup. Once it is set, this flag cannot be
+ cleared if there are any child cgroups with cpuset enabled.
+
Device controller
-----------------
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 419b758..e1a1af0 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -109,6 +109,9 @@ struct cpuset {
cpumask_var_t effective_cpus;
nodemask_t effective_mems;
+ /* Isolated CPUs for scheduling domain children */
+ cpumask_var_t isolated_cpus;
+
/*
* This is old Memory Nodes tasks took on.
*
@@ -134,6 +137,9 @@ struct cpuset {
/* for custom sched domain */
int relax_domain_level;
+
+ /* for isolated_cpus */
+ int isolation_count;
};
static inline struct cpuset *css_cs(struct cgroup_subsys_state *css)
@@ -175,6 +181,7 @@ static inline bool task_has_mempolicy(struct task_struct *task)
CS_SCHED_LOAD_BALANCE,
CS_SPREAD_PAGE,
CS_SPREAD_SLAB,
+ CS_SCHED_DOMAIN,
} cpuset_flagbits_t;
/* convenient tests for these bits */
@@ -203,6 +210,11 @@ static inline int is_sched_load_balance(const struct cpuset *cs)
return test_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
}
+static inline int is_sched_domain(const struct cpuset *cs)
+{
+ return test_bit(CS_SCHED_DOMAIN, &cs->flags);
+}
+
static inline int is_memory_migrate(const struct cpuset *cs)
{
return test_bit(CS_MEMORY_MIGRATE, &cs->flags);
@@ -220,7 +232,7 @@ static inline int is_spread_slab(const struct cpuset *cs)
static struct cpuset top_cpuset = {
.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
- (1 << CS_MEM_EXCLUSIVE)),
+ (1 << CS_MEM_EXCLUSIVE) | (1 << CS_SCHED_DOMAIN)),
};
/**
@@ -902,7 +914,19 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
cpuset_for_each_descendant_pre(cp, pos_css, cs) {
struct cpuset *parent = parent_cs(cp);
- cpumask_and(new_cpus, cp->cpus_allowed, parent->effective_cpus);
+ /*
+ * If parent has isolated CPUs, include them in the list
+ * of allowable CPUs.
+ */
+ if (parent->isolation_count) {
+ cpumask_or(new_cpus, parent->effective_cpus,
+ parent->isolated_cpus);
+ cpumask_and(new_cpus, new_cpus, cpu_online_mask);
+ cpumask_and(new_cpus, new_cpus, cp->cpus_allowed);
+ } else {
+ cpumask_and(new_cpus, cp->cpus_allowed,
+ parent->effective_cpus);
+ }
/*
* If it becomes empty, inherit the effective mask of the
@@ -948,6 +972,154 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
}
/**
+ * update_isolated_cpumask - update the isolated_cpus mask of parent cpuset
+ * @cpuset: The cpuset that requests CPU isolation
+ * @oldmask: The old isolated cpumask to be removed from the parent
+ * @newmask: The new isolated cpumask to be added to the parent
+ * Return: 0 if successful, an error code otherwise
+ *
+ * Changes to the isolated CPUs are not allowed if any of CPUs changing
+ * state are in any of the child cpusets of the parent except the requesting
+ * child.
+ *
+ * If the sched_domain flag changes, either the oldmask (0=>1) or the
+ * newmask (1=>0) will be NULL.
+ *
+ * Called with cpuset_mutex held.
+ */
+static int update_isolated_cpumask(struct cpuset *cpuset,
+ struct cpumask *oldmask, struct cpumask *newmask)
+{
+ int retval;
+ int adding, deleting;
+ cpumask_var_t addmask, delmask;
+ struct cpuset *parent = parent_cs(cpuset);
+ struct cpuset *sibling;
+ struct cgroup_subsys_state *pos_css;
+ int old_count = parent->isolation_count;
+ bool dying = cpuset->css.flags & CSS_DYING;
+
+ /*
+ * Parent must be a scheduling domain with non-empty cpus_allowed.
+ */
+ if (!is_sched_domain(parent) || cpumask_empty(parent->cpus_allowed))
+ return -EINVAL;
+
+ /*
+ * The oldmask, if present, must be a subset of parent's isolated
+ * CPUs.
+ */
+ if (oldmask && !cpumask_empty(oldmask) && (!parent->isolation_count ||
+ !cpumask_subset(oldmask, parent->isolated_cpus))) {
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+ }
+
+ /*
+ * A sched_domain state change is not allowed if there are
+ * online children and the cpuset is not dying.
+ */
+ if (!dying && (!oldmask || !newmask) &&
+ css_has_online_children(&cpuset->css))
+ return -EBUSY;
+
+ if (!zalloc_cpumask_var(&addmask, GFP_KERNEL))
+ return -ENOMEM;
+ if (!zalloc_cpumask_var(&delmask, GFP_KERNEL)) {
+ free_cpumask_var(addmask);
+ return -ENOMEM;
+ }
+
+ if (!old_count) {
+ if (!zalloc_cpumask_var(&parent->isolated_cpus, GFP_KERNEL)) {
+ retval = -ENOMEM;
+ goto out;
+ }
+ old_count = 1;
+ }
+
+ retval = -EBUSY;
+ adding = deleting = false;
+ if (newmask)
+ cpumask_copy(addmask, newmask);
+ if (oldmask)
+ deleting = cpumask_andnot(delmask, oldmask, addmask);
+ if (newmask)
+ adding = cpumask_andnot(addmask, newmask, delmask);
+
+ if (!adding && !deleting)
+ goto out_ok;
+
+ /*
+ * The cpus to be added must be in the parent's effective_cpus mask
+ * but not in the isolated_cpus mask.
+ */
+ if (!cpumask_subset(addmask, parent->effective_cpus))
+ goto out;
+ if (parent->isolation_count &&
+ cpumask_intersects(parent->isolated_cpus, addmask))
+ goto out;
+
+ /*
+ * Check if any CPUs in addmask or delmask are in a sibling cpuset.
+ * An empty sibling cpus_allowed means it is the same as parent's
+ * effective_cpus. This checking is skipped if the cpuset is dying.
+ */
+ if (dying)
+ goto updated_isolated_cpus;
+
+ cpuset_for_each_child(sibling, pos_css, parent) {
+ if ((sibling == cpuset) || !(sibling->css.flags & CSS_ONLINE))
+ continue;
+ if (cpumask_empty(sibling->cpus_allowed))
+ goto out;
+ if (adding &&
+ cpumask_intersects(sibling->cpus_allowed, addmask))
+ goto out;
+ if (deleting &&
+ cpumask_intersects(sibling->cpus_allowed, delmask))
+ goto out;
+ }
+
+ /*
+ * Change the isolated CPU list.
+ * Newly added isolated CPUs will be removed from effective_cpus
+ * and newly deleted ones will be added back if they are online.
+ */
+updated_isolated_cpus:
+ spin_lock_irq(&callback_lock);
+ if (adding)
+ cpumask_or(parent->isolated_cpus,
+ parent->isolated_cpus, addmask);
+
+ if (deleting)
+ cpumask_andnot(parent->isolated_cpus,
+ parent->isolated_cpus, delmask);
+
+ /*
+ * New effective_cpus = (cpus_allowed & ~isolated_cpus) &
+ * cpu_online_mask
+ */
+ cpumask_andnot(parent->effective_cpus, parent->cpus_allowed,
+ parent->isolated_cpus);
+ cpumask_and(parent->effective_cpus, parent->effective_cpus,
+ cpu_online_mask);
+
+ parent->isolation_count = cpumask_weight(parent->isolated_cpus);
+ spin_unlock_irq(&callback_lock);
+
+out_ok:
+ retval = 0;
+out:
+ free_cpumask_var(addmask);
+ free_cpumask_var(delmask);
+ if (old_count && !parent->isolation_count)
+ free_cpumask_var(parent->isolated_cpus);
+
+ return retval;
+}
+
+/**
* update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
* @cs: the cpuset to consider
* @trialcs: trial cpuset
@@ -988,6 +1160,13 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
if (retval < 0)
return retval;
+ if (is_sched_domain(cs)) {
+ retval = update_isolated_cpumask(cs, cs->cpus_allowed,
+ trialcs->cpus_allowed);
+ if (retval < 0)
+ return retval;
+ }
+
spin_lock_irq(&callback_lock);
cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
spin_unlock_irq(&callback_lock);
@@ -1316,6 +1495,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
struct cpuset *trialcs;
int balance_flag_changed;
int spread_flag_changed;
+ int domain_flag_changed;
int err;
trialcs = alloc_trial_cpuset(cs);
@@ -1327,6 +1507,18 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
else
clear_bit(bit, &trialcs->flags);
+ /*
+ * Turning on sched.domain flag (default hierarchy only) implies
+ * an implicit cpu_exclusive. Turning off sched.domain will clear
+ * the cpu_exclusive flag.
+ */
+ if (bit == CS_SCHED_DOMAIN) {
+ if (turning_on)
+ set_bit(CS_CPU_EXCLUSIVE, &trialcs->flags);
+ else
+ clear_bit(CS_CPU_EXCLUSIVE, &trialcs->flags);
+ }
+
err = validate_change(cs, trialcs);
if (err < 0)
goto out;
@@ -1337,11 +1529,26 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
|| (is_spread_page(cs) != is_spread_page(trialcs)));
+ domain_flag_changed = (is_sched_domain(cs) != is_sched_domain(trialcs));
+
+ if (domain_flag_changed) {
+ err = turning_on
+ ? update_isolated_cpumask(cs, NULL, cs->cpus_allowed)
+ : update_isolated_cpumask(cs, cs->cpus_allowed, NULL);
+ if (err < 0)
+ goto out;
+ /*
+ * At this point, the state has been changed.
+ * So we can't back out with error anymore.
+ */
+ }
+
spin_lock_irq(&callback_lock);
cs->flags = trialcs->flags;
spin_unlock_irq(&callback_lock);
- if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
+ if (!cpumask_empty(trialcs->cpus_allowed) &&
+ (balance_flag_changed || domain_flag_changed))
rebuild_sched_domains_locked();
if (spread_flag_changed)
@@ -1596,6 +1803,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
FILE_MEM_EXCLUSIVE,
FILE_MEM_HARDWALL,
FILE_SCHED_LOAD_BALANCE,
+ FILE_SCHED_DOMAIN,
FILE_SCHED_RELAX_DOMAIN_LEVEL,
FILE_MEMORY_PRESSURE_ENABLED,
FILE_MEMORY_PRESSURE,
@@ -1629,6 +1837,9 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
case FILE_SCHED_LOAD_BALANCE:
retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, val);
break;
+ case FILE_SCHED_DOMAIN:
+ retval = update_flag(CS_SCHED_DOMAIN, cs, val);
+ break;
case FILE_MEMORY_MIGRATE:
retval = update_flag(CS_MEMORY_MIGRATE, cs, val);
break;
@@ -1790,6 +2001,8 @@ static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
return is_mem_hardwall(cs);
case FILE_SCHED_LOAD_BALANCE:
return is_sched_load_balance(cs);
+ case FILE_SCHED_DOMAIN:
+ return is_sched_domain(cs);
case FILE_MEMORY_MIGRATE:
return is_memory_migrate(cs);
case FILE_MEMORY_PRESSURE_ENABLED:
@@ -1966,6 +2179,14 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
.flags = CFTYPE_NOT_ON_ROOT,
},
+ {
+ .name = "sched.domain",
+ .read_u64 = cpuset_read_u64,
+ .write_u64 = cpuset_write_u64,
+ .private = FILE_SCHED_DOMAIN,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+
{ } /* terminate */
};
@@ -2075,6 +2296,9 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
* If the cpuset being removed has its flag 'sched_load_balance'
* enabled, then simulate turning sched_load_balance off, which
* will call rebuild_sched_domains_locked().
+ *
+ * If the cpuset has the 'sched_domain' flag enabled, simulate
+ * turning sched_domain off.
*/
static void cpuset_css_offline(struct cgroup_subsys_state *css)
@@ -2083,6 +2307,13 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
mutex_lock(&cpuset_mutex);
+ /*
+ * Calling update_flag() may fail, so we have to call
+ * update_isolated_cpumask directly to be sure.
+ */
+ if (is_sched_domain(cs))
+ update_isolated_cpumask(cs, cs->cpus_allowed, NULL);
+
if (is_sched_load_balance(cs))
update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
--
1.8.3.1
Hi Waiman!
I've started looking at the possibility to move Android to use cgroups
v2 and the availability of the cpuset controller makes this even more
promising.
I'll try to give a run to this series on Android, meanwhile I have
some (hopefully not too much dummy) questions below.
On 17-May 16:55, Waiman Long wrote:
> Given the fact that thread mode had been merged into 4.14, it is now
> time to enable cpuset to be used in the default hierarchy (cgroup v2)
> as it is clearly threaded.
>
> The cpuset controller had experienced feature creep since its
> introduction more than a decade ago. Besides the core cpus and mems
> control files to limit cpus and memory nodes, there are a bunch of
> additional features that can be controlled from the userspace. Some of
> the features are of doubtful usefulness and may not be actively used.
>
> This patch enables cpuset controller in the default hierarchy with
> a minimal set of features, namely just the cpus and mems and their
> effective_* counterparts. We can certainly add more features to the
> default hierarchy in the future if there is a real user need for them
> later on.
>
> Alternatively, with the unified hiearachy, it may make more sense
> to move some of those additional cpuset features, if desired, to
> memory controller or may be to the cpu controller instead of staying
> with cpuset.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> Documentation/cgroup-v2.txt | 90 ++++++++++++++++++++++++++++++++++++++++++---
> kernel/cgroup/cpuset.c | 48 ++++++++++++++++++++++--
> 2 files changed, 130 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index 74cdeae..cf7bac6 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
> 5-3-2. Writeback
> 5-4. PID
> 5-4-1. PID Interface Files
> - 5-5. Device
> - 5-6. RDMA
> - 5-6-1. RDMA Interface Files
> - 5-7. Misc
> - 5-7-1. perf_event
> + 5-5. Cpuset
> + 5.5-1. Cpuset Interface Files
> + 5-6. Device
> + 5-7. RDMA
> + 5-7-1. RDMA Interface Files
> + 5-8. Misc
> + 5-8-1. perf_event
> 5-N. Non-normative information
> 5-N-1. CPU controller root cgroup process behaviour
> 5-N-2. IO controller root cgroup process behaviour
> @@ -1435,6 +1437,84 @@ through fork() or clone(). These will return -EAGAIN if the creation
> of a new process would cause a cgroup policy to be violated.
>
>
> +Cpuset
> +------
> +
> +The "cpuset" controller provides a mechanism for constraining
> +the CPU and memory node placement of tasks to only the resources
> +specified in the cpuset interface files in a task's current cgroup.
> +This is especially valuable on large NUMA systems where placing jobs
> +on properly sized subsets of the systems with careful processor and
> +memory placement to reduce cross-node memory access and contention
> +can improve overall system performance.
Another quite important use-case for cpuset is Android, where they are
actively used to do both power-saving as well as performance tunings.
For example, depending on the status of an application, its threads
can be allowed to run on all available CPUS (e.g. foreground apps) or
be restricted only on few energy efficient CPUs (e.g. backgroud apps).
Since here we are at "rewriting" cpusets for v2, I think it's important
to keep this mobile world scenario into consideration.
For example, in this context, we are looking at the possibility to
update/tune cpuset.cpus with a relatively high rate, i.e. tens of
times per second. Not sure that's the same update rate usually
required for the large NUMA systems you cite above. However, in this
case it's quite important to have really small overheads for these
operations.
> +
> +The "cpuset" controller is hierarchical. That means the controller
> +cannot use CPUs or memory nodes not allowed in its parent.
> +
> +
> +Cpuset Interface Files
> +~~~~~~~~~~~~~~~~~~~~~~
> +
> + cpuset.cpus
> + A read-write multiple values file which exists on non-root
> + cpuset-enabled cgroups.
> +
> + It lists the CPUs allowed to be used by tasks within this
> + cgroup. The CPU numbers are comma-separated numbers or
> + ranges. For example:
> +
> + # cat cpuset.cpus
> + 0-4,6,8-10
> +
> + An empty value indicates that the cgroup is using the same
> + setting as the nearest cgroup ancestor with a non-empty
> + "cpuset.cpus" or all the available CPUs if none is found.
Does that means that we can move tasks into a newly created group for
which we have not yet configured this value?
AFAIK, that's a different behavior wrt v1... and I like it better.
> +
> + The value of "cpuset.cpus" stays constant until the next update
> + and won't be affected by any CPU hotplug events.
This also sounds interesting, does it means that we use the
cpuset.cpus mask to restrict online CPUs, whatever they are?
I'll have a better look at the code, but my understanding of v1 is
that we spent a lot of effort to keep task cpu-affinity masks aligned
with the cpuset in which they live, and we do something similar at each
HP event, which ultimately generates a lot of overheads in systems
where: you have many HP events and/or cpuset.cpus change quite
frequently.
I hope to find some better behavior in this series.
> +
> + cpuset.cpus.effective
> + A read-only multiple values file which exists on non-root
> + cpuset-enabled cgroups.
> +
> + It lists the onlined CPUs that are actually allowed to be
> + used by tasks within the current cgroup. If "cpuset.cpus"
> + is empty, it shows all the CPUs from the parent cgroup that
> + will be available to be used by this cgroup. Otherwise, it is
> + a subset of "cpuset.cpus". Its value will be affected by CPU
> + hotplug events.
This looks similar to v1, isn't it?
> +
> + cpuset.mems
> + A read-write multiple values file which exists on non-root
> + cpuset-enabled cgroups.
> +
> + It lists the memory nodes allowed to be used by tasks within
> + this cgroup. The memory node numbers are comma-separated
> + numbers or ranges. For example:
> +
> + # cat cpuset.mems
> + 0-1,3
> +
> + An empty value indicates that the cgroup is using the same
> + setting as the nearest cgroup ancestor with a non-empty
> + "cpuset.mems" or all the available memory nodes if none
> + is found.
> +
> + The value of "cpuset.mems" stays constant until the next update
> + and won't be affected by any memory nodes hotplug events.
> +
> + cpuset.mems.effective
> + A read-only multiple values file which exists on non-root
> + cpuset-enabled cgroups.
> +
> + It lists the onlined memory nodes that are actually allowed to
> + be used by tasks within the current cgroup. If "cpuset.mems"
> + is empty, it shows all the memory nodes from the parent cgroup
> + that will be available to be used by this cgroup. Otherwise,
> + it is a subset of "cpuset.mems". Its value will be affected
> + by memory nodes hotplug events.
> +
> +
> Device controller
> -----------------
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index b42037e..419b758 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1823,12 +1823,11 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
> return 0;
> }
>
> -
> /*
> * for the common functions, 'private' gives the type of file
> */
>
> -static struct cftype files[] = {
> +static struct cftype legacy_files[] = {
> {
> .name = "cpus",
> .seq_show = cpuset_common_seq_show,
> @@ -1931,6 +1930,47 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
> };
>
> /*
> + * This is currently a minimal set for the default hierarchy. It can be
> + * expanded later on by migrating more features and control files from v1.
> + */
> +static struct cftype dfl_files[] = {
> + {
> + .name = "cpus",
> + .seq_show = cpuset_common_seq_show,
> + .write = cpuset_write_resmask,
> + .max_write_len = (100U + 6 * NR_CPUS),
> + .private = FILE_CPULIST,
> + .flags = CFTYPE_NOT_ON_ROOT,
> + },
> +
> + {
> + .name = "mems",
> + .seq_show = cpuset_common_seq_show,
> + .write = cpuset_write_resmask,
> + .max_write_len = (100U + 6 * MAX_NUMNODES),
> + .private = FILE_MEMLIST,
> + .flags = CFTYPE_NOT_ON_ROOT,
> + },
> +
> + {
> + .name = "cpus.effective",
> + .seq_show = cpuset_common_seq_show,
> + .private = FILE_EFFECTIVE_CPULIST,
> + .flags = CFTYPE_NOT_ON_ROOT,
> + },
> +
> + {
> + .name = "mems.effective",
> + .seq_show = cpuset_common_seq_show,
> + .private = FILE_EFFECTIVE_MEMLIST,
> + .flags = CFTYPE_NOT_ON_ROOT,
> + },
> +
> + { } /* terminate */
> +};
> +
> +
> +/*
> * cpuset_css_alloc - allocate a cpuset css
> * cgrp: control group that the new cpuset will be part of
> */
> @@ -2104,8 +2144,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
> .post_attach = cpuset_post_attach,
> .bind = cpuset_bind,
> .fork = cpuset_fork,
> - .legacy_cftypes = files,
> + .legacy_cftypes = legacy_files,
> + .dfl_cftypes = dfl_files,
> .early_init = true,
> + .threaded = true,
Which means that by default we can attach tasks instead of only
processes, right?
> };
>
> /**
> --
> 1.8.3.1
>
--
#include <best/regards.h>
Patrick Bellasi
On 05/21/2018 07:55 AM, Patrick Bellasi wrote:
> Hi Waiman!
>
> I've started looking at the possibility to move Android to use cgroups
> v2 and the availability of the cpuset controller makes this even more
> promising.
>
> I'll try to give a run to this series on Android, meanwhile I have
> some (hopefully not too much dummy) questions below.
>
> On 17-May 16:55, Waiman Long wrote:
>> Given the fact that thread mode had been merged into 4.14, it is now
>> time to enable cpuset to be used in the default hierarchy (cgroup v2)
>> as it is clearly threaded.
>>
>> The cpuset controller had experienced feature creep since its
>> introduction more than a decade ago. Besides the core cpus and mems
>> control files to limit cpus and memory nodes, there are a bunch of
>> additional features that can be controlled from the userspace. Some of
>> the features are of doubtful usefulness and may not be actively used.
>>
>> This patch enables cpuset controller in the default hierarchy with
>> a minimal set of features, namely just the cpus and mems and their
>> effective_* counterparts. We can certainly add more features to the
>> default hierarchy in the future if there is a real user need for them
>> later on.
>>
>> Alternatively, with the unified hiearachy, it may make more sense
>> to move some of those additional cpuset features, if desired, to
>> memory controller or may be to the cpu controller instead of staying
>> with cpuset.
>>
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> Documentation/cgroup-v2.txt | 90 ++++++++++++++++++++++++++++++++++++++++++---
>> kernel/cgroup/cpuset.c | 48 ++++++++++++++++++++++--
>> 2 files changed, 130 insertions(+), 8 deletions(-)
>>
>> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
>> index 74cdeae..cf7bac6 100644
>> --- a/Documentation/cgroup-v2.txt
>> +++ b/Documentation/cgroup-v2.txt
>> @@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
>> 5-3-2. Writeback
>> 5-4. PID
>> 5-4-1. PID Interface Files
>> - 5-5. Device
>> - 5-6. RDMA
>> - 5-6-1. RDMA Interface Files
>> - 5-7. Misc
>> - 5-7-1. perf_event
>> + 5-5. Cpuset
>> + 5.5-1. Cpuset Interface Files
>> + 5-6. Device
>> + 5-7. RDMA
>> + 5-7-1. RDMA Interface Files
>> + 5-8. Misc
>> + 5-8-1. perf_event
>> 5-N. Non-normative information
>> 5-N-1. CPU controller root cgroup process behaviour
>> 5-N-2. IO controller root cgroup process behaviour
>> @@ -1435,6 +1437,84 @@ through fork() or clone(). These will return -EAGAIN if the creation
>> of a new process would cause a cgroup policy to be violated.
>>
>>
>> +Cpuset
>> +------
>> +
>> +The "cpuset" controller provides a mechanism for constraining
>> +the CPU and memory node placement of tasks to only the resources
>> +specified in the cpuset interface files in a task's current cgroup.
>> +This is especially valuable on large NUMA systems where placing jobs
>> +on properly sized subsets of the systems with careful processor and
>> +memory placement to reduce cross-node memory access and contention
>> +can improve overall system performance.
> Another quite important use-case for cpuset is Android, where they are
> actively used to do both power-saving as well as performance tunings.
> For example, depending on the status of an application, its threads
> can be allowed to run on all available CPUS (e.g. foreground apps) or
> be restricted only on few energy efficient CPUs (e.g. backgroud apps).
>
> Since here we are at "rewriting" cpusets for v2, I think it's important
> to keep this mobile world scenario into consideration.
>
> For example, in this context, we are looking at the possibility to
> update/tune cpuset.cpus with a relatively high rate, i.e. tens of
> times per second. Not sure that's the same update rate usually
> required for the large NUMA systems you cite above. However, in this
> case it's quite important to have really small overheads for these
> operations.
The cgroup interface isn't designed for high update throughput. Changing
cpuset.cpus will require searching for the all the tasks in the cpuset
and change its cpu mask. That isn't a fast operation, but it shouldn't
be too bad either depending on how many tasks are in the cpuset.
I would not suggest doing rapid changes to cpuset.cpus as a mean to tune
the behavior of a task. So what exactly is the tuning you are thinking
about? Is it moving a task from the a high-power cpu to a low power one
or vice versa? If so, it is probably better to move the task from one
cpuset of high-power cpus to another cpuset of low-power cpus.
>> +
>> +The "cpuset" controller is hierarchical. That means the controller
>> +cannot use CPUs or memory nodes not allowed in its parent.
>> +
>> +
>> +Cpuset Interface Files
>> +~~~~~~~~~~~~~~~~~~~~~~
>> +
>> + cpuset.cpus
>> + A read-write multiple values file which exists on non-root
>> + cpuset-enabled cgroups.
>> +
>> + It lists the CPUs allowed to be used by tasks within this
>> + cgroup. The CPU numbers are comma-separated numbers or
>> + ranges. For example:
>> +
>> + # cat cpuset.cpus
>> + 0-4,6,8-10
>> +
>> + An empty value indicates that the cgroup is using the same
>> + setting as the nearest cgroup ancestor with a non-empty
>> + "cpuset.cpus" or all the available CPUs if none is found.
> Does that means that we can move tasks into a newly created group for
> which we have not yet configured this value?
> AFAIK, that's a different behavior wrt v1... and I like it better.
>
For v2, if you haven't set up the cpuset.cpus, it defaults to the
effective cpu list of its parent.
>> +
>> + The value of "cpuset.cpus" stays constant until the next update
>> + and won't be affected by any CPU hotplug events.
> This also sounds interesting, does it means that we use the
> cpuset.cpus mask to restrict online CPUs, whatever they are?
cpuset.cpus holds the cpu list written by the users.
cpuset.cpus.effective is the actual cpu mask that is being used. The
effective cpu mask is always a subset of cpuset.cpus. They differ if not
all the CPUs in cpuset.cpus are online.
> I'll have a better look at the code, but my understanding of v1 is
> that we spent a lot of effort to keep task cpu-affinity masks aligned
> with the cpuset in which they live, and we do something similar at each
> HP event, which ultimately generates a lot of overheads in systems
> where: you have many HP events and/or cpuset.cpus change quite
> frequently.
>
> I hope to find some better behavior in this series.
>
The behavior of CPU offline event should be similar in v2. Any HP event
will cause the system to reset the cpu masks of task affected by the
event. The online event, however, will be a bit different between v1 and
v2. For v1, the online event won't restore the CPU back to those cpusets
that had the onlined CPU previously. For v2, the v2, the online CPU will
be restored back to those cpusets. So there is less work from the
management layer, but overhead is still there in the kernel of doing the
restore.
>> +
>> + cpuset.cpus.effective
>> + A read-only multiple values file which exists on non-root
>> + cpuset-enabled cgroups.
>> +
>> + It lists the onlined CPUs that are actually allowed to be
>> + used by tasks within the current cgroup. If "cpuset.cpus"
>> + is empty, it shows all the CPUs from the parent cgroup that
>> + will be available to be used by this cgroup. Otherwise, it is
>> + a subset of "cpuset.cpus". Its value will be affected by CPU
>> + hotplug events.
> This looks similar to v1, isn't it?
For v1, cpuset.cpus.effective is the same as cpuset.cpus unless you turn
on the v2 mode when mounting the v1 cpuset. For v2, they differ. Please
see the explanation above.
>> +
>> + cpuset.mems
>> + A read-write multiple values file which exists on non-root
>> + cpuset-enabled cgroups.
>> +
>> + It lists the memory nodes allowed to be used by tasks within
>> + this cgroup. The memory node numbers are comma-separated
>> + numbers or ranges. For example:
>> +
>> + # cat cpuset.mems
>> + 0-1,3
>> +
>> + An empty value indicates that the cgroup is using the same
>> + setting as the nearest cgroup ancestor with a non-empty
>> + "cpuset.mems" or all the available memory nodes if none
>> + is found.
>> +
>> + The value of "cpuset.mems" stays constant until the next update
>> + and won't be affected by any memory nodes hotplug events.
>> +
>> + cpuset.mems.effective
>> + A read-only multiple values file which exists on non-root
>> + cpuset-enabled cgroups.
>> +
>> + It lists the onlined memory nodes that are actually allowed to
>> + be used by tasks within the current cgroup. If "cpuset.mems"
>> + is empty, it shows all the memory nodes from the parent cgroup
>> + that will be available to be used by this cgroup. Otherwise,
>> + it is a subset of "cpuset.mems". Its value will be affected
>> + by memory nodes hotplug events.
>> +
>> +
>> Device controller
>> -----------------
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index b42037e..419b758 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -1823,12 +1823,11 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
>> return 0;
>> }
>>
>> -
>> /*
>> * for the common functions, 'private' gives the type of file
>> */
>>
>> -static struct cftype files[] = {
>> +static struct cftype legacy_files[] = {
>> {
>> .name = "cpus",
>> .seq_show = cpuset_common_seq_show,
>> @@ -1931,6 +1930,47 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
>> };
>>
>> /*
>> + * This is currently a minimal set for the default hierarchy. It can be
>> + * expanded later on by migrating more features and control files from v1.
>> + */
>> +static struct cftype dfl_files[] = {
>> + {
>> + .name = "cpus",
>> + .seq_show = cpuset_common_seq_show,
>> + .write = cpuset_write_resmask,
>> + .max_write_len = (100U + 6 * NR_CPUS),
>> + .private = FILE_CPULIST,
>> + .flags = CFTYPE_NOT_ON_ROOT,
>> + },
>> +
>> + {
>> + .name = "mems",
>> + .seq_show = cpuset_common_seq_show,
>> + .write = cpuset_write_resmask,
>> + .max_write_len = (100U + 6 * MAX_NUMNODES),
>> + .private = FILE_MEMLIST,
>> + .flags = CFTYPE_NOT_ON_ROOT,
>> + },
>> +
>> + {
>> + .name = "cpus.effective",
>> + .seq_show = cpuset_common_seq_show,
>> + .private = FILE_EFFECTIVE_CPULIST,
>> + .flags = CFTYPE_NOT_ON_ROOT,
>> + },
>> +
>> + {
>> + .name = "mems.effective",
>> + .seq_show = cpuset_common_seq_show,
>> + .private = FILE_EFFECTIVE_MEMLIST,
>> + .flags = CFTYPE_NOT_ON_ROOT,
>> + },
>> +
>> + { } /* terminate */
>> +};
>> +
>> +
>> +/*
>> * cpuset_css_alloc - allocate a cpuset css
>> * cgrp: control group that the new cpuset will be part of
>> */
>> @@ -2104,8 +2144,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
>> .post_attach = cpuset_post_attach,
>> .bind = cpuset_bind,
>> .fork = cpuset_fork,
>> - .legacy_cftypes = files,
>> + .legacy_cftypes = legacy_files,
>> + .dfl_cftypes = dfl_files,
>> .early_init = true,
>> + .threaded = true,
> Which means that by default we can attach tasks instead of only
> processes, right?
Yes, you can control task placement on the thread level, not just process.
Regards,
Longman
On 21-May 09:55, Waiman Long wrote:
> On 05/21/2018 07:55 AM, Patrick Bellasi wrote:
> > Hi Waiman!
[...]
> >> +Cpuset
> >> +------
> >> +
> >> +The "cpuset" controller provides a mechanism for constraining
> >> +the CPU and memory node placement of tasks to only the resources
> >> +specified in the cpuset interface files in a task's current cgroup.
> >> +This is especially valuable on large NUMA systems where placing jobs
> >> +on properly sized subsets of the systems with careful processor and
> >> +memory placement to reduce cross-node memory access and contention
> >> +can improve overall system performance.
> > Another quite important use-case for cpuset is Android, where they are
> > actively used to do both power-saving as well as performance tunings.
> > For example, depending on the status of an application, its threads
> > can be allowed to run on all available CPUS (e.g. foreground apps) or
> > be restricted only on few energy efficient CPUs (e.g. backgroud apps).
> >
> > Since here we are at "rewriting" cpusets for v2, I think it's important
> > to keep this mobile world scenario into consideration.
> >
> > For example, in this context, we are looking at the possibility to
> > update/tune cpuset.cpus with a relatively high rate, i.e. tens of
> > times per second. Not sure that's the same update rate usually
> > required for the large NUMA systems you cite above. However, in this
> > case it's quite important to have really small overheads for these
> > operations.
>
> The cgroup interface isn't designed for high update throughput.
Indeed, I had the same impression...
> Changing cpuset.cpus will require searching for the all the tasks in
> the cpuset and change its cpu mask.
... I'm wondering if that has to be the case. In principle there can
be a different solution which is: update on demand. In the wakeup
path, once we know a task really need a CPU and we want to find one
for it, at that point we can align the cpuset mask with the task's
one. Sort of using the cpuset mask as a clamp on top of the task's
affinity mask.
The main downside of such an approach could be the overheads in the
wakeup path... but, still... that should be measured.
The advantage is that we do not spend time changing attributes of
tassk which, potentially, could be sleeping for a long time.
> That isn't a fast operation, but it shouldn't be too bad either
> depending on how many tasks are in the cpuset.
Indeed, althought it still seems a bit odd and overkilling updating
task affinity for tasks which are not currently RUNNABLE. Isn't it?
> I would not suggest doing rapid changes to cpuset.cpus as a mean to tune
> the behavior of a task. So what exactly is the tuning you are thinking
> about? Is it moving a task from the a high-power cpu to a low power one
> or vice versa?
That's defenitively a possible use case. In Android for example we
usually assign more resources to TOP_APP tasks (those belonging to the
application you are currently using) while we restrict the resoures
one we switch an app to be in BACKGROUND.
More in general, if you think about a generic Run-Time Resource
Management framework, which assign resources to the tasks of multiple
applications and want to have a fine grained control.
> If so, it is probably better to move the task from one cpuset of
> high-power cpus to another cpuset of low-power cpus.
This is what Android does not but also what we want to possible
change, for two main reasons:
1. it does not fit with the "number one guideline" for proper
CGroups usage, which is "Organize Once and Control":
https://elixir.bootlin.com/linux/latest/source/Documentation/cgroup-v2.txt#L518
where it says that:
migrating processes across cgroups frequently as a means to
apply different resource restrictions is discouraged.
Despite this giudeline, it turns out that in v1 at least, it seems
to be faster to move tasks across cpusets then tuning cpuset
attributes... also when all the tasks are sleeping.
2. it does not allow to get advantages for accounting controllers such
as the memory controller where, by moving tasks around, we cannot
properly account and control the amount of memory a task can use.
Thsu, for these reasons and also to possibly migrate to the unified
hierarchy schema proposed by CGroups v2... we would like a
low-overhead mechanism for setting/tuning cpuset at run-time with
whatever frequency you like.
> >> +
> >> +The "cpuset" controller is hierarchical. That means the controller
> >> +cannot use CPUs or memory nodes not allowed in its parent.
> >> +
> >> +
> >> +Cpuset Interface Files
> >> +~~~~~~~~~~~~~~~~~~~~~~
> >> +
> >> + cpuset.cpus
> >> + A read-write multiple values file which exists on non-root
> >> + cpuset-enabled cgroups.
> >> +
> >> + It lists the CPUs allowed to be used by tasks within this
> >> + cgroup. The CPU numbers are comma-separated numbers or
> >> + ranges. For example:
> >> +
> >> + # cat cpuset.cpus
> >> + 0-4,6,8-10
> >> +
> >> + An empty value indicates that the cgroup is using the same
> >> + setting as the nearest cgroup ancestor with a non-empty
> >> + "cpuset.cpus" or all the available CPUs if none is found.
> > Does that means that we can move tasks into a newly created group for
> > which we have not yet configured this value?
> > AFAIK, that's a different behavior wrt v1... and I like it better.
> >
>
> For v2, if you haven't set up the cpuset.cpus, it defaults to the
> effective cpu list of its parent.
+1
>
> >> +
> >> + The value of "cpuset.cpus" stays constant until the next update
> >> + and won't be affected by any CPU hotplug events.
> > This also sounds interesting, does it means that we use the
> > cpuset.cpus mask to restrict online CPUs, whatever they are?
>
> cpuset.cpus holds the cpu list written by the users.
> cpuset.cpus.effective is the actual cpu mask that is being used. The
> effective cpu mask is always a subset of cpuset.cpus. They differ if not
> all the CPUs in cpuset.cpus are online.
And that's fine: the effective mask is updated based on HP events.
The main limitations on this side, so far, is that in
update_tasks_cpumask() we walk all the tasks to set_cpus_allowed_ptr()
independently for them to be RUNNABLE or not. Isn't that?
Thus, this will ensure to have a valid mask at wakeup time, but
perhaps it's not such a big overhead to update the same on the wakeup
path... thus speeding up quite a lot the update_cpumasks_hier()
especially when you have many SLEEPING tasks on a cpuset.
A first measurement and tracing shows that this update could cost up
to 4ms on a Pixel2 device where you update the cpus for a cpuset
containing a single task always sleeping.
> > I'll have a better look at the code, but my understanding of v1 is
> > that we spent a lot of effort to keep task cpu-affinity masks aligned
> > with the cpuset in which they live, and we do something similar at each
> > HP event, which ultimately generates a lot of overheads in systems
> > where: you have many HP events and/or cpuset.cpus change quite
> > frequently.
> >
> > I hope to find some better behavior in this series.
> >
>
> The behavior of CPU offline event should be similar in v2. Any HP event
> will cause the system to reset the cpu masks of task affected by the
> event. The online event, however, will be a bit different between v1 and
> v2. For v1, the online event won't restore the CPU back to those cpusets
> that had the onlined CPU previously. For v2, the v2, the online CPU will
> be restored back to those cpusets. So there is less work from the
> management layer, but overhead is still there in the kernel of doing the
> restore.
On that side, I still have to better look into the v1 and v2
implementations, but for the util_clamp extension of the cpu
controller:
https://lkml.org/lkml/2018/4/9/601
I'm proposing a different update schema which it seems can give you
the benefits or "restoring the mask" after an UP event as well as a
fast update/tuning path at run-time.
Along the line of the above implementation, it would mean that the
task affinity mask is constrained/clamped/masked by the TG's affinity
mask. This should be an operation performed "on-demand" whenever it
makes sense.
However, to be honest, I never measured the overheads to combine two
cpu masks and it can very well be something overkilling for the wakeup
path. I don't think the AND by itself should be an issue, since it's
already used in the fast wakeup path, e.g.
select_task_rq_fair()
select_idle_sibling()
select_idle_core()
cpumask_and(cpus, sched_domain_span(sd),
&p->cpus_allowed);
What eventually could be an issue is the race between the scheduler
looking at the cpuset cpumaks and cgroups changing it... but perhaps
that's something could be fixed with a proper locking mechanism.
I will try to run some experiments to at least collect some overheads
numbers.
[...]
> >> @@ -2104,8 +2144,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
> >> .post_attach = cpuset_post_attach,
> >> .bind = cpuset_bind,
> >> .fork = cpuset_fork,
> >> - .legacy_cftypes = files,
> >> + .legacy_cftypes = legacy_files,
> >> + .dfl_cftypes = dfl_files,
> >> .early_init = true,
> >> + .threaded = true,
> > Which means that by default we can attach tasks instead of only
> > processes, right?
>
> Yes, you can control task placement on the thread level, not just process.
+1
--
#include <best/regards.h>
Patrick Bellasi
On 05/21/2018 11:09 AM, Patrick Bellasi wrote:
> On 21-May 09:55, Waiman Long wrote:
>
>> Changing cpuset.cpus will require searching for the all the tasks in
>> the cpuset and change its cpu mask.
> ... I'm wondering if that has to be the case. In principle there can
> be a different solution which is: update on demand. In the wakeup
> path, once we know a task really need a CPU and we want to find one
> for it, at that point we can align the cpuset mask with the task's
> one. Sort of using the cpuset mask as a clamp on top of the task's
> affinity mask.
>
> The main downside of such an approach could be the overheads in the
> wakeup path... but, still... that should be measured.
> The advantage is that we do not spend time changing attributes of
> tassk which, potentially, could be sleeping for a long time.
We already have a linked list of tasks in a cgroup. So it isn't too hard
to find them. Doing update on demand will require adding a bunch of code
to the wakeup path. So unless there is a good reason to do it, I don't
it as necessary at this point.
>
>> That isn't a fast operation, but it shouldn't be too bad either
>> depending on how many tasks are in the cpuset.
> Indeed, althought it still seems a bit odd and overkilling updating
> task affinity for tasks which are not currently RUNNABLE. Isn't it?
>
>> I would not suggest doing rapid changes to cpuset.cpus as a mean to tune
>> the behavior of a task. So what exactly is the tuning you are thinking
>> about? Is it moving a task from the a high-power cpu to a low power one
>> or vice versa?
> That's defenitively a possible use case. In Android for example we
> usually assign more resources to TOP_APP tasks (those belonging to the
> application you are currently using) while we restrict the resoures
> one we switch an app to be in BACKGROUND.
Switching an app from foreground to background and vice versa shouldn't
happen that frequently. Maybe once every few seconds, at most. I am just
wondering what use cases will require changing cpuset attributes in tens
per second.
> More in general, if you think about a generic Run-Time Resource
> Management framework, which assign resources to the tasks of multiple
> applications and want to have a fine grained control.
>
>> If so, it is probably better to move the task from one cpuset of
>> high-power cpus to another cpuset of low-power cpus.
> This is what Android does not but also what we want to possible
> change, for two main reasons:
>
> 1. it does not fit with the "number one guideline" for proper
> CGroups usage, which is "Organize Once and Control":
> https://elixir.bootlin.com/linux/latest/source/Documentation/cgroup-v2.txt#L518
> where it says that:
> migrating processes across cgroups frequently as a means to
> apply different resource restrictions is discouraged.
>
> Despite this giudeline, it turns out that in v1 at least, it seems
> to be faster to move tasks across cpusets then tuning cpuset
> attributes... also when all the tasks are sleeping.
It is probably similar in v2 as the core logic are almost the same.
> 2. it does not allow to get advantages for accounting controllers such
> as the memory controller where, by moving tasks around, we cannot
> properly account and control the amount of memory a task can use.
For v1, memory controller and cpuset controller can be in different
hierarchy. For v2, we have a unified hierarchy. However, we don't need
to enable all the controllers in different levels of the hierarchy. For
example,
A (memory, cpuset) -- B1 (cpuset)
\-- B2 (cpuset)
Cgroup A has memory and cpuset controllers enabled. The child cgroups B1
and B2 only have cpuset enabled. You can move tasks between B1 and B2
and they will be subjected to the same memory limitation as imposed by
the memory controller in A. So there are way to work around that.
> Thsu, for these reasons and also to possibly migrate to the unified
> hierarchy schema proposed by CGroups v2... we would like a
> low-overhead mechanism for setting/tuning cpuset at run-time with
> whatever frequency you like.
We may be able to improve the performance of changing cpuset attribute
somewhat, but I don't believe there will be much improvement here.
>>>> +
>>>> +The "cpuset" controller is hierarchical. That means the controller
>>>> +cannot use CPUs or memory nodes not allowed in its parent.
>>>> +
>>>> +
>>>> +Cpuset Interface Files
>>>> +~~~~~~~~~~~~~~~~~~~~~~
>>>> +
>>>> + cpuset.cpus
>>>> + A read-write multiple values file which exists on non-root
>>>> + cpuset-enabled cgroups.
>>>> +
>>>> + It lists the CPUs allowed to be used by tasks within this
>>>> + cgroup. The CPU numbers are comma-separated numbers or
>>>> + ranges. For example:
>>>> +
>>>> + # cat cpuset.cpus
>>>> + 0-4,6,8-10
>>>> +
>>>> + An empty value indicates that the cgroup is using the same
>>>> + setting as the nearest cgroup ancestor with a non-empty
>>>> + "cpuset.cpus" or all the available CPUs if none is found.
>>> Does that means that we can move tasks into a newly created group for
>>> which we have not yet configured this value?
>>> AFAIK, that's a different behavior wrt v1... and I like it better.
>>>
>> For v2, if you haven't set up the cpuset.cpus, it defaults to the
>> effective cpu list of its parent.
> +1
>
>>>> +
>>>> + The value of "cpuset.cpus" stays constant until the next update
>>>> + and won't be affected by any CPU hotplug events.
>>> This also sounds interesting, does it means that we use the
>>> cpuset.cpus mask to restrict online CPUs, whatever they are?
>> cpuset.cpus holds the cpu list written by the users.
>> cpuset.cpus.effective is the actual cpu mask that is being used. The
>> effective cpu mask is always a subset of cpuset.cpus. They differ if not
>> all the CPUs in cpuset.cpus are online.
> And that's fine: the effective mask is updated based on HP events.
>
> The main limitations on this side, so far, is that in
> update_tasks_cpumask() we walk all the tasks to set_cpus_allowed_ptr()
> independently for them to be RUNNABLE or not. Isn't that?
That is true.
> Thus, this will ensure to have a valid mask at wakeup time, but
> perhaps it's not such a big overhead to update the same on the wakeup
> path... thus speeding up quite a lot the update_cpumasks_hier()
> especially when you have many SLEEPING tasks on a cpuset.
>
> A first measurement and tracing shows that this update could cost up
> to 4ms on a Pixel2 device where you update the cpus for a cpuset
> containing a single task always sleeping.
The 4ms cost is more than what I would have expected. If you think
delaying the update until wakeup time is the right move, you can create
a patch to do that and we can discuss the merit of doing so in LKML.
>
>>> I'll have a better look at the code, but my understanding of v1 is
>>> that we spent a lot of effort to keep task cpu-affinity masks aligned
>>> with the cpuset in which they live, and we do something similar at each
>>> HP event, which ultimately generates a lot of overheads in systems
>>> where: you have many HP events and/or cpuset.cpus change quite
>>> frequently.
>>>
>>> I hope to find some better behavior in this series.
>>>
>> The behavior of CPU offline event should be similar in v2. Any HP event
>> will cause the system to reset the cpu masks of task affected by the
>> event. The online event, however, will be a bit different between v1 and
>> v2. For v1, the online event won't restore the CPU back to those cpusets
>> that had the onlined CPU previously. For v2, the v2, the online CPU will
>> be restored back to those cpusets. So there is less work from the
>> management layer, but overhead is still there in the kernel of doing the
>> restore.
> On that side, I still have to better look into the v1 and v2
> implementations, but for the util_clamp extension of the cpu
> controller:
> https://lkml.org/lkml/2018/4/9/601
> I'm proposing a different update schema which it seems can give you
> the benefits or "restoring the mask" after an UP event as well as a
> fast update/tuning path at run-time.
>
> Along the line of the above implementation, it would mean that the
> task affinity mask is constrained/clamped/masked by the TG's affinity
> mask. This should be an operation performed "on-demand" whenever it
> makes sense.
>
> However, to be honest, I never measured the overheads to combine two
> cpu masks and it can very well be something overkilling for the wakeup
> path. I don't think the AND by itself should be an issue, since it's
> already used in the fast wakeup path, e.g.
>
> select_task_rq_fair()
> select_idle_sibling()
> select_idle_core()
> cpumask_and(cpus, sched_domain_span(sd),
> &p->cpus_allowed);
>
> What eventually could be an issue is the race between the scheduler
> looking at the cpuset cpumaks and cgroups changing it... but perhaps
> that's something could be fixed with a proper locking mechanism.
>
> I will try to run some experiments to at least collect some overheads
> numbers.
Collecting more information on where the slowdown is will be helpful.
-Longman
Hi,
On 17/05/18 16:55, Waiman Long wrote:
[...]
> /**
> + * update_isolated_cpumask - update the isolated_cpus mask of parent cpuset
> + * @cpuset: The cpuset that requests CPU isolation
> + * @oldmask: The old isolated cpumask to be removed from the parent
> + * @newmask: The new isolated cpumask to be added to the parent
> + * Return: 0 if successful, an error code otherwise
> + *
> + * Changes to the isolated CPUs are not allowed if any of CPUs changing
> + * state are in any of the child cpusets of the parent except the requesting
> + * child.
> + *
> + * If the sched_domain flag changes, either the oldmask (0=>1) or the
> + * newmask (1=>0) will be NULL.
> + *
> + * Called with cpuset_mutex held.
> + */
> +static int update_isolated_cpumask(struct cpuset *cpuset,
> + struct cpumask *oldmask, struct cpumask *newmask)
> +{
> + int retval;
> + int adding, deleting;
> + cpumask_var_t addmask, delmask;
> + struct cpuset *parent = parent_cs(cpuset);
> + struct cpuset *sibling;
> + struct cgroup_subsys_state *pos_css;
> + int old_count = parent->isolation_count;
> + bool dying = cpuset->css.flags & CSS_DYING;
> +
> + /*
> + * Parent must be a scheduling domain with non-empty cpus_allowed.
> + */
> + if (!is_sched_domain(parent) || cpumask_empty(parent->cpus_allowed))
> + return -EINVAL;
> +
> + /*
> + * The oldmask, if present, must be a subset of parent's isolated
> + * CPUs.
> + */
> + if (oldmask && !cpumask_empty(oldmask) && (!parent->isolation_count ||
> + !cpumask_subset(oldmask, parent->isolated_cpus))) {
> + WARN_ON_ONCE(1);
> + return -EINVAL;
> + }
> +
> + /*
> + * A sched_domain state change is not allowed if there are
> + * online children and the cpuset is not dying.
> + */
> + if (!dying && (!oldmask || !newmask) &&
> + css_has_online_children(&cpuset->css))
> + return -EBUSY;
> +
> + if (!zalloc_cpumask_var(&addmask, GFP_KERNEL))
> + return -ENOMEM;
> + if (!zalloc_cpumask_var(&delmask, GFP_KERNEL)) {
> + free_cpumask_var(addmask);
> + return -ENOMEM;
> + }
> +
> + if (!old_count) {
> + if (!zalloc_cpumask_var(&parent->isolated_cpus, GFP_KERNEL)) {
> + retval = -ENOMEM;
> + goto out;
> + }
> + old_count = 1;
> + }
> +
> + retval = -EBUSY;
> + adding = deleting = false;
> + if (newmask)
> + cpumask_copy(addmask, newmask);
> + if (oldmask)
> + deleting = cpumask_andnot(delmask, oldmask, addmask);
> + if (newmask)
> + adding = cpumask_andnot(addmask, newmask, delmask);
> +
> + if (!adding && !deleting)
> + goto out_ok;
> +
> + /*
> + * The cpus to be added must be in the parent's effective_cpus mask
> + * but not in the isolated_cpus mask.
> + */
> + if (!cpumask_subset(addmask, parent->effective_cpus))
> + goto out;
> + if (parent->isolation_count &&
> + cpumask_intersects(parent->isolated_cpus, addmask))
> + goto out;
> +
> + /*
> + * Check if any CPUs in addmask or delmask are in a sibling cpuset.
> + * An empty sibling cpus_allowed means it is the same as parent's
> + * effective_cpus. This checking is skipped if the cpuset is dying.
> + */
> + if (dying)
> + goto updated_isolated_cpus;
> +
> + cpuset_for_each_child(sibling, pos_css, parent) {
> + if ((sibling == cpuset) || !(sibling->css.flags & CSS_ONLINE))
> + continue;
> + if (cpumask_empty(sibling->cpus_allowed))
> + goto out;
> + if (adding &&
> + cpumask_intersects(sibling->cpus_allowed, addmask))
> + goto out;
> + if (deleting &&
> + cpumask_intersects(sibling->cpus_allowed, delmask))
> + goto out;
> + }
Just got the below by echoing 1 into cpuset.sched.domain of a sibling with
"isolated" cpuset.cpus. Guess you are missing proper locking about here
above.
--->8---
[ 7509.905005] =============================
[ 7509.905009] WARNING: suspicious RCU usage
[ 7509.905014] 4.17.0-rc5+ #11 Not tainted
[ 7509.905017] -----------------------------
[ 7509.905023] /home/juri/work/kernel/linux/kernel/cgroup/cgroup.c:3826 cgroup_mutex or RCU read lock required!
[ 7509.905026]
other info that might help us debug this:
[ 7509.905031]
rcu_scheduler_active = 2, debug_locks = 1
[ 7509.905036] 4 locks held by bash/1480:
[ 7509.905039] #0: 00000000bf288709 (sb_writers#6){.+.+}, at: vfs_write+0x18a/0x1b0
[ 7509.905072] #1: 00000000ebf23fc9 (&of->mutex){+.+.}, at: kernfs_fop_write+0xe2/0x1a0
[ 7509.905098] #2: 00000000de7c626e (kn->count#302){.+.+}, at: kernfs_fop_write+0xeb/0x1a0
[ 7509.905124] #3: 00000000a6a2bd9f (cpuset_mutex){+.+.}, at: cpuset_write_u64+0x23/0x140
[ 7509.905149]
stack backtrace:
[ 7509.905156] CPU: 6 PID: 1480 Comm: bash Not tainted 4.17.0-rc5+ #11
[ 7509.905160] Hardware name: LENOVO 30B6S2F900/1030, BIOS S01KT56A 01/15/2018
[ 7509.905164] Call Trace:
[ 7509.905176] dump_stack+0x85/0xcb
[ 7509.905187] css_next_child+0x90/0xd0
[ 7509.905195] update_isolated_cpumask+0x18f/0x2e0
[ 7509.905208] update_flag+0x1f3/0x210
[ 7509.905220] cpuset_write_u64+0xff/0x140
[ 7509.905230] cgroup_file_write+0x178/0x230
[ 7509.905244] kernfs_fop_write+0x113/0x1a0
[ 7509.905254] __vfs_write+0x36/0x180
[ 7509.905264] ? rcu_read_lock_sched_held+0x6b/0x80
[ 7509.905270] ? rcu_sync_lockdep_assert+0x2e/0x60
[ 7509.905278] ? __sb_start_write+0x13e/0x1a0
[ 7509.905283] ? vfs_write+0x18a/0x1b0
[ 7509.905293] vfs_write+0xc1/0x1b0
[ 7509.905302] ksys_write+0x55/0xc0
[ 7509.905317] do_syscall_64+0x60/0x200
[ 7509.905327] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 7509.905333] RIP: 0033:0x7fee4fdfe414
[ 7509.905338] RSP: 002b:00007fff364a80a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 7509.905346] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fee4fdfe414
[ 7509.905350] RDX: 0000000000000002 RSI: 000055eb12f93740 RDI: 0000000000000001
[ 7509.905354] RBP: 000055eb12f93740 R08: 000000000000000a R09: 00007fff364a7c30
[ 7509.905358] R10: 000000000000000a R11: 0000000000000246 R12: 00007fee500cd760
[ 7509.905361] R13: 0000000000000002 R14: 00007fee500c8760 R15: 0000000000000002
--->8---
Best,
- Juri
On 05/22/2018 08:57 AM, Juri Lelli wrote:
> Hi,
>
> On 17/05/18 16:55, Waiman Long wrote:
>
> [...]
>
>> /**
>> + * update_isolated_cpumask - update the isolated_cpus mask of parent cpuset
>> + * @cpuset: The cpuset that requests CPU isolation
>> + * @oldmask: The old isolated cpumask to be removed from the parent
>> + * @newmask: The new isolated cpumask to be added to the parent
>> + * Return: 0 if successful, an error code otherwise
>> + *
>> + * Changes to the isolated CPUs are not allowed if any of CPUs changing
>> + * state are in any of the child cpusets of the parent except the requesting
>> + * child.
>> + *
>> + * If the sched_domain flag changes, either the oldmask (0=>1) or the
>> + * newmask (1=>0) will be NULL.
>> + *
>> + * Called with cpuset_mutex held.
>> + */
>> +static int update_isolated_cpumask(struct cpuset *cpuset,
>> + struct cpumask *oldmask, struct cpumask *newmask)
>> +{
>> + int retval;
>> + int adding, deleting;
>> + cpumask_var_t addmask, delmask;
>> + struct cpuset *parent = parent_cs(cpuset);
>> + struct cpuset *sibling;
>> + struct cgroup_subsys_state *pos_css;
>> + int old_count = parent->isolation_count;
>> + bool dying = cpuset->css.flags & CSS_DYING;
>> +
>> + /*
>> + * Parent must be a scheduling domain with non-empty cpus_allowed.
>> + */
>> + if (!is_sched_domain(parent) || cpumask_empty(parent->cpus_allowed))
>> + return -EINVAL;
>> +
>> + /*
>> + * The oldmask, if present, must be a subset of parent's isolated
>> + * CPUs.
>> + */
>> + if (oldmask && !cpumask_empty(oldmask) && (!parent->isolation_count ||
>> + !cpumask_subset(oldmask, parent->isolated_cpus))) {
>> + WARN_ON_ONCE(1);
>> + return -EINVAL;
>> + }
>> +
>> + /*
>> + * A sched_domain state change is not allowed if there are
>> + * online children and the cpuset is not dying.
>> + */
>> + if (!dying && (!oldmask || !newmask) &&
>> + css_has_online_children(&cpuset->css))
>> + return -EBUSY;
>> +
>> + if (!zalloc_cpumask_var(&addmask, GFP_KERNEL))
>> + return -ENOMEM;
>> + if (!zalloc_cpumask_var(&delmask, GFP_KERNEL)) {
>> + free_cpumask_var(addmask);
>> + return -ENOMEM;
>> + }
>> +
>> + if (!old_count) {
>> + if (!zalloc_cpumask_var(&parent->isolated_cpus, GFP_KERNEL)) {
>> + retval = -ENOMEM;
>> + goto out;
>> + }
>> + old_count = 1;
>> + }
>> +
>> + retval = -EBUSY;
>> + adding = deleting = false;
>> + if (newmask)
>> + cpumask_copy(addmask, newmask);
>> + if (oldmask)
>> + deleting = cpumask_andnot(delmask, oldmask, addmask);
>> + if (newmask)
>> + adding = cpumask_andnot(addmask, newmask, delmask);
>> +
>> + if (!adding && !deleting)
>> + goto out_ok;
>> +
>> + /*
>> + * The cpus to be added must be in the parent's effective_cpus mask
>> + * but not in the isolated_cpus mask.
>> + */
>> + if (!cpumask_subset(addmask, parent->effective_cpus))
>> + goto out;
>> + if (parent->isolation_count &&
>> + cpumask_intersects(parent->isolated_cpus, addmask))
>> + goto out;
>> +
>> + /*
>> + * Check if any CPUs in addmask or delmask are in a sibling cpuset.
>> + * An empty sibling cpus_allowed means it is the same as parent's
>> + * effective_cpus. This checking is skipped if the cpuset is dying.
>> + */
>> + if (dying)
>> + goto updated_isolated_cpus;
>> +
>> + cpuset_for_each_child(sibling, pos_css, parent) {
>> + if ((sibling == cpuset) || !(sibling->css.flags & CSS_ONLINE))
>> + continue;
>> + if (cpumask_empty(sibling->cpus_allowed))
>> + goto out;
>> + if (adding &&
>> + cpumask_intersects(sibling->cpus_allowed, addmask))
>> + goto out;
>> + if (deleting &&
>> + cpumask_intersects(sibling->cpus_allowed, delmask))
>> + goto out;
>> + }
> Just got the below by echoing 1 into cpuset.sched.domain of a sibling with
> "isolated" cpuset.cpus. Guess you are missing proper locking about here
> above.
>
> --->8---
> [ 7509.905005] =============================
> [ 7509.905009] WARNING: suspicious RCU usage
> [ 7509.905014] 4.17.0-rc5+ #11 Not tainted
> [ 7509.905017] -----------------------------
> [ 7509.905023] /home/juri/work/kernel/linux/kernel/cgroup/cgroup.c:3826 cgroup_mutex or RCU read lock required!
> [ 7509.905026]
> other info that might help us debug this:
>
> [ 7509.905031]
> rcu_scheduler_active = 2, debug_locks = 1
> [ 7509.905036] 4 locks held by bash/1480:
> [ 7509.905039] #0: 00000000bf288709 (sb_writers#6){.+.+}, at: vfs_write+0x18a/0x1b0
> [ 7509.905072] #1: 00000000ebf23fc9 (&of->mutex){+.+.}, at: kernfs_fop_write+0xe2/0x1a0
> [ 7509.905098] #2: 00000000de7c626e (kn->count#302){.+.+}, at: kernfs_fop_write+0xeb/0x1a0
> [ 7509.905124] #3: 00000000a6a2bd9f (cpuset_mutex){+.+.}, at: cpuset_write_u64+0x23/0x140
> [ 7509.905149]
> stack backtrace:
> [ 7509.905156] CPU: 6 PID: 1480 Comm: bash Not tainted 4.17.0-rc5+ #11
> [ 7509.905160] Hardware name: LENOVO 30B6S2F900/1030, BIOS S01KT56A 01/15/2018
> [ 7509.905164] Call Trace:
> [ 7509.905176] dump_stack+0x85/0xcb
> [ 7509.905187] css_next_child+0x90/0xd0
> [ 7509.905195] update_isolated_cpumask+0x18f/0x2e0
> [ 7509.905208] update_flag+0x1f3/0x210
> [ 7509.905220] cpuset_write_u64+0xff/0x140
> [ 7509.905230] cgroup_file_write+0x178/0x230
> [ 7509.905244] kernfs_fop_write+0x113/0x1a0
> [ 7509.905254] __vfs_write+0x36/0x180
> [ 7509.905264] ? rcu_read_lock_sched_held+0x6b/0x80
> [ 7509.905270] ? rcu_sync_lockdep_assert+0x2e/0x60
> [ 7509.905278] ? __sb_start_write+0x13e/0x1a0
> [ 7509.905283] ? vfs_write+0x18a/0x1b0
> [ 7509.905293] vfs_write+0xc1/0x1b0
> [ 7509.905302] ksys_write+0x55/0xc0
> [ 7509.905317] do_syscall_64+0x60/0x200
> [ 7509.905327] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 7509.905333] RIP: 0033:0x7fee4fdfe414
> [ 7509.905338] RSP: 002b:00007fff364a80a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [ 7509.905346] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fee4fdfe414
> [ 7509.905350] RDX: 0000000000000002 RSI: 000055eb12f93740 RDI: 0000000000000001
> [ 7509.905354] RBP: 000055eb12f93740 R08: 000000000000000a R09: 00007fff364a7c30
> [ 7509.905358] R10: 000000000000000a R11: 0000000000000246 R12: 00007fee500cd760
> [ 7509.905361] R13: 0000000000000002 R14: 00007fee500c8760 R15: 0000000000000002
> --->8---
>
> Best,
>
> - Juri
>
Thanks for the testing. I will fix that in the next version.
-Longman
Hi,
On 17/05/18 16:55, Waiman Long wrote:
> This patch enables us to report sched domain generation information.
>
> If DYNAMIC_DEBUG is enabled, issuing the following command
>
> echo "file cpuset.c +p" > /sys/kernel/debug/dynamic_debug/control
>
> and setting loglevel to 8 will allow the kernel to show what scheduling
> domain changes are being made.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/cgroup/cpuset.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index fb8aa82b..8f586e8 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -820,6 +820,12 @@ static int generate_sched_domains(cpumask_var_t **domains,
> }
> BUG_ON(nslot != ndoms);
>
> +#ifdef CONFIG_DEBUG_KERNEL
> + for (i = 0; i < ndoms; i++)
> + pr_debug("generate_sched_domains dom %d: %*pbl\n", i,
> + cpumask_pr_args(doms[i]));
> +#endif
> +
While I'm always in favor of adding debug output, in this case I'm not
sure it's adding much to what we already print when booting with
sched_debug kernel command-line param, e.g.
--->8---
Kernel command line: BOOT_IMAGE=/vmlinuz-4.17.0-rc5+ ... sched_debug
[...]
smp: Bringing up secondary CPUs ...
x86: Booting SMP configuration:
.... node #0, CPUs: #1 #2 #3 #4 #5
.... node #1, CPUs: #6 #7 #8 #9 #10 #11
smp: Brought up 2 nodes, 12 CPUs
smpboot: Max logical packages: 2
smpboot: Total of 12 processors activated (45636.50 BogoMIPS)
CPU0 attaching sched-domain(s):
domain-0: span=0-5 level=MC
groups: 0:{ span=0 cap=1016 }, 1:{ span=1 cap=1011 }, 2:{ span=2 }, 3:{ span=3 cap=1023 }, 4:{ span=4 }, 5:{ span=5 }
domain-1: span=0-11 level=NUMA
groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
CPU1 attaching sched-domain(s):
domain-0: span=0-5 level=MC
groups: 1:{ span=1 cap=1011 }, 2:{ span=2 }, 3:{ span=3 cap=1023 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 cap=1016 }
domain-1: span=0-11 level=NUMA
groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
CPU2 attaching sched-domain(s):
domain-0: span=0-5 level=MC
groups: 2:{ span=2 }, 3:{ span=3 cap=1023 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 cap=1016 }, 1:{ span=1 cap=1011 }
domain-1: span=0-11 level=NUMA
groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
CPU3 attaching sched-domain(s):
domain-0: span=0-5 level=MC
groups: 3:{ span=3 cap=1023 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 cap=1016 }, 1:{ span=1 cap=1011 }, 2:{ span=2 }
domain-1: span=0-11 level=NUMA
groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
CPU4 attaching sched-domain(s):
domain-0: span=0-5 level=MC
groups: 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 cap=1016 }, 1:{ span=1 cap=1011 }, 2:{ span=2 }, 3:{ span=3 cap=1023 }
domain-1: span=0-11 level=NUMA
groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
CPU5 attaching sched-domain(s):
domain-0: span=0-5 level=MC
groups: 5:{ span=5 }, 0:{ span=0 cap=1016 }, 1:{ span=1 cap=1011 }, 2:{ span=2 }, 3:{ span=3 cap=1023 }, 4:{ span=4 }
domain-1: span=0-11 level=NUMA
groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
CPU6 attaching sched-domain(s):
domain-0: span=6-11 level=MC
groups: 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 cap=1021 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }
domain-1: span=0-11 level=NUMA
groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
CPU7 attaching sched-domain(s):
domain-0: span=6-11 level=MC
groups: 7:{ span=7 }, 8:{ span=8 cap=1021 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }
domain-1: span=0-11 level=NUMA
groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
CPU8 attaching sched-domain(s):
domain-0: span=6-11 level=MC
groups: 8:{ span=8 cap=1021 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }
domain-1: span=0-11 level=NUMA
groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
CPU9 attaching sched-domain(s):
domain-0: span=6-11 level=MC
groups: 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 cap=1021 }
domain-1: span=0-11 level=NUMA
groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
CPU10 attaching sched-domain(s):
domain-0: span=6-11 level=MC
groups: 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 cap=1021 }, 9:{ span=9 }
domain-1: span=0-11 level=NUMA
groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
CPU11 attaching sched-domain(s):
domain-0: span=6-11 level=MC
groups: 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 cap=1021 }, 9:{ span=9 }, 10:{ span=10 }
domain-1: span=0-11 level=NUMA
groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
span: 0-11 (max cpu_capacity = 1024)
[...]
generate_sched_domains dom 0: 6-11 <-- this and the one below is what
generate_sched_domains dom 1: 0-5 you are adding
CPU0 attaching NULL sched-domain.
CPU1 attaching NULL sched-domain.
CPU2 attaching NULL sched-domain.
CPU3 attaching NULL sched-domain.
CPU4 attaching NULL sched-domain.
CPU5 attaching NULL sched-domain.
CPU6 attaching NULL sched-domain.
CPU7 attaching NULL sched-domain.
CPU8 attaching NULL sched-domain.
CPU9 attaching NULL sched-domain.
CPU10 attaching NULL sched-domain.
CPU11 attaching NULL sched-domain.
CPU6 attaching sched-domain(s):
domain-0: span=6-11 level=MC
groups: 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }
CPU7 attaching sched-domain(s):
domain-0: span=6-11 level=MC
groups: 7:{ span=7 }, 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }
CPU8 attaching sched-domain(s):
domain-0: span=6-11 level=MC
groups: 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }
CPU9 attaching sched-domain(s):
domain-0: span=6-11 level=MC
groups: 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 }
CPU10 attaching sched-domain(s):
domain-0: span=6-11 level=MC
groups: 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 }, 9:{ span=9 }
CPU11 attaching sched-domain(s):
domain-0: span=6-11 level=MC
groups: 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }
span: 6-11 (max cpu_capacity = 1024)
CPU0 attaching sched-domain(s):
domain-0: span=0-5 level=MC
groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }
CPU1 attaching sched-domain(s):
domain-0: span=0-5 level=MC
groups: 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 }
CPU2 attaching sched-domain(s):
domain-0: span=0-5 level=MC
groups: 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 }, 1:{ span=1 }
CPU3 attaching sched-domain(s):
domain-0: span=0-5 level=MC
groups: 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }
CPU4 attaching sched-domain(s):
domain-0: span=0-5 level=MC
groups: 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }
CPU5 attaching sched-domain(s):
domain-0: span=0-5 level=MC
groups: 5:{ span=5 }, 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }
span: 0-5 (max cpu_capacity = 1024)
--->8---
Do you think there is still a benefit in printing out what
generate_sched_domains does?
Best,
- Juri
Hi Waiman,
On 17-May 16:55, Waiman Long wrote:
[...]
> @@ -672,13 +672,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
> int ndoms = 0; /* number of sched domains in result */
> int nslot; /* next empty doms[] struct cpumask slot */
> struct cgroup_subsys_state *pos_css;
> + bool root_load_balance = is_sched_load_balance(&top_cpuset);
>
> doms = NULL;
> dattr = NULL;
> csa = NULL;
>
> /* Special case for the 99% of systems with one, full, sched domain */
> - if (is_sched_load_balance(&top_cpuset)) {
> + if (root_load_balance && !top_cpuset.isolation_count) {
Perhaps I'm missing something but, it seems to me that, when the two
conditions above are true, then we are going to destroy and rebuild
the exact same scheduling domains.
IOW, on 99% of systems where:
is_sched_load_balance(&top_cpuset)
top_cpuset.isolation_count = 0
since boot time and forever, then every time we update a value for
cpuset.cpus we keep rebuilding the same SDs.
It's not strictly related to this patch, the same already happens in
mainline based just on the first condition, but since you are extending
that optimization, perhaps you can tell me where I'm possibly wrong or
which cases I'm not considering.
I'm interested mainly because on Android systems those conditions
are always true and we see SDs rebuilds every time we write
something in cpuset.cpus, which ultimately accounts for almost all the
6-7[ms] time required for the write to return, depending on the CPU
frequency.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 05/23/2018 01:34 PM, Patrick Bellasi wrote:
> Hi Waiman,
>
> On 17-May 16:55, Waiman Long wrote:
>
> [...]
>
>> @@ -672,13 +672,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
>> int ndoms = 0; /* number of sched domains in result */
>> int nslot; /* next empty doms[] struct cpumask slot */
>> struct cgroup_subsys_state *pos_css;
>> + bool root_load_balance = is_sched_load_balance(&top_cpuset);
>>
>> doms = NULL;
>> dattr = NULL;
>> csa = NULL;
>>
>> /* Special case for the 99% of systems with one, full, sched domain */
>> - if (is_sched_load_balance(&top_cpuset)) {
>> + if (root_load_balance && !top_cpuset.isolation_count) {
> Perhaps I'm missing something but, it seems to me that, when the two
> conditions above are true, then we are going to destroy and rebuild
> the exact same scheduling domains.
>
> IOW, on 99% of systems where:
>
> is_sched_load_balance(&top_cpuset)
> top_cpuset.isolation_count = 0
>
> since boot time and forever, then every time we update a value for
> cpuset.cpus we keep rebuilding the same SDs.
>
> It's not strictly related to this patch, the same already happens in
> mainline based just on the first condition, but since you are extending
> that optimization, perhaps you can tell me where I'm possibly wrong or
> which cases I'm not considering.
>
> I'm interested mainly because on Android systems those conditions
> are always true and we see SDs rebuilds every time we write
> something in cpuset.cpus, which ultimately accounts for almost all the
> 6-7[ms] time required for the write to return, depending on the CPU
> frequency.
>
> Cheers Patrick
>
Yes, that is true. I will look into how to further optimize this. Thanks
for the suggestion.
-Longman
On 23-May 16:18, Waiman Long wrote:
> On 05/23/2018 01:34 PM, Patrick Bellasi wrote:
> > Hi Waiman,
> >
> > On 17-May 16:55, Waiman Long wrote:
> >
> > [...]
> >
> >> @@ -672,13 +672,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
> >> int ndoms = 0; /* number of sched domains in result */
> >> int nslot; /* next empty doms[] struct cpumask slot */
> >> struct cgroup_subsys_state *pos_css;
> >> + bool root_load_balance = is_sched_load_balance(&top_cpuset);
> >>
> >> doms = NULL;
> >> dattr = NULL;
> >> csa = NULL;
> >>
> >> /* Special case for the 99% of systems with one, full, sched domain */
> >> - if (is_sched_load_balance(&top_cpuset)) {
> >> + if (root_load_balance && !top_cpuset.isolation_count) {
> > Perhaps I'm missing something but, it seems to me that, when the two
> > conditions above are true, then we are going to destroy and rebuild
> > the exact same scheduling domains.
> >
> > IOW, on 99% of systems where:
> >
> > is_sched_load_balance(&top_cpuset)
> > top_cpuset.isolation_count = 0
> >
> > since boot time and forever, then every time we update a value for
> > cpuset.cpus we keep rebuilding the same SDs.
> >
> > It's not strictly related to this patch, the same already happens in
> > mainline based just on the first condition, but since you are extending
> > that optimization, perhaps you can tell me where I'm possibly wrong or
> > which cases I'm not considering.
> >
> > I'm interested mainly because on Android systems those conditions
> > are always true and we see SDs rebuilds every time we write
> > something in cpuset.cpus, which ultimately accounts for almost all the
> > 6-7[ms] time required for the write to return, depending on the CPU
> > frequency.
> >
> > Cheers Patrick
> >
> Yes, that is true. I will look into how to further optimize this. Thanks
> for the suggestion.
FWIW, following is my take on top of your series.
With the following patch applied I see a reduction of the average
execution time for a rebuild_sched_domains_locked() from 1.4[ms] to
40[us] while running 60 /tg1/cpuset.cpus switches in a loop on an
JunoR2 Arm board using the performance cpufreq governor.
---8<---
From 84bb8137ce79f74849d97e30871cf67d06d8d682 Mon Sep 17 00:00:00 2001
From: Patrick Bellasi <[email protected]>
Date: Wed, 23 May 2018 16:33:06 +0100
Subject: [PATCH 1/1] cgroup/cpuset: disable sched domain rebuild when not
required
The generate_sched_domains() already addresses the "special case for 99%
of systems" which require a single full sched domain at the root,
spanning all the CPUs. However, the current support is based on an
expensive sequence of operations which destroy and recreate the exact
same scheduling domain configuration.
If we notice that:
1) CPUs in "cpuset.isolcpus" are excluded from load balancing by the
isolcpus= kernel boot option, and will never be load balanced
regardless of the value of "cpuset.sched_load_balance" in any
cpuset.
2) the root cpuset has load_balance enabled by default at boot and
it's the only parameter which userspace can change at run-time.
we know that, by default, every system comes up with a complete and
properly configured set of scheduling domains covering all the CPUs.
Thus, on every system, unless the user explicitly disables load balance
for the top_cpuset, the scheduling domains already configured at boot
time by the scheduler/topology code and updated in consequence of
hotplug events, are already properly configured for cpuset too.
This configuration is the default one for 99% of the systems,
and it's also the one used by most of the Android devices which never
disable load balance from the top_cpuset.
Thus, while load balance is enabled for the top_cpuset,
destroying/rebuilding the scheduling domains at every cpuset.cpus
reconfiguration is a useless operation which will always produce the
same result.
Let's anticipate the "special" optimization within:
rebuild_sched_domains_locked()
thus completely skipping the expensive:
generate_sched_domains()
partition_sched_domains()
for all the cases we know that the scheduling domains already defined
will not be affected by whatsoever value of cpuset.cpus.
The proposed solution is the minimal variation to optimize the case for
systems with load balance enabled at the root level and without isolated
CPUs. As soon as one of these conditions is not more valid, we fall back
to the original behavior.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Tejun Heo <[email protected]>,
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
kernel/cgroup/cpuset.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 8f586e8bdc98..cff14be94678 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -874,6 +874,11 @@ static void rebuild_sched_domains_locked(void)
!cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
goto out;
+ /* Special case for the 99% of systems with one, full, sched domain */
+ if (!top_cpuset.isolation_count &&
+ is_sched_load_balance(&top_cpuset))
+ goto out;
+
/* Generate domain masks and attrs */
ndoms = generate_sched_domains(&doms, &attr);
--
2.15.1
---8<---
--
#include <best/regards.h>
Patrick Bellasi
On 17/05/18 16:55, Waiman Long wrote:
[...]
> @@ -849,7 +860,12 @@ static void rebuild_sched_domains_locked(void)
> * passing doms with offlined cpu to partition_sched_domains().
> * Anyways, hotplug work item will rebuild sched domains.
> */
> - if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
> + if (!top_cpuset.isolation_count &&
> + !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
> + goto out;
> +
> + if (top_cpuset.isolation_count &&
> + !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
> goto out;
Do we cover the case in which hotplug removed one of the isolated cpus
from cpu_active_mask?
On 24/05/18 10:04, Patrick Bellasi wrote:
[...]
> From 84bb8137ce79f74849d97e30871cf67d06d8d682 Mon Sep 17 00:00:00 2001
> From: Patrick Bellasi <[email protected]>
> Date: Wed, 23 May 2018 16:33:06 +0100
> Subject: [PATCH 1/1] cgroup/cpuset: disable sched domain rebuild when not
> required
>
> The generate_sched_domains() already addresses the "special case for 99%
> of systems" which require a single full sched domain at the root,
> spanning all the CPUs. However, the current support is based on an
> expensive sequence of operations which destroy and recreate the exact
> same scheduling domain configuration.
>
> If we notice that:
>
> 1) CPUs in "cpuset.isolcpus" are excluded from load balancing by the
> isolcpus= kernel boot option, and will never be load balanced
> regardless of the value of "cpuset.sched_load_balance" in any
> cpuset.
>
> 2) the root cpuset has load_balance enabled by default at boot and
> it's the only parameter which userspace can change at run-time.
>
> we know that, by default, every system comes up with a complete and
> properly configured set of scheduling domains covering all the CPUs.
>
> Thus, on every system, unless the user explicitly disables load balance
> for the top_cpuset, the scheduling domains already configured at boot
> time by the scheduler/topology code and updated in consequence of
> hotplug events, are already properly configured for cpuset too.
>
> This configuration is the default one for 99% of the systems,
> and it's also the one used by most of the Android devices which never
> disable load balance from the top_cpuset.
>
> Thus, while load balance is enabled for the top_cpuset,
> destroying/rebuilding the scheduling domains at every cpuset.cpus
> reconfiguration is a useless operation which will always produce the
> same result.
>
> Let's anticipate the "special" optimization within:
>
> rebuild_sched_domains_locked()
>
> thus completely skipping the expensive:
>
> generate_sched_domains()
> partition_sched_domains()
>
> for all the cases we know that the scheduling domains already defined
> will not be affected by whatsoever value of cpuset.cpus.
[...]
> + /* Special case for the 99% of systems with one, full, sched domain */
> + if (!top_cpuset.isolation_count &&
> + is_sched_load_balance(&top_cpuset))
> + goto out;
> +
Mmm, looks like we still need to destroy e recreate if there is a
new_topology (see arch_update_cpu_topology() in partition_sched_
domains).
Maybe we could move the check you are proposing in update_cpumasks_
hier() ?
On 17/05/18 16:55, Waiman Long wrote:
[...]
> + A parent cgroup cannot distribute all its CPUs to child
> + scheduling domain cgroups unless its load balancing flag is
> + turned off.
> +
> + cpuset.sched.load_balance
> + A read-write single value file which exists on non-root
> + cpuset-enabled cgroups. It is a binary value flag that accepts
> + either "0" (off) or a non-zero value (on). This flag is set
> + by the parent and is not delegatable.
> +
> + When it is on, tasks within this cpuset will be load-balanced
> + by the kernel scheduler. Tasks will be moved from CPUs with
> + high load to other CPUs within the same cpuset with less load
> + periodically.
> +
> + When it is off, there will be no load balancing among CPUs on
> + this cgroup. Tasks will stay in the CPUs they are running on
> + and will not be moved to other CPUs.
> +
> + The initial value of this flag is "1". This flag is then
> + inherited by child cgroups with cpuset enabled. Its state
> + can only be changed on a scheduling domain cgroup with no
> + cpuset-enabled children.
[...]
> + /*
> + * On default hierachy, a load balance flag change is only allowed
> + * in a scheduling domain with no child cpuset.
> + */
> + if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
> + (!is_sched_domain(cs) || css_has_online_children(&cs->css))) {
> + err = -EINVAL;
> + goto out;
> + }
The rule is actually
- no child cpuset
- and it must be a scheduling domain
Right?
On 05/24/2018 10:36 AM, Juri Lelli wrote:
> On 17/05/18 16:55, Waiman Long wrote:
>
> [...]
>
>> + A parent cgroup cannot distribute all its CPUs to child
>> + scheduling domain cgroups unless its load balancing flag is
>> + turned off.
>> +
>> + cpuset.sched.load_balance
>> + A read-write single value file which exists on non-root
>> + cpuset-enabled cgroups. It is a binary value flag that accepts
>> + either "0" (off) or a non-zero value (on). This flag is set
>> + by the parent and is not delegatable.
>> +
>> + When it is on, tasks within this cpuset will be load-balanced
>> + by the kernel scheduler. Tasks will be moved from CPUs with
>> + high load to other CPUs within the same cpuset with less load
>> + periodically.
>> +
>> + When it is off, there will be no load balancing among CPUs on
>> + this cgroup. Tasks will stay in the CPUs they are running on
>> + and will not be moved to other CPUs.
>> +
>> + The initial value of this flag is "1". This flag is then
>> + inherited by child cgroups with cpuset enabled. Its state
>> + can only be changed on a scheduling domain cgroup with no
>> + cpuset-enabled children.
> [...]
>
>> + /*
>> + * On default hierachy, a load balance flag change is only allowed
>> + * in a scheduling domain with no child cpuset.
>> + */
>> + if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
>> + (!is_sched_domain(cs) || css_has_online_children(&cs->css))) {
>> + err = -EINVAL;
>> + goto out;
>> + }
> The rule is actually
>
> - no child cpuset
> - and it must be a scheduling domain
>
> Right?
Yes, because it doesn't make sense to have a cpu in one cpuset that has
loading balance off while, at the same time, in another cpuset with load
balancing turned on. This restriction is there to make sure that the
above condition will not happen. I may be wrong if there is a realistic
use case where the above condition is desired.
-Longman
On 24/05/18 11:09, Waiman Long wrote:
> On 05/24/2018 10:36 AM, Juri Lelli wrote:
> > On 17/05/18 16:55, Waiman Long wrote:
> >
> > [...]
> >
> >> + A parent cgroup cannot distribute all its CPUs to child
> >> + scheduling domain cgroups unless its load balancing flag is
> >> + turned off.
> >> +
> >> + cpuset.sched.load_balance
> >> + A read-write single value file which exists on non-root
> >> + cpuset-enabled cgroups. It is a binary value flag that accepts
> >> + either "0" (off) or a non-zero value (on). This flag is set
> >> + by the parent and is not delegatable.
> >> +
> >> + When it is on, tasks within this cpuset will be load-balanced
> >> + by the kernel scheduler. Tasks will be moved from CPUs with
> >> + high load to other CPUs within the same cpuset with less load
> >> + periodically.
> >> +
> >> + When it is off, there will be no load balancing among CPUs on
> >> + this cgroup. Tasks will stay in the CPUs they are running on
> >> + and will not be moved to other CPUs.
> >> +
> >> + The initial value of this flag is "1". This flag is then
> >> + inherited by child cgroups with cpuset enabled. Its state
> >> + can only be changed on a scheduling domain cgroup with no
> >> + cpuset-enabled children.
> > [...]
> >
> >> + /*
> >> + * On default hierachy, a load balance flag change is only allowed
> >> + * in a scheduling domain with no child cpuset.
> >> + */
> >> + if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
> >> + (!is_sched_domain(cs) || css_has_online_children(&cs->css))) {
> >> + err = -EINVAL;
> >> + goto out;
> >> + }
> > The rule is actually
> >
> > - no child cpuset
> > - and it must be a scheduling domain
> >
> > Right?
>
> Yes, because it doesn't make sense to have a cpu in one cpuset that has
> loading balance off while, at the same time, in another cpuset with load
> balancing turned on. This restriction is there to make sure that the
> above condition will not happen. I may be wrong if there is a realistic
> use case where the above condition is desired.
Yep, makes sense to me.
Maybe add the second condition to the comment and documentation.
On 05/24/2018 11:16 AM, Juri Lelli wrote:
> On 24/05/18 11:09, Waiman Long wrote:
>> On 05/24/2018 10:36 AM, Juri Lelli wrote:
>>> On 17/05/18 16:55, Waiman Long wrote:
>>>
>>> [...]
>>>
>>>> + A parent cgroup cannot distribute all its CPUs to child
>>>> + scheduling domain cgroups unless its load balancing flag is
>>>> + turned off.
>>>> +
>>>> + cpuset.sched.load_balance
>>>> + A read-write single value file which exists on non-root
>>>> + cpuset-enabled cgroups. It is a binary value flag that accepts
>>>> + either "0" (off) or a non-zero value (on). This flag is set
>>>> + by the parent and is not delegatable.
>>>> +
>>>> + When it is on, tasks within this cpuset will be load-balanced
>>>> + by the kernel scheduler. Tasks will be moved from CPUs with
>>>> + high load to other CPUs within the same cpuset with less load
>>>> + periodically.
>>>> +
>>>> + When it is off, there will be no load balancing among CPUs on
>>>> + this cgroup. Tasks will stay in the CPUs they are running on
>>>> + and will not be moved to other CPUs.
>>>> +
>>>> + The initial value of this flag is "1". This flag is then
>>>> + inherited by child cgroups with cpuset enabled. Its state
>>>> + can only be changed on a scheduling domain cgroup with no
>>>> + cpuset-enabled children.
>>> [...]
>>>
>>>> + /*
>>>> + * On default hierachy, a load balance flag change is only allowed
>>>> + * in a scheduling domain with no child cpuset.
>>>> + */
>>>> + if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
>>>> + (!is_sched_domain(cs) || css_has_online_children(&cs->css))) {
>>>> + err = -EINVAL;
>>>> + goto out;
>>>> + }
>>> The rule is actually
>>>
>>> - no child cpuset
>>> - and it must be a scheduling domain
>>>
>>> Right?
>> Yes, because it doesn't make sense to have a cpu in one cpuset that has
>> loading balance off while, at the same time, in another cpuset with load
>> balancing turned on. This restriction is there to make sure that the
>> above condition will not happen. I may be wrong if there is a realistic
>> use case where the above condition is desired.
> Yep, makes sense to me.
>
> Maybe add the second condition to the comment and documentation.
Sure. Will do.
-Longman
On Thu, May 17, 2018 at 04:55:41PM -0400, Waiman Long wrote:
> A new cpuset.sched.domain boolean flag is added to cpuset v2. This new
> flag indicates that the CPUs in the current cpuset should be treated
> as a separate scheduling domain.
The traditional name for this is a partition.
> This new flag is owned by the parent
> and will cause the CPUs in the cpuset to be removed from the effective
> CPUs of its parent.
This is a significant departure from existing behaviour, but one I can
appreciate. I don't immediately see something terribly wrong with it.
> This is implemented internally by adding a new isolated_cpus mask that
> holds the CPUs belonging to child scheduling domain cpusets so that:
>
> isolated_cpus | effective_cpus = cpus_allowed
> isolated_cpus & effective_cpus = 0
>
> This new flag can only be turned on in a cpuset if its parent is either
> root or a scheduling domain itself with non-empty cpu list. The state
> of this flag cannot be changed if the cpuset has children.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> Documentation/cgroup-v2.txt | 22 ++++
> kernel/cgroup/cpuset.c | 237 +++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 256 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index cf7bac6..54d9e22 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1514,6 +1514,28 @@ Cpuset Interface Files
> it is a subset of "cpuset.mems". Its value will be affected
> by memory nodes hotplug events.
>
> + cpuset.sched.domain
> + A read-write single value file which exists on non-root
> + cpuset-enabled cgroups. It is a binary value flag that accepts
> + either "0" (off) or a non-zero value (on).
I would be conservative and only allow 0/1.
> This flag is set
> + by the parent and is not delegatable.
> +
> + If set, it indicates that the CPUs in the current cgroup will
> + be the root of a scheduling domain. The root cgroup is always
> + a scheduling domain. There are constraints on where this flag
> + can be set. It can only be set in a cgroup if all the following
> + conditions are true.
> +
> + 1) The parent cgroup is also a scheduling domain with a non-empty
> + cpu list.
Ah, so initially I was confused by the requirement for root to have it
always set, but you'll allow child domains to steal _all_ CPUs, such
that root ends up with an empty effective set?
What about the (kernel) threads that cannot be moved out of the root
group?
> + 2) The list of CPUs are exclusive, i.e. they are not shared by
> + any of its siblings.
Right.
> + 3) There is no child cgroups with cpuset enabled.
> +
> + Setting this flag will take the CPUs away from the effective
> + CPUs of the parent cgroup. Once it is set, this flag cannot be
> + cleared if there are any child cgroups with cpuset enabled.
This I'm not clear on. Why?
On Thu, May 17, 2018 at 04:55:42PM -0400, Waiman Long wrote:
> The sched.load_balance flag is needed to enable CPU isolation similar to
> what can be done with the "isolcpus" kernel boot parameter. Its value
> can only be changed in a scheduling domain with no child cpusets. On
> a non-scheduling domain cpuset, the value of sched.load_balance is
> inherited from its parent.
>
> This flag is set by the parent and is not delegatable.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> Documentation/cgroup-v2.txt | 24 ++++++++++++++++++++
> kernel/cgroup/cpuset.c | 53 +++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 73 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index 54d9e22..071b634d 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1536,6 +1536,30 @@ Cpuset Interface Files
> CPUs of the parent cgroup. Once it is set, this flag cannot be
> cleared if there are any child cgroups with cpuset enabled.
>
> + A parent cgroup cannot distribute all its CPUs to child
> + scheduling domain cgroups unless its load balancing flag is
> + turned off.
> +
> + cpuset.sched.load_balance
> + A read-write single value file which exists on non-root
> + cpuset-enabled cgroups. It is a binary value flag that accepts
> + either "0" (off) or a non-zero value (on). This flag is set
> + by the parent and is not delegatable.
> +
> + When it is on, tasks within this cpuset will be load-balanced
> + by the kernel scheduler. Tasks will be moved from CPUs with
> + high load to other CPUs within the same cpuset with less load
> + periodically.
> +
> + When it is off, there will be no load balancing among CPUs on
> + this cgroup. Tasks will stay in the CPUs they are running on
> + and will not be moved to other CPUs.
> +
> + The initial value of this flag is "1". This flag is then
> + inherited by child cgroups with cpuset enabled. Its state
> + can only be changed on a scheduling domain cgroup with no
> + cpuset-enabled children.
I'm confused... why exactly do we have both domain and load_balance ?
On 05/24/2018 11:43 AM, Peter Zijlstra wrote:
> On Thu, May 17, 2018 at 04:55:42PM -0400, Waiman Long wrote:
>> The sched.load_balance flag is needed to enable CPU isolation similar to
>> what can be done with the "isolcpus" kernel boot parameter. Its value
>> can only be changed in a scheduling domain with no child cpusets. On
>> a non-scheduling domain cpuset, the value of sched.load_balance is
>> inherited from its parent.
>>
>> This flag is set by the parent and is not delegatable.
>>
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> Documentation/cgroup-v2.txt | 24 ++++++++++++++++++++
>> kernel/cgroup/cpuset.c | 53 +++++++++++++++++++++++++++++++++++++++++----
>> 2 files changed, 73 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
>> index 54d9e22..071b634d 100644
>> --- a/Documentation/cgroup-v2.txt
>> +++ b/Documentation/cgroup-v2.txt
>> @@ -1536,6 +1536,30 @@ Cpuset Interface Files
>> CPUs of the parent cgroup. Once it is set, this flag cannot be
>> cleared if there are any child cgroups with cpuset enabled.
>>
>> + A parent cgroup cannot distribute all its CPUs to child
>> + scheduling domain cgroups unless its load balancing flag is
>> + turned off.
>> +
>> + cpuset.sched.load_balance
>> + A read-write single value file which exists on non-root
>> + cpuset-enabled cgroups. It is a binary value flag that accepts
>> + either "0" (off) or a non-zero value (on). This flag is set
>> + by the parent and is not delegatable.
>> +
>> + When it is on, tasks within this cpuset will be load-balanced
>> + by the kernel scheduler. Tasks will be moved from CPUs with
>> + high load to other CPUs within the same cpuset with less load
>> + periodically.
>> +
>> + When it is off, there will be no load balancing among CPUs on
>> + this cgroup. Tasks will stay in the CPUs they are running on
>> + and will not be moved to other CPUs.
>> +
>> + The initial value of this flag is "1". This flag is then
>> + inherited by child cgroups with cpuset enabled. Its state
>> + can only be changed on a scheduling domain cgroup with no
>> + cpuset-enabled children.
> I'm confused... why exactly do we have both domain and load_balance ?
The domain is for partitioning the CPUs only. It doesn't change the load
balancing state. So the load_balance flag is still need to turn on and
off load balancing.
Cheers,
Longman
On 05/24/2018 11:41 AM, Peter Zijlstra wrote:
> On Thu, May 17, 2018 at 04:55:41PM -0400, Waiman Long wrote:
>> A new cpuset.sched.domain boolean flag is added to cpuset v2. This new
>> flag indicates that the CPUs in the current cpuset should be treated
>> as a separate scheduling domain.
> The traditional name for this is a partition.
Do you want to call it cpuset.sched.partition? That name sounds strange
to me.
>> This new flag is owned by the parent
>> and will cause the CPUs in the cpuset to be removed from the effective
>> CPUs of its parent.
> This is a significant departure from existing behaviour, but one I can
> appreciate. I don't immediately see something terribly wrong with it.
>
>> This is implemented internally by adding a new isolated_cpus mask that
>> holds the CPUs belonging to child scheduling domain cpusets so that:
>>
>> isolated_cpus | effective_cpus = cpus_allowed
>> isolated_cpus & effective_cpus = 0
>>
>> This new flag can only be turned on in a cpuset if its parent is either
>> root or a scheduling domain itself with non-empty cpu list. The state
>> of this flag cannot be changed if the cpuset has children.
>>
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> Documentation/cgroup-v2.txt | 22 ++++
>> kernel/cgroup/cpuset.c | 237 +++++++++++++++++++++++++++++++++++++++++++-
>> 2 files changed, 256 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
>> index cf7bac6..54d9e22 100644
>> --- a/Documentation/cgroup-v2.txt
>> +++ b/Documentation/cgroup-v2.txt
>> @@ -1514,6 +1514,28 @@ Cpuset Interface Files
>> it is a subset of "cpuset.mems". Its value will be affected
>> by memory nodes hotplug events.
>>
>> + cpuset.sched.domain
>> + A read-write single value file which exists on non-root
>> + cpuset-enabled cgroups. It is a binary value flag that accepts
>> + either "0" (off) or a non-zero value (on).
> I would be conservative and only allow 0/1.
I stated that because echoing other integer value like 2 into the flag
file won't return any error. I will modify it to say just 0 and 1.
>> This flag is set
>> + by the parent and is not delegatable.
>> +
>> + If set, it indicates that the CPUs in the current cgroup will
>> + be the root of a scheduling domain. The root cgroup is always
>> + a scheduling domain. There are constraints on where this flag
>> + can be set. It can only be set in a cgroup if all the following
>> + conditions are true.
>> +
>> + 1) The parent cgroup is also a scheduling domain with a non-empty
>> + cpu list.
> Ah, so initially I was confused by the requirement for root to have it
> always set, but you'll allow child domains to steal _all_ CPUs, such
> that root ends up with an empty effective set?
>
> What about the (kernel) threads that cannot be moved out of the root
> group?
Actually, the current code won't allow you to take all the CPUs from a
scheduling domain cpuset with load balancing on. So there must be at
least 1 cpu left. You can take all away if load balancing is off.
>> + 2) The list of CPUs are exclusive, i.e. they are not shared by
>> + any of its siblings.
> Right.
>
>> + 3) There is no child cgroups with cpuset enabled.
>> +
>> + Setting this flag will take the CPUs away from the effective
>> + CPUs of the parent cgroup. Once it is set, this flag cannot be
>> + cleared if there are any child cgroups with cpuset enabled.
> This I'm not clear on. Why?
>
That is for pragmatic reason as it is easier to code this way. We could
remove this restriction but that will make the code more complex.
Cheers,
Longman
On Thu, May 24, 2018 at 02:53:31PM -0400, Waiman Long wrote:
> On 05/24/2018 11:41 AM, Peter Zijlstra wrote:
> > On Thu, May 17, 2018 at 04:55:41PM -0400, Waiman Long wrote:
> >> A new cpuset.sched.domain boolean flag is added to cpuset v2. This new
> >> flag indicates that the CPUs in the current cpuset should be treated
> >> as a separate scheduling domain.
> > The traditional name for this is a partition.
>
> Do you want to call it cpuset.sched.partition? That name sounds strange
> to me.
Let me explore the whole domain x load-balance space first. I'm thinking
the two parameters are mostly redundant, but I might be overlooking
something (trivial or otherwise).
> >> + cpuset.sched.domain
> >> + A read-write single value file which exists on non-root
> >> + cpuset-enabled cgroups. It is a binary value flag that accepts
> >> + either "0" (off) or a non-zero value (on).
> > I would be conservative and only allow 0/1.
>
> I stated that because echoing other integer value like 2 into the flag
> file won't return any error. I will modify it to say just 0 and 1.
Ah, I would make the file error on >1.
Because then you can always extend the meaning afterwards because you
know it won't be written to with the new value.
> >> + 3) There is no child cgroups with cpuset enabled.
> >> +
> >> + Setting this flag will take the CPUs away from the effective
> >> + CPUs of the parent cgroup. Once it is set, this flag cannot be
> >> + cleared if there are any child cgroups with cpuset enabled.
> > This I'm not clear on. Why?
> >
> That is for pragmatic reason as it is easier to code this way. We could
> remove this restriction but that will make the code more complex.
Best to mention that I think.
On 24-May 11:22, Waiman Long wrote:
> On 05/24/2018 11:16 AM, Juri Lelli wrote:
> > On 24/05/18 11:09, Waiman Long wrote:
> >> On 05/24/2018 10:36 AM, Juri Lelli wrote:
> >>> On 17/05/18 16:55, Waiman Long wrote:
> >>>
> >>> [...]
> >>>
> >>>> + A parent cgroup cannot distribute all its CPUs to child
> >>>> + scheduling domain cgroups unless its load balancing flag is
> >>>> + turned off.
> >>>> +
> >>>> + cpuset.sched.load_balance
> >>>> + A read-write single value file which exists on non-root
> >>>> + cpuset-enabled cgroups. It is a binary value flag that accepts
> >>>> + either "0" (off) or a non-zero value (on). This flag is set
> >>>> + by the parent and is not delegatable.
> >>>> +
> >>>> + When it is on, tasks within this cpuset will be load-balanced
> >>>> + by the kernel scheduler. Tasks will be moved from CPUs with
> >>>> + high load to other CPUs within the same cpuset with less load
> >>>> + periodically.
> >>>> +
> >>>> + When it is off, there will be no load balancing among CPUs on
> >>>> + this cgroup. Tasks will stay in the CPUs they are running on
> >>>> + and will not be moved to other CPUs.
> >>>> +
> >>>> + The initial value of this flag is "1". This flag is then
> >>>> + inherited by child cgroups with cpuset enabled. Its state
> >>>> + can only be changed on a scheduling domain cgroup with no
> >>>> + cpuset-enabled children.
> >>> [...]
> >>>
> >>>> + /*
> >>>> + * On default hierachy, a load balance flag change is only allowed
> >>>> + * in a scheduling domain with no child cpuset.
> >>>> + */
> >>>> + if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
> >>>> + (!is_sched_domain(cs) || css_has_online_children(&cs->css))) {
> >>>> + err = -EINVAL;
> >>>> + goto out;
> >>>> + }
> >>> The rule is actually
> >>>
> >>> - no child cpuset
> >>> - and it must be a scheduling domain
I always a bit confused by the usage of "scheduling domain", which
overlaps with the SD concept from the scheduler standpoint.
AFAIU a cpuset sched domain is not granted to be turned into an
actual scheduler SD, am I wrong?
If that's the case, why not better disambiguate these two concept by
calling the cpuset one a "cpus partition" or eventually "cpuset domain"?
--
#include <best/regards.h>
Patrick Bellasi
Hi Juri,
following are some notes I took while trying to understand what's going on...
could be useful to understand if I have a correct view of all the different
components and how they come together.
At the end there are also a couple of possible updates and a question on your
proposal.
Cheers Patrick
On 24-May 12:39, Juri Lelli wrote:
> On 24/05/18 10:04, Patrick Bellasi wrote:
>
> [...]
>
> > From 84bb8137ce79f74849d97e30871cf67d06d8d682 Mon Sep 17 00:00:00 2001
> > From: Patrick Bellasi <[email protected]>
> > Date: Wed, 23 May 2018 16:33:06 +0100
> > Subject: [PATCH 1/1] cgroup/cpuset: disable sched domain rebuild when not
> > required
> >
> > The generate_sched_domains() already addresses the "special case for 99%
> > of systems" which require a single full sched domain at the root,
> > spanning all the CPUs. However, the current support is based on an
> > expensive sequence of operations which destroy and recreate the exact
> > same scheduling domain configuration.
> >
> > If we notice that:
> >
> > 1) CPUs in "cpuset.isolcpus" are excluded from load balancing by the
> > isolcpus= kernel boot option, and will never be load balanced
> > regardless of the value of "cpuset.sched_load_balance" in any
> > cpuset.
> >
> > 2) the root cpuset has load_balance enabled by default at boot and
> > it's the only parameter which userspace can change at run-time.
> >
> > we know that, by default, every system comes up with a complete and
> > properly configured set of scheduling domains covering all the CPUs.
> >
> > Thus, on every system, unless the user explicitly disables load balance
> > for the top_cpuset, the scheduling domains already configured at boot
> > time by the scheduler/topology code and updated in consequence of
> > hotplug events, are already properly configured for cpuset too.
> >
> > This configuration is the default one for 99% of the systems,
> > and it's also the one used by most of the Android devices which never
> > disable load balance from the top_cpuset.
> >
> > Thus, while load balance is enabled for the top_cpuset,
> > destroying/rebuilding the scheduling domains at every cpuset.cpus
> > reconfiguration is a useless operation which will always produce the
> > same result.
> >
> > Let's anticipate the "special" optimization within:
> >
> > rebuild_sched_domains_locked()
> >
> > thus completely skipping the expensive:
> >
> > generate_sched_domains()
> > partition_sched_domains()
> >
> > for all the cases we know that the scheduling domains already defined
> > will not be affected by whatsoever value of cpuset.cpus.
>
> [...]
>
> > + /* Special case for the 99% of systems with one, full, sched domain */
> > + if (!top_cpuset.isolation_count &&
> > + is_sched_load_balance(&top_cpuset))
> > + goto out;
> > +
>
> Mmm, looks like we still need to destroy e recreate if there is a
> new_topology (see arch_update_cpu_topology() in partition_sched_
> domains).
Right, so the problem seems to be that we "need" to call
arch_update_cpu_topology() and we do that by calling
partition_sched_domains() which was initially introduced by:
029190c515f1 ("cpuset sched_load_balance flag")
back in 2007, where it's also quite well explained the reasons behind
the sched_load_balance flag and the idea to have "partitioned" SDs.
I also (hopefully) understood that there are at least two actors involved:
- A) arch code
which creates SDs and SGs, usually to group CPUs depending on the
memory hierarchy, to support different time granularity of load
balancing operations
Special case here are HP and hibernation which, by on-/off-lining
CPUs they directly affect the SDs/SGs definitions.
- B) cpusets
which expose to userspace the possibility to define,
_if possible_, a finer granularity set of SGs to further restrict the
scope of load balancing operations
Since B is a "possible finer granularity" refinement of A, then we
trigger A's reconfigurations based on B's constraints.
That's why, for example, in consequence of an HP online event,
we have:
--- core.c -------------------
HP[sched:active]
| sched_cpu_activate()
| cpuset_cpu_active()
--- cpuset.c -----------------
| cpuset_update_active_cpus()
| schedule_work(&cpuset_hotplug_work)
\.. System Kworker \
| cpuset_hotplug_workfn()
if (cpus_updated || force_rebuild)
| rebuild_sched_domains()
| rebuild_sched_domains_locked()
| generate_sched_domains()
--- topology.c ---------------
| partition_sched_domains()
| arch_update_cpu_topology()
IOW, we need to pass via cpusets to rebuild the SDs whenever we
there are HP events or we "need" to do an arch_update_cpu_topology()
via the arch topology driver (drivers/base/arch_topology.c).
This last bit is also interesting, whenever we detect arch topology
information that required an SD rebuild, we need to force a
partition_sched_domains(). But, for that, in:
commit 50e76632339d ("sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs")
we just introduced the support for the "force_rebuild" flag to be set.
Thus, potentially we can just extend the check I've proposed to consider the
force rebuild flag, to be something like:
---8<---
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 8f586e8bdc98..1f051fafaa3a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -874,11 +874,19 @@ static void rebuild_sched_domains_locked(void)
!cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
goto out;
+ /* Special case for the 99% of systems with one, full, sched domain */
+ if (!force_rebuild &&
+ !top_cpuset.isolation_count &&
+ is_sched_load_balance(&top_cpuset))
+ goto out;
+ force_rebuild = false;
+
/* Generate domain masks and attrs */
ndoms = generate_sched_domains(&doms, &attr);
/* Have scheduler rebuild the domains */
partition_sched_domains(ndoms, doms, attr);
out:
put_online_cpus();
---8<---
Which would still allow to use something like:
cpuset_force_rebuild()
rebuild_sched_domains()
to actually rebuild SD in consequence of arch topology changes.
>
> Maybe we could move the check you are proposing in update_cpumasks_
> hier() ?
Yes, that's another option... although there we are outside of
get_online_cpus(). Could be a problem?
However, in general, I would say that all around:
rebuild_sched_domains
rebuild_sched_domains_locked
update_cpumask
update_cpumasks_hier
a nice refactoring would be really deserved :)
--
#include <best/regards.h>
Patrick Bellasi
On 25/05/18 11:31, Patrick Bellasi wrote:
[...]
> Right, so the problem seems to be that we "need" to call
> arch_update_cpu_topology() and we do that by calling
> partition_sched_domains() which was initially introduced by:
>
> 029190c515f1 ("cpuset sched_load_balance flag")
>
> back in 2007, where it's also quite well explained the reasons behind
> the sched_load_balance flag and the idea to have "partitioned" SDs.
>
> I also (hopefully) understood that there are at least two actors involved:
>
> - A) arch code
> which creates SDs and SGs, usually to group CPUs depending on the
> memory hierarchy, to support different time granularity of load
> balancing operations
>
> Special case here are HP and hibernation which, by on-/off-lining
> CPUs they directly affect the SDs/SGs definitions.
>
> - B) cpusets
> which expose to userspace the possibility to define,
> _if possible_, a finer granularity set of SGs to further restrict the
> scope of load balancing operations
>
> Since B is a "possible finer granularity" refinement of A, then we
> trigger A's reconfigurations based on B's constraints.
>
> That's why, for example, in consequence of an HP online event,
> we have:
>
> --- core.c -------------------
> HP[sched:active]
> | sched_cpu_activate()
> | cpuset_cpu_active()
> --- cpuset.c -----------------
> | cpuset_update_active_cpus()
> | schedule_work(&cpuset_hotplug_work)
> \.. System Kworker \
> | cpuset_hotplug_workfn()
> if (cpus_updated || force_rebuild)
> | rebuild_sched_domains()
> | rebuild_sched_domains_locked()
> | generate_sched_domains()
> --- topology.c ---------------
> | partition_sched_domains()
> | arch_update_cpu_topology()
>
>
> IOW, we need to pass via cpusets to rebuild the SDs whenever we
> there are HP events or we "need" to do an arch_update_cpu_topology()
> via the arch topology driver (drivers/base/arch_topology.c).
I don't think the arch topology driver is always involved in this (e.g.,
arch/x86/kernel/itmt::sched_itmt_update_handler()).
Still we need to check if topology changed, as you say.
> This last bit is also interesting, whenever we detect arch topology
> information that required an SD rebuild, we need to force a
> partition_sched_domains(). But, for that, in:
>
> commit 50e76632339d ("sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs")
>
> we just introduced the support for the "force_rebuild" flag to be set.
>
> Thus, potentially we can just extend the check I've proposed to consider the
> force rebuild flag, to be something like:
>
> ---8<---
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 8f586e8bdc98..1f051fafaa3a 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -874,11 +874,19 @@ static void rebuild_sched_domains_locked(void)
> !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
> goto out;
>
> + /* Special case for the 99% of systems with one, full, sched domain */
> + if (!force_rebuild &&
> + !top_cpuset.isolation_count &&
> + is_sched_load_balance(&top_cpuset))
> + goto out;
> + force_rebuild = false;
> +
> /* Generate domain masks and attrs */
> ndoms = generate_sched_domains(&doms, &attr);
>
> /* Have scheduler rebuild the domains */
> partition_sched_domains(ndoms, doms, attr);
> out:
> put_online_cpus();
> ---8<---
>
>
> Which would still allow to use something like:
>
> cpuset_force_rebuild()
> rebuild_sched_domains()
>
> to actually rebuild SD in consequence of arch topology changes.
That might work.
>
> >
> > Maybe we could move the check you are proposing in update_cpumasks_
> > hier() ?
>
> Yes, that's another option... although there we are outside of
> get_online_cpus(). Could be a problem?
Mmm, using force_rebuild flag seems safer indeed.
On 05/25/2018 05:40 AM, Patrick Bellasi wrote:
> On 24-May 11:22, Waiman Long wrote:
>> On 05/24/2018 11:16 AM, Juri Lelli wrote:
>>> On 24/05/18 11:09, Waiman Long wrote:
>>>> On 05/24/2018 10:36 AM, Juri Lelli wrote:
>>>>> On 17/05/18 16:55, Waiman Long wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>> + A parent cgroup cannot distribute all its CPUs to child
>>>>>> + scheduling domain cgroups unless its load balancing flag is
>>>>>> + turned off.
>>>>>> +
>>>>>> + cpuset.sched.load_balance
>>>>>> + A read-write single value file which exists on non-root
>>>>>> + cpuset-enabled cgroups. It is a binary value flag that accepts
>>>>>> + either "0" (off) or a non-zero value (on). This flag is set
>>>>>> + by the parent and is not delegatable.
>>>>>> +
>>>>>> + When it is on, tasks within this cpuset will be load-balanced
>>>>>> + by the kernel scheduler. Tasks will be moved from CPUs with
>>>>>> + high load to other CPUs within the same cpuset with less load
>>>>>> + periodically.
>>>>>> +
>>>>>> + When it is off, there will be no load balancing among CPUs on
>>>>>> + this cgroup. Tasks will stay in the CPUs they are running on
>>>>>> + and will not be moved to other CPUs.
>>>>>> +
>>>>>> + The initial value of this flag is "1". This flag is then
>>>>>> + inherited by child cgroups with cpuset enabled. Its state
>>>>>> + can only be changed on a scheduling domain cgroup with no
>>>>>> + cpuset-enabled children.
>>>>> [...]
>>>>>
>>>>>> + /*
>>>>>> + * On default hierachy, a load balance flag change is only allowed
>>>>>> + * in a scheduling domain with no child cpuset.
>>>>>> + */
>>>>>> + if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
>>>>>> + (!is_sched_domain(cs) || css_has_online_children(&cs->css))) {
>>>>>> + err = -EINVAL;
>>>>>> + goto out;
>>>>>> + }
>>>>> The rule is actually
>>>>>
>>>>> - no child cpuset
>>>>> - and it must be a scheduling domain
> I always a bit confused by the usage of "scheduling domain", which
> overlaps with the SD concept from the scheduler standpoint.
It is supposed to mimic SD concept of scheduler.
>
> AFAIU a cpuset sched domain is not granted to be turned into an
> actual scheduler SD, am I wrong?
>
> If that's the case, why not better disambiguate these two concept by
> calling the cpuset one a "cpus partition" or eventually "cpuset domain"?
Good point. Peter has similar comment. I will probably change the name
and clarifying it better in the documentation.
Cheers,
Longman
On Thu, May 24, 2018 at 02:55:25PM -0400, Waiman Long wrote:
> On 05/24/2018 11:43 AM, Peter Zijlstra wrote:
> > I'm confused... why exactly do we have both domain and load_balance ?
>
> The domain is for partitioning the CPUs only. It doesn't change the load
> balancing state. So the load_balance flag is still need to turn on and
> off load balancing.
OK, so we have to two boolean flags, giving 4 possible states. Lets just
go through them one by on:
A) domain:0 load_balance:0 -- we have no exclusive domain, but have
load-balancing disabled across them. AFAICT this should be an invalid
state.
B) domain:0 load_balance:1 -- we have no exclusive domain, but have
load-balancing enabled. AFAICT this is the default state and is a
no-op.
C) domain:1 load_balance:0 -- we have an exclusive domain, and have
load-balancing disabled across it. This is, AFAICT, identical to
having a bunch of sub/sibling groups each with a single CPU domain.
D) domain:1 load_balance:1 -- we have an exclusive domain, and have
load-balancing enabled. This is a partition.
Now, I think I've overlooked the fact that load_balance==1 only really
means something when the parent's load_balance==0, but I'm not sure that
really changes anything.
So, afaict, the above only have two useful states: B and D. Which again
raises the question, why two knobs? What useful configurations does it
allow?
On 05/28/2018 08:45 AM, Peter Zijlstra wrote:
> On Thu, May 24, 2018 at 02:55:25PM -0400, Waiman Long wrote:
>> On 05/24/2018 11:43 AM, Peter Zijlstra wrote:
>>> I'm confused... why exactly do we have both domain and load_balance ?
>> The domain is for partitioning the CPUs only. It doesn't change the load
>> balancing state. So the load_balance flag is still need to turn on and
>> off load balancing.
> OK, so we have to two boolean flags, giving 4 possible states. Lets just
> go through them one by on:
>
> A) domain:0 load_balance:0 -- we have no exclusive domain, but have
> load-balancing disabled across them. AFAICT this should be an invalid
> state.
>
> B) domain:0 load_balance:1 -- we have no exclusive domain, but have
> load-balancing enabled. AFAICT this is the default state and is a
> no-op.
>
> C) domain:1 load_balance:0 -- we have an exclusive domain, and have
> load-balancing disabled across it. This is, AFAICT, identical to
> having a bunch of sub/sibling groups each with a single CPU domain.
>
> D) domain:1 load_balance:1 -- we have an exclusive domain, and have
> load-balancing enabled. This is a partition.
>
> Now, I think I've overlooked the fact that load_balance==1 only really
> means something when the parent's load_balance==0, but I'm not sure that
> really changes anything.
>
> So, afaict, the above only have two useful states: B and D. Which again
> raises the question, why two knobs? What useful configurations does it
> allow?
I am working on the v9 patch, and below is the current draft of the
documentation. Hopefully that will clarify some of the concepts that we
are discussing here.
cpuset.sched.domain_root
A read-write single value file which exists on non-root
cpuset-enabled cgroups. It is a binary value flag that accepts
either "0" (off) or "1" (on). This flag is set by the parent
and is not delegatable.
If set, it indicates that the current cgroup is the root of a
new scheduling domain or partition that comprises itself and
all its descendants except those that are scheduling domain
roots themselves and their descendants. The root cgroup is
always a scheduling domain root.
There are constraints on where this flag can be set. It can
only be set in a cgroup if all the following conditions are true.
1) The "cpuset.cpus" is not empty and the list of CPUs are
exclusive, i.e. they are not shared by any of its siblings.
2) The parent cgroup is also a scheduling domain root.
3) There is no child cgroups with cpuset enabled. This is
for eliminating corner cases that have to be handled if such
a condition is allowed.
Setting this flag will take the CPUs away from the effective
CPUs of the parent cgroup. Once it is set, this flag cannot
be cleared if there are any child cgroups with cpuset enabled.
Further changes made to "cpuset.cpus" is allowed as long as
the first condition above is still true.
A parent scheduling domain root cgroup cannot distribute all
its CPUs to its child scheduling domain root cgroups unless
its load balancing flag is turned off.
cpuset.sched.load_balance
A read-write single value file which exists on non-root
cpuset-enabled cgroups. It is a binary value flag that accepts
either "0" (off) or "1" (on). This flag is set by the parent
and is not delegatable. It is on by default in the root cgroup.
When it is on, tasks within this cpuset will be load-balanced
by the kernel scheduler. Tasks will be moved from CPUs with
high load to other CPUs within the same cpuset with less load
periodically.
When it is off, there will be no load balancing among CPUs on
this cgroup. Tasks will stay in the CPUs they are running on
and will not be moved to other CPUs.
The load balancing state of a cgroup can only be changed on a
scheduling domain root cgroup with no cpuset-enabled children.
All cgroups within a scheduling domain or partition must have
the same load balancing state. As descendant cgroups of a
scheduling domain root are created, they inherit the same load
balancing state of their root.
The main purpose of using a new domain_root flag is to enable user to
create new partitions without the trick of disabling load_balance in the
parent and enabling it in the child. Now, we can create as many
partitions as we want without ever turning off load balancing in any of
the cpusets. I find it to be more straight forward and easier to
understand than using the load_balance trick.
Of course, turning off load balancing is still useful in some use cases,
so it is supported. To simplify thing, it is mandated that all the
cpusets within a partition must have the same load balancing state. This
is to ensure that we can't use the load_balance trick to create
additional partition underneath it. The domain_root flag is the only way
to create partition.
A) domain_root: 0, load_balance: 0 -- a non-domain root cpuset within a
no load balancing partition.
B) domain_root: 0, load_balance: 1 -- a non-domain root cpuset within a
load balancing partition.
C) domain_root: 1, load_balance: 0 -- a domain root cpuset of a no load
balancing partition.
D) domain_root: 1, load_balance: 1 -- a domain root cpuset of a load
balancing partition.
Hope this help.
Cheers,
Longman
On 05/22/2018 08:57 AM, Juri Lelli wrote:
> Hi,
>
> On 17/05/18 16:55, Waiman Long wrote:
>
> [...]
>
>> /**
>> + * update_isolated_cpumask - update the isolated_cpus mask of parent cpuset
>> + * @cpuset: The cpuset that requests CPU isolation
>> + * @oldmask: The old isolated cpumask to be removed from the parent
>> + * @newmask: The new isolated cpumask to be added to the parent
>> + * Return: 0 if successful, an error code otherwise
>> + *
>> + * Changes to the isolated CPUs are not allowed if any of CPUs changing
>> + * state are in any of the child cpusets of the parent except the requesting
>> + * child.
>> + *
>> + * If the sched_domain flag changes, either the oldmask (0=>1) or the
>> + * newmask (1=>0) will be NULL.
>> + *
>> + * Called with cpuset_mutex held.
>> + */
>> +static int update_isolated_cpumask(struct cpuset *cpuset,
>> + struct cpumask *oldmask, struct cpumask *newmask)
>> +{
>> + int retval;
>> + int adding, deleting;
>> + cpumask_var_t addmask, delmask;
>> + struct cpuset *parent = parent_cs(cpuset);
>> + struct cpuset *sibling;
>> + struct cgroup_subsys_state *pos_css;
>> + int old_count = parent->isolation_count;
>> + bool dying = cpuset->css.flags & CSS_DYING;
>> +
>> + /*
>> + * Parent must be a scheduling domain with non-empty cpus_allowed.
>> + */
>> + if (!is_sched_domain(parent) || cpumask_empty(parent->cpus_allowed))
>> + return -EINVAL;
>> +
>> + /*
>> + * The oldmask, if present, must be a subset of parent's isolated
>> + * CPUs.
>> + */
>> + if (oldmask && !cpumask_empty(oldmask) && (!parent->isolation_count ||
>> + !cpumask_subset(oldmask, parent->isolated_cpus))) {
>> + WARN_ON_ONCE(1);
>> + return -EINVAL;
>> + }
>> +
>> + /*
>> + * A sched_domain state change is not allowed if there are
>> + * online children and the cpuset is not dying.
>> + */
>> + if (!dying && (!oldmask || !newmask) &&
>> + css_has_online_children(&cpuset->css))
>> + return -EBUSY;
>> +
>> + if (!zalloc_cpumask_var(&addmask, GFP_KERNEL))
>> + return -ENOMEM;
>> + if (!zalloc_cpumask_var(&delmask, GFP_KERNEL)) {
>> + free_cpumask_var(addmask);
>> + return -ENOMEM;
>> + }
>> +
>> + if (!old_count) {
>> + if (!zalloc_cpumask_var(&parent->isolated_cpus, GFP_KERNEL)) {
>> + retval = -ENOMEM;
>> + goto out;
>> + }
>> + old_count = 1;
>> + }
>> +
>> + retval = -EBUSY;
>> + adding = deleting = false;
>> + if (newmask)
>> + cpumask_copy(addmask, newmask);
>> + if (oldmask)
>> + deleting = cpumask_andnot(delmask, oldmask, addmask);
>> + if (newmask)
>> + adding = cpumask_andnot(addmask, newmask, delmask);
>> +
>> + if (!adding && !deleting)
>> + goto out_ok;
>> +
>> + /*
>> + * The cpus to be added must be in the parent's effective_cpus mask
>> + * but not in the isolated_cpus mask.
>> + */
>> + if (!cpumask_subset(addmask, parent->effective_cpus))
>> + goto out;
>> + if (parent->isolation_count &&
>> + cpumask_intersects(parent->isolated_cpus, addmask))
>> + goto out;
>> +
>> + /*
>> + * Check if any CPUs in addmask or delmask are in a sibling cpuset.
>> + * An empty sibling cpus_allowed means it is the same as parent's
>> + * effective_cpus. This checking is skipped if the cpuset is dying.
>> + */
>> + if (dying)
>> + goto updated_isolated_cpus;
>> +
>> + cpuset_for_each_child(sibling, pos_css, parent) {
>> + if ((sibling == cpuset) || !(sibling->css.flags & CSS_ONLINE))
>> + continue;
>> + if (cpumask_empty(sibling->cpus_allowed))
>> + goto out;
>> + if (adding &&
>> + cpumask_intersects(sibling->cpus_allowed, addmask))
>> + goto out;
>> + if (deleting &&
>> + cpumask_intersects(sibling->cpus_allowed, delmask))
>> + goto out;
>> + }
> Just got the below by echoing 1 into cpuset.sched.domain of a sibling with
> "isolated" cpuset.cpus. Guess you are missing proper locking about here
> above.
>
> --->8---
> [ 7509.905005] =============================
> [ 7509.905009] WARNING: suspicious RCU usage
> [ 7509.905014] 4.17.0-rc5+ #11 Not tainted
> [ 7509.905017] -----------------------------
> [ 7509.905023] /home/juri/work/kernel/linux/kernel/cgroup/cgroup.c:3826 cgroup_mutex or RCU read lock required!
> [ 7509.905026]
> other info that might help us debug this:
The cause is missing rcu_lock/rcu_unlock in section of the code. It will
be fixed in the next version.
Cheers,
Longman
On 05/22/2018 09:53 AM, Juri Lelli wrote:
> Hi,
>
> On 17/05/18 16:55, Waiman Long wrote:
>> This patch enables us to report sched domain generation information.
>>
>> If DYNAMIC_DEBUG is enabled, issuing the following command
>>
>> echo "file cpuset.c +p" > /sys/kernel/debug/dynamic_debug/control
>>
>> and setting loglevel to 8 will allow the kernel to show what scheduling
>> domain changes are being made.
>>
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> kernel/cgroup/cpuset.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index fb8aa82b..8f586e8 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -820,6 +820,12 @@ static int generate_sched_domains(cpumask_var_t **domains,
>> }
>> BUG_ON(nslot != ndoms);
>>
>> +#ifdef CONFIG_DEBUG_KERNEL
>> + for (i = 0; i < ndoms; i++)
>> + pr_debug("generate_sched_domains dom %d: %*pbl\n", i,
>> + cpumask_pr_args(doms[i]));
>> +#endif
>> +
> While I'm always in favor of adding debug output, in this case I'm not
> sure it's adding much to what we already print when booting with
> sched_debug kernel command-line param, e.g.
>
> --->8---
> Kernel command line: BOOT_IMAGE=/vmlinuz-4.17.0-rc5+ ... sched_debug
>
> [...]
>
> smp: Bringing up secondary CPUs ...
> x86: Booting SMP configuration:
> .... node #0, CPUs: #1 #2 #3 #4 #5
> .... node #1, CPUs: #6 #7 #8 #9 #10 #11
> smp: Brought up 2 nodes, 12 CPUs
> smpboot: Max logical packages: 2
> smpboot: Total of 12 processors activated (45636.50 BogoMIPS)
> CPU0 attaching sched-domain(s):
> domain-0: span=0-5 level=MC
> groups: 0:{ span=0 cap=1016 }, 1:{ span=1 cap=1011 }, 2:{ span=2 }, 3:{ span=3 cap=1023 }, 4:{ span=4 }, 5:{ span=5 }
> domain-1: span=0-11 level=NUMA
> groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
> CPU1 attaching sched-domain(s):
> domain-0: span=0-5 level=MC
> groups: 1:{ span=1 cap=1011 }, 2:{ span=2 }, 3:{ span=3 cap=1023 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 cap=1016 }
> domain-1: span=0-11 level=NUMA
> groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
> CPU2 attaching sched-domain(s):
> domain-0: span=0-5 level=MC
> groups: 2:{ span=2 }, 3:{ span=3 cap=1023 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 cap=1016 }, 1:{ span=1 cap=1011 }
> domain-1: span=0-11 level=NUMA
> groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
> CPU3 attaching sched-domain(s):
> domain-0: span=0-5 level=MC
> groups: 3:{ span=3 cap=1023 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 cap=1016 }, 1:{ span=1 cap=1011 }, 2:{ span=2 }
> domain-1: span=0-11 level=NUMA
> groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
> CPU4 attaching sched-domain(s):
> domain-0: span=0-5 level=MC
> groups: 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 cap=1016 }, 1:{ span=1 cap=1011 }, 2:{ span=2 }, 3:{ span=3 cap=1023 }
> domain-1: span=0-11 level=NUMA
> groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
> CPU5 attaching sched-domain(s):
> domain-0: span=0-5 level=MC
> groups: 5:{ span=5 }, 0:{ span=0 cap=1016 }, 1:{ span=1 cap=1011 }, 2:{ span=2 }, 3:{ span=3 cap=1023 }, 4:{ span=4 }
> domain-1: span=0-11 level=NUMA
> groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
> CPU6 attaching sched-domain(s):
> domain-0: span=6-11 level=MC
> groups: 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 cap=1021 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }
> domain-1: span=0-11 level=NUMA
> groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
> CPU7 attaching sched-domain(s):
> domain-0: span=6-11 level=MC
> groups: 7:{ span=7 }, 8:{ span=8 cap=1021 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }
> domain-1: span=0-11 level=NUMA
> groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
> CPU8 attaching sched-domain(s):
> domain-0: span=6-11 level=MC
> groups: 8:{ span=8 cap=1021 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }
> domain-1: span=0-11 level=NUMA
> groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
> CPU9 attaching sched-domain(s):
> domain-0: span=6-11 level=MC
> groups: 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 cap=1021 }
> domain-1: span=0-11 level=NUMA
> groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
> CPU10 attaching sched-domain(s):
> domain-0: span=6-11 level=MC
> groups: 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 cap=1021 }, 9:{ span=9 }
> domain-1: span=0-11 level=NUMA
> groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
> CPU11 attaching sched-domain(s):
> domain-0: span=6-11 level=MC
> groups: 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 cap=1021 }, 9:{ span=9 }, 10:{ span=10 }
> domain-1: span=0-11 level=NUMA
> groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
> span: 0-11 (max cpu_capacity = 1024)
>
> [...]
>
> generate_sched_domains dom 0: 6-11 <-- this and the one below is what
> generate_sched_domains dom 1: 0-5 you are adding
> CPU0 attaching NULL sched-domain.
> CPU1 attaching NULL sched-domain.
> CPU2 attaching NULL sched-domain.
> CPU3 attaching NULL sched-domain.
> CPU4 attaching NULL sched-domain.
> CPU5 attaching NULL sched-domain.
> CPU6 attaching NULL sched-domain.
> CPU7 attaching NULL sched-domain.
> CPU8 attaching NULL sched-domain.
> CPU9 attaching NULL sched-domain.
> CPU10 attaching NULL sched-domain.
> CPU11 attaching NULL sched-domain.
> CPU6 attaching sched-domain(s):
> domain-0: span=6-11 level=MC
> groups: 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }
> CPU7 attaching sched-domain(s):
> domain-0: span=6-11 level=MC
> groups: 7:{ span=7 }, 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }
> CPU8 attaching sched-domain(s):
> domain-0: span=6-11 level=MC
> groups: 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }
> CPU9 attaching sched-domain(s):
> domain-0: span=6-11 level=MC
> groups: 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 }
> CPU10 attaching sched-domain(s):
> domain-0: span=6-11 level=MC
> groups: 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 }, 9:{ span=9 }
> CPU11 attaching sched-domain(s):
> domain-0: span=6-11 level=MC
> groups: 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }
> span: 6-11 (max cpu_capacity = 1024)
> CPU0 attaching sched-domain(s):
> domain-0: span=0-5 level=MC
> groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }
> CPU1 attaching sched-domain(s):
> domain-0: span=0-5 level=MC
> groups: 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 }
> CPU2 attaching sched-domain(s):
> domain-0: span=0-5 level=MC
> groups: 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 }, 1:{ span=1 }
> CPU3 attaching sched-domain(s):
> domain-0: span=0-5 level=MC
> groups: 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }
> CPU4 attaching sched-domain(s):
> domain-0: span=0-5 level=MC
> groups: 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }
> CPU5 attaching sched-domain(s):
> domain-0: span=0-5 level=MC
> groups: 5:{ span=5 }, 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }
> span: 0-5 (max cpu_capacity = 1024)
> --->8---
>
> Do you think there is still a benefit in printing out what
> generate_sched_domains does?
For testing purpose the debug messages from generate_sched_domains() is
more concise. The sched_debug message is too verbose from my point of
view. So I think it is still useful.
Cheers,
Longman
On 05/24/2018 06:28 AM, Juri Lelli wrote:
> On 17/05/18 16:55, Waiman Long wrote:
>
> [...]
>
>> @@ -849,7 +860,12 @@ static void rebuild_sched_domains_locked(void)
>> * passing doms with offlined cpu to partition_sched_domains().
>> * Anyways, hotplug work item will rebuild sched domains.
>> */
>> - if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
>> + if (!top_cpuset.isolation_count &&
>> + !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
>> + goto out;
>> +
>> + if (top_cpuset.isolation_count &&
>> + !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
>> goto out;
> Do we cover the case in which hotplug removed one of the isolated cpus
> from cpu_active_mask?
Yes, you are right. That is the remnant of my original patch that allows
only one isolated_cpus at root. Thanks for spotting that.
Cheers,
Longman
On 05/28/2018 09:12 PM, Waiman Long wrote:
> On 05/24/2018 06:28 AM, Juri Lelli wrote:
>> On 17/05/18 16:55, Waiman Long wrote:
>>
>> [...]
>>
>>> @@ -849,7 +860,12 @@ static void rebuild_sched_domains_locked(void)
>>> * passing doms with offlined cpu to partition_sched_domains().
>>> * Anyways, hotplug work item will rebuild sched domains.
>>> */
>>> - if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
>>> + if (!top_cpuset.isolation_count &&
>>> + !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
>>> + goto out;
>>> +
>>> + if (top_cpuset.isolation_count &&
>>> + !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
>>> goto out;
>> Do we cover the case in which hotplug removed one of the isolated cpus
>> from cpu_active_mask?
> Yes, you are right. That is the remnant of my original patch that allows
> only one isolated_cpus at root. Thanks for spotting that.
I am sorry. I would like to take it back my previous comment. The code
above looks for inconsistency in the state of the effective_cpus mask to
find out if it is racing with a hotplug event. If it is, we can skip the
domain generation as the hotplug event will do that too. The checks are
still valid with the current patchset. So I don't think we need to make
any change here.
Cheers,
Longman
On 28/05/18 21:24, Waiman Long wrote:
> On 05/28/2018 09:12 PM, Waiman Long wrote:
> > On 05/24/2018 06:28 AM, Juri Lelli wrote:
> >> On 17/05/18 16:55, Waiman Long wrote:
> >>
> >> [...]
> >>
> >>> @@ -849,7 +860,12 @@ static void rebuild_sched_domains_locked(void)
> >>> * passing doms with offlined cpu to partition_sched_domains().
> >>> * Anyways, hotplug work item will rebuild sched domains.
> >>> */
> >>> - if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
> >>> + if (!top_cpuset.isolation_count &&
> >>> + !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
> >>> + goto out;
> >>> +
> >>> + if (top_cpuset.isolation_count &&
> >>> + !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
> >>> goto out;
> >> Do we cover the case in which hotplug removed one of the isolated cpus
> >> from cpu_active_mask?
> > Yes, you are right. That is the remnant of my original patch that allows
> > only one isolated_cpus at root. Thanks for spotting that.
>
> I am sorry. I would like to take it back my previous comment. The code
> above looks for inconsistency in the state of the effective_cpus mask to
> find out if it is racing with a hotplug event. If it is, we can skip the
> domain generation as the hotplug event will do that too. The checks are
> still valid with the current patchset. So I don't think we need to make
> any change here.
Yes, these checks are valid, but don't we also need to check for hotplug
races w.r.t. isolated CPUs (of some other sub domain)?
On 05/29/2018 02:27 AM, Juri Lelli wrote:
> On 28/05/18 21:24, Waiman Long wrote:
>> On 05/28/2018 09:12 PM, Waiman Long wrote:
>>> On 05/24/2018 06:28 AM, Juri Lelli wrote:
>>>> On 17/05/18 16:55, Waiman Long wrote:
>>>>
>>>> [...]
>>>>
>>>>> @@ -849,7 +860,12 @@ static void rebuild_sched_domains_locked(void)
>>>>> * passing doms with offlined cpu to partition_sched_domains().
>>>>> * Anyways, hotplug work item will rebuild sched domains.
>>>>> */
>>>>> - if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
>>>>> + if (!top_cpuset.isolation_count &&
>>>>> + !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
>>>>> + goto out;
>>>>> +
>>>>> + if (top_cpuset.isolation_count &&
>>>>> + !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
>>>>> goto out;
>>>> Do we cover the case in which hotplug removed one of the isolated cpus
>>>> from cpu_active_mask?
>>> Yes, you are right. That is the remnant of my original patch that allows
>>> only one isolated_cpus at root. Thanks for spotting that.
>> I am sorry. I would like to take it back my previous comment. The code
>> above looks for inconsistency in the state of the effective_cpus mask to
>> find out if it is racing with a hotplug event. If it is, we can skip the
>> domain generation as the hotplug event will do that too. The checks are
>> still valid with the current patchset. So I don't think we need to make
>> any change here.
> Yes, these checks are valid, but don't we also need to check for hotplug
> races w.r.t. isolated CPUs (of some other sub domain)?
It is not actually a race. Both the hotplug event and any changes to cpu
lists or flags are serialized by the cpuset_mutex. It is just that we
may be doing the same work twice that we are wasting cpu cycles. So we
are doing a quick check to avoid this. The check isn't exhaustive and we
can certainly miss some cases. Doing a more throughout check may need as
much time as doing the sched domain generation itself and so you are
actually wasting more CPU cycles on average as the chance of a hotplug
event is very low.
Cheers,
Longman
On 29/05/18 08:40, Waiman Long wrote:
> On 05/29/2018 02:27 AM, Juri Lelli wrote:
> > On 28/05/18 21:24, Waiman Long wrote:
> >> On 05/28/2018 09:12 PM, Waiman Long wrote:
> >>> On 05/24/2018 06:28 AM, Juri Lelli wrote:
> >>>> On 17/05/18 16:55, Waiman Long wrote:
> >>>>
> >>>> [...]
> >>>>
> >>>>> @@ -849,7 +860,12 @@ static void rebuild_sched_domains_locked(void)
> >>>>> * passing doms with offlined cpu to partition_sched_domains().
> >>>>> * Anyways, hotplug work item will rebuild sched domains.
> >>>>> */
> >>>>> - if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
> >>>>> + if (!top_cpuset.isolation_count &&
> >>>>> + !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
> >>>>> + goto out;
> >>>>> +
> >>>>> + if (top_cpuset.isolation_count &&
> >>>>> + !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
> >>>>> goto out;
> >>>> Do we cover the case in which hotplug removed one of the isolated cpus
> >>>> from cpu_active_mask?
> >>> Yes, you are right. That is the remnant of my original patch that allows
> >>> only one isolated_cpus at root. Thanks for spotting that.
> >> I am sorry. I would like to take it back my previous comment. The code
> >> above looks for inconsistency in the state of the effective_cpus mask to
> >> find out if it is racing with a hotplug event. If it is, we can skip the
> >> domain generation as the hotplug event will do that too. The checks are
> >> still valid with the current patchset. So I don't think we need to make
> >> any change here.
> > Yes, these checks are valid, but don't we also need to check for hotplug
> > races w.r.t. isolated CPUs (of some other sub domain)?
>
> It is not actually a race. Both the hotplug event and any changes to cpu
> lists or flags are serialized by the cpuset_mutex. It is just that we
> may be doing the same work twice that we are wasting cpu cycles. So we
> are doing a quick check to avoid this. The check isn't exhaustive and we
> can certainly miss some cases. Doing a more throughout check may need as
> much time as doing the sched domain generation itself and so you are
> actually wasting more CPU cycles on average as the chance of a hotplug
> event is very low.
Fair enough.
Thanks,
- Juri