2017-08-16 21:20:48

by Mathieu Poirier

[permalink] [raw]
Subject: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]
where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug
operations. When CPUhotplug and some CUPset manipulation take place root
domains are destroyed and new ones created, loosing at the same time DL
accounting pertaining to utilisation.

An earlier attempt by Juri [2] used the scheduling classes' rq_online() and
rq_offline() methods, something that highlighted a problem with sleeping
DL tasks. The email thread that followed envisioned creating a list of
sleeping tasks to circle through when recomputing DL accounting.

In this set the problem is addressed by relying on existing list of tasks
(sleeping or not) already maintained by CPUsets. When CPUset or
CPUhotplug operations have completed we circle through the list of tasks
maintained by each CPUset looking for DL tasks. When a DL task is found
its utilisation is added to the root domain it pertains to by way of its
runqueue.

The advantage of proceeding this way is that recomputing of DL accounting
is done the same way for both active and inactive tasks, along with
guaranteeing that DL accounting for tasks end up in the correct root
domain regardless of the CPUset topology. The disadvantage is that
circling through all the tasks in a CPUset can be time consuming. The
counter argument is that both CPUset and CPUhotplug operations are time
consuming in the first place.

OPEN ISSUE:

Regardless of how we proceed (using existing CPUset list or new ones) we
need to deal with DL tasks that span more than one root domain, something
that will typically happen after a CPUset operation. For example, if we
split the number of available CPUs on a system in two CPUsets and then turn
off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the
parent CPUset will end up spanning two root domains.

One way to deal with this is to prevent CPUset operations from happening
when such condition is detected, as enacted in this set. Although simple
this approach feels brittle and akin to a "whack-a-mole" game. A better
and more reliable approach would be to teach the DL scheduler to deal with
tasks that span multiple root domains, a serious and substantial
undertaking.

I am sending this as a starting point for discussion. I would be grateful
if you could take the time to comment on the approach and most importantly
provide input on how to deal with the open issue underlined above.

Many thanks,
Mathieu

[1]. https://lkml.org/lkml/2016/2/3/966
[2]. https://marc.info/?l=linux-kernel&m=145493552607388&w=2

Mathieu Poirier (7):
sched/topology: Adding function partition_sched_domains_locked()
cpuset: Rebuild root domain deadline accounting information
sched/deadline: Keep new DL task within root domain's boundary
cgroup: Constrain 'sched_load_balance' flag when DL tasks are present
cgroup: Concentrate DL related validation code in one place
cgroup: Constrain the addition of CPUs to a new CPUset
sched/core: Don't change the affinity of DL tasks

include/linux/sched.h | 3 +
include/linux/sched/deadline.h | 8 ++
include/linux/sched/topology.h | 9 ++
kernel/cgroup/cpuset.c | 186 ++++++++++++++++++++++++++++++++++++++---
kernel/sched/core.c | 10 +--
kernel/sched/deadline.c | 47 ++++++++++-
kernel/sched/sched.h | 3 -
kernel/sched/topology.c | 31 +++++--
8 files changed, 272 insertions(+), 25 deletions(-)

--
2.7.4


2017-08-16 21:20:57

by Mathieu Poirier

[permalink] [raw]
Subject: [PATCH 3/7] sched/deadline: Keep new DL task within root domain's boundary

When considering to move a task to the DL policy we need to make sure
the CPUs it is allowed to run on matches the CPUs of the root domains of
the runqueue it is currently assigned to. Otherwise the task will be
allowed to roam on CPUs outside of this root domain, something that will
skew system deadline statistics and potentially lead to over selling DL
bandwidth.

For example say we have a 4 core system split in 2 cpuset: set1 has CPU 0
and 1 while set2 has CPU 2 and 3. This results in 3 cpuset - the default
set that has all 4 CPUs along with set1 and set2 as just depicted. We also
have task A that hasn't been assigned to any CPUset and as such, is part of
the default CPUset.

At the time we want to move task A to a DL policy it has been assigned to
CPU1. Since CPU1 is part of set1 the root domain will have 2 CPUs in it
and the bandwidth constraint checked against the current DL bandwidth
allotment of those 2 CPUs.

If task A is promoted to a DL policy it's 'cpus_allowed' mask is still
equal to the CPUs in the default CPUset, making it possible for the
scheduler to move it to CPU2 and CPU3, which could also be running DL tasks
of their own.

This patch makes sure that a task's cpus_allowed mask matches the CPUs
in the root domain associated to the runqueue it has been assigned to.

Signed-off-by: Mathieu Poirier <[email protected]>
---
kernel/sched/deadline.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ba64a5b8f40b..2c0405d74367 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2474,6 +2474,7 @@ int sched_dl_overflow(struct task_struct *p, int policy,
const struct sched_attr *attr)
{
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+ struct root_domain *rd = cpu_rq(task_cpu(p))->rd;
u64 period = attr->sched_period ?: attr->sched_deadline;
u64 runtime = attr->sched_runtime;
u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
@@ -2484,6 +2485,19 @@ int sched_dl_overflow(struct task_struct *p, int policy,
return 0;

/*
+ * By default a task is set to run on all the CPUs the system
+ * knows about. This is fine for as long as we don't deal with cpusets
+ * where runqueues are split between root domains. The computation
+ * below is based on root domain information, as such the task must be
+ * constrained to run within that root domain. It is the user's
+ * responsability to constrain the task to specific CPUs by either
+ * assigning the task to a cpuset or run the taskset utility. Here we
+ * simply make sure things are coherent.
+ */
+ if (!cpumask_equal(&p->cpus_allowed, rd->span))
+ goto out;
+
+ /*
* Either if a task, enters, leave, or stays -deadline but changes
* its parameters, we may need to update accordingly the total
* allocated bandwidth of the container.
@@ -2518,7 +2532,7 @@ int sched_dl_overflow(struct task_struct *p, int policy,
err = 0;
}
raw_spin_unlock(&dl_b->lock);
-
+out:
return err;
}

--
2.7.4

2017-08-16 21:21:02

by Mathieu Poirier

[permalink] [raw]
Subject: [PATCH 5/7] cgroup: Concentrate DL related validation code in one place

Having two different places to validate DL operations makes no sense. As
such move everything in the same function.

Signed-off-by: Mathieu Poirier <[email protected]>
---
kernel/cgroup/cpuset.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 18df143b4013..8d2ba5591dfb 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -477,6 +477,15 @@ static int validate_dl_change(struct cpuset *cur, struct cpuset *trial)
struct cpuset *cs;

/*
+ * We can't shrink if we won't have enough room for SCHED_DEADLINE
+ * tasks.
+ */
+ if (is_cpu_exclusive(cur) &&
+ !cpuset_cpumask_can_shrink(cur->cpus_allowed,
+ trial->cpus_allowed))
+ goto out;
+
+ /*
* The cpuset.sched_load_balance flag is flipped off on
* the current cpuset.
*/
@@ -606,16 +615,6 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
goto out;
}

- /*
- * We can't shrink if we won't have enough room for SCHED_DEADLINE
- * tasks.
- */
- ret = -EBUSY;
- if (is_cpu_exclusive(cur) &&
- !cpuset_cpumask_can_shrink(cur->cpus_allowed,
- trial->cpus_allowed))
- goto out;
-
ret = 0;
out:
rcu_read_unlock();
--
2.7.4

2017-08-16 21:21:06

by Mathieu Poirier

[permalink] [raw]
Subject: [PATCH 7/7] sched/core: Don't change the affinity of DL tasks

Now that we mandate that on creation, the ->cpus_allowed mask of a future
DL task has to be equal to the rd->span of the root domain it will be
associated with, changing the affinity of a DL task doesn't make sense
anymore.

Indeed, if we set the task to a smaller affinity set then we may be running
out of CPU cycle. If we try to increase the size of the affinity set the
new CPUs are necessarily part of another root domain where DL utilation for
the task won't be accounted for.

As such simply refuse to change the affinity set of a DL task.

Signed-off-by: Mathieu Poirier <[email protected]>
---
kernel/sched/core.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0869b20fba81..a1948d127d8f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4629,15 +4629,15 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
cpumask_and(new_mask, in_mask, cpus_allowed);

/*
- * Since bandwidth control happens on root_domain basis,
- * if admission test is enabled, we only admit -deadline
- * tasks allowed to run on all the CPUs in the task's
- * root_domain.
+ * Since bandwidth control happens on root_domain basis, if admission
+ * test is enabled, we don't allow the task' CPUs to change. If
+ * @new_mask is smaller than we risk running out of cycles. If it is
+ * bigger than we may be using DL bandwidth allocated to other CPUs.
*/
#ifdef CONFIG_SMP
if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
rcu_read_lock();
- if (!cpumask_subset(task_rq(p)->rd->span, new_mask)) {
+ if (!cpumask_equal(task_rq(p)->rd->span, new_mask)) {
retval = -EBUSY;
rcu_read_unlock();
goto out_free_new_mask;
--
2.7.4

2017-08-16 21:21:30

by Mathieu Poirier

[permalink] [raw]
Subject: [PATCH 6/7] cgroup: Constrain the addition of CPUs to a new CPUset

Care must be taken when CPUs are added to a new CPUset. If an ancestor
of that set has its sched_load_balance flag switch off then the CPUs in
the new CPUset will be added to a new root domain. If the ancestor also
had DL tasks those will end up covering more than one root domain,
breaking at the same time the DL integrity model.

This patch prevents adding CPUs to a new CPUset if one of its ancestor
had its sched_load_balance flag off and had DL tasks assigned to it.

Signed-off-by: Mathieu Poirier <[email protected]>
---
kernel/cgroup/cpuset.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 8d2ba5591dfb..fb38feaf3fda 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -530,6 +530,28 @@ static int validate_dl_change(struct cpuset *cur, struct cpuset *trial)
goto out;
}

+ /*
+ * CPUs are being added to a CPUset. If any parent of @trial has its
+ * sched_load_balance flag switched off this operation will create a
+ * new root domain spanning trailcs->cpus_allowed. At the same time
+ * if any parent of @trial has a DL task, that task will end up
+ * spanning more than one root domain.
+ */
+ if (cpumask_empty(cur->cpus_allowed) &&
+ !cpumask_empty(trial->cpus_allowed)) {
+ struct cpuset *parent;
+
+ parent = parent_cs(trial);
+ /* Go up until we reach the top_cpuset */
+ while (parent) {
+ if (cpuset_has_dl_tasks(parent) &&
+ !is_sched_load_balance(parent))
+ goto out;
+
+ parent = parent_cs(parent);
+ }
+ }
+
ret = 0;
out:
return ret;
--
2.7.4

2017-08-16 21:20:54

by Mathieu Poirier

[permalink] [raw]
Subject: [PATCH 1/7] sched/topology: Adding function partition_sched_domains_locked()

Introducing function partition_sched_domains_locked() by taking
the mutex locking code out of the original function. That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.

This patch doesn't change the functionality provided by the
original code.

Signed-off-by: Mathieu Poirier <[email protected]>
---
include/linux/sched/topology.h | 9 +++++++++
kernel/sched/topology.c | 17 +++++++++++++----
2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 7d065abc7a47..59b568079e05 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -161,6 +161,9 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
return to_cpumask(sd->span);
}

+extern void partition_sched_domains_locked(int ndoms_new,
+ cpumask_var_t doms_new[],
+ struct sched_domain_attr *dattr_new);
extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new);

@@ -206,6 +209,12 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
struct sched_domain_attr;

static inline void
+partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+ struct sched_domain_attr *dattr_new)
+{
+}
+
+static inline void
partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new)
{
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9b4f2701ba3c..58133f2ce889 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1840,15 +1840,15 @@ static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
* ndoms_new == 0 is a special case for destroying existing domains,
* and it will not create the default domain.
*
- * Call with hotplug lock held
+ * Call with hotplug lock and sched_domains_mutex held
*/
-void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
- struct sched_domain_attr *dattr_new)
+void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
+ struct sched_domain_attr *dattr_new)
{
int i, j, n;
int new_topology;

- mutex_lock(&sched_domains_mutex);
+ lockdep_assert_held(&sched_domains_mutex);

/* Always unregister in case we don't destroy any domains: */
unregister_sched_domain_sysctl();
@@ -1902,7 +1902,16 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
ndoms_cur = ndoms_new;

register_sched_domain_sysctl();
+}

+/*
+ * Call with hotplug lock held
+ */
+void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+ struct sched_domain_attr *dattr_new)
+{
+ mutex_lock(&sched_domains_mutex);
+ partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
mutex_unlock(&sched_domains_mutex);
}

--
2.7.4

2017-08-16 21:21:49

by Mathieu Poirier

[permalink] [raw]
Subject: [PATCH 4/7] cgroup: Constrain 'sched_load_balance' flag when DL tasks are present

This patch prevents the 'sched_load_balance' flag from being fippled off
when DL tasks are present in a CPUset. Otherwise we end up with the
DL tasks using CPUs belonging to different root domains, something that
breaks the mathematical model behind DL bandwidth management.

For example on a 4 core system CPUset "set1" has been created and CPUs
0 and 1 assigned to it. A DL task has also been spun off. By default
the DL task can use all the CPUs in the default CPUset.

If we set the base CPUset's cpuset.sched_load_balance to '0', CPU 0 and 1
are added to a newly created root domain while CPU 2 and 3 endup in the
default root domain. But the DL task is still part of the base CPUset and
as such can use CPUs 0 to 3, spanning at the same time more than one root
domain.

Signed-off-by: Mathieu Poirier <[email protected]>
---
kernel/cgroup/cpuset.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 82 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f6d1e485dc2d..18df143b4013 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -447,6 +447,85 @@ static void free_trial_cpuset(struct cpuset *trial)
kfree(trial);
}

+static bool cpuset_has_dl_tasks(struct cpuset *cs)
+{
+ bool dl_tasks = false;
+ struct css_task_iter it;
+ struct task_struct *task;
+
+ /* Go through each task in @cs looking for a DL task */
+ css_task_iter_start(&cs->css, &it);
+
+ while (!dl_tasks && (task = css_task_iter_next(&it))) {
+ if (dl_task(task))
+ dl_tasks = true;
+ }
+
+ css_task_iter_end(&it);
+
+ return dl_tasks;
+}
+
+/*
+ * Assumes RCU read lock and cpuset_mutex are held.
+ */
+static int validate_dl_change(struct cpuset *cur, struct cpuset *trial)
+{
+ bool populated = false, dl_tasks = false;
+ int ret = -EBUSY;
+ struct cgroup_subsys_state *pos_css;
+ struct cpuset *cs;
+
+ /*
+ * The cpuset.sched_load_balance flag is flipped off on
+ * the current cpuset.
+ */
+ if (is_sched_load_balance(cur) &&
+ !is_sched_load_balance(trial)) {
+ /* See if at least one descendant cpuset is populated */
+ cpuset_for_each_descendant_pre(cs, pos_css, cur) {
+ /* Skip over ourselve */
+ if (cs == cur)
+ continue;
+
+ /* Empty cpusets are of no interest */
+ if (cpumask_empty(cs->cpus_allowed)) {
+ pos_css = css_rightmost_descendant(pos_css);
+ continue;
+ }
+
+ /*
+ * @cur has at least one children and CPUs have been
+ * assigned to it - there is no need to go further.
+ */
+ populated = true;
+ break;
+ }
+
+ dl_tasks = cpuset_has_dl_tasks(cur);
+
+ /*
+ * This CPUset has a children that is populated by (at least)
+ * one CPU. When the sched_load_balance flag gets flipped off
+ * it will create a new root domain for the children CPUset,
+ * and that new root domain will include the CPUs assigned to
+ * the CPUset.
+ *
+ * Since the tasks in the current CPUset have not been assigned
+ * to the children CPUset they will simply stay here and use
+ * all the CPUs available in this set. For DL tasks this can't
+ * be allowed since they will be executing on CPUs associated to
+ * more than one root domains.
+ */
+ if (populated && dl_tasks)
+ goto out;
+ }
+
+ ret = 0;
+out:
+ return ret;
+}
+
/*
* validate_change() - Used to validate that any proposed cpuset change
* follows the structural rules for cpusets.
@@ -481,6 +560,9 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
if (!is_cpuset_subset(c, trial))
goto out;

+ if (validate_dl_change(cur, trial))
+ goto out;
+
/* Remaining checks don't apply to root cpuset */
ret = 0;
if (cur == &top_cpuset)
--
2.7.4

2017-08-16 21:22:42

by Mathieu Poirier

[permalink] [raw]
Subject: [PATCH 2/7] cpuset: Rebuild root domain deadline accounting information

When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch address the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Signed-off-by: Mathieu Poirier <[email protected]>
---
include/linux/sched.h | 3 ++
include/linux/sched/deadline.h | 8 ++++++
kernel/cgroup/cpuset.c | 63 +++++++++++++++++++++++++++++++++++++++++-
kernel/sched/deadline.c | 31 +++++++++++++++++++++
kernel/sched/sched.h | 3 --
kernel/sched/topology.c | 14 +++++++++-
6 files changed, 117 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8337e2db0bb2..bb07b194d951 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -241,6 +241,9 @@ struct vtime {
u64 gtime;
};

+extern struct root_domain def_root_domain;
+extern struct mutex sched_domains_mutex;
+
struct sched_info {
#ifdef CONFIG_SCHED_INFO
/* Cumulative counters: */
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 975be862e083..fcef5565dc94 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -28,4 +28,12 @@ static inline bool dl_time_before(u64 a, u64 b)
return (s64)(a - b) < 0;
}

+#ifdef CONFIG_SMP
+
+struct root_domain;
+extern void dl_add_task_root_domain(struct task_struct *p);
+extern void dl_clear_root_domain(struct root_domain *rd);
+
+#endif /* CONFIG_SMP */
+
#endif /* _LINUX_SCHED_DEADLINE_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 63ce8bbb476e..f6d1e485dc2d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -44,6 +44,7 @@
#include <linux/proc_fs.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
+#include <linux/sched/deadline.h>
#include <linux/sched/mm.h>
#include <linux/sched/task.h>
#include <linux/seq_file.h>
@@ -800,6 +801,66 @@ static int generate_sched_domains(cpumask_var_t **domains,
return ndoms;
}

+static void update_tasks_root_domain(struct cpuset *cs)
+{
+ struct css_task_iter it;
+ struct task_struct *task;
+
+ css_task_iter_start(&cs->css, &it);
+
+ while ((task = css_task_iter_next(&it)))
+ dl_add_task_root_domain(task);
+
+ css_task_iter_end(&it);
+}
+
+/*
+ * Called with cpuset_mutex held (rebuild_sched_domains())
+ * Called with hotplug lock held (rebuild_sched_domains_locked())
+ * Called with sched_domains_mutex held (partition_and_rebuild_domains())
+ */
+static void rebuild_root_domains(void)
+{
+ struct cpuset *cs = NULL;
+ struct cgroup_subsys_state *pos_css;
+
+ rcu_read_lock();
+
+ /*
+ * Clear default root domain DL accounting, it will be computed again
+ * if a task belongs to it.
+ */
+ dl_clear_root_domain(&def_root_domain);
+
+ cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+
+ if (cpumask_empty(cs->effective_cpus)) {
+ pos_css = css_rightmost_descendant(pos_css);
+ continue;
+ }
+
+ css_get(&cs->css);
+
+ rcu_read_unlock();
+
+ update_tasks_root_domain(cs);
+
+ rcu_read_lock();
+ css_put(&cs->css);
+ }
+ rcu_read_unlock();
+}
+
+static void
+partition_and_rebuild_domains(int ndoms_new, cpumask_var_t doms_new[],
+ struct sched_domain_attr *dattr_new)
+{
+ mutex_lock(&sched_domains_mutex);
+ partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
+ rebuild_root_domains();
+ mutex_unlock(&sched_domains_mutex);
+}
+
/*
* Rebuild scheduler domains.
*
@@ -832,7 +893,7 @@ static void rebuild_sched_domains_locked(void)
ndoms = generate_sched_domains(&doms, &attr);

/* Have scheduler rebuild the domains */
- partition_sched_domains(ndoms, doms, attr);
+ partition_and_rebuild_domains(ndoms, doms, attr);
out:
put_online_cpus();
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 755bd3f1a1a9..ba64a5b8f40b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2214,6 +2214,37 @@ void __init init_sched_dl_class(void)
GFP_KERNEL, cpu_to_node(i));
}

+void dl_add_task_root_domain(struct task_struct *p)
+{
+ unsigned long flags;
+ struct rq_flags rf;
+ struct rq *rq;
+ struct dl_bw *dl_b;
+
+ rq = task_rq_lock(p, &rf);
+ if (!dl_task(p))
+ goto unlock;
+
+ dl_b = &rq->rd->dl_bw;
+ raw_spin_lock_irqsave(&dl_b->lock, flags);
+
+ dl_b->total_bw += p->dl.dl_bw;
+
+ raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+
+unlock:
+ task_rq_unlock(rq, p, &rf);
+}
+
+void dl_clear_root_domain(struct root_domain *rd)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
+ rd->dl_bw.total_bw = 0;
+ raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
+}
+
#endif /* CONFIG_SMP */

static void switched_from_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eeef1a3086d1..cfd3b5b5fe7f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -650,9 +650,6 @@ struct root_domain {
unsigned long max_cpu_capacity;
};

-extern struct root_domain def_root_domain;
-extern struct mutex sched_domains_mutex;
-
extern void init_defrootdomain(void);
extern int sched_init_domains(const struct cpumask *cpu_map);
extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 58133f2ce889..3522dfc2566f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2,6 +2,7 @@
* Scheduler topology setup/handling methods
*/
#include <linux/sched.h>
+#include <linux/sched/deadline.h>
#include <linux/mutex.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
@@ -1862,8 +1863,19 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
for (i = 0; i < ndoms_cur; i++) {
for (j = 0; j < n && !new_topology; j++) {
if (cpumask_equal(doms_cur[i], doms_new[j])
- && dattrs_equal(dattr_cur, i, dattr_new, j))
+ && dattrs_equal(dattr_cur, i, dattr_new, j)) {
+ struct root_domain *rd;
+
+ /*
+ * This domain won't be destroyed and as such
+ * its dl_bw->total_bw needs to be cleared. It
+ * will be recomputed in function
+ * update_tasks_root_domain().
+ */
+ rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
+ dl_clear_root_domain(rd);
goto match1;
+ }
}
/* No match - a current sched domain not in new doms_new[] */
detach_destroy_domains(doms_cur[i]);
--
2.7.4

2017-08-22 12:21:50

by luca abeni

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

Hi Mathieu,

On Wed, 16 Aug 2017 15:20:36 -0600
Mathieu Poirier <[email protected]> wrote:

> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]
> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug
> operations. When CPUhotplug and some CUPset manipulation take place root
> domains are destroyed and new ones created, loosing at the same time DL
> accounting pertaining to utilisation.

Thanks for looking at this longstanding issue! I am just back from
vacations; in the next days I'll try your patches.
Do you have some kind of scripts for reproducing the issue
automatically? (I see that in the original email Steven described how
to reproduce it manually; I just wonder if anyone already scripted the
test).

> An earlier attempt by Juri [2] used the scheduling classes' rq_online() and
> rq_offline() methods, something that highlighted a problem with sleeping
> DL tasks. The email thread that followed envisioned creating a list of
> sleeping tasks to circle through when recomputing DL accounting.
>
> In this set the problem is addressed by relying on existing list of tasks
> (sleeping or not) already maintained by CPUsets. When CPUset or
> CPUhotplug operations have completed we circle through the list of tasks
> maintained by each CPUset looking for DL tasks. When a DL task is found
> its utilisation is added to the root domain it pertains to by way of its
> runqueue.
>
> The advantage of proceeding this way is that recomputing of DL accounting
> is done the same way for both active and inactive tasks, along with
> guaranteeing that DL accounting for tasks end up in the correct root
> domain regardless of the CPUset topology. The disadvantage is that
> circling through all the tasks in a CPUset can be time consuming. The
> counter argument is that both CPUset and CPUhotplug operations are time
> consuming in the first place.

I do not know the cpuset code too much, but I agree that your approach
looks better than creating an additional list for blocked deadline
tasks.


> OPEN ISSUE:
>
> Regardless of how we proceed (using existing CPUset list or new ones) we
> need to deal with DL tasks that span more than one root domain, something
> that will typically happen after a CPUset operation. For example, if we
> split the number of available CPUs on a system in two CPUsets and then turn
> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the
> parent CPUset will end up spanning two root domains.
>
> One way to deal with this is to prevent CPUset operations from happening
> when such condition is detected, as enacted in this set.

I think this is the simplest (if not only?) solution if we want to use
gEDF in each root domain.

> Although simple
> this approach feels brittle and akin to a "whack-a-mole" game. A better
> and more reliable approach would be to teach the DL scheduler to deal with
> tasks that span multiple root domains, a serious and substantial
> undertaking.
>
> I am sending this as a starting point for discussion. I would be grateful
> if you could take the time to comment on the approach and most importantly
> provide input on how to deal with the open issue underlined above.

I suspect that if we want to guarantee bounded tardiness then we have to
go for a solution similar to the one suggested by Tommaso some time ago
(if I remember well):

if we want to create some "second level cpusets" inside a "parent
cpuset", allowing deadline tasks to be placed inside both the "parent
cpuset" and the "second level cpusets", then we have to subtract the
"second level cpusets" maximum utilizations from the "parent cpuset"
utilization.

I am not sure how difficult it can be to implement this...


If, instead, we want to allow to guarantee the respect of all the
deadlines, then we need to have a look at Brandenburg's paper on
arbitrary affinities:
https://people.mpi-sws.org/~bbb/papers/pdf/rtsj14.pdf


Thanks,
Luca

2017-08-23 19:47:17

by Mathieu Poirier

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

On 22 August 2017 at 06:21, Luca Abeni <[email protected]> wrote:
> Hi Mathieu,

Good day to you,

>
> On Wed, 16 Aug 2017 15:20:36 -0600
> Mathieu Poirier <[email protected]> wrote:
>
>> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]
>> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug
>> operations. When CPUhotplug and some CUPset manipulation take place root
>> domains are destroyed and new ones created, loosing at the same time DL
>> accounting pertaining to utilisation.
>
> Thanks for looking at this longstanding issue! I am just back from
> vacations; in the next days I'll try your patches.
> Do you have some kind of scripts for reproducing the issue
> automatically? (I see that in the original email Steven described how
> to reproduce it manually; I just wonder if anyone already scripted the
> test).

I didn't bother scripting it since it is so easy to do. I'm eager to
see how things work out on your end.

>
>> An earlier attempt by Juri [2] used the scheduling classes' rq_online() and
>> rq_offline() methods, something that highlighted a problem with sleeping
>> DL tasks. The email thread that followed envisioned creating a list of
>> sleeping tasks to circle through when recomputing DL accounting.
>>
>> In this set the problem is addressed by relying on existing list of tasks
>> (sleeping or not) already maintained by CPUsets. When CPUset or
>> CPUhotplug operations have completed we circle through the list of tasks
>> maintained by each CPUset looking for DL tasks. When a DL task is found
>> its utilisation is added to the root domain it pertains to by way of its
>> runqueue.
>>
>> The advantage of proceeding this way is that recomputing of DL accounting
>> is done the same way for both active and inactive tasks, along with
>> guaranteeing that DL accounting for tasks end up in the correct root
>> domain regardless of the CPUset topology. The disadvantage is that
>> circling through all the tasks in a CPUset can be time consuming. The
>> counter argument is that both CPUset and CPUhotplug operations are time
>> consuming in the first place.
>
> I do not know the cpuset code too much, but I agree that your approach
> looks better than creating an additional list for blocked deadline
> tasks.
>
>
>> OPEN ISSUE:
>>
>> Regardless of how we proceed (using existing CPUset list or new ones) we
>> need to deal with DL tasks that span more than one root domain, something
>> that will typically happen after a CPUset operation. For example, if we
>> split the number of available CPUs on a system in two CPUsets and then turn
>> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the
>> parent CPUset will end up spanning two root domains.
>>
>> One way to deal with this is to prevent CPUset operations from happening
>> when such condition is detected, as enacted in this set.
>
> I think this is the simplest (if not only?) solution if we want to use
> gEDF in each root domain.

Global Earliest Deadline First? Is my interpretation correct?

>
>> Although simple
>> this approach feels brittle and akin to a "whack-a-mole" game. A better
>> and more reliable approach would be to teach the DL scheduler to deal with
>> tasks that span multiple root domains, a serious and substantial
>> undertaking.
>>
>> I am sending this as a starting point for discussion. I would be grateful
>> if you could take the time to comment on the approach and most importantly
>> provide input on how to deal with the open issue underlined above.
>
> I suspect that if we want to guarantee bounded tardiness then we have to
> go for a solution similar to the one suggested by Tommaso some time ago
> (if I remember well):
>
> if we want to create some "second level cpusets" inside a "parent
> cpuset", allowing deadline tasks to be placed inside both the "parent
> cpuset" and the "second level cpusets", then we have to subtract the
> "second level cpusets" maximum utilizations from the "parent cpuset"
> utilization.
>
> I am not sure how difficult it can be to implement this...

Humm... I am missing some context here. Nonetheless the approach I
was contemplating was to repeat the current mathematics to all the
root domains accessible from a p->cpus_allowed's flag. As such we'd
have the same acceptance test but repeated to more than one root
domain. To do that time can be an issue but the real problem I see is
related to the current DL code. It is geared around a single root
domain and changing that means meddling in a lot of places. I had a
prototype that was beginning to address that but decided to gather
people's opinion before getting in too deep.

>
>
> If, instead, we want to allow to guarantee the respect of all the
> deadlines, then we need to have a look at Brandenburg's paper on
> arbitrary affinities:
> https://people.mpi-sws.org/~bbb/papers/pdf/rtsj14.pdf
>

Ouch, that's an extended read...

>
> Thanks,
> Luca

2017-08-24 07:53:36

by luca abeni

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

On Wed, 23 Aug 2017 13:47:13 -0600
Mathieu Poirier <[email protected]> wrote:
> >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]
> >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug
> >> operations. When CPUhotplug and some CUPset manipulation take place root
> >> domains are destroyed and new ones created, loosing at the same time DL
> >> accounting pertaining to utilisation.
> >
> > Thanks for looking at this longstanding issue! I am just back from
> > vacations; in the next days I'll try your patches.
> > Do you have some kind of scripts for reproducing the issue
> > automatically? (I see that in the original email Steven described how
> > to reproduce it manually; I just wonder if anyone already scripted the
> > test).
>
> I didn't bother scripting it since it is so easy to do. I'm eager to
> see how things work out on your end.

Ok, so I'll try to reproduce the issue manually as described in Steven's
original email; I'll run some tests as soon as I finish with some stuff
that accumulated during vacations.

[...]
> >> OPEN ISSUE:
> >>
> >> Regardless of how we proceed (using existing CPUset list or new ones) we
> >> need to deal with DL tasks that span more than one root domain, something
> >> that will typically happen after a CPUset operation. For example, if we
> >> split the number of available CPUs on a system in two CPUsets and then turn
> >> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the
> >> parent CPUset will end up spanning two root domains.
> >>
> >> One way to deal with this is to prevent CPUset operations from happening
> >> when such condition is detected, as enacted in this set.
> >
> > I think this is the simplest (if not only?) solution if we want to use
> > gEDF in each root domain.
>
> Global Earliest Deadline First? Is my interpretation correct?

Right. As far as I understand, the original SCHED_DEADLINE design is to
partition the CPUs in disjoint sets, and then use global EDF scheduling
on each one of those sets (this guarantees bounded tardiness, and if
you run some additional admission tests in user space you can also
guarantee the hard respect of every deadline).


> >> Although simple
> >> this approach feels brittle and akin to a "whack-a-mole" game. A better
> >> and more reliable approach would be to teach the DL scheduler to deal with
> >> tasks that span multiple root domains, a serious and substantial
> >> undertaking.
> >>
> >> I am sending this as a starting point for discussion. I would be grateful
> >> if you could take the time to comment on the approach and most importantly
> >> provide input on how to deal with the open issue underlined above.
> >
> > I suspect that if we want to guarantee bounded tardiness then we have to
> > go for a solution similar to the one suggested by Tommaso some time ago
> > (if I remember well):
> >
> > if we want to create some "second level cpusets" inside a "parent
> > cpuset", allowing deadline tasks to be placed inside both the "parent
> > cpuset" and the "second level cpusets", then we have to subtract the
> > "second level cpusets" maximum utilizations from the "parent cpuset"
> > utilization.
> >
> > I am not sure how difficult it can be to implement this...
>
> Humm... I am missing some context here.

Or maybe I misunderstood the issue you were seeing (I am no expert on
cpusets). Is it related to hierarchies of cpusets (with one cpuset
contained inside another one)?
Can you describe how to reproduce the problematic situation?

> Nonetheless the approach I
> was contemplating was to repeat the current mathematics to all the
> root domains accessible from a p->cpus_allowed's flag.

I think in the original SCHED_DEADLINE design there should be only one
root domain compatible with the task's affinity... If this does not
happen, I suspect it is a bug (Juri, can you confirm?).

My understanding is that with SCHED_DEADLINE cpusets should be used to
partition the system's CPUs in disjoint sets (and I think there is one
root domain for each one of those disjoint sets). And the task affinity
mask should correspond with the CPUs composing the set in which the
task is executing.


> As such we'd
> have the same acceptance test but repeated to more than one root
> domain. To do that time can be an issue but the real problem I see is
> related to the current DL code. It is geared around a single root
> domain and changing that means meddling in a lot of places. I had a
> prototype that was beginning to address that but decided to gather
> people's opinion before getting in too deep.

I still do not fully understand this (I got the impression that this is
related to hierarchies of cpusets, but I am not sure if this
understanding is correct). Maybe an example would help me to understand.



Thanks,
Luca

2017-08-24 08:31:03

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

Hi,

On 24/08/17 09:53, Luca Abeni wrote:
> On Wed, 23 Aug 2017 13:47:13 -0600
> Mathieu Poirier <[email protected]> wrote:
> > >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]
> > >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug
> > >> operations. When CPUhotplug and some CUPset manipulation take place root
> > >> domains are destroyed and new ones created, loosing at the same time DL
> > >> accounting pertaining to utilisation.
> > >
> > > Thanks for looking at this longstanding issue! I am just back from
> > > vacations; in the next days I'll try your patches.
> > > Do you have some kind of scripts for reproducing the issue
> > > automatically? (I see that in the original email Steven described how
> > > to reproduce it manually; I just wonder if anyone already scripted the
> > > test).
> >
> > I didn't bother scripting it since it is so easy to do. I'm eager to
> > see how things work out on your end.
>
> Ok, so I'll try to reproduce the issue manually as described in Steven's
> original email; I'll run some tests as soon as I finish with some stuff
> that accumulated during vacations.
>

I have to apologize myself, as I suspect I won't have much time to
properly review this set before LPC. :(
I'll try my best to have a look though.

[...]

> > Nonetheless the approach I
> > was contemplating was to repeat the current mathematics to all the
> > root domains accessible from a p->cpus_allowed's flag.
>
> I think in the original SCHED_DEADLINE design there should be only one
> root domain compatible with the task's affinity... If this does not
> happen, I suspect it is a bug (Juri, can you confirm?).
>
> My understanding is that with SCHED_DEADLINE cpusets should be used to
> partition the system's CPUs in disjoint sets (and I think there is one
> root domain for each one of those disjoint sets). And the task affinity
> mask should correspond with the CPUs composing the set in which the
> task is executing.
>

Correct. No overlapping cpusets are allowed, and a task's affinity can't
be restricted to a subset of the cpuset's root domain cpus.

[...]

Thanks,

- Juri

2017-08-24 20:32:25

by Mathieu Poirier

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

On 24 August 2017 at 01:53, Luca Abeni <[email protected]> wrote:
> On Wed, 23 Aug 2017 13:47:13 -0600
> Mathieu Poirier <[email protected]> wrote:
>> >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]
>> >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug
>> >> operations. When CPUhotplug and some CUPset manipulation take place root
>> >> domains are destroyed and new ones created, loosing at the same time DL
>> >> accounting pertaining to utilisation.
>> >
>> > Thanks for looking at this longstanding issue! I am just back from
>> > vacations; in the next days I'll try your patches.
>> > Do you have some kind of scripts for reproducing the issue
>> > automatically? (I see that in the original email Steven described how
>> > to reproduce it manually; I just wonder if anyone already scripted the
>> > test).
>>
>> I didn't bother scripting it since it is so easy to do. I'm eager to
>> see how things work out on your end.
>
> Ok, so I'll try to reproduce the issue manually as described in Steven's
> original email; I'll run some tests as soon as I finish with some stuff
> that accumulated during vacations.
>
> [...]
>> >> OPEN ISSUE:
>> >>
>> >> Regardless of how we proceed (using existing CPUset list or new ones) we
>> >> need to deal with DL tasks that span more than one root domain, something
>> >> that will typically happen after a CPUset operation. For example, if we
>> >> split the number of available CPUs on a system in two CPUsets and then turn
>> >> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the
>> >> parent CPUset will end up spanning two root domains.
>> >>
>> >> One way to deal with this is to prevent CPUset operations from happening
>> >> when such condition is detected, as enacted in this set.
>> >
>> > I think this is the simplest (if not only?) solution if we want to use
>> > gEDF in each root domain.
>>
>> Global Earliest Deadline First? Is my interpretation correct?
>
> Right. As far as I understand, the original SCHED_DEADLINE design is to
> partition the CPUs in disjoint sets, and then use global EDF scheduling
> on each one of those sets (this guarantees bounded tardiness, and if
> you run some additional admission tests in user space you can also
> guarantee the hard respect of every deadline).
>
>
>> >> Although simple
>> >> this approach feels brittle and akin to a "whack-a-mole" game. A better
>> >> and more reliable approach would be to teach the DL scheduler to deal with
>> >> tasks that span multiple root domains, a serious and substantial
>> >> undertaking.
>> >>
>> >> I am sending this as a starting point for discussion. I would be grateful
>> >> if you could take the time to comment on the approach and most importantly
>> >> provide input on how to deal with the open issue underlined above.
>> >
>> > I suspect that if we want to guarantee bounded tardiness then we have to
>> > go for a solution similar to the one suggested by Tommaso some time ago
>> > (if I remember well):
>> >
>> > if we want to create some "second level cpusets" inside a "parent
>> > cpuset", allowing deadline tasks to be placed inside both the "parent
>> > cpuset" and the "second level cpusets", then we have to subtract the
>> > "second level cpusets" maximum utilizations from the "parent cpuset"
>> > utilization.
>> >
>> > I am not sure how difficult it can be to implement this...
>>
>> Humm... I am missing some context here.
>
> Or maybe I misunderstood the issue you were seeing (I am no expert on
> cpusets). Is it related to hierarchies of cpusets (with one cpuset
> contained inside another one)?

Having spent a lot of time in the CPUset code, I can understand the confusion.

CPUset allows to create a hierarchy of sets, _seemingly_ creating
overlapping root domains. Fortunately that isn't the case -
overlapping CPUsets are morphed together to create non-overlapping
root domains. The magic happens in rebuild_sched_domains_locked() [1]
where generate_sched_domains() [2] transforms any CPUset topology into
disjoint domains.

> Can you describe how to reproduce the problematic situation?

Let's start with a 4 CPU system (in this case the Q401c Dragon board)
where patches 1/7 and 2/7 have been applied to a vanilla kernel. I'm
also using Juri's tools [3,4] as describe in Steve's email [5].

root@linaro-developer:/home/linaro# uname -a
Linux linaro-developer 4.13.0-rc5-00012-g98bf1310205e #149 SMP PREEMPT
Thu Aug 24 13:12:39 MDT 2017 aarch64 GNU/Linux
root@linaro-developer:/home/linaro#
root@linaro-developer:/home/linaro# cat /sys/devices/system/cpu/online
0-3
root@linaro-developer:/home/linaro#
root@linaro-developer:/home/linaro# grep dl /proc/sched_debug
dl_rq[0]:
.dl_nr_running : 0
.dl_nr_migratory : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 0
dl_rq[1]:
.dl_nr_running : 0
.dl_nr_migratory : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 0
dl_rq[2]:
.dl_nr_running : 0
.dl_nr_migratory : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 0
dl_rq[3]:
.dl_nr_running : 0
.dl_nr_migratory : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 0
root@linaro-developer:/home/linaro#

This checks out as expected. Now let's create 2 CPUsets and make sure
new root domains are created by setting the 'sched_load_balance' flag
to '0' on the default CPUset.

root@linaro-developer:/sys/fs/cgroup/cpuset# mkdir set1 set2
root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set1/cpuset.mem
root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set2/cpuset.mems
root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0,1 > set1/cpuset.cpus
root@linaro-developer:/sys/fs/cgroup/cpuset# echo 2,3 > set2/cpuset.cpus
root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
root@linaro-developer:/sys/fs/cgroup/cpuset#

At this time runqueue0 and runqueue1 point to root domain A while
runqueue2 and runqueue3 point to root domain B (something that can't
be seen without adding more instrumentation). Newly created tasks can
roam on all the CPUs available:


root@linaro-developer:/home/linaro# ./burn &
[1] 3973
root@linaro-developer:/home/linaro# grep Cpus_allowed: /proc/3973/status
Cpus_allowed: f
root@linaro-developer:/home/linaro#

The above demonstrate that even if we have two CPUsets new task belong
to the "default" CPUset and as such can use all the available CPUs.
Now let's make task 3973 a DL task:

root@linaro-developer:/home/linaro# ./schedtool -E -t 900000:1000000 3973
root@linaro-developer:/home/linaro# grep dl /proc/sched_debug
dl_rq[0]:
.dl_nr_running : 0
.dl_nr_migratory : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 0 <------ Problem
dl_rq[1]:
.dl_nr_running : 0
.dl_nr_migratory : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 0 <------ Problem
dl_rq[2]:
.dl_nr_running : 1
.dl_nr_migratory : 1
.dl_bw->bw : 996147
.dl_bw->total_bw : 943718 <------ As expected
dl_rq[3]:
.dl_nr_running : 0
.dl_nr_migratory : 0
.dl_bw->bw : 996147
.dl_bw->total_bw : 943718 <------ As expected
root@linaro-developer:/home/linaro/jlelli#

When task 3973 was promoted to a DL task it was running on either CPU2
or CPU3. The acceptance test was done on root domain B and the task
utilisation added as expected. But as pointed out above task 3973 can
still be scheduled on CPU0 and CPU1 and that is a problem since the
utilisation hasn't been added there as well. The task is now spread
over two root domains rather than a single one, as currently expected
by the DL code (note that there are many ways to reproduce this
situation).

In its current form the patchset prevents specific operations from
being carried out if we recognise that a task could end up spanning
more than a single root domain. But that will break as soon as we
find a new way to create a DL task that spans multiple domains (and I
may not have caught them all either).

Another way to fix this is to do an acceptance test on all the root
domain of a task. So above we'd run the acceptance test on root
domain A and B before promoting the task. Of course we'd also have to
add the utilisation of that task to both root domain. Although simple
it goes at the core of the DL scheduler and touches pretty much every
aspect of it, something I'm reluctant to embark on.

[1]. http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L814
[2]. http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L634
[3]. https://github.com/jlelli/tests.git
[4]. https://github.com/jlelli/schedtool-dl.git
[5]. https://lkml.org/lkml/2016/2/3/966

>
>> Nonetheless the approach I
>> was contemplating was to repeat the current mathematics to all the
>> root domains accessible from a p->cpus_allowed's flag.
>
> I think in the original SCHED_DEADLINE design there should be only one
> root domain compatible with the task's affinity... If this does not
> happen, I suspect it is a bug (Juri, can you confirm?).
>
> My understanding is that with SCHED_DEADLINE cpusets should be used to
> partition the system's CPUs in disjoint sets (and I think there is one
> root domain for each one of those disjoint sets). And the task affinity
> mask should correspond with the CPUs composing the set in which the
> task is executing.
>
>
>> As such we'd
>> have the same acceptance test but repeated to more than one root
>> domain. To do that time can be an issue but the real problem I see is
>> related to the current DL code. It is geared around a single root
>> domain and changing that means meddling in a lot of places. I had a
>> prototype that was beginning to address that but decided to gather
>> people's opinion before getting in too deep.
>
> I still do not fully understand this (I got the impression that this is
> related to hierarchies of cpusets, but I am not sure if this
> understanding is correct). Maybe an example would help me to understand.

The above should say it all - please get back to me if I haven't
expressed myself clearly.

>
>
>
> Thanks,
> Luca

2017-08-25 06:03:59

by luca abeni

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

Hi Mathieu,

On Thu, 24 Aug 2017 14:32:20 -0600
Mathieu Poirier <[email protected]> wrote:
[...]
> >> > if we want to create some "second level cpusets" inside a "parent
> >> > cpuset", allowing deadline tasks to be placed inside both the
> >> > "parent cpuset" and the "second level cpusets", then we have to
> >> > subtract the "second level cpusets" maximum utilizations from
> >> > the "parent cpuset" utilization.
> >> >
> >> > I am not sure how difficult it can be to implement this...
> >>
> >> Humm... I am missing some context here.
> >
> > Or maybe I misunderstood the issue you were seeing (I am no expert
> > on cpusets). Is it related to hierarchies of cpusets (with one
> > cpuset contained inside another one)?
>
> Having spent a lot of time in the CPUset code, I can understand the
> confusion.
>
> CPUset allows to create a hierarchy of sets, _seemingly_ creating
> overlapping root domains. Fortunately that isn't the case -
> overlapping CPUsets are morphed together to create non-overlapping
> root domains. The magic happens in rebuild_sched_domains_locked() [1]
> where generate_sched_domains() [2] transforms any CPUset topology into
> disjoint domains.

Ok; thanks for explaining

[...]
> root@linaro-developer:/sys/fs/cgroup/cpuset# mkdir set1 set2
> root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set1/cpuset.mem
> root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set2/cpuset.mems
> root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0,1 >
> set1/cpuset.cpus root@linaro-developer:/sys/fs/cgroup/cpuset# echo
> 2,3 > set2/cpuset.cpus root@linaro-developer:/sys/fs/cgroup/cpuset#
> echo 0 > cpuset.sched_load_balance
> root@linaro-developer:/sys/fs/cgroup/cpuset#
>
> At this time runqueue0 and runqueue1 point to root domain A while
> runqueue2 and runqueue3 point to root domain B (something that can't
> be seen without adding more instrumentation).

Ok; up to here, everything is clear to me ;-)

> Newly created tasks can roam on all the CPUs available:
>
>
> root@linaro-developer:/home/linaro# ./burn &
> [1] 3973
> root@linaro-developer:/home/linaro# grep
> Cpus_allowed: /proc/3973/status Cpus_allowed: f
> root@linaro-developer:/home/linaro#

This happens because the task is not in set1 nor in set2, right? I
_think_ (but I am not sure; I did not design this part of
SCHED_DEADLINE) that the original idea was that in this situation
SCHED_DEADLINE tasks can be only in set1 or in set2 (SCHED_DEADLINE
tasks are not allowed to be in the "default" CPUset, in this setup).
Is this what one of your later patches enforces?


> The above demonstrate that even if we have two CPUsets new task belong
> to the "default" CPUset and as such can use all the available CPUs.

I still have a doubt (probably showing all my ignorance about
CPUsets :)... In this situation, we have 3 CPUsets: "default",
set1, and set2... Is everyone of these CPUsets associated to a
root domain (so, we have 3 root domains)? Or only set1 and set2 are
associated to a root domain?


> Now let's make task 3973 a DL task:
>
> root@linaro-developer:/home/linaro# ./schedtool -E -t 900000:1000000
> 3973 root@linaro-developer:/home/linaro# grep dl /proc/sched_debug
> dl_rq[0]:
> .dl_nr_running : 0
> .dl_nr_migratory : 0
> .dl_bw->bw : 996147
> .dl_bw->total_bw : 0 <------ Problem

Ok; I think I understand the problem, now...


> dl_rq[3]:
> .dl_nr_running : 0
> .dl_nr_migratory : 0
> .dl_bw->bw : 996147
> .dl_bw->total_bw : 943718 <------ As expected
> root@linaro-developer:/home/linaro/jlelli#
>
> When task 3973 was promoted to a DL task it was running on either CPU2
> or CPU3. The acceptance test was done on root domain B and the task
> utilisation added as expected. But as pointed out above task 3973 can
> still be scheduled on CPU0 and CPU1 and that is a problem since the
> utilisation hasn't been added there as well. The task is now spread
> over two root domains rather than a single one, as currently expected
> by the DL code (note that there are many ways to reproduce this
> situation).

I think this is a bug, and the only reasonable solution is to allow the
task to become SCHED_DEADLINE if it is in set1 or set2 (so, if its
affinity mask coincides exactly with all of the CPUs of the root domain
where the task utilization is added).


> In its current form the patchset prevents specific operations from
> being carried out if we recognise that a task could end up spanning
> more than a single root domain.

Good. I think this is the right way to go.


> But that will break as soon as we
> find a new way to create a DL task that spans multiple domains (and I
> may not have caught them all either).

So, we need to fix that too ;-)


> Another way to fix this is to do an acceptance test on all the root
> domain of a task.

I think we need to undestand what's the inteded behaviour of
SCHED_DEADLINE in this situation... My understanding is that
SCHED_DEADLINE is designed to do global EDF scheduling inside an
"isolated" CPUset; a SCHED_DEADLINE task spanning multiple domains would
break some SCHED_DEADLINE properties (from the scheduling theory
point of view) in some interesting ways...

I am not saying we should not do this, but I believe that allowing
tasks to span multiple domains require some redesign of the admission
test and migration mechanisms in SCHED_DEADLINE.

I think this is related to the "generic affinities" issue that Peter
mentioned some time ago.


> So above we'd run the acceptance test on root
> domain A and B before promoting the task. Of course we'd also have to
> add the utilisation of that task to both root domain. Although simple
> it goes at the core of the DL scheduler and touches pretty much every
> aspect of it, something I'm reluctant to embark on.

I see... So, the "default" CPUset does not have any root domain
associated to it? If it had, we could just subtract the maximum
utilizations of set1 and set2 to it when creating the root domains of
set1 and set2.



Thanks,
Luca

>
> [1].
> http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L814
> [2].
> http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L634
> [3]. https://github.com/jlelli/tests.git [4].
> https://github.com/jlelli/schedtool-dl.git [5].
> https://lkml.org/lkml/2016/2/3/966
>
> >
> >> Nonetheless the approach I
> >> was contemplating was to repeat the current mathematics to all the
> >> root domains accessible from a p->cpus_allowed's flag.
> >
> > I think in the original SCHED_DEADLINE design there should be only
> > one root domain compatible with the task's affinity... If this does
> > not happen, I suspect it is a bug (Juri, can you confirm?).
> >
> > My understanding is that with SCHED_DEADLINE cpusets should be used
> > to partition the system's CPUs in disjoint sets (and I think there
> > is one root domain for each one of those disjoint sets). And the
> > task affinity mask should correspond with the CPUs composing the
> > set in which the task is executing.
> >
> >
> >> As such we'd
> >> have the same acceptance test but repeated to more than one root
> >> domain. To do that time can be an issue but the real problem I
> >> see is related to the current DL code. It is geared around a
> >> single root domain and changing that means meddling in a lot of
> >> places. I had a prototype that was beginning to address that but
> >> decided to gather people's opinion before getting in too deep.
> >
> > I still do not fully understand this (I got the impression that
> > this is related to hierarchies of cpusets, but I am not sure if this
> > understanding is correct). Maybe an example would help me to
> > understand.
>
> The above should say it all - please get back to me if I haven't
> expressed myself clearly.
>
> >
> >
> >
> > Thanks,
> > Luca

2017-08-25 09:52:18

by luca abeni

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

On Fri, 25 Aug 2017 08:02:43 +0200
luca abeni <[email protected]> wrote:
[...]
> > The above demonstrate that even if we have two CPUsets new task belong
> > to the "default" CPUset and as such can use all the available CPUs.
>
> I still have a doubt (probably showing all my ignorance about
> CPUsets :)... In this situation, we have 3 CPUsets: "default",
> set1, and set2... Is everyone of these CPUsets associated to a
> root domain (so, we have 3 root domains)? Or only set1 and set2 are
> associated to a root domain?

Ok, after reading (and hopefully understanding better :) the code, I
think this question was kind of silly... There are only 2 root domains,
corresponding to set1 and set2 (right?).

[...]

> > So above we'd run the acceptance test on root
> > domain A and B before promoting the task. Of course we'd also have to
> > add the utilisation of that task to both root domain. Although simple
> > it goes at the core of the DL scheduler and touches pretty much every
> > aspect of it, something I'm reluctant to embark on.
>
> I see... So, the "default" CPUset does not have any root domain
> associated to it? If it had, we could just subtract the maximum
> utilizations of set1 and set2 to it when creating the root domains of
> set1 and set2.
...
So, this idea of mine had no sense.

I think the correct solution is what you implemented in your patchset
(if I understand it correctly).

If we want to have task spanning multiple root domains, many more
changes in the code are needed... I am wondering if it would make more
sense to track utilizations per runqueue (instead of per root domain):
- when a task tries to become SCHED_DEADLINE, we count how many CPUs are
in its affinity mask. Let's call "n" this number
- then, we sum u / n (where "u" is the task's utilization) to the
utilization of every runqueue that is in its affinity mask, and we
check if all the sums are below the schedulability bound

For tasks spanning one single root domain, this should be equivalent to
the current admission test. Moreover, this check should ensure that no
root domain can be ever overloaded (even if tasks span multiple
domains).
But I do not know the locking implications for this idea... I suspect
it will not scale :(



Luca

2017-08-25 14:38:01

by luca abeni

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

Hi Mathieu,

On Wed, 23 Aug 2017 13:47:13 -0600
Mathieu Poirier <[email protected]> wrote:

> On 22 August 2017 at 06:21, Luca Abeni <[email protected]> wrote:
> > Hi Mathieu,
>
> Good day to you,
>
> >
> > On Wed, 16 Aug 2017 15:20:36 -0600
> > Mathieu Poirier <[email protected]> wrote:
> >
> >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]
> >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug
> >> operations. When CPUhotplug and some CUPset manipulation take place root
> >> domains are destroyed and new ones created, loosing at the same time DL
> >> accounting pertaining to utilisation.
> >
> > Thanks for looking at this longstanding issue! I am just back from
> > vacations; in the next days I'll try your patches.
> > Do you have some kind of scripts for reproducing the issue
> > automatically? (I see that in the original email Steven described how
> > to reproduce it manually; I just wonder if anyone already scripted the
> > test).
>
> I didn't bother scripting it since it is so easy to do. I'm eager to
> see how things work out on your end.

I ran some tests with your patchset, and I confirm that it fixes the
issue originally pointed out by Steven.

But I still need to run some more tests (I'll continue on Monday).

I think I found an issue by:
1) creating two disjoint cpusets (CPUs 0 and 1 in the first cpuset,
CPUs 2 and 3 in the second one) and setting sched_load_balance to 0
2) starting a task in one of the two cpusets, and making it
SCHED_DEADLINE <--- up to here, everything looks fine
3) setting sched_load_balance to 1 <--- At this point, I think there is
a bug: the system has only one root domain, and the task utilization
is summed to it... But the task affinity mask is still the one of
the "old root domain" that was associated with the cpuset where the
task is executing.

I still need to run some experiments about this.



Thanks,
Luca

2017-08-25 19:53:43

by Mathieu Poirier

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

On 25 August 2017 at 03:52, Luca Abeni <[email protected]> wrote:
> On Fri, 25 Aug 2017 08:02:43 +0200
> luca abeni <[email protected]> wrote:
> [...]
>> > The above demonstrate that even if we have two CPUsets new task belong
>> > to the "default" CPUset and as such can use all the available CPUs.
>>
>> I still have a doubt (probably showing all my ignorance about
>> CPUsets :)... In this situation, we have 3 CPUsets: "default",
>> set1, and set2... Is everyone of these CPUsets associated to a
>> root domain (so, we have 3 root domains)? Or only set1 and set2 are
>> associated to a root domain?
>
> Ok, after reading (and hopefully understanding better :) the code, I
> think this question was kind of silly... There are only 2 root domains,
> corresponding to set1 and set2 (right?).

Correct - although there is a default CPUset there isn't a default root domain.

>
> [...]
>
>> > So above we'd run the acceptance test on root
>> > domain A and B before promoting the task. Of course we'd also have to
>> > add the utilisation of that task to both root domain. Although simple
>> > it goes at the core of the DL scheduler and touches pretty much every
>> > aspect of it, something I'm reluctant to embark on.
>>
>> I see... So, the "default" CPUset does not have any root domain
>> associated to it? If it had, we could just subtract the maximum
>> utilizations of set1 and set2 to it when creating the root domains of
>> set1 and set2.
> ...
> So, this idea of mine had no sense.
>
> I think the correct solution is what you implemented in your patchset
> (if I understand it correctly).
>
> If we want to have task spanning multiple root domains, many more
> changes in the code are needed... I am wondering if it would make more
> sense to track utilizations per runqueue (instead of per root domain):
> - when a task tries to become SCHED_DEADLINE, we count how many CPUs are
> in its affinity mask. Let's call "n" this number
> - then, we sum u / n (where "u" is the task's utilization) to the
> utilization of every runqueue that is in its affinity mask, and we
> check if all the sums are below the schedulability bound
>
> For tasks spanning one single root domain, this should be equivalent to
> the current admission test. Moreover, this check should ensure that no
> root domain can be ever overloaded (even if tasks span multiple
> domains).

This is an idea worth exploring.

> But I do not know the locking implications for this idea... I suspect
> it will not scale :(

Right, scaling could be a problem - we'd have to prototype it and see
how bad things get. We _may_ be able to figure something out with RCU
trickery.

As I mention in a previous email I toyed with the idea of extending
the DL code to support more than one root domain. Maybe it is time to
go back to it, finish the admission test and publish just that part...
At least we would have code to comment on.

Regardless of the avenue we choose to go with I think we could use my
current solution as a stepping stone while we figure out what we
really want to do. At least it would be an improvement on the current
situation.

>
>
>
> Luca

2017-08-25 20:29:30

by Mathieu Poirier

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

On 25 August 2017 at 08:37, Luca Abeni <[email protected]> wrote:
> Hi Mathieu,
>
> On Wed, 23 Aug 2017 13:47:13 -0600
> Mathieu Poirier <[email protected]> wrote:
>
>> On 22 August 2017 at 06:21, Luca Abeni <[email protected]> wrote:
>> > Hi Mathieu,
>>
>> Good day to you,
>>
>> >
>> > On Wed, 16 Aug 2017 15:20:36 -0600
>> > Mathieu Poirier <[email protected]> wrote:
>> >
>> >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]
>> >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug
>> >> operations. When CPUhotplug and some CUPset manipulation take place root
>> >> domains are destroyed and new ones created, loosing at the same time DL
>> >> accounting pertaining to utilisation.
>> >
>> > Thanks for looking at this longstanding issue! I am just back from
>> > vacations; in the next days I'll try your patches.
>> > Do you have some kind of scripts for reproducing the issue
>> > automatically? (I see that in the original email Steven described how
>> > to reproduce it manually; I just wonder if anyone already scripted the
>> > test).
>>
>> I didn't bother scripting it since it is so easy to do. I'm eager to
>> see how things work out on your end.
>
> I ran some tests with your patchset, and I confirm that it fixes the
> issue originally pointed out by Steven.
>

Good, at least it's a start.

> But I still need to run some more tests (I'll continue on Monday).
>
> I think I found an issue by:
> 1) creating two disjoint cpusets (CPUs 0 and 1 in the first cpuset,
> CPUs 2 and 3 in the second one) and setting sched_load_balance to 0
> 2) starting a task in one of the two cpusets, and making it
> SCHED_DEADLINE <--- up to here, everything looks fine
> 3) setting sched_load_balance to 1 <--- At this point, I think there is
> a bug: the system has only one root domain, and the task utilization
> is summed to it... But the task affinity mask is still the one of
> the "old root domain" that was associated with the cpuset where the
> task is executing.

I can reproduce the problem on my side as well.

This is how CPUset works and the expected behaviour. For normal tasks
it isn't a problem but I agree with you that for DL tasks, we need to
address this.

>
> I still need to run some experiments about this.

Thanks for the time,
Mathieu

>
>
>
> Thanks,
> Luca

2017-08-25 20:36:01

by Mathieu Poirier

[permalink] [raw]
Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

On 25 August 2017 at 03:52, Luca Abeni <[email protected]> wrote:
> On Fri, 25 Aug 2017 08:02:43 +0200
> luca abeni <[email protected]> wrote:
> [...]
>> > The above demonstrate that even if we have two CPUsets new task belong
>> > to the "default" CPUset and as such can use all the available CPUs.
>>
>> I still have a doubt (probably showing all my ignorance about
>> CPUsets :)... In this situation, we have 3 CPUsets: "default",
>> set1, and set2... Is everyone of these CPUsets associated to a
>> root domain (so, we have 3 root domains)? Or only set1 and set2 are
>> associated to a root domain?
>
> Ok, after reading (and hopefully understanding better :) the code, I
> think this question was kind of silly... There are only 2 root domains,
> corresponding to set1 and set2 (right?).

For this scenario yes, you are correct.

>
> [...]
>
>> > So above we'd run the acceptance test on root
>> > domain A and B before promoting the task. Of course we'd also have to
>> > add the utilisation of that task to both root domain. Although simple
>> > it goes at the core of the DL scheduler and touches pretty much every
>> > aspect of it, something I'm reluctant to embark on.
>>
>> I see... So, the "default" CPUset does not have any root domain
>> associated to it? If it had, we could just subtract the maximum
>> utilizations of set1 and set2 to it when creating the root domains of
>> set1 and set2.
> ...
> So, this idea of mine had no sense.
>
> I think the correct solution is what you implemented in your patchset
> (if I understand it correctly).
>
> If we want to have task spanning multiple root domains, many more
> changes in the code are needed... I am wondering if it would make more
> sense to track utilizations per runqueue (instead of per root domain):
> - when a task tries to become SCHED_DEADLINE, we count how many CPUs are
> in its affinity mask. Let's call "n" this number
> - then, we sum u / n (where "u" is the task's utilization) to the
> utilization of every runqueue that is in its affinity mask, and we
> check if all the sums are below the schedulability bound
>
> For tasks spanning one single root domain, this should be equivalent to
> the current admission test. Moreover, this check should ensure that no
> root domain can be ever overloaded (even if tasks span multiple
> domains).
> But I do not know the locking implications for this idea... I suspect
> it will not scale :(
>
>
>
> Luca