2019-07-19 14:03:11

by Juri Lelli

[permalink] [raw]
Subject: [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting

Hi,

v9 of a series of patches, originally authored by Mathieu, with the intent
of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
As originally reported by Steve [1], when hotplug and/or (certain)
cpuset reconfiguration operations take place, DEADLINE bandwidth
accounting information is lost since root domains are destroyed and
recreated.

Mathieu's approach is based on restoring bandwidth accounting info on
the newly created root domains by iterating through the (DEADLINE) tasks
belonging to the configured cpuset(s).

Apart from the usual rebase on top of cgroup/for-next, this version

- make cpuset_{can,cancel}_attach grab cpuset_rwsem for write (5/8 - Peter)
- moves v8 8/8 to 7/8 for bisectability (Peter)
- adds comment in changelog regarding normalize_rt_tasks() (8/8 - Peter)

Set also available at

https://github.com/jlelli/linux.git fixes/deadline/root-domain-accounting-v9

Thanks,

- Juri

[1] https://lkml.org/lkml/2016/2/3/966

Juri Lelli (6):
cpuset: Rebuild root domain deadline accounting information
sched/deadline: Fix bandwidth accounting at all levels after offline
migration
cgroup/cpuset: convert cpuset_mutex to percpu_rwsem
cgroup/cpuset: Change cpuset_rwsem and hotplug lock order
rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region
sched/core: Prevent race condition between cpuset and
__sched_setscheduler()

Mathieu Poirier (2):
sched/topology: Adding function partition_sched_domains_locked()
sched/core: Streamlining calls to task_rq_unlock()

include/linux/cgroup.h | 1 +
include/linux/cpuset.h | 13 ++-
include/linux/sched.h | 5 +
include/linux/sched/deadline.h | 8 ++
include/linux/sched/topology.h | 10 ++
kernel/cgroup/cgroup.c | 2 +-
kernel/cgroup/cpuset.c | 163 +++++++++++++++++++++++++--------
kernel/rcu/tree.c | 6 +-
kernel/sched/core.c | 57 ++++++++----
kernel/sched/deadline.c | 63 +++++++++++++
kernel/sched/sched.h | 3 -
kernel/sched/topology.c | 30 +++++-
12 files changed, 290 insertions(+), 71 deletions(-)

--
2.17.2


2019-07-19 14:03:36

by Juri Lelli

[permalink] [raw]
Subject: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information

When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch addresses the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Signed-off-by: Mathieu Poirier <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
include/linux/cgroup.h | 1 +
include/linux/sched.h | 5 +++
include/linux/sched/deadline.h | 8 +++++
kernel/cgroup/cgroup.c | 2 +-
kernel/cgroup/cpuset.c | 64 +++++++++++++++++++++++++++++++++-
kernel/sched/deadline.c | 30 ++++++++++++++++
kernel/sched/sched.h | 3 --
kernel/sched/topology.c | 13 ++++++-
8 files changed, 120 insertions(+), 6 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3745ecdad925..107b8d5943bc 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -150,6 +150,7 @@ struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
struct cgroup_subsys_state **dst_cssp);

+void cgroup_enable_task_cg_lists(void);
void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
struct css_task_iter *it);
struct task_struct *css_task_iter_next(struct css_task_iter *it);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 11837410690f..f74738953e70 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -281,6 +281,11 @@ struct vtime {
u64 gtime;
};

+#ifdef CONFIG_SMP
+extern struct root_domain def_root_domain;
+extern struct mutex sched_domains_mutex;
+#endif
+
struct sched_info {
#ifdef CONFIG_SCHED_INFO
/* Cumulative counters: */
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 0cb034331cbb..1aff00b65f3c 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -24,3 +24,11 @@ static inline bool dl_time_before(u64 a, u64 b)
{
return (s64)(a - b) < 0;
}
+
+#ifdef CONFIG_SMP
+
+struct root_domain;
+extern void dl_add_task_root_domain(struct task_struct *p);
+extern void dl_clear_root_domain(struct root_domain *rd);
+
+#endif /* CONFIG_SMP */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f582414e15ba..d356905044a2 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1879,7 +1879,7 @@ static int cgroup_reconfigure(struct fs_context *fc)
*/
static bool use_task_css_set_links __read_mostly;

-static void cgroup_enable_task_cg_lists(void)
+void cgroup_enable_task_cg_lists(void)
{
struct task_struct *p, *g;

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 95da64cc8732..48d29a6112cb 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -45,6 +45,7 @@
#include <linux/proc_fs.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
+#include <linux/sched/deadline.h>
#include <linux/sched/mm.h>
#include <linux/sched/task.h>
#include <linux/seq_file.h>
@@ -947,6 +948,67 @@ static int generate_sched_domains(cpumask_var_t **domains,
return ndoms;
}

+static void update_tasks_root_domain(struct cpuset *cs)
+{
+ struct css_task_iter it;
+ struct task_struct *task;
+
+ css_task_iter_start(&cs->css, 0, &it);
+
+ while ((task = css_task_iter_next(&it)))
+ dl_add_task_root_domain(task);
+
+ css_task_iter_end(&it);
+}
+
+static void rebuild_root_domains(void)
+{
+ struct cpuset *cs = NULL;
+ struct cgroup_subsys_state *pos_css;
+
+ lockdep_assert_held(&cpuset_mutex);
+ lockdep_assert_cpus_held();
+ lockdep_assert_held(&sched_domains_mutex);
+
+ cgroup_enable_task_cg_lists();
+
+ rcu_read_lock();
+
+ /*
+ * Clear default root domain DL accounting, it will be computed again
+ * if a task belongs to it.
+ */
+ dl_clear_root_domain(&def_root_domain);
+
+ cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+
+ if (cpumask_empty(cs->effective_cpus)) {
+ pos_css = css_rightmost_descendant(pos_css);
+ continue;
+ }
+
+ css_get(&cs->css);
+
+ rcu_read_unlock();
+
+ update_tasks_root_domain(cs);
+
+ rcu_read_lock();
+ css_put(&cs->css);
+ }
+ rcu_read_unlock();
+}
+
+static void
+partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+ struct sched_domain_attr *dattr_new)
+{
+ mutex_lock(&sched_domains_mutex);
+ partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
+ rebuild_root_domains();
+ mutex_unlock(&sched_domains_mutex);
+}
+
/*
* Rebuild scheduler domains.
*
@@ -984,7 +1046,7 @@ static void rebuild_sched_domains_locked(void)
ndoms = generate_sched_domains(&doms, &attr);

/* Have scheduler rebuild the domains */
- partition_sched_domains(ndoms, doms, attr);
+ partition_and_rebuild_sched_domains(ndoms, doms, attr);
out:
put_online_cpus();
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 43901fa3f269..4cedcf8d6b03 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2283,6 +2283,36 @@ void __init init_sched_dl_class(void)
GFP_KERNEL, cpu_to_node(i));
}

+void dl_add_task_root_domain(struct task_struct *p)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+ struct dl_bw *dl_b;
+
+ rq = task_rq_lock(p, &rf);
+ if (!dl_task(p))
+ goto unlock;
+
+ dl_b = &rq->rd->dl_bw;
+ raw_spin_lock(&dl_b->lock);
+
+ __dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
+
+ raw_spin_unlock(&dl_b->lock);
+
+unlock:
+ task_rq_unlock(rq, p, &rf);
+}
+
+void dl_clear_root_domain(struct root_domain *rd)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
+ rd->dl_bw.total_bw = 0;
+ raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
+}
+
#endif /* CONFIG_SMP */

static void switched_from_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1ada0be..8607ceb11e8a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -783,9 +783,6 @@ struct root_domain {
struct perf_domain __rcu *pd;
};

-extern struct root_domain def_root_domain;
-extern struct mutex sched_domains_mutex;
-
extern void init_defrootdomain(void);
extern int sched_init_domains(const struct cpumask *cpu_map);
extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 362c383ec4bd..9fc6ad3c341f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2193,8 +2193,19 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
for (i = 0; i < ndoms_cur; i++) {
for (j = 0; j < n && !new_topology; j++) {
if (cpumask_equal(doms_cur[i], doms_new[j]) &&
- dattrs_equal(dattr_cur, i, dattr_new, j))
+ dattrs_equal(dattr_cur, i, dattr_new, j)) {
+ struct root_domain *rd;
+
+ /*
+ * This domain won't be destroyed and as such
+ * its dl_bw->total_bw needs to be cleared. It
+ * will be recomputed in function
+ * update_tasks_root_domain().
+ */
+ rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
+ dl_clear_root_domain(rd);
goto match1;
+ }
}
/* No match - a current sched domain not in new doms_new[] */
detach_destroy_domains(doms_cur[i]);
--
2.17.2

2019-07-19 14:03:37

by Juri Lelli

[permalink] [raw]
Subject: [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()

From: Mathieu Poirier <[email protected]>

Calls to task_rq_unlock() are done several times in function
__sched_setscheduler(). This is fine when only the rq lock needs to be
handled but not so much when other locks come into play.

This patch streamlines the release of the rq lock so that only one
location need to be modified when dealing with more than one lock.

No change of functionality is introduced by this patch.

Signed-off-by: Mathieu Poirier <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 24 ++++++++++++++----------
1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427742a9..acd6a9fe85bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4222,8 +4222,8 @@ static int __sched_setscheduler(struct task_struct *p,
* Changing the policy of the stop threads its a very bad idea:
*/
if (p == rq->stop) {
- task_rq_unlock(rq, p, &rf);
- return -EINVAL;
+ retval = -EINVAL;
+ goto unlock;
}

/*
@@ -4239,8 +4239,8 @@ static int __sched_setscheduler(struct task_struct *p,
goto change;

p->sched_reset_on_fork = reset_on_fork;
- task_rq_unlock(rq, p, &rf);
- return 0;
+ retval = 0;
+ goto unlock;
}
change:

@@ -4253,8 +4253,8 @@ static int __sched_setscheduler(struct task_struct *p,
if (rt_bandwidth_enabled() && rt_policy(policy) &&
task_group(p)->rt_bandwidth.rt_runtime == 0 &&
!task_group_is_autogroup(task_group(p))) {
- task_rq_unlock(rq, p, &rf);
- return -EPERM;
+ retval = -EPERM;
+ goto unlock;
}
#endif
#ifdef CONFIG_SMP
@@ -4269,8 +4269,8 @@ static int __sched_setscheduler(struct task_struct *p,
*/
if (!cpumask_subset(span, &p->cpus_allowed) ||
rq->rd->dl_bw.bw == 0) {
- task_rq_unlock(rq, p, &rf);
- return -EPERM;
+ retval = -EPERM;
+ goto unlock;
}
}
#endif
@@ -4289,8 +4289,8 @@ static int __sched_setscheduler(struct task_struct *p,
* is available.
*/
if ((dl_policy(policy) || dl_task(p)) && sched_dl_overflow(p, policy, attr)) {
- task_rq_unlock(rq, p, &rf);
- return -EBUSY;
+ retval = -EBUSY;
+ goto unlock;
}

p->sched_reset_on_fork = reset_on_fork;
@@ -4346,6 +4346,10 @@ static int __sched_setscheduler(struct task_struct *p,
preempt_enable();

return 0;
+
+unlock:
+ task_rq_unlock(rq, p, &rf);
+ return retval;
}

static int _sched_setscheduler(struct task_struct *p, int policy,
--
2.17.2

2019-07-19 14:03:46

by Juri Lelli

[permalink] [raw]
Subject: [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration

If a task happens to be throttled while the CPU it was running on gets
hotplugged off, the bandwidth associated with the task is not correctly
migrated with it when the replenishment timer fires (offline_migration).

Fix things up, for this_bw, running_bw and total_bw, when replenishment
timer fires and task is migrated (dl_task_offline_migration()).

Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/deadline.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4cedcf8d6b03..f0166ab8c6b4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -529,6 +529,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq);
static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p)
{
struct rq *later_rq = NULL;
+ struct dl_bw *dl_b;

later_rq = find_lock_later_rq(p, rq);
if (!later_rq) {
@@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
double_lock_balance(rq, later_rq);
}

+ if (p->dl.dl_non_contending || p->dl.dl_throttled) {
+ /*
+ * Inactive timer is armed (or callback is running, but
+ * waiting for us to release rq locks). In any case, when it
+ * will file (or continue), it will see running_bw of this
+ * task migrated to later_rq (and correctly handle it).
+ */
+ sub_running_bw(&p->dl, &rq->dl);
+ sub_rq_bw(&p->dl, &rq->dl);
+
+ add_rq_bw(&p->dl, &later_rq->dl);
+ add_running_bw(&p->dl, &later_rq->dl);
+ } else {
+ sub_rq_bw(&p->dl, &rq->dl);
+ add_rq_bw(&p->dl, &later_rq->dl);
+ }
+
+ /*
+ * And we finally need to fixup root_domain(s) bandwidth accounting,
+ * since p is still hanging out in the old (now moved to default) root
+ * domain.
+ */
+ dl_b = &rq->rd->dl_bw;
+ raw_spin_lock(&dl_b->lock);
+ __dl_sub(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
+ raw_spin_unlock(&dl_b->lock);
+
+ dl_b = &later_rq->rd->dl_bw;
+ raw_spin_lock(&dl_b->lock);
+ __dl_add(dl_b, p->dl.dl_bw, cpumask_weight(later_rq->rd->span));
+ raw_spin_unlock(&dl_b->lock);
+
set_task_cpu(p, later_rq->cpu);
double_unlock_balance(later_rq, rq);

--
2.17.2

2019-07-19 16:42:55

by Juri Lelli

[permalink] [raw]
Subject: [PATCH v9 7/8] rcu/tree: Setschedule gp ktread to SCHED_FIFO outside of atomic region

sched_setscheduler() needs to acquire cpuset_rwsem, but it is currently
called from an invalid (atomic) context by rcu_spawn_gp_kthread().

Fix that by simply moving sched_setscheduler_nocheck() call outside of
the atomic region, as it doesn't actually require to be guarded by
rcu_node lock.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/rcu/tree.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 980ca3ca643f..32ea75acba14 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3123,13 +3123,13 @@ static int __init rcu_spawn_gp_kthread(void)
t = kthread_create(rcu_gp_kthread, NULL, "%s", rcu_state.name);
if (WARN_ONCE(IS_ERR(t), "%s: Could not start grace-period kthread, OOM is now expected behavior\n", __func__))
return 0;
+ if (kthread_prio)
+ sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
rnp = rcu_get_root();
raw_spin_lock_irqsave_rcu_node(rnp, flags);
rcu_state.gp_kthread = t;
- if (kthread_prio) {
+ if (kthread_prio)
sp.sched_priority = kthread_prio;
- sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
- }
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
wake_up_process(t);
rcu_spawn_nocb_kthreads();
--
2.17.2

2019-07-22 09:10:09

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()

On 7/19/19 3:59 PM, Juri Lelli wrote:
> From: Mathieu Poirier <[email protected]>

[...]

> @@ -4269,8 +4269,8 @@ static int __sched_setscheduler(struct task_struct *p,
> */
> if (!cpumask_subset(span, &p->cpus_allowed) ||

This doesn't apply cleanly on v5.3-rc1 anymore due to commit
3bd3706251ee ("sched/core: Provide a pointer to the valid CPU mask").

> rq->rd->dl_bw.bw == 0) {
> - task_rq_unlock(rq, p, &rf);
> - return -EPERM;
> + retval = -EPERM;
> + goto unlock;
> }
> }
> #endif

[...]

2019-07-22 09:32:38

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()

On 22/07/19 10:21, Dietmar Eggemann wrote:
> On 7/19/19 3:59 PM, Juri Lelli wrote:
> > From: Mathieu Poirier <[email protected]>
>
> [...]
>
> > @@ -4269,8 +4269,8 @@ static int __sched_setscheduler(struct task_struct *p,
> > */
> > if (!cpumask_subset(span, &p->cpus_allowed) ||
>
> This doesn't apply cleanly on v5.3-rc1 anymore due to commit
> 3bd3706251ee ("sched/core: Provide a pointer to the valid CPU mask").
>
> > rq->rd->dl_bw.bw == 0) {
> > - task_rq_unlock(rq, p, &rf);
> > - return -EPERM;
> > + retval = -EPERM;
> > + goto unlock;
> > }
> > }
> > #endif

Thanks for reporting. The set is based on cgroup/for-next (as of last
week), though. I can of course rebase on tip/sched/core or mainline if
needed.

Best,

Juri

2019-07-22 09:45:28

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()

On 7/22/19 10:32 AM, Juri Lelli wrote:
> On 22/07/19 10:21, Dietmar Eggemann wrote:
>> On 7/19/19 3:59 PM, Juri Lelli wrote:
>>> From: Mathieu Poirier <[email protected]>
>>
>> [...]
>>
>>> @@ -4269,8 +4269,8 @@ static int __sched_setscheduler(struct task_struct *p,
>>> */
>>> if (!cpumask_subset(span, &p->cpus_allowed) ||
>>
>> This doesn't apply cleanly on v5.3-rc1 anymore due to commit
>> 3bd3706251ee ("sched/core: Provide a pointer to the valid CPU mask").
>>
>>> rq->rd->dl_bw.bw == 0) {
>>> - task_rq_unlock(rq, p, &rf);
>>> - return -EPERM;
>>> + retval = -EPERM;
>>> + goto unlock;
>>> }
>>> }
>>> #endif
>
> Thanks for reporting. The set is based on cgroup/for-next (as of last
> week), though. I can of course rebase on tip/sched/core or mainline if
> needed.

Not sure, there is another little issue on 3/8 since uclamp is in
v5.3-rc1 as well commit 69842cba9ace8 ("sched/uclamp: Add CPU's clamp
buckets refcounting").

2019-07-22 13:34:26

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration

On 7/22/19 2:28 PM, Juri Lelli wrote:
> On 22/07/19 13:07, Dietmar Eggemann wrote:
>> On 7/19/19 3:59 PM, Juri Lelli wrote:
>>
>> [...]
>>
>>> @@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
>>> double_lock_balance(rq, later_rq);
>>> }
>>>
>>> + if (p->dl.dl_non_contending || p->dl.dl_throttled) {
>>> + /*
>>> + * Inactive timer is armed (or callback is running, but
>>> + * waiting for us to release rq locks). In any case, when it
>>> + * will file (or continue), it will see running_bw of this
>>
>> s/file/fire ?
>
> Yep.
>
>>> + * task migrated to later_rq (and correctly handle it).
>>
>> Is this because of dl_task_timer()->enqueue_task_dl()->task_contending()
>> setting dl_se->dl_non_contending = 0 ?
>
> No, this is related to inactive_task_timer() callback. Since the task is
> migrated (by this function calling set_task_cpu()) because a CPU hotplug
> operation happened, we need to reflect this w.r.t. running_bw, or
> inactive_task_timer() might sub from the new CPU and cause running_bw to
> underflow.

I was more referring to the '... it will see running_bw of thus task
migrated to later_rq ...) and specifically to the HOW the timer
callback can detect this. I should have made this clearer.
inactive_task_timer() checks if (dl_se->dl_non_contending == 0) so I
thought I have to find the place where dl_se->dl_non_contending is set 0?



2019-07-22 13:37:35

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration

On 22/07/19 15:21, Dietmar Eggemann wrote:
> On 7/22/19 2:28 PM, Juri Lelli wrote:
> > On 22/07/19 13:07, Dietmar Eggemann wrote:
> >> On 7/19/19 3:59 PM, Juri Lelli wrote:
> >>
> >> [...]
> >>
> >>> @@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
> >>> double_lock_balance(rq, later_rq);
> >>> }
> >>>
> >>> + if (p->dl.dl_non_contending || p->dl.dl_throttled) {
> >>> + /*
> >>> + * Inactive timer is armed (or callback is running, but
> >>> + * waiting for us to release rq locks). In any case, when it
> >>> + * will file (or continue), it will see running_bw of this
> >>
> >> s/file/fire ?
> >
> > Yep.
> >
> >>> + * task migrated to later_rq (and correctly handle it).
> >>
> >> Is this because of dl_task_timer()->enqueue_task_dl()->task_contending()
> >> setting dl_se->dl_non_contending = 0 ?
> >
> > No, this is related to inactive_task_timer() callback. Since the task is
> > migrated (by this function calling set_task_cpu()) because a CPU hotplug
> > operation happened, we need to reflect this w.r.t. running_bw, or
> > inactive_task_timer() might sub from the new CPU and cause running_bw to
> > underflow.
>
> I was more referring to the '... it will see running_bw of thus task
> migrated to later_rq ...) and specifically to the HOW the timer
> callback can detect this.

Oh, it actually doesn't "actively" detect this condition. The problem is
that if it still sees dl_non_contending == 1, it will sub (from the
"new" rq to which task's running_bw hasn't been added - w/o this fix)
and cause the underflow.

2019-07-22 14:44:43

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration

On 7/19/19 3:59 PM, Juri Lelli wrote:

[...]

> @@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
> double_lock_balance(rq, later_rq);
> }
>
> + if (p->dl.dl_non_contending || p->dl.dl_throttled) {
> + /*
> + * Inactive timer is armed (or callback is running, but
> + * waiting for us to release rq locks). In any case, when it
> + * will file (or continue), it will see running_bw of this

s/file/fire ?

> + * task migrated to later_rq (and correctly handle it).

Is this because of dl_task_timer()->enqueue_task_dl()->task_contending()
setting dl_se->dl_non_contending = 0 ?

[...]

2019-07-22 15:38:35

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration

On 22/07/19 13:07, Dietmar Eggemann wrote:
> On 7/19/19 3:59 PM, Juri Lelli wrote:
>
> [...]
>
> > @@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
> > double_lock_balance(rq, later_rq);
> > }
> >
> > + if (p->dl.dl_non_contending || p->dl.dl_throttled) {
> > + /*
> > + * Inactive timer is armed (or callback is running, but
> > + * waiting for us to release rq locks). In any case, when it
> > + * will file (or continue), it will see running_bw of this
>
> s/file/fire ?

Yep.

> > + * task migrated to later_rq (and correctly handle it).
>
> Is this because of dl_task_timer()->enqueue_task_dl()->task_contending()
> setting dl_se->dl_non_contending = 0 ?

No, this is related to inactive_task_timer() callback. Since the task is
migrated (by this function calling set_task_cpu()) because a CPU hotplug
operation happened, we need to reflect this w.r.t. running_bw, or
inactive_task_timer() might sub from the new CPU and cause running_bw to
underflow.

Thanks,

Juri

2019-07-22 17:27:53

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v9 4/8] sched/deadline: Fix bandwidth accounting at all levels after offline migration

On 7/22/19 3:35 PM, Juri Lelli wrote:
> On 22/07/19 15:21, Dietmar Eggemann wrote:
>> On 7/22/19 2:28 PM, Juri Lelli wrote:
>>> On 22/07/19 13:07, Dietmar Eggemann wrote:
>>>> On 7/19/19 3:59 PM, Juri Lelli wrote:
>>>>
>>>> [...]
>>>>
>>>>> @@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
>>>>> double_lock_balance(rq, later_rq);
>>>>> }
>>>>>
>>>>> + if (p->dl.dl_non_contending || p->dl.dl_throttled) {
>>>>> + /*
>>>>> + * Inactive timer is armed (or callback is running, but
>>>>> + * waiting for us to release rq locks). In any case, when it
>>>>> + * will file (or continue), it will see running_bw of this
>>>>
>>>> s/file/fire ?
>>>
>>> Yep.
>>>
>>>>> + * task migrated to later_rq (and correctly handle it).
>>>>
>>>> Is this because of dl_task_timer()->enqueue_task_dl()->task_contending()
>>>> setting dl_se->dl_non_contending = 0 ?
>>>
>>> No, this is related to inactive_task_timer() callback. Since the task is
>>> migrated (by this function calling set_task_cpu()) because a CPU hotplug
>>> operation happened, we need to reflect this w.r.t. running_bw, or
>>> inactive_task_timer() might sub from the new CPU and cause running_bw to
>>> underflow.
>>
>> I was more referring to the '... it will see running_bw of thus task
>> migrated to later_rq ...) and specifically to the HOW the timer
>> callback can detect this.
>
> Oh, it actually doesn't "actively" detect this condition. The problem is
> that if it still sees dl_non_contending == 1, it will sub (from the
> "new" rq to which task's running_bw hasn't been added - w/o this fix)
> and cause the underflow.

I was wrong ... enqueue_task_dl() is called with ENQUEUE_REPLENISH which
doesn't call task_contending(). The comment makes sense to me now.


2019-07-23 20:07:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()

On Mon, Jul 22, 2019 at 10:32:14AM +0200, Juri Lelli wrote:

> Thanks for reporting. The set is based on cgroup/for-next (as of last
> week), though. I can of course rebase on tip/sched/core or mainline if
> needed.

TJ; I would like to take these patches through the scheduler tree if you
don't mind. Afaict there's no real conflict vs cgroup/for-next (I
applied the patches and then did a pull of cgroup/for-next which
finished without complaints).

2019-07-23 20:26:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()

On Mon, Jul 22, 2019 at 11:08:17AM +0200, Dietmar Eggemann wrote:

> Not sure, there is another little issue on 3/8 since uclamp is in
> v5.3-rc1 as well commit 69842cba9ace8 ("sched/uclamp: Add CPU's clamp
> buckets refcounting").

Also, 8/8, but all conflicts are trivial and I've fixed them up.

2019-07-24 00:21:02

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()

On 23/07/19 06:11, Tejun Heo wrote:
> On Tue, Jul 23, 2019 at 12:31:31PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 22, 2019 at 10:32:14AM +0200, Juri Lelli wrote:
> >
> > > Thanks for reporting. The set is based on cgroup/for-next (as of last
> > > week), though. I can of course rebase on tip/sched/core or mainline if
> > > needed.
> >
> > TJ; I would like to take these patches through the scheduler tree if you
> > don't mind. Afaict there's no real conflict vs cgroup/for-next (I
> > applied the patches and then did a pull of cgroup/for-next which
> > finished without complaints).
>
> Yeah, for sure, please go ahead.
>
> Thanks.

Thanks!

2019-07-24 02:15:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v9 2/8] sched/core: Streamlining calls to task_rq_unlock()

On Tue, Jul 23, 2019 at 12:31:31PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 22, 2019 at 10:32:14AM +0200, Juri Lelli wrote:
>
> > Thanks for reporting. The set is based on cgroup/for-next (as of last
> > week), though. I can of course rebase on tip/sched/core or mainline if
> > needed.
>
> TJ; I would like to take these patches through the scheduler tree if you
> don't mind. Afaict there's no real conflict vs cgroup/for-next (I
> applied the patches and then did a pull of cgroup/for-next which
> finished without complaints).

Yeah, for sure, please go ahead.

Thanks.

--
tejun

2019-07-24 08:46:39

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v9 0/8] sched/deadline: fix cpusets bandwidth accounting

On 7/19/19 3:59 PM, Juri Lelli wrote:
> Hi,
>
> v9 of a series of patches, originally authored by Mathieu, with the intent
> of fixing a long standing issue of SCHED_DEADLINE bandwidth accounting.
> As originally reported by Steve [1], when hotplug and/or (certain)
> cpuset reconfiguration operations take place, DEADLINE bandwidth
> accounting information is lost since root domains are destroyed and
> recreated.
>
> Mathieu's approach is based on restoring bandwidth accounting info on
> the newly created root domains by iterating through the (DEADLINE) tasks
> belonging to the configured cpuset(s).
>
> Apart from the usual rebase on top of cgroup/for-next, this version
>
> - make cpuset_{can,cancel}_attach grab cpuset_rwsem for write (5/8 - Peter)
> - moves v8 8/8 to 7/8 for bisectability (Peter)
> - adds comment in changelog regarding normalize_rt_tasks() (8/8 - Peter)
>
> Set also available at
>
> https://github.com/jlelli/linux.git fixes/deadline/root-domain-accounting-v9

Tested-by: Dietmar Eggemann <[email protected]>

Test description:

Juno-r0 (Arm64 big/Little [L b b L L L]) with 6 DL tasks
(12000/100000/100000).

Rt-app runs DL workload for 10min.

After rt-app launched, start CPU hotplug stress test (random CPU hp
in/out except for CPU1 (CPU orig capacity 1024 (big)) during the entire
rt-app run.

Tests ran with 1 and 2 (exclusive cpusets ([0,3-5], [1-2])) root domains.

2019-07-25 14:24:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information


* Juri Lelli <[email protected]> wrote:

> When the topology of root domains is modified by CPUset or CPUhotplug
> operations information about the current deadline bandwidth held in the
> root domain is lost.
>
> This patch addresses the issue by recalculating the lost deadline
> bandwidth information by circling through the deadline tasks held in
> CPUsets and adding their current load to the root domain they are
> associated with.
>
> Signed-off-by: Mathieu Poirier <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>

Was this commit written by Mathieu? If yes then it's missing a From line.
If not then the Signed-off-by should probably be changed to Acked-by or
Reviewed-by?

Thanks,

Ingo

2019-07-25 14:27:49

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information

Hi,

On 25/07/19 15:56, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Juri Lelli <[email protected]> wrote:
> >
> > > When the topology of root domains is modified by CPUset or CPUhotplug
> > > operations information about the current deadline bandwidth held in the
> > > root domain is lost.
> > >
> > > This patch addresses the issue by recalculating the lost deadline
> > > bandwidth information by circling through the deadline tasks held in
> > > CPUsets and adding their current load to the root domain they are
> > > associated with.
> > >
> > > Signed-off-by: Mathieu Poirier <[email protected]>
> > > Signed-off-by: Juri Lelli <[email protected]>
> >
> > Was this commit written by Mathieu? If yes then it's missing a From line.
> > If not then the Signed-off-by should probably be changed to Acked-by or
> > Reviewed-by?
>
> So for now I'm assuming that the original patch was written by Mathieu,
> with modifications by you. So I added his From line and extended the SOB
> chain with the additional information that you modified the patch:
>
> Tested-by: Dietmar Eggemann <[email protected]>
> Signed-off-by: Mathieu Poirier <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> [ Various additional modifications. ]
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
>
> Let me know if that's not accurate!

Looks good to me, thanks. And sorry for the confusion.

Best,

Juri

2019-07-25 17:54:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information


* Ingo Molnar <[email protected]> wrote:

>
> * Juri Lelli <[email protected]> wrote:
>
> > When the topology of root domains is modified by CPUset or CPUhotplug
> > operations information about the current deadline bandwidth held in the
> > root domain is lost.
> >
> > This patch addresses the issue by recalculating the lost deadline
> > bandwidth information by circling through the deadline tasks held in
> > CPUsets and adding their current load to the root domain they are
> > associated with.
> >
> > Signed-off-by: Mathieu Poirier <[email protected]>
> > Signed-off-by: Juri Lelli <[email protected]>
>
> Was this commit written by Mathieu? If yes then it's missing a From line.
> If not then the Signed-off-by should probably be changed to Acked-by or
> Reviewed-by?

So for now I'm assuming that the original patch was written by Mathieu,
with modifications by you. So I added his From line and extended the SOB
chain with the additional information that you modified the patch:

Tested-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Mathieu Poirier <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
[ Various additional modifications. ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

Let me know if that's not accurate!

Thanks,

Ingo

2019-07-25 18:05:18

by Mathieu Poirier

[permalink] [raw]
Subject: Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information

On Thu, 25 Jul 2019 at 07:56, Ingo Molnar <[email protected]> wrote:
>
>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Juri Lelli <[email protected]> wrote:
> >
> > > When the topology of root domains is modified by CPUset or CPUhotplug
> > > operations information about the current deadline bandwidth held in the
> > > root domain is lost.
> > >
> > > This patch addresses the issue by recalculating the lost deadline
> > > bandwidth information by circling through the deadline tasks held in
> > > CPUsets and adding their current load to the root domain they are
> > > associated with.
> > >
> > > Signed-off-by: Mathieu Poirier <[email protected]>
> > > Signed-off-by: Juri Lelli <[email protected]>
> >
> > Was this commit written by Mathieu? If yes then it's missing a From line.
> > If not then the Signed-off-by should probably be changed to Acked-by or
> > Reviewed-by?
>
> So for now I'm assuming that the original patch was written by Mathieu,
> with modifications by you. So I added his From line and extended the SOB
> chain with the additional information that you modified the patch:
>
> Tested-by: Dietmar Eggemann <[email protected]>
> Signed-off-by: Mathieu Poirier <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> [ Various additional modifications. ]
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
>
> Let me know if that's not accurate!

You are correct - thanks,
Mathieu

>
> Thanks,
>
> Ingo

Subject: [tip:sched/core] sched/deadline: Fix bandwidth accounting at all levels after offline migration

Commit-ID: 59d06cea1198d665ba11f7e8c5f45b00ff2e4812
Gitweb: https://git.kernel.org/tip/59d06cea1198d665ba11f7e8c5f45b00ff2e4812
Author: Juri Lelli <[email protected]>
AuthorDate: Fri, 19 Jul 2019 15:59:56 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 25 Jul 2019 15:55:02 +0200

sched/deadline: Fix bandwidth accounting at all levels after offline migration

If a task happens to be throttled while the CPU it was running on gets
hotplugged off, the bandwidth associated with the task is not correctly
migrated with it when the replenishment timer fires (offline_migration).

Fix things up, for this_bw, running_bw and total_bw, when replenishment
timer fires and task is migrated (dl_task_offline_migration()).

Tested-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/deadline.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0f9d2180be23..039dde2b1dac 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -529,6 +529,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq);
static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p)
{
struct rq *later_rq = NULL;
+ struct dl_bw *dl_b;

later_rq = find_lock_later_rq(p, rq);
if (!later_rq) {
@@ -557,6 +558,38 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
double_lock_balance(rq, later_rq);
}

+ if (p->dl.dl_non_contending || p->dl.dl_throttled) {
+ /*
+ * Inactive timer is armed (or callback is running, but
+ * waiting for us to release rq locks). In any case, when it
+ * will fire (or continue), it will see running_bw of this
+ * task migrated to later_rq (and correctly handle it).
+ */
+ sub_running_bw(&p->dl, &rq->dl);
+ sub_rq_bw(&p->dl, &rq->dl);
+
+ add_rq_bw(&p->dl, &later_rq->dl);
+ add_running_bw(&p->dl, &later_rq->dl);
+ } else {
+ sub_rq_bw(&p->dl, &rq->dl);
+ add_rq_bw(&p->dl, &later_rq->dl);
+ }
+
+ /*
+ * And we finally need to fixup root_domain(s) bandwidth accounting,
+ * since p is still hanging out in the old (now moved to default) root
+ * domain.
+ */
+ dl_b = &rq->rd->dl_bw;
+ raw_spin_lock(&dl_b->lock);
+ __dl_sub(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
+ raw_spin_unlock(&dl_b->lock);
+
+ dl_b = &later_rq->rd->dl_bw;
+ raw_spin_lock(&dl_b->lock);
+ __dl_add(dl_b, p->dl.dl_bw, cpumask_weight(later_rq->rd->span));
+ raw_spin_unlock(&dl_b->lock);
+
set_task_cpu(p, later_rq->cpu);
double_unlock_balance(later_rq, rq);


Subject: [tip:sched/core] rcu/tree: Call setschedule() gp ktread to SCHED_FIFO outside of atomic region

Commit-ID: 1a763fd7c6335e3122c1cc09576ef6c99ada4267
Gitweb: https://git.kernel.org/tip/1a763fd7c6335e3122c1cc09576ef6c99ada4267
Author: Juri Lelli <[email protected]>
AuthorDate: Fri, 19 Jul 2019 15:59:59 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 25 Jul 2019 15:55:03 +0200

rcu/tree: Call setschedule() gp ktread to SCHED_FIFO outside of atomic region

sched_setscheduler() needs to acquire cpuset_rwsem, but it is currently
called from an invalid (atomic) context by rcu_spawn_gp_kthread().

Fix that by simply moving sched_setscheduler_nocheck() call outside of
the atomic region, as it doesn't actually require to be guarded by
rcu_node lock.

Suggested-by: Peter Zijlstra <[email protected]>
Tested-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/rcu/tree.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a14e5fbbea46..eb764c24bc4d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3234,13 +3234,13 @@ static int __init rcu_spawn_gp_kthread(void)
t = kthread_create(rcu_gp_kthread, NULL, "%s", rcu_state.name);
if (WARN_ONCE(IS_ERR(t), "%s: Could not start grace-period kthread, OOM is now expected behavior\n", __func__))
return 0;
+ if (kthread_prio)
+ sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
rnp = rcu_get_root();
raw_spin_lock_irqsave_rcu_node(rnp, flags);
rcu_state.gp_kthread = t;
- if (kthread_prio) {
+ if (kthread_prio)
sp.sched_priority = kthread_prio;
- sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
- }
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
wake_up_process(t);
rcu_spawn_nocb_kthreads();

Subject: [tip:sched/core] cpusets: Rebuild root domain deadline accounting information

Commit-ID: f9a25f776d780bfa3279f0b6e5f5cf3224997976
Gitweb: https://git.kernel.org/tip/f9a25f776d780bfa3279f0b6e5f5cf3224997976
Author: Mathieu Poirier <[email protected]>
AuthorDate: Fri, 19 Jul 2019 15:59:55 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 25 Jul 2019 15:55:01 +0200

cpusets: Rebuild root domain deadline accounting information

When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch addresses the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Tested-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Mathieu Poirier <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
[ Various additional modifications. ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/cgroup.h | 1 +
include/linux/sched.h | 5 ++++
include/linux/sched/deadline.h | 8 ++++++
kernel/cgroup/cgroup.c | 2 +-
kernel/cgroup/cpuset.c | 64 +++++++++++++++++++++++++++++++++++++++++-
kernel/sched/deadline.c | 30 ++++++++++++++++++++
kernel/sched/sched.h | 3 --
kernel/sched/topology.c | 13 ++++++++-
8 files changed, 120 insertions(+), 6 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f6b048902d6c..3ba3e6da13a6 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -150,6 +150,7 @@ struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
struct cgroup_subsys_state **dst_cssp);

+void cgroup_enable_task_cg_lists(void);
void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
struct css_task_iter *it);
struct task_struct *css_task_iter_next(struct css_task_iter *it);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f51932bd543..b94ad92dfbe6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -295,6 +295,11 @@ enum uclamp_id {
UCLAMP_CNT
};

+#ifdef CONFIG_SMP
+extern struct root_domain def_root_domain;
+extern struct mutex sched_domains_mutex;
+#endif
+
struct sched_info {
#ifdef CONFIG_SCHED_INFO
/* Cumulative counters: */
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 0cb034331cbb..1aff00b65f3c 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -24,3 +24,11 @@ static inline bool dl_time_before(u64 a, u64 b)
{
return (s64)(a - b) < 0;
}
+
+#ifdef CONFIG_SMP
+
+struct root_domain;
+extern void dl_add_task_root_domain(struct task_struct *p);
+extern void dl_clear_root_domain(struct root_domain *rd);
+
+#endif /* CONFIG_SMP */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 753afbca549f..4b5bc452176c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1891,7 +1891,7 @@ static int cgroup_reconfigure(struct fs_context *fc)
*/
static bool use_task_css_set_links __read_mostly;

-static void cgroup_enable_task_cg_lists(void)
+void cgroup_enable_task_cg_lists(void)
{
struct task_struct *p, *g;

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5aa37531ce76..846cbdb68566 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -45,6 +45,7 @@
#include <linux/proc_fs.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
+#include <linux/sched/deadline.h>
#include <linux/sched/mm.h>
#include <linux/sched/task.h>
#include <linux/seq_file.h>
@@ -894,6 +895,67 @@ done:
return ndoms;
}

+static void update_tasks_root_domain(struct cpuset *cs)
+{
+ struct css_task_iter it;
+ struct task_struct *task;
+
+ css_task_iter_start(&cs->css, 0, &it);
+
+ while ((task = css_task_iter_next(&it)))
+ dl_add_task_root_domain(task);
+
+ css_task_iter_end(&it);
+}
+
+static void rebuild_root_domains(void)
+{
+ struct cpuset *cs = NULL;
+ struct cgroup_subsys_state *pos_css;
+
+ lockdep_assert_held(&cpuset_mutex);
+ lockdep_assert_cpus_held();
+ lockdep_assert_held(&sched_domains_mutex);
+
+ cgroup_enable_task_cg_lists();
+
+ rcu_read_lock();
+
+ /*
+ * Clear default root domain DL accounting, it will be computed again
+ * if a task belongs to it.
+ */
+ dl_clear_root_domain(&def_root_domain);
+
+ cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+
+ if (cpumask_empty(cs->effective_cpus)) {
+ pos_css = css_rightmost_descendant(pos_css);
+ continue;
+ }
+
+ css_get(&cs->css);
+
+ rcu_read_unlock();
+
+ update_tasks_root_domain(cs);
+
+ rcu_read_lock();
+ css_put(&cs->css);
+ }
+ rcu_read_unlock();
+}
+
+static void
+partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
+ struct sched_domain_attr *dattr_new)
+{
+ mutex_lock(&sched_domains_mutex);
+ partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
+ rebuild_root_domains();
+ mutex_unlock(&sched_domains_mutex);
+}
+
/*
* Rebuild scheduler domains.
*
@@ -931,7 +993,7 @@ static void rebuild_sched_domains_locked(void)
ndoms = generate_sched_domains(&doms, &attr);

/* Have scheduler rebuild the domains */
- partition_sched_domains(ndoms, doms, attr);
+ partition_and_rebuild_sched_domains(ndoms, doms, attr);
out:
put_online_cpus();
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ef5b9f6b1d42..0f9d2180be23 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2283,6 +2283,36 @@ void __init init_sched_dl_class(void)
GFP_KERNEL, cpu_to_node(i));
}

+void dl_add_task_root_domain(struct task_struct *p)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+ struct dl_bw *dl_b;
+
+ rq = task_rq_lock(p, &rf);
+ if (!dl_task(p))
+ goto unlock;
+
+ dl_b = &rq->rd->dl_bw;
+ raw_spin_lock(&dl_b->lock);
+
+ __dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
+
+ raw_spin_unlock(&dl_b->lock);
+
+unlock:
+ task_rq_unlock(rq, p, &rf);
+}
+
+void dl_clear_root_domain(struct root_domain *rd)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
+ rd->dl_bw.total_bw = 0;
+ raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
+}
+
#endif /* CONFIG_SMP */

static void switched_from_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 16126efd14ed..7583faddba33 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -778,9 +778,6 @@ struct root_domain {
struct perf_domain __rcu *pd;
};

-extern struct root_domain def_root_domain;
-extern struct mutex sched_domains_mutex;
-
extern void init_defrootdomain(void);
extern int sched_init_domains(const struct cpumask *cpu_map);
extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5a174ae6ecf3..8f83e8e3ea9a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2203,8 +2203,19 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
for (i = 0; i < ndoms_cur; i++) {
for (j = 0; j < n && !new_topology; j++) {
if (cpumask_equal(doms_cur[i], doms_new[j]) &&
- dattrs_equal(dattr_cur, i, dattr_new, j))
+ dattrs_equal(dattr_cur, i, dattr_new, j)) {
+ struct root_domain *rd;
+
+ /*
+ * This domain won't be destroyed and as such
+ * its dl_bw->total_bw needs to be cleared. It
+ * will be recomputed in function
+ * update_tasks_root_domain().
+ */
+ rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
+ dl_clear_root_domain(rd);
goto match1;
+ }
}
/* No match - a current sched domain not in new doms_new[] */
detach_destroy_domains(doms_cur[i]);

Subject: [tip:sched/core] sched/core: Streamle calls to task_rq_unlock()

Commit-ID: 4b211f2b129dd1f6a6956bbc76e2f232c1ec3ad8
Gitweb: https://git.kernel.org/tip/4b211f2b129dd1f6a6956bbc76e2f232c1ec3ad8
Author: Mathieu Poirier <[email protected]>
AuthorDate: Fri, 19 Jul 2019 15:59:54 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 25 Jul 2019 15:51:57 +0200

sched/core: Streamle calls to task_rq_unlock()

Calls to task_rq_unlock() are done several times in the
__sched_setscheduler() function. This is fine when only the rq lock needs to be
handled but not so much when other locks come into play.

This patch streamlines the release of the rq lock so that only one
location need to be modified when dealing with more than one lock.

No change of functionality is introduced by this patch.

Tested-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Mathieu Poirier <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 24 ++++++++++++++----------
1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0b22e55cebe8..1af3d2dc6b29 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4712,8 +4712,8 @@ recheck:
* Changing the policy of the stop threads its a very bad idea:
*/
if (p == rq->stop) {
- task_rq_unlock(rq, p, &rf);
- return -EINVAL;
+ retval = -EINVAL;
+ goto unlock;
}

/*
@@ -4731,8 +4731,8 @@ recheck:
goto change;

p->sched_reset_on_fork = reset_on_fork;
- task_rq_unlock(rq, p, &rf);
- return 0;
+ retval = 0;
+ goto unlock;
}
change:

@@ -4745,8 +4745,8 @@ change:
if (rt_bandwidth_enabled() && rt_policy(policy) &&
task_group(p)->rt_bandwidth.rt_runtime == 0 &&
!task_group_is_autogroup(task_group(p))) {
- task_rq_unlock(rq, p, &rf);
- return -EPERM;
+ retval = -EPERM;
+ goto unlock;
}
#endif
#ifdef CONFIG_SMP
@@ -4761,8 +4761,8 @@ change:
*/
if (!cpumask_subset(span, p->cpus_ptr) ||
rq->rd->dl_bw.bw == 0) {
- task_rq_unlock(rq, p, &rf);
- return -EPERM;
+ retval = -EPERM;
+ goto unlock;
}
}
#endif
@@ -4781,8 +4781,8 @@ change:
* is available.
*/
if ((dl_policy(policy) || dl_task(p)) && sched_dl_overflow(p, policy, attr)) {
- task_rq_unlock(rq, p, &rf);
- return -EBUSY;
+ retval = -EBUSY;
+ goto unlock;
}

p->sched_reset_on_fork = reset_on_fork;
@@ -4840,6 +4840,10 @@ change:
preempt_enable();

return 0;
+
+unlock:
+ task_rq_unlock(rq, p, &rf);
+ return retval;
}

static int _sched_setscheduler(struct task_struct *p, int policy,

2019-11-04 18:58:57

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information

Hello.

I came across a cgroup_enable_task_cg_lists() call from the cpuset
controller...

On Fri, Jul 19, 2019 at 03:59:55PM +0200, Juri Lelli <[email protected]> wrote:
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> [...]
> +static void rebuild_root_domains(void)
> +{
> + struct cpuset *cs = NULL;
> + struct cgroup_subsys_state *pos_css;
> +
> + lockdep_assert_held(&cpuset_mutex);
> + lockdep_assert_cpus_held();
> + lockdep_assert_held(&sched_domains_mutex);
> +
> + cgroup_enable_task_cg_lists();
...and I wonder why is it necessary to call at this place?

(IIUC, before cpuset hierarchy is anywhere mounted it's mere top_cpuset,
i.e. processing the top_cpuset alone is enough. And if anyone wants to
create any non-root cpusets, they have to mount the hierachy first, i.e.
no need to call cgroup_enable_task_cg_lists() manually. Also if I'm not
overlooking anything, the race between hotplug and mount (more precisely
new cpuset creation) should be synchronized by cpuset_mutex.)

Thanks,
Michal


Attachments:
(No filename) (1.03 kB)
signature.asc (849.00 B)
Digital signature
Download all attachments

2019-11-05 12:07:36

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information

On Mon, Nov 04, 2019 at 07:57:42PM +0100, Michal Koutn? <[email protected]> wrote:
> I came across a cgroup_enable_task_cg_lists() call from the cpuset
> controller...
>[...]
> > + cgroup_enable_task_cg_lists();
> ...and I wonder why is it necessary to call at this place?
Consider this resolved.

I realized the on-demand linking is being removed [1].

Michal

[1] https://lore.kernel.org/lkml/[email protected]/


Attachments:
(No filename) (461.00 B)
signature.asc (849.00 B)
Digital signature
Download all attachments

2022-12-16 23:45:10

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information

Hi

On 07/19/19 15:59, Juri Lelli wrote:
> When the topology of root domains is modified by CPUset or CPUhotplug
> operations information about the current deadline bandwidth held in the
> root domain is lost.
>
> This patch addresses the issue by recalculating the lost deadline
> bandwidth information by circling through the deadline tasks held in
> CPUsets and adding their current load to the root domain they are
> associated with.
>
> Signed-off-by: Mathieu Poirier <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> ---

We see that rebuild_root_domain() can take 10+ ms (I get a max of 20ms quite
consistently) on suspend/resume.

Do we actually need to rebuild_root_domain() if we're going through
a suspend/resume cycle?

ie: would something like the below make sense? We'd skip this logic if
cpuhp_tasks_frozen is set which indicates it's not a real hotplug operation but
we're suspending/resuming.


Cheers

--
Qais Yousef


--->8---


From 4cfd50960ad872c5eb810ad3038eaf840bab5182 Mon Sep 17 00:00:00 2001
From: Qais Yousef <[email protected]>
Date: Tue, 29 Nov 2022 19:01:52 +0000
Subject: [PATCH] sched: cpuset: Don't rebuild sched domains on suspend-resume

Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
enabled rebuilding sched domain on cpuset and hotplug operations to
correct deadline accounting.

Rebuilding sched domain is a slow operation and we see 10+ ms delays
on suspend-resume because of that.

Since nothing is expected to change on suspend-resume operation; skip
rebuilding the sched domains to regain some of the time lost.

Debugged-by: Rick Yiu <[email protected]>
Signed-off-by: Qais Yousef (Google) <[email protected]>
---
kernel/cgroup/cpuset.c | 6 ++++++
kernel/sched/deadline.c | 3 +++
2 files changed, 9 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b474289c15b8..2ff68d625b7b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1067,6 +1067,9 @@ static void update_tasks_root_domain(struct cpuset *cs)
struct css_task_iter it;
struct task_struct *task;

+ if (cpuhp_tasks_frozen)
+ return;
+
css_task_iter_start(&cs->css, 0, &it);

while ((task = css_task_iter_next(&it)))
@@ -1084,6 +1087,9 @@ static void rebuild_root_domains(void)
lockdep_assert_cpus_held();
lockdep_assert_held(&sched_domains_mutex);

+ if (cpuhp_tasks_frozen)
+ return;
+
rcu_read_lock();

/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0d97d54276cc..42c1143a3956 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2575,6 +2575,9 @@ void dl_clear_root_domain(struct root_domain *rd)
{
unsigned long flags;

+ if (cpuhp_tasks_frozen)
+ return;
+
raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
rd->dl_bw.total_bw = 0;
raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
--
2.25.1

2022-12-17 02:48:49

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information

On 12/16/22 18:35, Qais Yousef wrote:
> Hi
>
> On 07/19/19 15:59, Juri Lelli wrote:
>> When the topology of root domains is modified by CPUset or CPUhotplug
>> operations information about the current deadline bandwidth held in the
>> root domain is lost.
>>
>> This patch addresses the issue by recalculating the lost deadline
>> bandwidth information by circling through the deadline tasks held in
>> CPUsets and adding their current load to the root domain they are
>> associated with.
>>
>> Signed-off-by: Mathieu Poirier <[email protected]>
>> Signed-off-by: Juri Lelli <[email protected]>
>> ---
> We see that rebuild_root_domain() can take 10+ ms (I get a max of 20ms quite
> consistently) on suspend/resume.
>
> Do we actually need to rebuild_root_domain() if we're going through
> a suspend/resume cycle?
>
> ie: would something like the below make sense? We'd skip this logic if
> cpuhp_tasks_frozen is set which indicates it's not a real hotplug operation but
> we're suspending/resuming.
>
>
> Cheers
>
> --
> Qais Yousef
>
>
> --->8---
>
>
> From 4cfd50960ad872c5eb810ad3038eaf840bab5182 Mon Sep 17 00:00:00 2001
> From: Qais Yousef <[email protected]>
> Date: Tue, 29 Nov 2022 19:01:52 +0000
> Subject: [PATCH] sched: cpuset: Don't rebuild sched domains on suspend-resume
>
> Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
> enabled rebuilding sched domain on cpuset and hotplug operations to
> correct deadline accounting.
>
> Rebuilding sched domain is a slow operation and we see 10+ ms delays
> on suspend-resume because of that.
>
> Since nothing is expected to change on suspend-resume operation; skip
> rebuilding the sched domains to regain some of the time lost.
>
> Debugged-by: Rick Yiu <[email protected]>
> Signed-off-by: Qais Yousef (Google) <[email protected]>
> ---
> kernel/cgroup/cpuset.c | 6 ++++++
> kernel/sched/deadline.c | 3 +++
> 2 files changed, 9 insertions(+)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index b474289c15b8..2ff68d625b7b 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1067,6 +1067,9 @@ static void update_tasks_root_domain(struct cpuset *cs)
> struct css_task_iter it;
> struct task_struct *task;
>
> + if (cpuhp_tasks_frozen)
> + return;
> +
> css_task_iter_start(&cs->css, 0, &it);
>
> while ((task = css_task_iter_next(&it)))
> @@ -1084,6 +1087,9 @@ static void rebuild_root_domains(void)
> lockdep_assert_cpus_held();
> lockdep_assert_held(&sched_domains_mutex);
>
> + if (cpuhp_tasks_frozen)
> + return;
> +
> rcu_read_lock();
>
> /*

rebuild_root_domains() is the only caller of update_tasks_root_domain().
So the first hunk is redundant as update_tasks_root_domain() won't be
called when cpuhp_tasks_frozen is set.

Cheers,
Longman

2022-12-17 22:54:12

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information

On 12/16/22 21:26, Waiman Long wrote:
> On 12/16/22 18:35, Qais Yousef wrote:
> > Hi
> >
> > On 07/19/19 15:59, Juri Lelli wrote:
> > > When the topology of root domains is modified by CPUset or CPUhotplug
> > > operations information about the current deadline bandwidth held in the
> > > root domain is lost.
> > >
> > > This patch addresses the issue by recalculating the lost deadline
> > > bandwidth information by circling through the deadline tasks held in
> > > CPUsets and adding their current load to the root domain they are
> > > associated with.
> > >
> > > Signed-off-by: Mathieu Poirier <[email protected]>
> > > Signed-off-by: Juri Lelli <[email protected]>
> > > ---
> > We see that rebuild_root_domain() can take 10+ ms (I get a max of 20ms quite
> > consistently) on suspend/resume.
> >
> > Do we actually need to rebuild_root_domain() if we're going through
> > a suspend/resume cycle?
> >
> > ie: would something like the below make sense? We'd skip this logic if
> > cpuhp_tasks_frozen is set which indicates it's not a real hotplug operation but
> > we're suspending/resuming.
> >
> >
> > Cheers
> >
> > --
> > Qais Yousef
> >
> >
> > --->8---
> >
> >
> > From 4cfd50960ad872c5eb810ad3038eaf840bab5182 Mon Sep 17 00:00:00 2001
> > From: Qais Yousef <[email protected]>
> > Date: Tue, 29 Nov 2022 19:01:52 +0000
> > Subject: [PATCH] sched: cpuset: Don't rebuild sched domains on suspend-resume
> >
> > Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
> > enabled rebuilding sched domain on cpuset and hotplug operations to
> > correct deadline accounting.
> >
> > Rebuilding sched domain is a slow operation and we see 10+ ms delays
> > on suspend-resume because of that.
> >
> > Since nothing is expected to change on suspend-resume operation; skip
> > rebuilding the sched domains to regain some of the time lost.
> >
> > Debugged-by: Rick Yiu <[email protected]>
> > Signed-off-by: Qais Yousef (Google) <[email protected]>
> > ---
> > kernel/cgroup/cpuset.c | 6 ++++++
> > kernel/sched/deadline.c | 3 +++
> > 2 files changed, 9 insertions(+)
> >
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index b474289c15b8..2ff68d625b7b 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -1067,6 +1067,9 @@ static void update_tasks_root_domain(struct cpuset *cs)
> > struct css_task_iter it;
> > struct task_struct *task;
> > + if (cpuhp_tasks_frozen)
> > + return;
> > +
> > css_task_iter_start(&cs->css, 0, &it);
> > while ((task = css_task_iter_next(&it)))
> > @@ -1084,6 +1087,9 @@ static void rebuild_root_domains(void)
> > lockdep_assert_cpus_held();
> > lockdep_assert_held(&sched_domains_mutex);
> > + if (cpuhp_tasks_frozen)
> > + return;
> > +
> > rcu_read_lock();
> > /*
>
> rebuild_root_domains() is the only caller of update_tasks_root_domain(). So
> the first hunk is redundant as update_tasks_root_domain() won't be called
> when cpuhp_tasks_frozen is set.

True. I went overzealous and protected all the functions.


Thanks!

--
Qais Yousef

2022-12-19 09:07:18

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information

On Sat, 17 Dec 2022 at 00:35, Qais Yousef <[email protected]> wrote:
>
> Hi
>
> On 07/19/19 15:59, Juri Lelli wrote:
> > When the topology of root domains is modified by CPUset or CPUhotplug
> > operations information about the current deadline bandwidth held in the
> > root domain is lost.
> >
> > This patch addresses the issue by recalculating the lost deadline
> > bandwidth information by circling through the deadline tasks held in
> > CPUsets and adding their current load to the root domain they are
> > associated with.
> >
> > Signed-off-by: Mathieu Poirier <[email protected]>
> > Signed-off-by: Juri Lelli <[email protected]>
> > ---
>
> We see that rebuild_root_domain() can take 10+ ms (I get a max of 20ms quite
> consistently) on suspend/resume.
>
> Do we actually need to rebuild_root_domain() if we're going through
> a suspend/resume cycle?

During suspend to ram, there are cpus hotplug operation but If you use
suspend to idle, you will skip cpus hotplug operation and its
associated rebuild.

>
> ie: would something like the below make sense? We'd skip this logic if
> cpuhp_tasks_frozen is set which indicates it's not a real hotplug operation but
> we're suspending/resuming.
>
>
> Cheers
>
> --
> Qais Yousef
>
>
> --->8---
>
>
> From 4cfd50960ad872c5eb810ad3038eaf840bab5182 Mon Sep 17 00:00:00 2001
> From: Qais Yousef <[email protected]>
> Date: Tue, 29 Nov 2022 19:01:52 +0000
> Subject: [PATCH] sched: cpuset: Don't rebuild sched domains on suspend-resume
>
> Commit f9a25f776d78 ("cpusets: Rebuild root domain deadline accounting information")
> enabled rebuilding sched domain on cpuset and hotplug operations to
> correct deadline accounting.
>
> Rebuilding sched domain is a slow operation and we see 10+ ms delays
> on suspend-resume because of that.
>
> Since nothing is expected to change on suspend-resume operation; skip
> rebuilding the sched domains to regain some of the time lost.
>
> Debugged-by: Rick Yiu <[email protected]>
> Signed-off-by: Qais Yousef (Google) <[email protected]>
> ---
> kernel/cgroup/cpuset.c | 6 ++++++
> kernel/sched/deadline.c | 3 +++
> 2 files changed, 9 insertions(+)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index b474289c15b8..2ff68d625b7b 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1067,6 +1067,9 @@ static void update_tasks_root_domain(struct cpuset *cs)
> struct css_task_iter it;
> struct task_struct *task;
>
> + if (cpuhp_tasks_frozen)
> + return;
> +
> css_task_iter_start(&cs->css, 0, &it);
>
> while ((task = css_task_iter_next(&it)))
> @@ -1084,6 +1087,9 @@ static void rebuild_root_domains(void)
> lockdep_assert_cpus_held();
> lockdep_assert_held(&sched_domains_mutex);
>
> + if (cpuhp_tasks_frozen)
> + return;
> +
> rcu_read_lock();
>
> /*
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 0d97d54276cc..42c1143a3956 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2575,6 +2575,9 @@ void dl_clear_root_domain(struct root_domain *rd)
> {
> unsigned long flags;
>
> + if (cpuhp_tasks_frozen)
> + return;
> +
> raw_spin_lock_irqsave(&rd->dl_bw.lock, flags);
> rd->dl_bw.total_bw = 0;
> raw_spin_unlock_irqrestore(&rd->dl_bw.lock, flags);
> --
> 2.25.1
>

2022-12-20 11:57:32

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information

On 12/19/22 09:07, Vincent Guittot wrote:
> On Sat, 17 Dec 2022 at 00:35, Qais Yousef <[email protected]> wrote:
> >
> > Hi
> >
> > On 07/19/19 15:59, Juri Lelli wrote:
> > > When the topology of root domains is modified by CPUset or CPUhotplug
> > > operations information about the current deadline bandwidth held in the
> > > root domain is lost.
> > >
> > > This patch addresses the issue by recalculating the lost deadline
> > > bandwidth information by circling through the deadline tasks held in
> > > CPUsets and adding their current load to the root domain they are
> > > associated with.
> > >
> > > Signed-off-by: Mathieu Poirier <[email protected]>
> > > Signed-off-by: Juri Lelli <[email protected]>
> > > ---
> >
> > We see that rebuild_root_domain() can take 10+ ms (I get a max of 20ms quite
> > consistently) on suspend/resume.
> >
> > Do we actually need to rebuild_root_domain() if we're going through
> > a suspend/resume cycle?
>
> During suspend to ram, there are cpus hotplug operation but If you use
> suspend to idle, you will skip cpus hotplug operation and its
> associated rebuild.

Thanks Vincent. I'll check on that - but if we want to keep suspend to ram?
Do we really to incur this hit?


Thanks

--
Qais Yousef

2023-01-15 17:20:30

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v9 3/8] cpuset: Rebuild root domain deadline accounting information

On 12/20/22 11:43, Qais Yousef wrote:
> On 12/19/22 09:07, Vincent Guittot wrote:
> > On Sat, 17 Dec 2022 at 00:35, Qais Yousef <[email protected]> wrote:
> > >
> > > Hi
> > >
> > > On 07/19/19 15:59, Juri Lelli wrote:
> > > > When the topology of root domains is modified by CPUset or CPUhotplug
> > > > operations information about the current deadline bandwidth held in the
> > > > root domain is lost.
> > > >
> > > > This patch addresses the issue by recalculating the lost deadline
> > > > bandwidth information by circling through the deadline tasks held in
> > > > CPUsets and adding their current load to the root domain they are
> > > > associated with.
> > > >
> > > > Signed-off-by: Mathieu Poirier <[email protected]>
> > > > Signed-off-by: Juri Lelli <[email protected]>
> > > > ---
> > >
> > > We see that rebuild_root_domain() can take 10+ ms (I get a max of 20ms quite
> > > consistently) on suspend/resume.
> > >
> > > Do we actually need to rebuild_root_domain() if we're going through
> > > a suspend/resume cycle?
> >
> > During suspend to ram, there are cpus hotplug operation but If you use
> > suspend to idle, you will skip cpus hotplug operation and its
> > associated rebuild.
>
> Thanks Vincent. I'll check on that - but if we want to keep suspend to ram?
> Do we really to incur this hit?

Using s2idle is not an option actually. I'll prepare v2 to address Waiman
comment if I don't get more feedback in the next few days.


Thanks!

--
Qais Yousef