2023-03-29 13:01:11

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 0/6] sched/deadline: cpuset: Rework DEADLINE bandwidth restoration

Qais reported [1] that iterating over all tasks when rebuilding root
domains for finding out which ones are DEADLINE and need their bandwidth
correctly restored on such root domains can be a costly operation (10+
ms delays on suspend-resume). He proposed we skip rebuilding root
domains for certain operations, but that approach seemed arch specific
and possibly prone to errors, as paths that ultimately trigger a rebuild
might be quite convoluted (thanks Qais for spending time on this!).

To fix the problem

01/06 - Rename functions deadline with DEADLINE accounting (cleanup
suggested by Qais) - no functional change
02/06 - Bring back cpuset_mutex (so that we have write access to cpusets
from scheduler operations - and we also fix some problems
associated to percpu_cpuset_rwsem)
03/06 - Keep track of the number of DEADLINE tasks belonging to each cpuset
04/06 - Create DL BW alloc, free & check overflow interface for bulk
bandwidth allocation/removal - no functional change
05/06 - Fix bandwidth allocation handling for cgroup operation
involving multiple tasks
06/06 - Use this information to only perform the costly iteration if
DEADLINE tasks are actually present in the cpuset for which a
corresponding root domain is being rebuilt

With respect to the RFC posting [2]

1 - rename DEADLINE bandwidth accounting functions - Qais
2 - call inc/dec_dl_tasks_cs from switched_{to,from}_dl - Qais
3 - fix DEADLINE bandwidth allocation with multiple tasks - Waiman,
contributed by Dietmar

This set is also available from

https://github.com/jlelli/linux.git deadline/rework-cpusets

Best,
Juri

1 - https://lore.kernel.org/lkml/[email protected]/
2 - https://lore.kernel.org/lkml/[email protected]/

Dietmar Eggemann (2):
sched/deadline: Create DL BW alloc, free & check overflow interface
cgroup/cpuset: Free DL BW in case can_attach() fails

Juri Lelli (4):
cgroup/cpuset: Rename functions dealing with DEADLINE accounting
sched/cpuset: Bring back cpuset_mutex
sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets
cgroup/cpuset: Iterate only if DEADLINE tasks are present

include/linux/cpuset.h | 12 ++-
include/linux/sched.h | 4 +-
kernel/cgroup/cgroup.c | 4 +
kernel/cgroup/cpuset.c | 232 ++++++++++++++++++++++++++--------------
kernel/sched/core.c | 41 ++++---
kernel/sched/deadline.c | 67 +++++++++---
kernel/sched/sched.h | 2 +-
7 files changed, 240 insertions(+), 122 deletions(-)

--
2.39.2


2023-03-29 13:01:47

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 1/6] cgroup/cpuset: Rename functions dealing with DEADLINE accounting

rebuild_root_domains() and update_tasks_root_domain() have neutral
names, but actually deal with DEADLINE bandwidth accounting.

Rename them to use 'dl_' prefix so that intent is more clear.

No functional change.

Suggested-by: Qais Yousef <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/cgroup/cpuset.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 636f1c682ac0..501913bc2805 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1066,7 +1066,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
return ndoms;
}

-static void update_tasks_root_domain(struct cpuset *cs)
+static void dl_update_tasks_root_domain(struct cpuset *cs)
{
struct css_task_iter it;
struct task_struct *task;
@@ -1079,7 +1079,7 @@ static void update_tasks_root_domain(struct cpuset *cs)
css_task_iter_end(&it);
}

-static void rebuild_root_domains(void)
+static void dl_rebuild_rd_accounting(void)
{
struct cpuset *cs = NULL;
struct cgroup_subsys_state *pos_css;
@@ -1107,7 +1107,7 @@ static void rebuild_root_domains(void)

rcu_read_unlock();

- update_tasks_root_domain(cs);
+ dl_update_tasks_root_domain(cs);

rcu_read_lock();
css_put(&cs->css);
@@ -1121,7 +1121,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
{
mutex_lock(&sched_domains_mutex);
partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
- rebuild_root_domains();
+ dl_rebuild_rd_accounting();
mutex_unlock(&sched_domains_mutex);
}

--
2.39.2

2023-03-29 13:02:39

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 3/6] sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets

Qais reported that iterating over all tasks when rebuilding root domains
for finding out which ones are DEADLINE and need their bandwidth
correctly restored on such root domains can be a costly operation (10+
ms delays on suspend-resume).

To fix the problem keep track of the number of DEADLINE tasks belonging
to each cpuset and then use this information (followup patch) to only
perform the above iteration if DEADLINE tasks are actually present in
the cpuset for which a corresponding root domain is being rebuilt.

Reported-by: Qais Yousef <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Juri Lelli <[email protected]>
---
include/linux/cpuset.h | 4 ++++
kernel/cgroup/cgroup.c | 4 ++++
kernel/cgroup/cpuset.c | 25 +++++++++++++++++++++++++
kernel/sched/deadline.c | 14 ++++++++++++++
4 files changed, 47 insertions(+)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 355f796c5f07..0348dba5680e 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -71,6 +71,8 @@ extern void cpuset_init_smp(void);
extern void cpuset_force_rebuild(void);
extern void cpuset_update_active_cpus(void);
extern void cpuset_wait_for_hotplug(void);
+extern void inc_dl_tasks_cs(struct task_struct *task);
+extern void dec_dl_tasks_cs(struct task_struct *task);
extern void cpuset_lock(void);
extern void cpuset_unlock(void);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
@@ -196,6 +198,8 @@ static inline void cpuset_update_active_cpus(void)

static inline void cpuset_wait_for_hotplug(void) { }

+static inline void inc_dl_tasks_cs(struct task_struct *task) { }
+static inline void dec_dl_tasks_cs(struct task_struct *task) { }
static inline void cpuset_lock(void) { }
static inline void cpuset_unlock(void) { }

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 935e8121b21e..ff27b2d2bf0b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -57,6 +57,7 @@
#include <linux/file.h>
#include <linux/fs_parser.h>
#include <linux/sched/cputime.h>
+#include <linux/sched/deadline.h>
#include <linux/psi.h>
#include <net/sock.h>

@@ -6673,6 +6674,9 @@ void cgroup_exit(struct task_struct *tsk)
list_add_tail(&tsk->cg_list, &cset->dying_tasks);
cset->nr_tasks--;

+ if (dl_task(tsk))
+ dec_dl_tasks_cs(tsk);
+
WARN_ON_ONCE(cgroup_task_frozen(tsk));
if (unlikely(!(tsk->flags & PF_KTHREAD) &&
test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags)))
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index fbc10b494292..eb0854ef9757 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -193,6 +193,12 @@ struct cpuset {
int use_parent_ecpus;
int child_ecpus_count;

+ /*
+ * number of SCHED_DEADLINE tasks attached to this cpuset, so that we
+ * know when to rebuild associated root domain bandwidth information.
+ */
+ int nr_deadline_tasks;
+
/* Invalid partition error code, not lock protected */
enum prs_errcode prs_err;

@@ -245,6 +251,20 @@ static inline struct cpuset *parent_cs(struct cpuset *cs)
return css_cs(cs->css.parent);
}

+void inc_dl_tasks_cs(struct task_struct *p)
+{
+ struct cpuset *cs = task_cs(p);
+
+ cs->nr_deadline_tasks++;
+}
+
+void dec_dl_tasks_cs(struct task_struct *p)
+{
+ struct cpuset *cs = task_cs(p);
+
+ cs->nr_deadline_tasks--;
+}
+
/* bits in struct cpuset flags field */
typedef enum {
CS_ONLINE,
@@ -2477,6 +2497,11 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
ret = security_task_setscheduler(task);
if (ret)
goto out_unlock;
+
+ if (dl_task(task)) {
+ cs->nr_deadline_tasks++;
+ cpuset_attach_old_cs->nr_deadline_tasks--;
+ }
}

/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4cc7e1ca066d..8f92f0f87383 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -16,6 +16,8 @@
* Fabio Checconi <[email protected]>
*/

+#include <linux/cpuset.h>
+
/*
* Default limits for DL period; on the top end we guard against small util
* tasks still getting ridiculously long effective runtimes, on the bottom end we
@@ -2595,6 +2597,12 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
if (task_on_rq_queued(p) && p->dl.dl_runtime)
task_non_contending(p);

+ /*
+ * In case a task is setscheduled out from SCHED_DEADLINE we need to
+ * keep track of that on its cpuset (for correct bandwidth tracking).
+ */
+ dec_dl_tasks_cs(p);
+
if (!task_on_rq_queued(p)) {
/*
* Inactive timer is armed. However, p is leaving DEADLINE and
@@ -2635,6 +2643,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
put_task_struct(p);

+ /*
+ * In case a task is setscheduled to SCHED_DEADLINE we need to keep
+ * track of that on its cpuset (for correct bandwidth tracking).
+ */
+ inc_dl_tasks_cs(p);
+
/* If p is not queued we will update its parameters at next wakeup. */
if (!task_on_rq_queued(p)) {
add_rq_bw(&p->dl, &rq->dl);
--
2.39.2

2023-03-29 13:06:38

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 5/6] cgroup/cpuset: Free DL BW in case can_attach() fails

From: Dietmar Eggemann <[email protected]>

cpuset_can_attach() can fail. Postpone DL BW allocation until all tasks
have been checked. DL BW is not allocated per-task but as a sum over
all DL tasks migrating.

If multiple controllers are attached to the cgroup next to the cuset
controller a non-cpuset can_attach() can fail. In this case free DL BW
in cpuset_cancel_attach().

Finally, update cpuset DL task count (nr_deadline_tasks) only in
cpuset_attach().

Suggested-by: Waiman Long <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/cgroup/cpuset.c | 55 ++++++++++++++++++++++++++++++++++++++----
kernel/sched/core.c | 17 ++-----------
3 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6f3d84e0ed08..50cbbfefbe11 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1847,7 +1847,7 @@ current_restore_flags(unsigned long orig_flags, unsigned long flags)
}

extern int cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
-extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_effective_cpus);
+extern int task_can_attach(struct task_struct *p);
extern int dl_bw_alloc(int cpu, u64 dl_bw);
extern void dl_bw_free(int cpu, u64 dl_bw);
#ifdef CONFIG_SMP
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index eb0854ef9757..f8ebec66da51 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -198,6 +198,8 @@ struct cpuset {
* know when to rebuild associated root domain bandwidth information.
*/
int nr_deadline_tasks;
+ int nr_migrate_dl_tasks;
+ u64 sum_migrate_dl_bw;

/* Invalid partition error code, not lock protected */
enum prs_errcode prs_err;
@@ -2464,16 +2466,23 @@ static int fmeter_getrate(struct fmeter *fmp)

static struct cpuset *cpuset_attach_old_cs;

+static void reset_migrate_dl_data(struct cpuset *cs)
+{
+ cs->nr_migrate_dl_tasks = 0;
+ cs->sum_migrate_dl_bw = 0;
+}
+
/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
static int cpuset_can_attach(struct cgroup_taskset *tset)
{
struct cgroup_subsys_state *css;
- struct cpuset *cs;
+ struct cpuset *cs, *oldcs;
struct task_struct *task;
int ret;

/* used later by cpuset_attach() */
cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
+ oldcs = cpuset_attach_old_cs;
cs = css_cs(css);

mutex_lock(&cpuset_mutex);
@@ -2491,7 +2500,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
goto out_unlock;

cgroup_taskset_for_each(task, css, tset) {
- ret = task_can_attach(task, cs->effective_cpus);
+ ret = task_can_attach(task);
if (ret)
goto out_unlock;
ret = security_task_setscheduler(task);
@@ -2499,11 +2508,31 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
goto out_unlock;

if (dl_task(task)) {
- cs->nr_deadline_tasks++;
- cpuset_attach_old_cs->nr_deadline_tasks--;
+ cs->nr_migrate_dl_tasks++;
+ cs->sum_migrate_dl_bw += task->dl.dl_bw;
+ }
+ }
+
+ if (!cs->nr_migrate_dl_tasks)
+ goto out_succes;
+
+ if (!cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus)) {
+ int cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
+
+ if (unlikely(cpu >= nr_cpu_ids)) {
+ reset_migrate_dl_data(cs);
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
+ if (ret) {
+ reset_migrate_dl_data(cs);
+ goto out_unlock;
}
}

+out_succes:
/*
* Mark attach is in progress. This makes validate_change() fail
* changes which zero cpus/mems_allowed.
@@ -2518,11 +2547,21 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
static void cpuset_cancel_attach(struct cgroup_taskset *tset)
{
struct cgroup_subsys_state *css;
+ struct cpuset *cs;

cgroup_taskset_first(tset, &css);
+ cs = css_cs(css);

mutex_lock(&cpuset_mutex);
- css_cs(css)->attach_in_progress--;
+ cs->attach_in_progress--;
+
+ if (cs->nr_migrate_dl_tasks) {
+ int cpu = cpumask_any(cs->effective_cpus);
+
+ dl_bw_free(cpu, cs->sum_migrate_dl_bw);
+ reset_migrate_dl_data(cs);
+ }
+
mutex_unlock(&cpuset_mutex);
}

@@ -2617,6 +2656,12 @@ static void cpuset_attach(struct cgroup_taskset *tset)
out:
cs->old_mems_allowed = cpuset_attach_nodemask_to;

+ if (cs->nr_migrate_dl_tasks) {
+ cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
+ oldcs->nr_deadline_tasks -= cs->nr_migrate_dl_tasks;
+ reset_migrate_dl_data(cs);
+ }
+
cs->attach_in_progress--;
if (!cs->attach_in_progress)
wake_up(&cpuset_attach_wq);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c83dae6b8586..10454980e830 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9269,8 +9269,7 @@ int cpuset_cpumask_can_shrink(const struct cpumask *cur,
return ret;
}

-int task_can_attach(struct task_struct *p,
- const struct cpumask *cs_effective_cpus)
+int task_can_attach(struct task_struct *p)
{
int ret = 0;

@@ -9283,21 +9282,9 @@ int task_can_attach(struct task_struct *p,
* success of set_cpus_allowed_ptr() on all attached tasks
* before cpus_mask may be changed.
*/
- if (p->flags & PF_NO_SETAFFINITY) {
+ if (p->flags & PF_NO_SETAFFINITY)
ret = -EINVAL;
- goto out;
- }
-
- if (dl_task(p) && !cpumask_intersects(task_rq(p)->rd->span,
- cs_effective_cpus)) {
- int cpu = cpumask_any_and(cpu_active_mask, cs_effective_cpus);

- if (unlikely(cpu >= nr_cpu_ids))
- return -EINVAL;
- ret = dl_bw_alloc(cpu, p->dl.dl_bw);
- }
-
-out:
return ret;
}

--
2.39.2

2023-03-29 13:07:37

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 2/6] sched/cpuset: Bring back cpuset_mutex

Turns out percpu_cpuset_rwsem - commit 1243dc518c9d ("cgroup/cpuset:
Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea,
as it has been reported to cause slowdowns in workloads that need to
change cpuset configuration frequently and it is also not implementing
priority inheritance (which causes troubles with realtime workloads).

Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it
only for SCHED_DEADLINE tasks (other policies don't care about stable
cpusets anyway).

Signed-off-by: Juri Lelli <[email protected]>
---
include/linux/cpuset.h | 8 +--
kernel/cgroup/cpuset.c | 145 ++++++++++++++++++++---------------------
kernel/sched/core.c | 22 +++++--
3 files changed, 91 insertions(+), 84 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index d58e0476ee8e..355f796c5f07 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -71,8 +71,8 @@ extern void cpuset_init_smp(void);
extern void cpuset_force_rebuild(void);
extern void cpuset_update_active_cpus(void);
extern void cpuset_wait_for_hotplug(void);
-extern void cpuset_read_lock(void);
-extern void cpuset_read_unlock(void);
+extern void cpuset_lock(void);
+extern void cpuset_unlock(void);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -196,8 +196,8 @@ static inline void cpuset_update_active_cpus(void)

static inline void cpuset_wait_for_hotplug(void) { }

-static inline void cpuset_read_lock(void) { }
-static inline void cpuset_read_unlock(void) { }
+static inline void cpuset_lock(void) { }
+static inline void cpuset_unlock(void) { }

static inline void cpuset_cpus_allowed(struct task_struct *p,
struct cpumask *mask)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 501913bc2805..fbc10b494292 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -366,22 +366,21 @@ static struct cpuset top_cpuset = {
if (is_cpuset_online(((des_cs) = css_cs((pos_css)))))

/*
- * There are two global locks guarding cpuset structures - cpuset_rwsem and
+ * There are two global locks guarding cpuset structures - cpuset_mutex and
* callback_lock. We also require taking task_lock() when dereferencing a
* task's cpuset pointer. See "The task_lock() exception", at the end of this
- * comment. The cpuset code uses only cpuset_rwsem write lock. Other
- * kernel subsystems can use cpuset_read_lock()/cpuset_read_unlock() to
- * prevent change to cpuset structures.
+ * comment. The cpuset code uses only cpuset_mutex. Other kernel subsystems
+ * can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset
+ * structures.
*
* A task must hold both locks to modify cpusets. If a task holds
- * cpuset_rwsem, it blocks others wanting that rwsem, ensuring that it
- * is the only task able to also acquire callback_lock and be able to
- * modify cpusets. It can perform various checks on the cpuset structure
- * first, knowing nothing will change. It can also allocate memory while
- * just holding cpuset_rwsem. While it is performing these checks, various
- * callback routines can briefly acquire callback_lock to query cpusets.
- * Once it is ready to make the changes, it takes callback_lock, blocking
- * everyone else.
+ * cpuset_mutex, it blocks others, ensuring that it is the only task able to
+ * also acquire callback_lock and be able to modify cpusets. It can perform
+ * various checks on the cpuset structure first, knowing nothing will change.
+ * It can also allocate memory while just holding cpuset_mutex. While it is
+ * performing these checks, various callback routines can briefly acquire
+ * callback_lock to query cpusets. Once it is ready to make the changes, it
+ * takes callback_lock, blocking everyone else.
*
* Calls to the kernel memory allocator can not be made while holding
* callback_lock, as that would risk double tripping on callback_lock
@@ -403,16 +402,16 @@ static struct cpuset top_cpuset = {
* guidelines for accessing subsystem state in kernel/cgroup.c
*/

-DEFINE_STATIC_PERCPU_RWSEM(cpuset_rwsem);
+static DEFINE_MUTEX(cpuset_mutex);

-void cpuset_read_lock(void)
+void cpuset_lock(void)
{
- percpu_down_read(&cpuset_rwsem);
+ mutex_lock(&cpuset_mutex);
}

-void cpuset_read_unlock(void)
+void cpuset_unlock(void)
{
- percpu_up_read(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);
}

static DEFINE_SPINLOCK(callback_lock);
@@ -496,7 +495,7 @@ static inline bool partition_is_populated(struct cpuset *cs,
* One way or another, we guarantee to return some non-empty subset
* of cpu_online_mask.
*
- * Call with callback_lock or cpuset_rwsem held.
+ * Call with callback_lock or cpuset_mutex held.
*/
static void guarantee_online_cpus(struct task_struct *tsk,
struct cpumask *pmask)
@@ -538,7 +537,7 @@ static void guarantee_online_cpus(struct task_struct *tsk,
* One way or another, we guarantee to return some non-empty subset
* of node_states[N_MEMORY].
*
- * Call with callback_lock or cpuset_rwsem held.
+ * Call with callback_lock or cpuset_mutex held.
*/
static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
{
@@ -550,7 +549,7 @@ static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
/*
* update task's spread flag if cpuset's page/slab spread flag is set
*
- * Call with callback_lock or cpuset_rwsem held. The check can be skipped
+ * Call with callback_lock or cpuset_mutex held. The check can be skipped
* if on default hierarchy.
*/
static void cpuset_update_task_spread_flags(struct cpuset *cs,
@@ -575,7 +574,7 @@ static void cpuset_update_task_spread_flags(struct cpuset *cs,
*
* One cpuset is a subset of another if all its allowed CPUs and
* Memory Nodes are a subset of the other, and its exclusive flags
- * are only set if the other's are set. Call holding cpuset_rwsem.
+ * are only set if the other's are set. Call holding cpuset_mutex.
*/

static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
@@ -713,7 +712,7 @@ static int validate_change_legacy(struct cpuset *cur, struct cpuset *trial)
* If we replaced the flag and mask values of the current cpuset
* (cur) with those values in the trial cpuset (trial), would
* our various subset and exclusive rules still be valid? Presumes
- * cpuset_rwsem held.
+ * cpuset_mutex held.
*
* 'cur' is the address of an actual, in-use cpuset. Operations
* such as list traversal that depend on the actual address of the
@@ -829,7 +828,7 @@ static void update_domain_attr_tree(struct sched_domain_attr *dattr,
rcu_read_unlock();
}

-/* Must be called with cpuset_rwsem held. */
+/* Must be called with cpuset_mutex held. */
static inline int nr_cpusets(void)
{
/* jump label reference count + the top-level cpuset */
@@ -855,7 +854,7 @@ static inline int nr_cpusets(void)
* domains when operating in the severe memory shortage situations
* that could cause allocation failures below.
*
- * Must be called with cpuset_rwsem held.
+ * Must be called with cpuset_mutex held.
*
* The three key local variables below are:
* cp - cpuset pointer, used (together with pos_css) to perform a
@@ -1084,7 +1083,7 @@ static void dl_rebuild_rd_accounting(void)
struct cpuset *cs = NULL;
struct cgroup_subsys_state *pos_css;

- percpu_rwsem_assert_held(&cpuset_rwsem);
+ lockdep_assert_held(&cpuset_mutex);
lockdep_assert_cpus_held();
lockdep_assert_held(&sched_domains_mutex);

@@ -1134,7 +1133,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
* 'cpus' is removed, then call this routine to rebuild the
* scheduler's dynamic sched domains.
*
- * Call with cpuset_rwsem held. Takes cpus_read_lock().
+ * Call with cpuset_mutex held. Takes cpus_read_lock().
*/
static void rebuild_sched_domains_locked(void)
{
@@ -1145,7 +1144,7 @@ static void rebuild_sched_domains_locked(void)
int ndoms;

lockdep_assert_cpus_held();
- percpu_rwsem_assert_held(&cpuset_rwsem);
+ lockdep_assert_held(&cpuset_mutex);

/*
* If we have raced with CPU hotplug, return early to avoid
@@ -1196,9 +1195,9 @@ static void rebuild_sched_domains_locked(void)
void rebuild_sched_domains(void)
{
cpus_read_lock();
- percpu_down_write(&cpuset_rwsem);
+ mutex_lock(&cpuset_mutex);
rebuild_sched_domains_locked();
- percpu_up_write(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);
cpus_read_unlock();
}

@@ -1208,7 +1207,7 @@ void rebuild_sched_domains(void)
* @new_cpus: the temp variable for the new effective_cpus mask
*
* Iterate through each task of @cs updating its cpus_allowed to the
- * effective cpuset's. As this function is called with cpuset_rwsem held,
+ * effective cpuset's. As this function is called with cpuset_mutex held,
* cpuset membership stays stable.
*/
static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
@@ -1317,7 +1316,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
int old_prs, new_prs;
int part_error = PERR_NONE; /* Partition error? */

- percpu_rwsem_assert_held(&cpuset_rwsem);
+ lockdep_assert_held(&cpuset_mutex);

/*
* The parent must be a partition root.
@@ -1540,7 +1539,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
*
* On legacy hierarchy, effective_cpus will be the same with cpu_allowed.
*
- * Called with cpuset_rwsem held
+ * Called with cpuset_mutex held
*/
static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
bool force)
@@ -1700,7 +1699,7 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
struct cpuset *sibling;
struct cgroup_subsys_state *pos_css;

- percpu_rwsem_assert_held(&cpuset_rwsem);
+ lockdep_assert_held(&cpuset_mutex);

/*
* Check all its siblings and call update_cpumasks_hier()
@@ -1942,12 +1941,12 @@ static void *cpuset_being_rebound;
* @cs: the cpuset in which each task's mems_allowed mask needs to be changed
*
* Iterate through each task of @cs updating its mems_allowed to the
- * effective cpuset's. As this function is called with cpuset_rwsem held,
+ * effective cpuset's. As this function is called with cpuset_mutex held,
* cpuset membership stays stable.
*/
static void update_tasks_nodemask(struct cpuset *cs)
{
- static nodemask_t newmems; /* protected by cpuset_rwsem */
+ static nodemask_t newmems; /* protected by cpuset_mutex */
struct css_task_iter it;
struct task_struct *task;

@@ -1960,7 +1959,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
* take while holding tasklist_lock. Forks can happen - the
* mpol_dup() cpuset_being_rebound check will catch such forks,
* and rebind their vma mempolicies too. Because we still hold
- * the global cpuset_rwsem, we know that no other rebind effort
+ * the global cpuset_mutex, we know that no other rebind effort
* will be contending for the global variable cpuset_being_rebound.
* It's ok if we rebind the same mm twice; mpol_rebind_mm()
* is idempotent. Also migrate pages in each mm to new nodes.
@@ -2006,7 +2005,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
*
* On legacy hierarchy, effective_mems will be the same with mems_allowed.
*
- * Called with cpuset_rwsem held
+ * Called with cpuset_mutex held
*/
static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
{
@@ -2059,7 +2058,7 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
* mempolicies and if the cpuset is marked 'memory_migrate',
* migrate the tasks pages to the new memory.
*
- * Call with cpuset_rwsem held. May take callback_lock during call.
+ * Call with cpuset_mutex held. May take callback_lock during call.
* Will take tasklist_lock, scan tasklist for tasks in cpuset cs,
* lock each such tasks mm->mmap_lock, scan its vma's and rebind
* their mempolicies to the cpusets new mems_allowed.
@@ -2151,7 +2150,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val)
* @cs: the cpuset in which each task's spread flags needs to be changed
*
* Iterate through each task of @cs updating its spread flags. As this
- * function is called with cpuset_rwsem held, cpuset membership stays
+ * function is called with cpuset_mutex held, cpuset membership stays
* stable.
*/
static void update_tasks_flags(struct cpuset *cs)
@@ -2171,7 +2170,7 @@ static void update_tasks_flags(struct cpuset *cs)
* cs: the cpuset to update
* turning_on: whether the flag is being set or cleared
*
- * Call with cpuset_rwsem held.
+ * Call with cpuset_mutex held.
*/

static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
@@ -2221,7 +2220,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
* @new_prs: new partition root state
* Return: 0 if successful, != 0 if error
*
- * Call with cpuset_rwsem held.
+ * Call with cpuset_mutex held.
*/
static int update_prstate(struct cpuset *cs, int new_prs)
{
@@ -2445,7 +2444,7 @@ static int fmeter_getrate(struct fmeter *fmp)

static struct cpuset *cpuset_attach_old_cs;

-/* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */
+/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
static int cpuset_can_attach(struct cgroup_taskset *tset)
{
struct cgroup_subsys_state *css;
@@ -2457,7 +2456,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
cs = css_cs(css);

- percpu_down_write(&cpuset_rwsem);
+ mutex_lock(&cpuset_mutex);

/* allow moving tasks into an empty cpuset if on default hierarchy */
ret = -ENOSPC;
@@ -2487,7 +2486,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
cs->attach_in_progress++;
ret = 0;
out_unlock:
- percpu_up_write(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);
return ret;
}

@@ -2497,13 +2496,13 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)

cgroup_taskset_first(tset, &css);

- percpu_down_write(&cpuset_rwsem);
+ mutex_lock(&cpuset_mutex);
css_cs(css)->attach_in_progress--;
- percpu_up_write(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);
}

/*
- * Protected by cpuset_rwsem. cpus_attach is used only by cpuset_attach()
+ * Protected by cpuset_mutex. cpus_attach is used only by cpuset_attach()
* but we can't allocate it dynamically there. Define it global and
* allocate from cpuset_init().
*/
@@ -2511,7 +2510,7 @@ static cpumask_var_t cpus_attach;

static void cpuset_attach(struct cgroup_taskset *tset)
{
- /* static buf protected by cpuset_rwsem */
+ /* static buf protected by cpuset_mutex */
static nodemask_t cpuset_attach_nodemask_to;
struct task_struct *task;
struct task_struct *leader;
@@ -2524,7 +2523,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
cs = css_cs(css);

lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */
- percpu_down_write(&cpuset_rwsem);
+ mutex_lock(&cpuset_mutex);
cpus_updated = !cpumask_equal(cs->effective_cpus,
oldcs->effective_cpus);
mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
@@ -2597,7 +2596,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
if (!cs->attach_in_progress)
wake_up(&cpuset_attach_wq);

- percpu_up_write(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);
}

/* The various types of files and directories in a cpuset file system */
@@ -2629,7 +2628,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
int retval = 0;

cpus_read_lock();
- percpu_down_write(&cpuset_rwsem);
+ mutex_lock(&cpuset_mutex);
if (!is_cpuset_online(cs)) {
retval = -ENODEV;
goto out_unlock;
@@ -2665,7 +2664,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
break;
}
out_unlock:
- percpu_up_write(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);
cpus_read_unlock();
return retval;
}
@@ -2678,7 +2677,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
int retval = -ENODEV;

cpus_read_lock();
- percpu_down_write(&cpuset_rwsem);
+ mutex_lock(&cpuset_mutex);
if (!is_cpuset_online(cs))
goto out_unlock;

@@ -2691,7 +2690,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
break;
}
out_unlock:
- percpu_up_write(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);
cpus_read_unlock();
return retval;
}
@@ -2724,7 +2723,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
* operation like this one can lead to a deadlock through kernfs
* active_ref protection. Let's break the protection. Losing the
* protection is okay as we check whether @cs is online after
- * grabbing cpuset_rwsem anyway. This only happens on the legacy
+ * grabbing cpuset_mutex anyway. This only happens on the legacy
* hierarchies.
*/
css_get(&cs->css);
@@ -2732,7 +2731,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
flush_work(&cpuset_hotplug_work);

cpus_read_lock();
- percpu_down_write(&cpuset_rwsem);
+ mutex_lock(&cpuset_mutex);
if (!is_cpuset_online(cs))
goto out_unlock;

@@ -2756,7 +2755,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,

free_cpuset(trialcs);
out_unlock:
- percpu_up_write(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);
cpus_read_unlock();
kernfs_unbreak_active_protection(of->kn);
css_put(&cs->css);
@@ -2904,13 +2903,13 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,

css_get(&cs->css);
cpus_read_lock();
- percpu_down_write(&cpuset_rwsem);
+ mutex_lock(&cpuset_mutex);
if (!is_cpuset_online(cs))
goto out_unlock;

retval = update_prstate(cs, val);
out_unlock:
- percpu_up_write(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);
cpus_read_unlock();
css_put(&cs->css);
return retval ?: nbytes;
@@ -3127,7 +3126,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
return 0;

cpus_read_lock();
- percpu_down_write(&cpuset_rwsem);
+ mutex_lock(&cpuset_mutex);

set_bit(CS_ONLINE, &cs->flags);
if (is_spread_page(parent))
@@ -3178,7 +3177,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
spin_unlock_irq(&callback_lock);
out_unlock:
- percpu_up_write(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);
cpus_read_unlock();
return 0;
}
@@ -3199,7 +3198,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
struct cpuset *cs = css_cs(css);

cpus_read_lock();
- percpu_down_write(&cpuset_rwsem);
+ mutex_lock(&cpuset_mutex);

if (is_partition_valid(cs))
update_prstate(cs, 0);
@@ -3218,7 +3217,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
cpuset_dec();
clear_bit(CS_ONLINE, &cs->flags);

- percpu_up_write(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);
cpus_read_unlock();
}

@@ -3231,7 +3230,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)

static void cpuset_bind(struct cgroup_subsys_state *root_css)
{
- percpu_down_write(&cpuset_rwsem);
+ mutex_lock(&cpuset_mutex);
spin_lock_irq(&callback_lock);

if (is_in_v2_mode()) {
@@ -3244,7 +3243,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
}

spin_unlock_irq(&callback_lock);
- percpu_up_write(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);
}

/*
@@ -3357,7 +3356,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
is_empty = cpumask_empty(cs->cpus_allowed) ||
nodes_empty(cs->mems_allowed);

- percpu_up_write(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);

/*
* Move tasks to the nearest ancestor with execution resources,
@@ -3367,7 +3366,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
if (is_empty)
remove_tasks_in_empty_cpuset(cs);

- percpu_down_write(&cpuset_rwsem);
+ mutex_lock(&cpuset_mutex);
}

static void
@@ -3418,14 +3417,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
retry:
wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);

- percpu_down_write(&cpuset_rwsem);
+ mutex_lock(&cpuset_mutex);

/*
* We have raced with task attaching. We wait until attaching
* is finished, so we won't attach a task to an empty cpuset.
*/
if (cs->attach_in_progress) {
- percpu_up_write(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);
goto retry;
}

@@ -3519,7 +3518,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems,
cpus_updated, mems_updated);

- percpu_up_write(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);
}

/**
@@ -3549,7 +3548,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
if (on_dfl && !alloc_cpumasks(NULL, &tmp))
ptmp = &tmp;

- percpu_down_write(&cpuset_rwsem);
+ mutex_lock(&cpuset_mutex);

/* fetch the available cpus/mems and find out which changed how */
cpumask_copy(&new_cpus, cpu_active_mask);
@@ -3606,7 +3605,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
update_tasks_nodemask(&top_cpuset);
}

- percpu_up_write(&cpuset_rwsem);
+ mutex_unlock(&cpuset_mutex);

/* if cpus or mems changed, we need to propagate to descendants */
if (cpus_updated || mems_updated) {
@@ -4037,7 +4036,7 @@ void __cpuset_memory_pressure_bump(void)
* - Used for /proc/<pid>/cpuset.
* - No need to task_lock(tsk) on this tsk->cpuset reference, as it
* doesn't really matter if tsk->cpuset changes after we read it,
- * and we take cpuset_rwsem, keeping cpuset_attach() from changing it
+ * and we take cpuset_mutex, keeping cpuset_attach() from changing it
* anyway.
*/
int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b9616f153946..179266ff653f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7565,6 +7565,7 @@ static int __sched_setscheduler(struct task_struct *p,
int reset_on_fork;
int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
struct rq *rq;
+ bool cpuset_locked = false;

/* The pi code expects interrupts enabled */
BUG_ON(pi && in_interrupt());
@@ -7614,8 +7615,14 @@ static int __sched_setscheduler(struct task_struct *p,
return retval;
}

- if (pi)
- cpuset_read_lock();
+ /*
+ * SCHED_DEADLINE bandwidth accounting relies on stable cpusets
+ * information.
+ */
+ if (dl_policy(policy) || dl_policy(p->policy)) {
+ cpuset_locked = true;
+ cpuset_lock();
+ }

/*
* Make sure no PI-waiters arrive (or leave) while we are
@@ -7691,8 +7698,8 @@ static int __sched_setscheduler(struct task_struct *p,
if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
policy = oldpolicy = -1;
task_rq_unlock(rq, p, &rf);
- if (pi)
- cpuset_read_unlock();
+ if (cpuset_locked)
+ cpuset_unlock();
goto recheck;
}

@@ -7759,7 +7766,8 @@ static int __sched_setscheduler(struct task_struct *p,
task_rq_unlock(rq, p, &rf);

if (pi) {
- cpuset_read_unlock();
+ if (cpuset_locked)
+ cpuset_unlock();
rt_mutex_adjust_pi(p);
}

@@ -7771,8 +7779,8 @@ static int __sched_setscheduler(struct task_struct *p,

unlock:
task_rq_unlock(rq, p, &rf);
- if (pi)
- cpuset_read_unlock();
+ if (cpuset_locked)
+ cpuset_unlock();
return retval;
}

--
2.39.2

2023-03-29 13:08:00

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 6/6] cgroup/cpuset: Iterate only if DEADLINE tasks are present

update_tasks_root_domain currently iterates over all tasks even if no
DEADLINE task is present on the cpuset/root domain for which bandwidth
accounting is being rebuilt. This has been reported to introduce 10+ ms
delays on suspend-resume operations.

Skip the costly iteration for cpusets that don't contain DEADLINE tasks.

Reported-by: Qais Yousef <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/cgroup/cpuset.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f8ebec66da51..05c0a1255218 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1092,6 +1092,9 @@ static void dl_update_tasks_root_domain(struct cpuset *cs)
struct css_task_iter it;
struct task_struct *task;

+ if (cs->nr_deadline_tasks == 0)
+ return;
+
css_task_iter_start(&cs->css, 0, &it);

while ((task = css_task_iter_next(&it)))
--
2.39.2

2023-03-29 13:08:25

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 4/6] sched/deadline: Create DL BW alloc, free & check overflow interface

From: Dietmar Eggemann <[email protected]>

Rework the existing dl_cpu_busy() interface which offers DL BW overflow
checking and per-task DL BW allocation.

Add dl_bw_free() as an interface to be able to free DL BW.
It will be used to allow freeing of the DL BW request done during
cpuset_can_attach() in case multiple controllers are attached to the
cgroup next to the cpuset controller and one of the non-cpuset
can_attach() fails.

dl_bw_alloc() (and dl_bw_free()) now take a `u64 dl_bw` parameter
instead of `struct task_struct *p` used in dl_cpu_busy(). This allows
to allocate DL BW for a set of tasks too rater than only for a single
task.

Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 4 ++--
kernel/sched/deadline.c | 53 +++++++++++++++++++++++++++++++----------
kernel/sched/sched.h | 2 +-
4 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d654eb4cabd..6f3d84e0ed08 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1848,6 +1848,8 @@ current_restore_flags(unsigned long orig_flags, unsigned long flags)

extern int cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_effective_cpus);
+extern int dl_bw_alloc(int cpu, u64 dl_bw);
+extern void dl_bw_free(int cpu, u64 dl_bw);
#ifdef CONFIG_SMP
extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask);
extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 179266ff653f..c83dae6b8586 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9294,7 +9294,7 @@ int task_can_attach(struct task_struct *p,

if (unlikely(cpu >= nr_cpu_ids))
return -EINVAL;
- ret = dl_cpu_busy(cpu, p);
+ ret = dl_bw_alloc(cpu, p->dl.dl_bw);
}

out:
@@ -9579,7 +9579,7 @@ static void cpuset_cpu_active(void)
static int cpuset_cpu_inactive(unsigned int cpu)
{
if (!cpuhp_tasks_frozen) {
- int ret = dl_cpu_busy(cpu, NULL);
+ int ret = dl_bw_check_overflow(cpu);

if (ret)
return ret;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 8f92f0f87383..5b6965e0e537 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3057,26 +3057,38 @@ int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur,
return ret;
}

-int dl_cpu_busy(int cpu, struct task_struct *p)
+enum dl_bw_request {
+ dl_bw_req_check_overflow = 0,
+ dl_bw_req_alloc,
+ dl_bw_req_free
+};
+
+static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
{
- unsigned long flags, cap;
+ unsigned long flags;
struct dl_bw *dl_b;
- bool overflow;
+ bool overflow = 0;

rcu_read_lock_sched();
dl_b = dl_bw_of(cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- cap = dl_bw_capacity(cpu);
- overflow = __dl_overflow(dl_b, cap, 0, p ? p->dl.dl_bw : 0);

- if (!overflow && p) {
- /*
- * We reserve space for this task in the destination
- * root_domain, as we can't fail after this point.
- * We will free resources in the source root_domain
- * later on (see set_cpus_allowed_dl()).
- */
- __dl_add(dl_b, p->dl.dl_bw, dl_bw_cpus(cpu));
+ if (req == dl_bw_req_free) {
+ __dl_sub(dl_b, dl_bw, dl_bw_cpus(cpu));
+ } else {
+ unsigned long cap = dl_bw_capacity(cpu);
+
+ overflow = __dl_overflow(dl_b, cap, 0, dl_bw);
+
+ if (req == dl_bw_req_alloc && !overflow) {
+ /*
+ * We reserve space in the destination
+ * root_domain, as we can't fail after this point.
+ * We will free resources in the source root_domain
+ * later on (see set_cpus_allowed_dl()).
+ */
+ __dl_add(dl_b, dl_bw, dl_bw_cpus(cpu));
+ }
}

raw_spin_unlock_irqrestore(&dl_b->lock, flags);
@@ -3084,6 +3096,21 @@ int dl_cpu_busy(int cpu, struct task_struct *p)

return overflow ? -EBUSY : 0;
}
+
+int dl_bw_check_overflow(int cpu)
+{
+ return dl_bw_manage(dl_bw_req_check_overflow, cpu, 0);
+}
+
+int dl_bw_alloc(int cpu, u64 dl_bw)
+{
+ return dl_bw_manage(dl_bw_req_alloc, cpu, dl_bw);
+}
+
+void dl_bw_free(int cpu, u64 dl_bw)
+{
+ dl_bw_manage(dl_bw_req_free, cpu, dl_bw);
+}
#endif

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 060616944d7a..81ecfd1a1a48 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -330,7 +330,7 @@ extern void __getparam_dl(struct task_struct *p, struct sched_attr *attr);
extern bool __checkparam_dl(const struct sched_attr *attr);
extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr);
extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
-extern int dl_cpu_busy(int cpu, struct task_struct *p);
+extern int dl_bw_check_overflow(int cpu);

#ifdef CONFIG_CGROUP_SCHED

--
2.39.2

2023-03-29 14:43:24

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 5/6] cgroup/cpuset: Free DL BW in case can_attach() fails


On 3/29/23 08:55, Juri Lelli wrote:
> From: Dietmar Eggemann <[email protected]>
>
> cpuset_can_attach() can fail. Postpone DL BW allocation until all tasks
> have been checked. DL BW is not allocated per-task but as a sum over
> all DL tasks migrating.
>
> If multiple controllers are attached to the cgroup next to the cuset
Typo: : "cuset" => "cpuset"
> controller a non-cpuset can_attach() can fail. In this case free DL BW
> in cpuset_cancel_attach().
>
> Finally, update cpuset DL task count (nr_deadline_tasks) only in
> cpuset_attach().
>
> Suggested-by: Waiman Long <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> include/linux/sched.h | 2 +-
> kernel/cgroup/cpuset.c | 55 ++++++++++++++++++++++++++++++++++++++----
> kernel/sched/core.c | 17 ++-----------
> 3 files changed, 53 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6f3d84e0ed08..50cbbfefbe11 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1847,7 +1847,7 @@ current_restore_flags(unsigned long orig_flags, unsigned long flags)
> }
>
> extern int cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
> -extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_effective_cpus);
> +extern int task_can_attach(struct task_struct *p);
> extern int dl_bw_alloc(int cpu, u64 dl_bw);
> extern void dl_bw_free(int cpu, u64 dl_bw);
> #ifdef CONFIG_SMP
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index eb0854ef9757..f8ebec66da51 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -198,6 +198,8 @@ struct cpuset {
> * know when to rebuild associated root domain bandwidth information.
> */
> int nr_deadline_tasks;
> + int nr_migrate_dl_tasks;
> + u64 sum_migrate_dl_bw;
>
> /* Invalid partition error code, not lock protected */
> enum prs_errcode prs_err;
> @@ -2464,16 +2466,23 @@ static int fmeter_getrate(struct fmeter *fmp)
>
> static struct cpuset *cpuset_attach_old_cs;
>
> +static void reset_migrate_dl_data(struct cpuset *cs)
> +{
> + cs->nr_migrate_dl_tasks = 0;
> + cs->sum_migrate_dl_bw = 0;
> +}
> +
> /* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
> static int cpuset_can_attach(struct cgroup_taskset *tset)
> {
> struct cgroup_subsys_state *css;
> - struct cpuset *cs;
> + struct cpuset *cs, *oldcs;
> struct task_struct *task;
> int ret;
>
> /* used later by cpuset_attach() */
> cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
> + oldcs = cpuset_attach_old_cs;
> cs = css_cs(css);
>
> mutex_lock(&cpuset_mutex);
> @@ -2491,7 +2500,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> goto out_unlock;
>
> cgroup_taskset_for_each(task, css, tset) {
> - ret = task_can_attach(task, cs->effective_cpus);
> + ret = task_can_attach(task);
> if (ret)
> goto out_unlock;
> ret = security_task_setscheduler(task);
> @@ -2499,11 +2508,31 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> goto out_unlock;
>
> if (dl_task(task)) {
> - cs->nr_deadline_tasks++;
> - cpuset_attach_old_cs->nr_deadline_tasks--;
> + cs->nr_migrate_dl_tasks++;
> + cs->sum_migrate_dl_bw += task->dl.dl_bw;
> + }
> + }
> +
> + if (!cs->nr_migrate_dl_tasks)
> + goto out_succes;
> +
> + if (!cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus)) {
> + int cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
> +
> + if (unlikely(cpu >= nr_cpu_ids)) {
> + reset_migrate_dl_data(cs);
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> +
> + ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
> + if (ret) {
> + reset_migrate_dl_data(cs);
> + goto out_unlock;
> }
> }
>
> +out_succes:
> /*
> * Mark attach is in progress. This makes validate_change() fail
> * changes which zero cpus/mems_allowed.
> @@ -2518,11 +2547,21 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> static void cpuset_cancel_attach(struct cgroup_taskset *tset)
> {
> struct cgroup_subsys_state *css;
> + struct cpuset *cs;
>
> cgroup_taskset_first(tset, &css);
> + cs = css_cs(css);
>
> mutex_lock(&cpuset_mutex);
> - css_cs(css)->attach_in_progress--;
> + cs->attach_in_progress--;
> +
> + if (cs->nr_migrate_dl_tasks) {
> + int cpu = cpumask_any(cs->effective_cpus);
> +
> + dl_bw_free(cpu, cs->sum_migrate_dl_bw);
> + reset_migrate_dl_data(cs);
> + }
> +
> mutex_unlock(&cpuset_mutex);
> }
>
> @@ -2617,6 +2656,12 @@ static void cpuset_attach(struct cgroup_taskset *tset)
> out:
> cs->old_mems_allowed = cpuset_attach_nodemask_to;
>
> + if (cs->nr_migrate_dl_tasks) {
> + cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
> + oldcs->nr_deadline_tasks -= cs->nr_migrate_dl_tasks;
> + reset_migrate_dl_data(cs);
> + }
> +
> cs->attach_in_progress--;
> if (!cs->attach_in_progress)
> wake_up(&cpuset_attach_wq);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c83dae6b8586..10454980e830 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9269,8 +9269,7 @@ int cpuset_cpumask_can_shrink(const struct cpumask *cur,
> return ret;
> }
>
> -int task_can_attach(struct task_struct *p,
> - const struct cpumask *cs_effective_cpus)
> +int task_can_attach(struct task_struct *p)
> {
> int ret = 0;
>
> @@ -9283,21 +9282,9 @@ int task_can_attach(struct task_struct *p,
> * success of set_cpus_allowed_ptr() on all attached tasks
> * before cpus_mask may be changed.
> */
> - if (p->flags & PF_NO_SETAFFINITY) {
> + if (p->flags & PF_NO_SETAFFINITY)
> ret = -EINVAL;
> - goto out;
> - }
> -
> - if (dl_task(p) && !cpumask_intersects(task_rq(p)->rd->span,
> - cs_effective_cpus)) {
> - int cpu = cpumask_any_and(cpu_active_mask, cs_effective_cpus);
>
> - if (unlikely(cpu >= nr_cpu_ids))
> - return -EINVAL;
> - ret = dl_bw_alloc(cpu, p->dl.dl_bw);
> - }
> -
> -out:
> return ret;
> }
>

2023-03-29 14:43:26

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 4/6] sched/deadline: Create DL BW alloc, free & check overflow interface


On 3/29/23 08:55, Juri Lelli wrote:
> From: Dietmar Eggemann <[email protected]>
>
> Rework the existing dl_cpu_busy() interface which offers DL BW overflow
> checking and per-task DL BW allocation.
>
> Add dl_bw_free() as an interface to be able to free DL BW.
> It will be used to allow freeing of the DL BW request done during
> cpuset_can_attach() in case multiple controllers are attached to the
> cgroup next to the cpuset controller and one of the non-cpuset
> can_attach() fails.
>
> dl_bw_alloc() (and dl_bw_free()) now take a `u64 dl_bw` parameter
> instead of `struct task_struct *p` used in dl_cpu_busy(). This allows
> to allocate DL BW for a set of tasks too rater than only for a single
Typo: "rater" => "rather"
> task.
>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> include/linux/sched.h | 2 ++
> kernel/sched/core.c | 4 ++--
> kernel/sched/deadline.c | 53 +++++++++++++++++++++++++++++++----------
> kernel/sched/sched.h | 2 +-
> 4 files changed, 45 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6d654eb4cabd..6f3d84e0ed08 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1848,6 +1848,8 @@ current_restore_flags(unsigned long orig_flags, unsigned long flags)
>
> extern int cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
> extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_effective_cpus);
> +extern int dl_bw_alloc(int cpu, u64 dl_bw);
> +extern void dl_bw_free(int cpu, u64 dl_bw);
> #ifdef CONFIG_SMP
> extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask);
> extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 179266ff653f..c83dae6b8586 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9294,7 +9294,7 @@ int task_can_attach(struct task_struct *p,
>
> if (unlikely(cpu >= nr_cpu_ids))
> return -EINVAL;
> - ret = dl_cpu_busy(cpu, p);
> + ret = dl_bw_alloc(cpu, p->dl.dl_bw);
> }
>
> out:
> @@ -9579,7 +9579,7 @@ static void cpuset_cpu_active(void)
> static int cpuset_cpu_inactive(unsigned int cpu)
> {
> if (!cpuhp_tasks_frozen) {
> - int ret = dl_cpu_busy(cpu, NULL);
> + int ret = dl_bw_check_overflow(cpu);
>
> if (ret)
> return ret;
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 8f92f0f87383..5b6965e0e537 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -3057,26 +3057,38 @@ int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur,
> return ret;
> }
>
> -int dl_cpu_busy(int cpu, struct task_struct *p)
> +enum dl_bw_request {
> + dl_bw_req_check_overflow = 0,
> + dl_bw_req_alloc,
> + dl_bw_req_free
> +};
> +
> +static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
> {
> - unsigned long flags, cap;
> + unsigned long flags;
> struct dl_bw *dl_b;
> - bool overflow;
> + bool overflow = 0;
>
> rcu_read_lock_sched();
> dl_b = dl_bw_of(cpu);
> raw_spin_lock_irqsave(&dl_b->lock, flags);
> - cap = dl_bw_capacity(cpu);
> - overflow = __dl_overflow(dl_b, cap, 0, p ? p->dl.dl_bw : 0);
>
> - if (!overflow && p) {
> - /*
> - * We reserve space for this task in the destination
> - * root_domain, as we can't fail after this point.
> - * We will free resources in the source root_domain
> - * later on (see set_cpus_allowed_dl()).
> - */
> - __dl_add(dl_b, p->dl.dl_bw, dl_bw_cpus(cpu));
> + if (req == dl_bw_req_free) {
> + __dl_sub(dl_b, dl_bw, dl_bw_cpus(cpu));
> + } else {
> + unsigned long cap = dl_bw_capacity(cpu);
> +
> + overflow = __dl_overflow(dl_b, cap, 0, dl_bw);
> +
> + if (req == dl_bw_req_alloc && !overflow) {
> + /*
> + * We reserve space in the destination
> + * root_domain, as we can't fail after this point.
> + * We will free resources in the source root_domain
> + * later on (see set_cpus_allowed_dl()).
> + */
> + __dl_add(dl_b, dl_bw, dl_bw_cpus(cpu));
> + }
> }
>
> raw_spin_unlock_irqrestore(&dl_b->lock, flags);
> @@ -3084,6 +3096,21 @@ int dl_cpu_busy(int cpu, struct task_struct *p)
>
> return overflow ? -EBUSY : 0;
> }
> +
> +int dl_bw_check_overflow(int cpu)
> +{
> + return dl_bw_manage(dl_bw_req_check_overflow, cpu, 0);
> +}
> +
> +int dl_bw_alloc(int cpu, u64 dl_bw)
> +{
> + return dl_bw_manage(dl_bw_req_alloc, cpu, dl_bw);
> +}
> +
> +void dl_bw_free(int cpu, u64 dl_bw)
> +{
> + dl_bw_manage(dl_bw_req_free, cpu, dl_bw);
> +}
> #endif
>
> #ifdef CONFIG_SCHED_DEBUG
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 060616944d7a..81ecfd1a1a48 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -330,7 +330,7 @@ extern void __getparam_dl(struct task_struct *p, struct sched_attr *attr);
> extern bool __checkparam_dl(const struct sched_attr *attr);
> extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr);
> extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
> -extern int dl_cpu_busy(int cpu, struct task_struct *p);
> +extern int dl_bw_check_overflow(int cpu);
>
> #ifdef CONFIG_CGROUP_SCHED
>

2023-03-29 14:43:50

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/6] sched/deadline: cpuset: Rework DEADLINE bandwidth restoration

On 3/29/23 08:55, Juri Lelli wrote:
> Qais reported [1] that iterating over all tasks when rebuilding root
> domains for finding out which ones are DEADLINE and need their bandwidth
> correctly restored on such root domains can be a costly operation (10+
> ms delays on suspend-resume). He proposed we skip rebuilding root
> domains for certain operations, but that approach seemed arch specific
> and possibly prone to errors, as paths that ultimately trigger a rebuild
> might be quite convoluted (thanks Qais for spending time on this!).
>
> To fix the problem
>
> 01/06 - Rename functions deadline with DEADLINE accounting (cleanup
> suggested by Qais) - no functional change
> 02/06 - Bring back cpuset_mutex (so that we have write access to cpusets
> from scheduler operations - and we also fix some problems
> associated to percpu_cpuset_rwsem)
> 03/06 - Keep track of the number of DEADLINE tasks belonging to each cpuset
> 04/06 - Create DL BW alloc, free & check overflow interface for bulk
> bandwidth allocation/removal - no functional change
> 05/06 - Fix bandwidth allocation handling for cgroup operation
> involving multiple tasks
> 06/06 - Use this information to only perform the costly iteration if
> DEADLINE tasks are actually present in the cpuset for which a
> corresponding root domain is being rebuilt
>
> With respect to the RFC posting [2]
>
> 1 - rename DEADLINE bandwidth accounting functions - Qais
> 2 - call inc/dec_dl_tasks_cs from switched_{to,from}_dl - Qais
> 3 - fix DEADLINE bandwidth allocation with multiple tasks - Waiman,
> contributed by Dietmar
>
> This set is also available from
>
> https://github.com/jlelli/linux.git deadline/rework-cpusets
>
> Best,
> Juri
>
> 1 - https://lore.kernel.org/lkml/[email protected]/
> 2 - https://lore.kernel.org/lkml/[email protected]/
>
> Dietmar Eggemann (2):
> sched/deadline: Create DL BW alloc, free & check overflow interface
> cgroup/cpuset: Free DL BW in case can_attach() fails
>
> Juri Lelli (4):
> cgroup/cpuset: Rename functions dealing with DEADLINE accounting
> sched/cpuset: Bring back cpuset_mutex
> sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets
> cgroup/cpuset: Iterate only if DEADLINE tasks are present
>
> include/linux/cpuset.h | 12 ++-
> include/linux/sched.h | 4 +-
> kernel/cgroup/cgroup.c | 4 +
> kernel/cgroup/cpuset.c | 232 ++++++++++++++++++++++++++--------------
> kernel/sched/core.c | 41 ++++---
> kernel/sched/deadline.c | 67 +++++++++---
> kernel/sched/sched.h | 2 +-
> 7 files changed, 240 insertions(+), 122 deletions(-)

Other than some minor issues that I have talked in earlier emails, the
patch series looks good to me.

You can add my ack once the issues are fixed.

Acked-by: Waiman Long <[email protected]>

2023-03-29 14:44:33

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 5/6] cgroup/cpuset: Free DL BW in case can_attach() fails

On 3/29/23 10:25, Waiman Long wrote:
>
> On 3/29/23 08:55, Juri Lelli wrote:
>> From: Dietmar Eggemann <[email protected]>
>>
>> cpuset_can_attach() can fail. Postpone DL BW allocation until all tasks
>> have been checked. DL BW is not allocated per-task but as a sum over
>> all DL tasks migrating.
>>
>> If multiple controllers are attached to the cgroup next to the cuset
> Typo: : "cuset" => "cpuset"
>> controller a non-cpuset can_attach() can fail. In this case free DL BW
>> in cpuset_cancel_attach().
>>
>> Finally, update cpuset DL task count (nr_deadline_tasks) only in
>> cpuset_attach().
>>
>> Suggested-by: Waiman Long <[email protected]>
>> Signed-off-by: Dietmar Eggemann <[email protected]>
>> Signed-off-by: Juri Lelli <[email protected]>
>> ---
>>   include/linux/sched.h  |  2 +-
>>   kernel/cgroup/cpuset.c | 55 ++++++++++++++++++++++++++++++++++++++----
>>   kernel/sched/core.c    | 17 ++-----------
>>   3 files changed, 53 insertions(+), 21 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 6f3d84e0ed08..50cbbfefbe11 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1847,7 +1847,7 @@ current_restore_flags(unsigned long orig_flags,
>> unsigned long flags)
>>   }
>>     extern int cpuset_cpumask_can_shrink(const struct cpumask *cur,
>> const struct cpumask *trial);
>> -extern int task_can_attach(struct task_struct *p, const struct
>> cpumask *cs_effective_cpus);
>> +extern int task_can_attach(struct task_struct *p);
>>   extern int dl_bw_alloc(int cpu, u64 dl_bw);
>>   extern void dl_bw_free(int cpu, u64 dl_bw);
>>   #ifdef CONFIG_SMP
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index eb0854ef9757..f8ebec66da51 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -198,6 +198,8 @@ struct cpuset {
>>        * know when to rebuild associated root domain bandwidth
>> information.
>>        */
>>       int nr_deadline_tasks;
>> +    int nr_migrate_dl_tasks;
>> +    u64 sum_migrate_dl_bw;
>>         /* Invalid partition error code, not lock protected */
>>       enum prs_errcode prs_err;
>> @@ -2464,16 +2466,23 @@ static int fmeter_getrate(struct fmeter *fmp)
>>     static struct cpuset *cpuset_attach_old_cs;
>>   +static void reset_migrate_dl_data(struct cpuset *cs)
>> +{
>> +    cs->nr_migrate_dl_tasks = 0;
>> +    cs->sum_migrate_dl_bw = 0;
>> +}
>> +
>>   /* Called by cgroups to determine if a cpuset is usable;
>> cpuset_mutex held */
>>   static int cpuset_can_attach(struct cgroup_taskset *tset)
>>   {
>>       struct cgroup_subsys_state *css;
>> -    struct cpuset *cs;
>> +    struct cpuset *cs, *oldcs;
>>       struct task_struct *task;
>>       int ret;
>>         /* used later by cpuset_attach() */
>>       cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
>> +    oldcs = cpuset_attach_old_cs;
>>       cs = css_cs(css);
>>         mutex_lock(&cpuset_mutex);
>> @@ -2491,7 +2500,7 @@ static int cpuset_can_attach(struct
>> cgroup_taskset *tset)
>>           goto out_unlock;
>>         cgroup_taskset_for_each(task, css, tset) {
>> -        ret = task_can_attach(task, cs->effective_cpus);
>> +        ret = task_can_attach(task);
>>           if (ret)
>>               goto out_unlock;
>>           ret = security_task_setscheduler(task);
>> @@ -2499,11 +2508,31 @@ static int cpuset_can_attach(struct
>> cgroup_taskset *tset)
>>               goto out_unlock;
>>             if (dl_task(task)) {
>> -            cs->nr_deadline_tasks++;
>> -            cpuset_attach_old_cs->nr_deadline_tasks--;
>> +            cs->nr_migrate_dl_tasks++;
>> +            cs->sum_migrate_dl_bw += task->dl.dl_bw;
>> +        }
>> +    }
>> +
>> +    if (!cs->nr_migrate_dl_tasks)
>> +        goto out_succes;
>> +
>> +    if (!cpumask_intersects(oldcs->effective_cpus,
>> cs->effective_cpus)) {
>> +        int cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
>> +
>> +        if (unlikely(cpu >= nr_cpu_ids)) {
>> +            reset_migrate_dl_data(cs);
>> +            ret = -EINVAL;
>> +            goto out_unlock;
>> +        }
>> +
>> +        ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
>> +        if (ret) {
>> +            reset_migrate_dl_data(cs);
>> +            goto out_unlock;
>>           }
>>       }
>>   +out_succes:
>>       /*
>>        * Mark attach is in progress.  This makes validate_change() fail
>>        * changes which zero cpus/mems_allowed.
>> @@ -2518,11 +2547,21 @@ static int cpuset_can_attach(struct
>> cgroup_taskset *tset)
>>   static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>>   {
>>       struct cgroup_subsys_state *css;
>> +    struct cpuset *cs;
>>         cgroup_taskset_first(tset, &css);
>> +    cs = css_cs(css);
>>         mutex_lock(&cpuset_mutex);
>> -    css_cs(css)->attach_in_progress--;
>> +    cs->attach_in_progress--;
>> +
>> +    if (cs->nr_migrate_dl_tasks) {
>> +        int cpu = cpumask_any(cs->effective_cpus);
>> +
>> +        dl_bw_free(cpu, cs->sum_migrate_dl_bw);
>> +        reset_migrate_dl_data(cs);
>> +    }
>> +

Another nit that I have is that you may have to record also the cpu
where the DL bandwidth is allocated in cpuset_can_attach() and free the
bandwidth back into that cpu or there can be an underflow if another cpu
is chosen.

Cheers,
Longman

2023-03-29 16:30:37

by Waiman Long

[permalink] [raw]
Subject: [PATCH 6/7] cgroup/cpuset: Protect DL BW data against parallel cpuset_attach()

It is possible to have parallel attach operations to the same cpuset in
progress. To avoid possible corruption of single set of DL BW data in
the cpuset structure, we have to disallow parallel attach operations if
DL tasks are present. Attach operations can still proceed in parallel
as long as no DL tasks are involved.

This patch also stores the CPU where DL BW is allocated and free that BW
back to the same CPU in case cpuset_can_attach() is called.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 05c0a1255218..555a6b1a2b76 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -199,6 +199,7 @@ struct cpuset {
*/
int nr_deadline_tasks;
int nr_migrate_dl_tasks;
+ int dl_bw_cpu;
u64 sum_migrate_dl_bw;

/* Invalid partition error code, not lock protected */
@@ -2502,6 +2503,16 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
if (cpumask_empty(cs->effective_cpus))
goto out_unlock;

+ /*
+ * If there is another parallel attach operations in progress for
+ * the same cpuset, the single set of DL data there may get
+ * incorrectly overwritten. So parallel operations are not allowed
+ * if DL tasks are present.
+ */
+ ret = -EBUSY;
+ if (cs->nr_migrate_dl_tasks)
+ goto out_unlock;
+
cgroup_taskset_for_each(task, css, tset) {
ret = task_can_attach(task);
if (ret)
@@ -2511,6 +2522,9 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
goto out_unlock;

if (dl_task(task)) {
+ if (cs->attach_in_progress)
+ goto out_unlock;
+
cs->nr_migrate_dl_tasks++;
cs->sum_migrate_dl_bw += task->dl.dl_bw;
}
@@ -2533,6 +2547,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
reset_migrate_dl_data(cs);
goto out_unlock;
}
+ cs->dl_bw_cpu = cpu;
}

out_succes:
@@ -2559,9 +2574,7 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
cs->attach_in_progress--;

if (cs->nr_migrate_dl_tasks) {
- int cpu = cpumask_any(cs->effective_cpus);
-
- dl_bw_free(cpu, cs->sum_migrate_dl_bw);
+ dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
reset_migrate_dl_data(cs);
}

--
2.31.1

2023-03-29 16:52:43

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 5/6] cgroup/cpuset: Free DL BW in case can_attach() fails

On 29/03/2023 16:31, Waiman Long wrote:
> On 3/29/23 10:25, Waiman Long wrote:
>>
>> On 3/29/23 08:55, Juri Lelli wrote:
>>> From: Dietmar Eggemann <[email protected]>

[...]

>>> @@ -2518,11 +2547,21 @@ static int cpuset_can_attach(struct
>>> cgroup_taskset *tset)
>>>   static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>>>   {
>>>       struct cgroup_subsys_state *css;
>>> +    struct cpuset *cs;
>>>         cgroup_taskset_first(tset, &css);
>>> +    cs = css_cs(css);
>>>         mutex_lock(&cpuset_mutex);
>>> -    css_cs(css)->attach_in_progress--;
>>> +    cs->attach_in_progress--;
>>> +
>>> +    if (cs->nr_migrate_dl_tasks) {
>>> +        int cpu = cpumask_any(cs->effective_cpus);
>>> +
>>> +        dl_bw_free(cpu, cs->sum_migrate_dl_bw);
>>> +        reset_migrate_dl_data(cs);
>>> +    }
>>> +
>
> Another nit that I have is that you may have to record also the cpu
> where the DL bandwidth is allocated in cpuset_can_attach() and free the
> bandwidth back into that cpu or there can be an underflow if another cpu
> is chosen.

Many thanks for the review!

But isn't the DL BW control `struct dl_bw` per `struct root_domain`
which is per exclusive cpuset. So as long cpu is from
`cs->effective_cpus` shouldn't this be fine?


2023-03-29 18:13:40

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 5/6] cgroup/cpuset: Free DL BW in case can_attach() fails

On 3/29/23 12:39, Dietmar Eggemann wrote:
> On 29/03/2023 16:31, Waiman Long wrote:
>> On 3/29/23 10:25, Waiman Long wrote:
>>> On 3/29/23 08:55, Juri Lelli wrote:
>>>> From: Dietmar Eggemann <[email protected]>
> [...]
>
>>>> @@ -2518,11 +2547,21 @@ static int cpuset_can_attach(struct
>>>> cgroup_taskset *tset)
>>>>   static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>>>>   {
>>>>       struct cgroup_subsys_state *css;
>>>> +    struct cpuset *cs;
>>>>         cgroup_taskset_first(tset, &css);
>>>> +    cs = css_cs(css);
>>>>         mutex_lock(&cpuset_mutex);
>>>> -    css_cs(css)->attach_in_progress--;
>>>> +    cs->attach_in_progress--;
>>>> +
>>>> +    if (cs->nr_migrate_dl_tasks) {
>>>> +        int cpu = cpumask_any(cs->effective_cpus);
>>>> +
>>>> +        dl_bw_free(cpu, cs->sum_migrate_dl_bw);
>>>> +        reset_migrate_dl_data(cs);
>>>> +    }
>>>> +
>> Another nit that I have is that you may have to record also the cpu
>> where the DL bandwidth is allocated in cpuset_can_attach() and free the
>> bandwidth back into that cpu or there can be an underflow if another cpu
>> is chosen.
> Many thanks for the review!
>
> But isn't the DL BW control `struct dl_bw` per `struct root_domain`
> which is per exclusive cpuset. So as long cpu is from
> `cs->effective_cpus` shouldn't this be fine?

Sorry for my ignorance on how the deadline bandwidth operation work. I
check the bandwidth code and find that we are storing the bandwidth
information in the root domain, not on the cpu. That shouldn't be a
concern then.

However, I still have some question on how that works when dealing with
cpuset. First of all, not all the CPUs in a given root domains are in
the cpuset. So there may be enough bandwidth on the root domain, but it
doesn't mean there will be enough bandwidth in the set of CPUs in a
particular cpuset. Secondly, how do you deal with isolated CPUs that do
not have a corresponding root domain? It is now possible to create a
cpuset with isolated CPUs.

Cheers,
Longman

2023-03-30 13:39:09

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 6/7] cgroup/cpuset: Protect DL BW data against parallel cpuset_attach()

On 29/03/2023 18:02, Waiman Long wrote:
> It is possible to have parallel attach operations to the same cpuset in
> progress. To avoid possible corruption of single set of DL BW data in
> the cpuset structure, we have to disallow parallel attach operations if
> DL tasks are present. Attach operations can still proceed in parallel
> as long as no DL tasks are involved.
>
> This patch also stores the CPU where DL BW is allocated and free that BW
> back to the same CPU in case cpuset_can_attach() is called.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/cgroup/cpuset.c | 19 ++++++++++++++++---
> 1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 05c0a1255218..555a6b1a2b76 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -199,6 +199,7 @@ struct cpuset {
> */
> int nr_deadline_tasks;
> int nr_migrate_dl_tasks;
> + int dl_bw_cpu;

Like I mentioned in
https://lkml.kernel.org/r/[email protected]
IMHO any CPU of the cpuset is fine since exclusive cpuset and related
root_domain (as the container for DL BW accounting data) are congruent
in terms of cpumask.

> u64 sum_migrate_dl_bw;
>
> /* Invalid partition error code, not lock protected */
> @@ -2502,6 +2503,16 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> if (cpumask_empty(cs->effective_cpus))
> goto out_unlock;
>
> + /*
> + * If there is another parallel attach operations in progress for
> + * the same cpuset, the single set of DL data there may get
> + * incorrectly overwritten. So parallel operations are not allowed
> + * if DL tasks are present.
> + */
> + ret = -EBUSY;
> + if (cs->nr_migrate_dl_tasks)
> + goto out_unlock;

(1)

> cgroup_taskset_for_each(task, css, tset) {
> ret = task_can_attach(task);
> if (ret)
> @@ -2511,6 +2522,9 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> goto out_unlock;
>
> if (dl_task(task)) {
> + if (cs->attach_in_progress)
> + goto out_unlock;

(2)

Just to check if I get this right, 2 bail-out conditions are necessary
because:

(1) is to prevent any new cs attach if there is already a DL cs attach
and (2) is to prevent a new DL cs attach if there is already a non-DL cs
attach.

> cs->nr_migrate_dl_tasks++;
> cs->sum_migrate_dl_bw += task->dl.dl_bw;
> }

[...]

2023-03-30 15:19:49

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 5/6] cgroup/cpuset: Free DL BW in case can_attach() fails

On 29/03/2023 20:09, Waiman Long wrote:
> On 3/29/23 12:39, Dietmar Eggemann wrote:
>> On 29/03/2023 16:31, Waiman Long wrote:
>>> On 3/29/23 10:25, Waiman Long wrote:
>>>> On 3/29/23 08:55, Juri Lelli wrote:
>>>>> From: Dietmar Eggemann <[email protected]>
>> [...]
>>
>>>>> @@ -2518,11 +2547,21 @@ static int cpuset_can_attach(struct
>>>>> cgroup_taskset *tset)
>>>>>    static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>>>>>    {
>>>>>        struct cgroup_subsys_state *css;
>>>>> +    struct cpuset *cs;
>>>>>          cgroup_taskset_first(tset, &css);
>>>>> +    cs = css_cs(css);
>>>>>          mutex_lock(&cpuset_mutex);
>>>>> -    css_cs(css)->attach_in_progress--;
>>>>> +    cs->attach_in_progress--;
>>>>> +
>>>>> +    if (cs->nr_migrate_dl_tasks) {
>>>>> +        int cpu = cpumask_any(cs->effective_cpus);
>>>>> +
>>>>> +        dl_bw_free(cpu, cs->sum_migrate_dl_bw);
>>>>> +        reset_migrate_dl_data(cs);
>>>>> +    }
>>>>> +
>>> Another nit that I have is that you may have to record also the cpu
>>> where the DL bandwidth is allocated in cpuset_can_attach() and free the
>>> bandwidth back into that cpu or there can be an underflow if another cpu
>>> is chosen.
>> Many thanks for the review!
>>
>> But isn't the DL BW control `struct dl_bw` per `struct root_domain`
>> which is per exclusive cpuset. So as long cpu is from
>> `cs->effective_cpus` shouldn't this be fine?
>
> Sorry for my ignorance on how the deadline bandwidth operation work. I
> check the bandwidth code and find that we are storing the bandwidth
> information in the root domain, not on the cpu. That shouldn't be a
> concern then.
>
> However, I still have some question on how that works when dealing with
> cpuset. First of all, not all the CPUs in a given root domains are in
> the cpuset. So there may be enough bandwidth on the root domain, but it
> doesn't mean there will be enough bandwidth in the set of CPUs in a
> particular cpuset. Secondly, how do you deal with isolated CPUs that do
> not have a corresponding root domain? It is now possible to create a
> cpuset with isolated CPUs.

Sorry, I overlooked this email somehow.

IMHO, this is only done for exclusive cpusets:

cpuset_can_attach()

if (!cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus))

So they should have their own root_domain congruent to their cpumask.

2023-03-30 16:29:10

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 5/6] cgroup/cpuset: Free DL BW in case can_attach() fails

On 3/30/23 11:14, Dietmar Eggemann wrote:
> On 29/03/2023 20:09, Waiman Long wrote:
>> On 3/29/23 12:39, Dietmar Eggemann wrote:
>>> On 29/03/2023 16:31, Waiman Long wrote:
>>>> On 3/29/23 10:25, Waiman Long wrote:
>>>>> On 3/29/23 08:55, Juri Lelli wrote:
>>>>>> From: Dietmar Eggemann <[email protected]>
>>> [...]
>>>
>>>>>> @@ -2518,11 +2547,21 @@ static int cpuset_can_attach(struct
>>>>>> cgroup_taskset *tset)
>>>>>>    static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>>>>>>    {
>>>>>>        struct cgroup_subsys_state *css;
>>>>>> +    struct cpuset *cs;
>>>>>>          cgroup_taskset_first(tset, &css);
>>>>>> +    cs = css_cs(css);
>>>>>>          mutex_lock(&cpuset_mutex);
>>>>>> -    css_cs(css)->attach_in_progress--;
>>>>>> +    cs->attach_in_progress--;
>>>>>> +
>>>>>> +    if (cs->nr_migrate_dl_tasks) {
>>>>>> +        int cpu = cpumask_any(cs->effective_cpus);
>>>>>> +
>>>>>> +        dl_bw_free(cpu, cs->sum_migrate_dl_bw);
>>>>>> +        reset_migrate_dl_data(cs);
>>>>>> +    }
>>>>>> +
>>>> Another nit that I have is that you may have to record also the cpu
>>>> where the DL bandwidth is allocated in cpuset_can_attach() and free the
>>>> bandwidth back into that cpu or there can be an underflow if another cpu
>>>> is chosen.
>>> Many thanks for the review!
>>>
>>> But isn't the DL BW control `struct dl_bw` per `struct root_domain`
>>> which is per exclusive cpuset. So as long cpu is from
>>> `cs->effective_cpus` shouldn't this be fine?
>> Sorry for my ignorance on how the deadline bandwidth operation work. I
>> check the bandwidth code and find that we are storing the bandwidth
>> information in the root domain, not on the cpu. That shouldn't be a
>> concern then.
>>
>> However, I still have some question on how that works when dealing with
>> cpuset. First of all, not all the CPUs in a given root domains are in
>> the cpuset. So there may be enough bandwidth on the root domain, but it
>> doesn't mean there will be enough bandwidth in the set of CPUs in a
>> particular cpuset. Secondly, how do you deal with isolated CPUs that do
>> not have a corresponding root domain? It is now possible to create a
>> cpuset with isolated CPUs.
> Sorry, I overlooked this email somehow.
>
> IMHO, this is only done for exclusive cpusets:
>
> cpuset_can_attach()
>
> if (!cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus))
>
> So they should have their own root_domain congruent to their cpumask.

I am sorry that I missed that check.

Parallel attach is actually an existing problem in cpuset as there is a
shared cpuset_attach_old_cs variable being used by cpuset between
cpuset_can_attach() and cpuset_attach(). So any parallel attach can lead
to corruption of this common data causing incorrect result. So this
problem is not specific to this patch series. So please ignore this
patch for now. It has to be addressed separately.

Cheers,
Longman

2023-04-04 17:57:00

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched/cpuset: Bring back cpuset_mutex

On 3/29/23 08:55, Juri Lelli wrote:
> Turns out percpu_cpuset_rwsem - commit 1243dc518c9d ("cgroup/cpuset:
> Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea,
> as it has been reported to cause slowdowns in workloads that need to
> change cpuset configuration frequently and it is also not implementing
> priority inheritance (which causes troubles with realtime workloads).
>
> Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it
> only for SCHED_DEADLINE tasks (other policies don't care about stable
> cpusets anyway).
>
> Signed-off-by: Juri Lelli <[email protected]>

I am thinking that maybe we should switch the percpu rwsem to a regular
rwsem as there are cases where a read lock is sufficient. This will also
avoid the potential PREEMPT_RT problem with PI and reduce the time it
needs to take a write lock.

Cheers,
Longman

2023-04-04 20:09:20

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched/cpuset: Bring back cpuset_mutex

On 03/29/23 14:55, Juri Lelli wrote:
> Turns out percpu_cpuset_rwsem - commit 1243dc518c9d ("cgroup/cpuset:
> Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea,
> as it has been reported to cause slowdowns in workloads that need to
> change cpuset configuration frequently and it is also not implementing
> priority inheritance (which causes troubles with realtime workloads).
>
> Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it
> only for SCHED_DEADLINE tasks (other policies don't care about stable
> cpusets anyway).
>
> Signed-off-by: Juri Lelli <[email protected]>
> ---

Reviewed-by: Qais Yousef <[email protected]>
Tested-by: Qais Yousef <[email protected]>


Thanks!

--
Qais Yousef

> include/linux/cpuset.h | 8 +--
> kernel/cgroup/cpuset.c | 145 ++++++++++++++++++++---------------------
> kernel/sched/core.c | 22 +++++--
> 3 files changed, 91 insertions(+), 84 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index d58e0476ee8e..355f796c5f07 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -71,8 +71,8 @@ extern void cpuset_init_smp(void);
> extern void cpuset_force_rebuild(void);
> extern void cpuset_update_active_cpus(void);
> extern void cpuset_wait_for_hotplug(void);
> -extern void cpuset_read_lock(void);
> -extern void cpuset_read_unlock(void);
> +extern void cpuset_lock(void);
> +extern void cpuset_unlock(void);
> extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
> extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
> extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
> @@ -196,8 +196,8 @@ static inline void cpuset_update_active_cpus(void)
>
> static inline void cpuset_wait_for_hotplug(void) { }
>
> -static inline void cpuset_read_lock(void) { }
> -static inline void cpuset_read_unlock(void) { }
> +static inline void cpuset_lock(void) { }
> +static inline void cpuset_unlock(void) { }
>
> static inline void cpuset_cpus_allowed(struct task_struct *p,
> struct cpumask *mask)
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 501913bc2805..fbc10b494292 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -366,22 +366,21 @@ static struct cpuset top_cpuset = {
> if (is_cpuset_online(((des_cs) = css_cs((pos_css)))))
>
> /*
> - * There are two global locks guarding cpuset structures - cpuset_rwsem and
> + * There are two global locks guarding cpuset structures - cpuset_mutex and
> * callback_lock. We also require taking task_lock() when dereferencing a
> * task's cpuset pointer. See "The task_lock() exception", at the end of this
> - * comment. The cpuset code uses only cpuset_rwsem write lock. Other
> - * kernel subsystems can use cpuset_read_lock()/cpuset_read_unlock() to
> - * prevent change to cpuset structures.
> + * comment. The cpuset code uses only cpuset_mutex. Other kernel subsystems
> + * can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset
> + * structures.
> *
> * A task must hold both locks to modify cpusets. If a task holds
> - * cpuset_rwsem, it blocks others wanting that rwsem, ensuring that it
> - * is the only task able to also acquire callback_lock and be able to
> - * modify cpusets. It can perform various checks on the cpuset structure
> - * first, knowing nothing will change. It can also allocate memory while
> - * just holding cpuset_rwsem. While it is performing these checks, various
> - * callback routines can briefly acquire callback_lock to query cpusets.
> - * Once it is ready to make the changes, it takes callback_lock, blocking
> - * everyone else.
> + * cpuset_mutex, it blocks others, ensuring that it is the only task able to
> + * also acquire callback_lock and be able to modify cpusets. It can perform
> + * various checks on the cpuset structure first, knowing nothing will change.
> + * It can also allocate memory while just holding cpuset_mutex. While it is
> + * performing these checks, various callback routines can briefly acquire
> + * callback_lock to query cpusets. Once it is ready to make the changes, it
> + * takes callback_lock, blocking everyone else.
> *
> * Calls to the kernel memory allocator can not be made while holding
> * callback_lock, as that would risk double tripping on callback_lock
> @@ -403,16 +402,16 @@ static struct cpuset top_cpuset = {
> * guidelines for accessing subsystem state in kernel/cgroup.c
> */
>
> -DEFINE_STATIC_PERCPU_RWSEM(cpuset_rwsem);
> +static DEFINE_MUTEX(cpuset_mutex);
>
> -void cpuset_read_lock(void)
> +void cpuset_lock(void)
> {
> - percpu_down_read(&cpuset_rwsem);
> + mutex_lock(&cpuset_mutex);
> }
>
> -void cpuset_read_unlock(void)
> +void cpuset_unlock(void)
> {
> - percpu_up_read(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
> }
>
> static DEFINE_SPINLOCK(callback_lock);
> @@ -496,7 +495,7 @@ static inline bool partition_is_populated(struct cpuset *cs,
> * One way or another, we guarantee to return some non-empty subset
> * of cpu_online_mask.
> *
> - * Call with callback_lock or cpuset_rwsem held.
> + * Call with callback_lock or cpuset_mutex held.
> */
> static void guarantee_online_cpus(struct task_struct *tsk,
> struct cpumask *pmask)
> @@ -538,7 +537,7 @@ static void guarantee_online_cpus(struct task_struct *tsk,
> * One way or another, we guarantee to return some non-empty subset
> * of node_states[N_MEMORY].
> *
> - * Call with callback_lock or cpuset_rwsem held.
> + * Call with callback_lock or cpuset_mutex held.
> */
> static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
> {
> @@ -550,7 +549,7 @@ static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
> /*
> * update task's spread flag if cpuset's page/slab spread flag is set
> *
> - * Call with callback_lock or cpuset_rwsem held. The check can be skipped
> + * Call with callback_lock or cpuset_mutex held. The check can be skipped
> * if on default hierarchy.
> */
> static void cpuset_update_task_spread_flags(struct cpuset *cs,
> @@ -575,7 +574,7 @@ static void cpuset_update_task_spread_flags(struct cpuset *cs,
> *
> * One cpuset is a subset of another if all its allowed CPUs and
> * Memory Nodes are a subset of the other, and its exclusive flags
> - * are only set if the other's are set. Call holding cpuset_rwsem.
> + * are only set if the other's are set. Call holding cpuset_mutex.
> */
>
> static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
> @@ -713,7 +712,7 @@ static int validate_change_legacy(struct cpuset *cur, struct cpuset *trial)
> * If we replaced the flag and mask values of the current cpuset
> * (cur) with those values in the trial cpuset (trial), would
> * our various subset and exclusive rules still be valid? Presumes
> - * cpuset_rwsem held.
> + * cpuset_mutex held.
> *
> * 'cur' is the address of an actual, in-use cpuset. Operations
> * such as list traversal that depend on the actual address of the
> @@ -829,7 +828,7 @@ static void update_domain_attr_tree(struct sched_domain_attr *dattr,
> rcu_read_unlock();
> }
>
> -/* Must be called with cpuset_rwsem held. */
> +/* Must be called with cpuset_mutex held. */
> static inline int nr_cpusets(void)
> {
> /* jump label reference count + the top-level cpuset */
> @@ -855,7 +854,7 @@ static inline int nr_cpusets(void)
> * domains when operating in the severe memory shortage situations
> * that could cause allocation failures below.
> *
> - * Must be called with cpuset_rwsem held.
> + * Must be called with cpuset_mutex held.
> *
> * The three key local variables below are:
> * cp - cpuset pointer, used (together with pos_css) to perform a
> @@ -1084,7 +1083,7 @@ static void dl_rebuild_rd_accounting(void)
> struct cpuset *cs = NULL;
> struct cgroup_subsys_state *pos_css;
>
> - percpu_rwsem_assert_held(&cpuset_rwsem);
> + lockdep_assert_held(&cpuset_mutex);
> lockdep_assert_cpus_held();
> lockdep_assert_held(&sched_domains_mutex);
>
> @@ -1134,7 +1133,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> * 'cpus' is removed, then call this routine to rebuild the
> * scheduler's dynamic sched domains.
> *
> - * Call with cpuset_rwsem held. Takes cpus_read_lock().
> + * Call with cpuset_mutex held. Takes cpus_read_lock().
> */
> static void rebuild_sched_domains_locked(void)
> {
> @@ -1145,7 +1144,7 @@ static void rebuild_sched_domains_locked(void)
> int ndoms;
>
> lockdep_assert_cpus_held();
> - percpu_rwsem_assert_held(&cpuset_rwsem);
> + lockdep_assert_held(&cpuset_mutex);
>
> /*
> * If we have raced with CPU hotplug, return early to avoid
> @@ -1196,9 +1195,9 @@ static void rebuild_sched_domains_locked(void)
> void rebuild_sched_domains(void)
> {
> cpus_read_lock();
> - percpu_down_write(&cpuset_rwsem);
> + mutex_lock(&cpuset_mutex);
> rebuild_sched_domains_locked();
> - percpu_up_write(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
> cpus_read_unlock();
> }
>
> @@ -1208,7 +1207,7 @@ void rebuild_sched_domains(void)
> * @new_cpus: the temp variable for the new effective_cpus mask
> *
> * Iterate through each task of @cs updating its cpus_allowed to the
> - * effective cpuset's. As this function is called with cpuset_rwsem held,
> + * effective cpuset's. As this function is called with cpuset_mutex held,
> * cpuset membership stays stable.
> */
> static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
> @@ -1317,7 +1316,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
> int old_prs, new_prs;
> int part_error = PERR_NONE; /* Partition error? */
>
> - percpu_rwsem_assert_held(&cpuset_rwsem);
> + lockdep_assert_held(&cpuset_mutex);
>
> /*
> * The parent must be a partition root.
> @@ -1540,7 +1539,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
> *
> * On legacy hierarchy, effective_cpus will be the same with cpu_allowed.
> *
> - * Called with cpuset_rwsem held
> + * Called with cpuset_mutex held
> */
> static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
> bool force)
> @@ -1700,7 +1699,7 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
> struct cpuset *sibling;
> struct cgroup_subsys_state *pos_css;
>
> - percpu_rwsem_assert_held(&cpuset_rwsem);
> + lockdep_assert_held(&cpuset_mutex);
>
> /*
> * Check all its siblings and call update_cpumasks_hier()
> @@ -1942,12 +1941,12 @@ static void *cpuset_being_rebound;
> * @cs: the cpuset in which each task's mems_allowed mask needs to be changed
> *
> * Iterate through each task of @cs updating its mems_allowed to the
> - * effective cpuset's. As this function is called with cpuset_rwsem held,
> + * effective cpuset's. As this function is called with cpuset_mutex held,
> * cpuset membership stays stable.
> */
> static void update_tasks_nodemask(struct cpuset *cs)
> {
> - static nodemask_t newmems; /* protected by cpuset_rwsem */
> + static nodemask_t newmems; /* protected by cpuset_mutex */
> struct css_task_iter it;
> struct task_struct *task;
>
> @@ -1960,7 +1959,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
> * take while holding tasklist_lock. Forks can happen - the
> * mpol_dup() cpuset_being_rebound check will catch such forks,
> * and rebind their vma mempolicies too. Because we still hold
> - * the global cpuset_rwsem, we know that no other rebind effort
> + * the global cpuset_mutex, we know that no other rebind effort
> * will be contending for the global variable cpuset_being_rebound.
> * It's ok if we rebind the same mm twice; mpol_rebind_mm()
> * is idempotent. Also migrate pages in each mm to new nodes.
> @@ -2006,7 +2005,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
> *
> * On legacy hierarchy, effective_mems will be the same with mems_allowed.
> *
> - * Called with cpuset_rwsem held
> + * Called with cpuset_mutex held
> */
> static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
> {
> @@ -2059,7 +2058,7 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
> * mempolicies and if the cpuset is marked 'memory_migrate',
> * migrate the tasks pages to the new memory.
> *
> - * Call with cpuset_rwsem held. May take callback_lock during call.
> + * Call with cpuset_mutex held. May take callback_lock during call.
> * Will take tasklist_lock, scan tasklist for tasks in cpuset cs,
> * lock each such tasks mm->mmap_lock, scan its vma's and rebind
> * their mempolicies to the cpusets new mems_allowed.
> @@ -2151,7 +2150,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val)
> * @cs: the cpuset in which each task's spread flags needs to be changed
> *
> * Iterate through each task of @cs updating its spread flags. As this
> - * function is called with cpuset_rwsem held, cpuset membership stays
> + * function is called with cpuset_mutex held, cpuset membership stays
> * stable.
> */
> static void update_tasks_flags(struct cpuset *cs)
> @@ -2171,7 +2170,7 @@ static void update_tasks_flags(struct cpuset *cs)
> * cs: the cpuset to update
> * turning_on: whether the flag is being set or cleared
> *
> - * Call with cpuset_rwsem held.
> + * Call with cpuset_mutex held.
> */
>
> static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
> @@ -2221,7 +2220,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
> * @new_prs: new partition root state
> * Return: 0 if successful, != 0 if error
> *
> - * Call with cpuset_rwsem held.
> + * Call with cpuset_mutex held.
> */
> static int update_prstate(struct cpuset *cs, int new_prs)
> {
> @@ -2445,7 +2444,7 @@ static int fmeter_getrate(struct fmeter *fmp)
>
> static struct cpuset *cpuset_attach_old_cs;
>
> -/* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */
> +/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
> static int cpuset_can_attach(struct cgroup_taskset *tset)
> {
> struct cgroup_subsys_state *css;
> @@ -2457,7 +2456,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
> cs = css_cs(css);
>
> - percpu_down_write(&cpuset_rwsem);
> + mutex_lock(&cpuset_mutex);
>
> /* allow moving tasks into an empty cpuset if on default hierarchy */
> ret = -ENOSPC;
> @@ -2487,7 +2486,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> cs->attach_in_progress++;
> ret = 0;
> out_unlock:
> - percpu_up_write(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
> return ret;
> }
>
> @@ -2497,13 +2496,13 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
>
> cgroup_taskset_first(tset, &css);
>
> - percpu_down_write(&cpuset_rwsem);
> + mutex_lock(&cpuset_mutex);
> css_cs(css)->attach_in_progress--;
> - percpu_up_write(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
> }
>
> /*
> - * Protected by cpuset_rwsem. cpus_attach is used only by cpuset_attach()
> + * Protected by cpuset_mutex. cpus_attach is used only by cpuset_attach()
> * but we can't allocate it dynamically there. Define it global and
> * allocate from cpuset_init().
> */
> @@ -2511,7 +2510,7 @@ static cpumask_var_t cpus_attach;
>
> static void cpuset_attach(struct cgroup_taskset *tset)
> {
> - /* static buf protected by cpuset_rwsem */
> + /* static buf protected by cpuset_mutex */
> static nodemask_t cpuset_attach_nodemask_to;
> struct task_struct *task;
> struct task_struct *leader;
> @@ -2524,7 +2523,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
> cs = css_cs(css);
>
> lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */
> - percpu_down_write(&cpuset_rwsem);
> + mutex_lock(&cpuset_mutex);
> cpus_updated = !cpumask_equal(cs->effective_cpus,
> oldcs->effective_cpus);
> mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
> @@ -2597,7 +2596,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
> if (!cs->attach_in_progress)
> wake_up(&cpuset_attach_wq);
>
> - percpu_up_write(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
> }
>
> /* The various types of files and directories in a cpuset file system */
> @@ -2629,7 +2628,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
> int retval = 0;
>
> cpus_read_lock();
> - percpu_down_write(&cpuset_rwsem);
> + mutex_lock(&cpuset_mutex);
> if (!is_cpuset_online(cs)) {
> retval = -ENODEV;
> goto out_unlock;
> @@ -2665,7 +2664,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
> break;
> }
> out_unlock:
> - percpu_up_write(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
> cpus_read_unlock();
> return retval;
> }
> @@ -2678,7 +2677,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
> int retval = -ENODEV;
>
> cpus_read_lock();
> - percpu_down_write(&cpuset_rwsem);
> + mutex_lock(&cpuset_mutex);
> if (!is_cpuset_online(cs))
> goto out_unlock;
>
> @@ -2691,7 +2690,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
> break;
> }
> out_unlock:
> - percpu_up_write(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
> cpus_read_unlock();
> return retval;
> }
> @@ -2724,7 +2723,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
> * operation like this one can lead to a deadlock through kernfs
> * active_ref protection. Let's break the protection. Losing the
> * protection is okay as we check whether @cs is online after
> - * grabbing cpuset_rwsem anyway. This only happens on the legacy
> + * grabbing cpuset_mutex anyway. This only happens on the legacy
> * hierarchies.
> */
> css_get(&cs->css);
> @@ -2732,7 +2731,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
> flush_work(&cpuset_hotplug_work);
>
> cpus_read_lock();
> - percpu_down_write(&cpuset_rwsem);
> + mutex_lock(&cpuset_mutex);
> if (!is_cpuset_online(cs))
> goto out_unlock;
>
> @@ -2756,7 +2755,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>
> free_cpuset(trialcs);
> out_unlock:
> - percpu_up_write(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
> cpus_read_unlock();
> kernfs_unbreak_active_protection(of->kn);
> css_put(&cs->css);
> @@ -2904,13 +2903,13 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
>
> css_get(&cs->css);
> cpus_read_lock();
> - percpu_down_write(&cpuset_rwsem);
> + mutex_lock(&cpuset_mutex);
> if (!is_cpuset_online(cs))
> goto out_unlock;
>
> retval = update_prstate(cs, val);
> out_unlock:
> - percpu_up_write(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
> cpus_read_unlock();
> css_put(&cs->css);
> return retval ?: nbytes;
> @@ -3127,7 +3126,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
> return 0;
>
> cpus_read_lock();
> - percpu_down_write(&cpuset_rwsem);
> + mutex_lock(&cpuset_mutex);
>
> set_bit(CS_ONLINE, &cs->flags);
> if (is_spread_page(parent))
> @@ -3178,7 +3177,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
> cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
> spin_unlock_irq(&callback_lock);
> out_unlock:
> - percpu_up_write(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
> cpus_read_unlock();
> return 0;
> }
> @@ -3199,7 +3198,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
> struct cpuset *cs = css_cs(css);
>
> cpus_read_lock();
> - percpu_down_write(&cpuset_rwsem);
> + mutex_lock(&cpuset_mutex);
>
> if (is_partition_valid(cs))
> update_prstate(cs, 0);
> @@ -3218,7 +3217,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
> cpuset_dec();
> clear_bit(CS_ONLINE, &cs->flags);
>
> - percpu_up_write(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
> cpus_read_unlock();
> }
>
> @@ -3231,7 +3230,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
>
> static void cpuset_bind(struct cgroup_subsys_state *root_css)
> {
> - percpu_down_write(&cpuset_rwsem);
> + mutex_lock(&cpuset_mutex);
> spin_lock_irq(&callback_lock);
>
> if (is_in_v2_mode()) {
> @@ -3244,7 +3243,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
> }
>
> spin_unlock_irq(&callback_lock);
> - percpu_up_write(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
> }
>
> /*
> @@ -3357,7 +3356,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
> is_empty = cpumask_empty(cs->cpus_allowed) ||
> nodes_empty(cs->mems_allowed);
>
> - percpu_up_write(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
>
> /*
> * Move tasks to the nearest ancestor with execution resources,
> @@ -3367,7 +3366,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
> if (is_empty)
> remove_tasks_in_empty_cpuset(cs);
>
> - percpu_down_write(&cpuset_rwsem);
> + mutex_lock(&cpuset_mutex);
> }
>
> static void
> @@ -3418,14 +3417,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
> retry:
> wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
>
> - percpu_down_write(&cpuset_rwsem);
> + mutex_lock(&cpuset_mutex);
>
> /*
> * We have raced with task attaching. We wait until attaching
> * is finished, so we won't attach a task to an empty cpuset.
> */
> if (cs->attach_in_progress) {
> - percpu_up_write(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
> goto retry;
> }
>
> @@ -3519,7 +3518,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
> hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems,
> cpus_updated, mems_updated);
>
> - percpu_up_write(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
> }
>
> /**
> @@ -3549,7 +3548,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
> if (on_dfl && !alloc_cpumasks(NULL, &tmp))
> ptmp = &tmp;
>
> - percpu_down_write(&cpuset_rwsem);
> + mutex_lock(&cpuset_mutex);
>
> /* fetch the available cpus/mems and find out which changed how */
> cpumask_copy(&new_cpus, cpu_active_mask);
> @@ -3606,7 +3605,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
> update_tasks_nodemask(&top_cpuset);
> }
>
> - percpu_up_write(&cpuset_rwsem);
> + mutex_unlock(&cpuset_mutex);
>
> /* if cpus or mems changed, we need to propagate to descendants */
> if (cpus_updated || mems_updated) {
> @@ -4037,7 +4036,7 @@ void __cpuset_memory_pressure_bump(void)
> * - Used for /proc/<pid>/cpuset.
> * - No need to task_lock(tsk) on this tsk->cpuset reference, as it
> * doesn't really matter if tsk->cpuset changes after we read it,
> - * and we take cpuset_rwsem, keeping cpuset_attach() from changing it
> + * and we take cpuset_mutex, keeping cpuset_attach() from changing it
> * anyway.
> */
> int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b9616f153946..179266ff653f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7565,6 +7565,7 @@ static int __sched_setscheduler(struct task_struct *p,
> int reset_on_fork;
> int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
> struct rq *rq;
> + bool cpuset_locked = false;
>
> /* The pi code expects interrupts enabled */
> BUG_ON(pi && in_interrupt());
> @@ -7614,8 +7615,14 @@ static int __sched_setscheduler(struct task_struct *p,
> return retval;
> }
>
> - if (pi)
> - cpuset_read_lock();
> + /*
> + * SCHED_DEADLINE bandwidth accounting relies on stable cpusets
> + * information.
> + */
> + if (dl_policy(policy) || dl_policy(p->policy)) {
> + cpuset_locked = true;
> + cpuset_lock();
> + }
>
> /*
> * Make sure no PI-waiters arrive (or leave) while we are
> @@ -7691,8 +7698,8 @@ static int __sched_setscheduler(struct task_struct *p,
> if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
> policy = oldpolicy = -1;
> task_rq_unlock(rq, p, &rf);
> - if (pi)
> - cpuset_read_unlock();
> + if (cpuset_locked)
> + cpuset_unlock();
> goto recheck;
> }
>
> @@ -7759,7 +7766,8 @@ static int __sched_setscheduler(struct task_struct *p,
> task_rq_unlock(rq, p, &rf);
>
> if (pi) {
> - cpuset_read_unlock();
> + if (cpuset_locked)
> + cpuset_unlock();
> rt_mutex_adjust_pi(p);
> }
>
> @@ -7771,8 +7779,8 @@ static int __sched_setscheduler(struct task_struct *p,
>
> unlock:
> task_rq_unlock(rq, p, &rf);
> - if (pi)
> - cpuset_read_unlock();
> + if (cpuset_locked)
> + cpuset_unlock();
> return retval;
> }
>
> --
> 2.39.2
>

2023-04-04 20:09:29

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 1/6] cgroup/cpuset: Rename functions dealing with DEADLINE accounting

On 03/29/23 14:55, Juri Lelli wrote:
> rebuild_root_domains() and update_tasks_root_domain() have neutral
> names, but actually deal with DEADLINE bandwidth accounting.
>
> Rename them to use 'dl_' prefix so that intent is more clear.
>
> No functional change.
>
> Suggested-by: Qais Yousef <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> ---

Reviewed-by: Qais Yousef <[email protected]>
Tested-by: Qais Yousef <[email protected]>


Thanks!

--
Qais Yousef

> kernel/cgroup/cpuset.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 636f1c682ac0..501913bc2805 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1066,7 +1066,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
> return ndoms;
> }
>
> -static void update_tasks_root_domain(struct cpuset *cs)
> +static void dl_update_tasks_root_domain(struct cpuset *cs)
> {
> struct css_task_iter it;
> struct task_struct *task;
> @@ -1079,7 +1079,7 @@ static void update_tasks_root_domain(struct cpuset *cs)
> css_task_iter_end(&it);
> }
>
> -static void rebuild_root_domains(void)
> +static void dl_rebuild_rd_accounting(void)
> {
> struct cpuset *cs = NULL;
> struct cgroup_subsys_state *pos_css;
> @@ -1107,7 +1107,7 @@ static void rebuild_root_domains(void)
>
> rcu_read_unlock();
>
> - update_tasks_root_domain(cs);
> + dl_update_tasks_root_domain(cs);
>
> rcu_read_lock();
> css_put(&cs->css);
> @@ -1121,7 +1121,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> {
> mutex_lock(&sched_domains_mutex);
> partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
> - rebuild_root_domains();
> + dl_rebuild_rd_accounting();
> mutex_unlock(&sched_domains_mutex);
> }
>
> --
> 2.39.2
>

2023-04-04 20:09:43

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 3/6] sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets

On 03/29/23 14:55, Juri Lelli wrote:
> Qais reported that iterating over all tasks when rebuilding root domains
> for finding out which ones are DEADLINE and need their bandwidth
> correctly restored on such root domains can be a costly operation (10+
> ms delays on suspend-resume).
>
> To fix the problem keep track of the number of DEADLINE tasks belonging
> to each cpuset and then use this information (followup patch) to only
> perform the above iteration if DEADLINE tasks are actually present in
> the cpuset for which a corresponding root domain is being rebuilt.
>
> Reported-by: Qais Yousef <[email protected]>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Signed-off-by: Juri Lelli <[email protected]>
> ---

Reviewed-by: Qais Yousef <[email protected]>
Tested-by: Qais Yousef <[email protected]>


Thanks!

--
Qais Yousef

> include/linux/cpuset.h | 4 ++++
> kernel/cgroup/cgroup.c | 4 ++++
> kernel/cgroup/cpuset.c | 25 +++++++++++++++++++++++++
> kernel/sched/deadline.c | 14 ++++++++++++++
> 4 files changed, 47 insertions(+)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 355f796c5f07..0348dba5680e 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -71,6 +71,8 @@ extern void cpuset_init_smp(void);
> extern void cpuset_force_rebuild(void);
> extern void cpuset_update_active_cpus(void);
> extern void cpuset_wait_for_hotplug(void);
> +extern void inc_dl_tasks_cs(struct task_struct *task);
> +extern void dec_dl_tasks_cs(struct task_struct *task);
> extern void cpuset_lock(void);
> extern void cpuset_unlock(void);
> extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
> @@ -196,6 +198,8 @@ static inline void cpuset_update_active_cpus(void)
>
> static inline void cpuset_wait_for_hotplug(void) { }
>
> +static inline void inc_dl_tasks_cs(struct task_struct *task) { }
> +static inline void dec_dl_tasks_cs(struct task_struct *task) { }
> static inline void cpuset_lock(void) { }
> static inline void cpuset_unlock(void) { }
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 935e8121b21e..ff27b2d2bf0b 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -57,6 +57,7 @@
> #include <linux/file.h>
> #include <linux/fs_parser.h>
> #include <linux/sched/cputime.h>
> +#include <linux/sched/deadline.h>
> #include <linux/psi.h>
> #include <net/sock.h>
>
> @@ -6673,6 +6674,9 @@ void cgroup_exit(struct task_struct *tsk)
> list_add_tail(&tsk->cg_list, &cset->dying_tasks);
> cset->nr_tasks--;
>
> + if (dl_task(tsk))
> + dec_dl_tasks_cs(tsk);
> +
> WARN_ON_ONCE(cgroup_task_frozen(tsk));
> if (unlikely(!(tsk->flags & PF_KTHREAD) &&
> test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags)))
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index fbc10b494292..eb0854ef9757 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -193,6 +193,12 @@ struct cpuset {
> int use_parent_ecpus;
> int child_ecpus_count;
>
> + /*
> + * number of SCHED_DEADLINE tasks attached to this cpuset, so that we
> + * know when to rebuild associated root domain bandwidth information.
> + */
> + int nr_deadline_tasks;
> +
> /* Invalid partition error code, not lock protected */
> enum prs_errcode prs_err;
>
> @@ -245,6 +251,20 @@ static inline struct cpuset *parent_cs(struct cpuset *cs)
> return css_cs(cs->css.parent);
> }
>
> +void inc_dl_tasks_cs(struct task_struct *p)
> +{
> + struct cpuset *cs = task_cs(p);
> +
> + cs->nr_deadline_tasks++;
> +}
> +
> +void dec_dl_tasks_cs(struct task_struct *p)
> +{
> + struct cpuset *cs = task_cs(p);
> +
> + cs->nr_deadline_tasks--;
> +}
> +
> /* bits in struct cpuset flags field */
> typedef enum {
> CS_ONLINE,
> @@ -2477,6 +2497,11 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> ret = security_task_setscheduler(task);
> if (ret)
> goto out_unlock;
> +
> + if (dl_task(task)) {
> + cs->nr_deadline_tasks++;
> + cpuset_attach_old_cs->nr_deadline_tasks--;
> + }
> }
>
> /*
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 4cc7e1ca066d..8f92f0f87383 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -16,6 +16,8 @@
> * Fabio Checconi <[email protected]>
> */
>
> +#include <linux/cpuset.h>
> +
> /*
> * Default limits for DL period; on the top end we guard against small util
> * tasks still getting ridiculously long effective runtimes, on the bottom end we
> @@ -2595,6 +2597,12 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
> if (task_on_rq_queued(p) && p->dl.dl_runtime)
> task_non_contending(p);
>
> + /*
> + * In case a task is setscheduled out from SCHED_DEADLINE we need to
> + * keep track of that on its cpuset (for correct bandwidth tracking).
> + */
> + dec_dl_tasks_cs(p);
> +
> if (!task_on_rq_queued(p)) {
> /*
> * Inactive timer is armed. However, p is leaving DEADLINE and
> @@ -2635,6 +2643,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
> if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> put_task_struct(p);
>
> + /*
> + * In case a task is setscheduled to SCHED_DEADLINE we need to keep
> + * track of that on its cpuset (for correct bandwidth tracking).
> + */
> + inc_dl_tasks_cs(p);
> +
> /* If p is not queued we will update its parameters at next wakeup. */
> if (!task_on_rq_queued(p)) {
> add_rq_bw(&p->dl, &rq->dl);
> --
> 2.39.2
>

2023-04-04 20:09:56

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 6/6] cgroup/cpuset: Iterate only if DEADLINE tasks are present

On 03/29/23 14:55, Juri Lelli wrote:
> update_tasks_root_domain currently iterates over all tasks even if no
> DEADLINE task is present on the cpuset/root domain for which bandwidth
> accounting is being rebuilt. This has been reported to introduce 10+ ms
> delays on suspend-resume operations.
>
> Skip the costly iteration for cpusets that don't contain DEADLINE tasks.
>
> Reported-by: Qais Yousef <[email protected]>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Signed-off-by: Juri Lelli <[email protected]>
> ---

Wouldn't this be better placed as patch 4? The two fixes from Dietmar look
orthogonal to me to the accounting problem. But it seems the whole lot needs to
go to stable anyway, so good to keep them together. Should Dietmar fixes be at
the end instead of this?

Anyways.

Reviewed-by: Qais Yousef <[email protected]>
Tested-by: Qais Yousef <[email protected]>


Thanks

--
Qais Yousef

> kernel/cgroup/cpuset.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index f8ebec66da51..05c0a1255218 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1092,6 +1092,9 @@ static void dl_update_tasks_root_domain(struct cpuset *cs)
> struct css_task_iter it;
> struct task_struct *task;
>
> + if (cs->nr_deadline_tasks == 0)
> + return;
> +
> css_task_iter_start(&cs->css, 0, &it);
>
> while ((task = css_task_iter_next(&it)))
> --
> 2.39.2
>

2023-04-04 20:11:41

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 0/6] sched/deadline: cpuset: Rework DEADLINE bandwidth restoration

On 03/29/23 14:55, Juri Lelli wrote:
> Qais reported [1] that iterating over all tasks when rebuilding root
> domains for finding out which ones are DEADLINE and need their bandwidth
> correctly restored on such root domains can be a costly operation (10+
> ms delays on suspend-resume). He proposed we skip rebuilding root
> domains for certain operations, but that approach seemed arch specific
> and possibly prone to errors, as paths that ultimately trigger a rebuild
> might be quite convoluted (thanks Qais for spending time on this!).
>
> To fix the problem
>
> 01/06 - Rename functions deadline with DEADLINE accounting (cleanup
> suggested by Qais) - no functional change
> 02/06 - Bring back cpuset_mutex (so that we have write access to cpusets
> from scheduler operations - and we also fix some problems
> associated to percpu_cpuset_rwsem)
> 03/06 - Keep track of the number of DEADLINE tasks belonging to each cpuset
> 04/06 - Create DL BW alloc, free & check overflow interface for bulk
> bandwidth allocation/removal - no functional change
> 05/06 - Fix bandwidth allocation handling for cgroup operation
> involving multiple tasks
> 06/06 - Use this information to only perform the costly iteration if
> DEADLINE tasks are actually present in the cpuset for which a
> corresponding root domain is being rebuilt
>
> With respect to the RFC posting [2]
>
> 1 - rename DEADLINE bandwidth accounting functions - Qais
> 2 - call inc/dec_dl_tasks_cs from switched_{to,from}_dl - Qais
> 3 - fix DEADLINE bandwidth allocation with multiple tasks - Waiman,
> contributed by Dietmar
>
> This set is also available from
>
> https://github.com/jlelli/linux.git deadline/rework-cpusets

Thanks a lot Juri!

I picked up the updated series and applied them to a 5.10 kernel and tested the
issue is fixed. Replied with my reviewed-and-tested-bys to some of the patches
already.

I haven't looked much at Dietmar's patches and while they were part of the
test, but there are no dl tasks on the system so I felt hesitant to say
I tested that part.


Cheers

--
Qais Yousef

2023-04-18 14:14:31

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 0/6] sched/deadline: cpuset: Rework DEADLINE bandwidth restoration

On 03/29/23 14:55, Juri Lelli wrote:
> Qais reported [1] that iterating over all tasks when rebuilding root
> domains for finding out which ones are DEADLINE and need their bandwidth
> correctly restored on such root domains can be a costly operation (10+
> ms delays on suspend-resume). He proposed we skip rebuilding root
> domains for certain operations, but that approach seemed arch specific
> and possibly prone to errors, as paths that ultimately trigger a rebuild
> might be quite convoluted (thanks Qais for spending time on this!).
>
> To fix the problem
>
> 01/06 - Rename functions deadline with DEADLINE accounting (cleanup
> suggested by Qais) - no functional change
> 02/06 - Bring back cpuset_mutex (so that we have write access to cpusets
> from scheduler operations - and we also fix some problems
> associated to percpu_cpuset_rwsem)
> 03/06 - Keep track of the number of DEADLINE tasks belonging to each cpuset
> 04/06 - Create DL BW alloc, free & check overflow interface for bulk
> bandwidth allocation/removal - no functional change
> 05/06 - Fix bandwidth allocation handling for cgroup operation
> involving multiple tasks
> 06/06 - Use this information to only perform the costly iteration if
> DEADLINE tasks are actually present in the cpuset for which a
> corresponding root domain is being rebuilt
>
> With respect to the RFC posting [2]
>
> 1 - rename DEADLINE bandwidth accounting functions - Qais
> 2 - call inc/dec_dl_tasks_cs from switched_{to,from}_dl - Qais
> 3 - fix DEADLINE bandwidth allocation with multiple tasks - Waiman,
> contributed by Dietmar
>
> This set is also available from
>
> https://github.com/jlelli/linux.git deadline/rework-cpusets

Is this just waiting to be picked up or still there's something to be addressed
still?


Thanks!

--
Qais Yousef

2023-04-18 14:44:52

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 0/6] sched/deadline: cpuset: Rework DEADLINE bandwidth restoration

On 4/18/23 10:11, Qais Yousef wrote:
> On 03/29/23 14:55, Juri Lelli wrote:
>> Qais reported [1] that iterating over all tasks when rebuilding root
>> domains for finding out which ones are DEADLINE and need their bandwidth
>> correctly restored on such root domains can be a costly operation (10+
>> ms delays on suspend-resume). He proposed we skip rebuilding root
>> domains for certain operations, but that approach seemed arch specific
>> and possibly prone to errors, as paths that ultimately trigger a rebuild
>> might be quite convoluted (thanks Qais for spending time on this!).
>>
>> To fix the problem
>>
>> 01/06 - Rename functions deadline with DEADLINE accounting (cleanup
>> suggested by Qais) - no functional change
>> 02/06 - Bring back cpuset_mutex (so that we have write access to cpusets
>> from scheduler operations - and we also fix some problems
>> associated to percpu_cpuset_rwsem)
>> 03/06 - Keep track of the number of DEADLINE tasks belonging to each cpuset
>> 04/06 - Create DL BW alloc, free & check overflow interface for bulk
>> bandwidth allocation/removal - no functional change
>> 05/06 - Fix bandwidth allocation handling for cgroup operation
>> involving multiple tasks
>> 06/06 - Use this information to only perform the costly iteration if
>> DEADLINE tasks are actually present in the cpuset for which a
>> corresponding root domain is being rebuilt
>>
>> With respect to the RFC posting [2]
>>
>> 1 - rename DEADLINE bandwidth accounting functions - Qais
>> 2 - call inc/dec_dl_tasks_cs from switched_{to,from}_dl - Qais
>> 3 - fix DEADLINE bandwidth allocation with multiple tasks - Waiman,
>> contributed by Dietmar
>>
>> This set is also available from
>>
>> https://github.com/jlelli/linux.git deadline/rework-cpusets
> Is this just waiting to be picked up or still there's something to be addressed
> still?

There are some changes to cpuset code recently and so I believe that
this patch series may need to be refreshed to reconcile the changes.

Thanks,
Longman

2023-04-20 14:20:21

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 0/6] sched/deadline: cpuset: Rework DEADLINE bandwidth restoration

On 18/04/23 10:31, Waiman Long wrote:
> On 4/18/23 10:11, Qais Yousef wrote:
> > On 03/29/23 14:55, Juri Lelli wrote:
> > > Qais reported [1] that iterating over all tasks when rebuilding root
> > > domains for finding out which ones are DEADLINE and need their bandwidth
> > > correctly restored on such root domains can be a costly operation (10+
> > > ms delays on suspend-resume). He proposed we skip rebuilding root
> > > domains for certain operations, but that approach seemed arch specific
> > > and possibly prone to errors, as paths that ultimately trigger a rebuild
> > > might be quite convoluted (thanks Qais for spending time on this!).
> > >
> > > To fix the problem
> > >
> > > 01/06 - Rename functions deadline with DEADLINE accounting (cleanup
> > > suggested by Qais) - no functional change
> > > 02/06 - Bring back cpuset_mutex (so that we have write access to cpusets
> > > from scheduler operations - and we also fix some problems
> > > associated to percpu_cpuset_rwsem)
> > > 03/06 - Keep track of the number of DEADLINE tasks belonging to each cpuset
> > > 04/06 - Create DL BW alloc, free & check overflow interface for bulk
> > > bandwidth allocation/removal - no functional change
> > > 05/06 - Fix bandwidth allocation handling for cgroup operation
> > > involving multiple tasks
> > > 06/06 - Use this information to only perform the costly iteration if
> > > DEADLINE tasks are actually present in the cpuset for which a
> > > corresponding root domain is being rebuilt
> > >
> > > With respect to the RFC posting [2]
> > >
> > > 1 - rename DEADLINE bandwidth accounting functions - Qais
> > > 2 - call inc/dec_dl_tasks_cs from switched_{to,from}_dl - Qais
> > > 3 - fix DEADLINE bandwidth allocation with multiple tasks - Waiman,
> > > contributed by Dietmar
> > >
> > > This set is also available from
> > >
> > > https://github.com/jlelli/linux.git deadline/rework-cpusets
> > Is this just waiting to be picked up or still there's something to be addressed
> > still?
>
> There are some changes to cpuset code recently and so I believe that this
> patch series may need to be refreshed to reconcile the changes.

Yeah, will soon take a look.

Thanks!
Juri

2023-04-26 12:04:50

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched/cpuset: Bring back cpuset_mutex

On 04/04/23 13:31, Waiman Long wrote:
> On 3/29/23 08:55, Juri Lelli wrote:
> > Turns out percpu_cpuset_rwsem - commit 1243dc518c9d ("cgroup/cpuset:
> > Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea,
> > as it has been reported to cause slowdowns in workloads that need to
> > change cpuset configuration frequently and it is also not implementing
> > priority inheritance (which causes troubles with realtime workloads).
> >
> > Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it
> > only for SCHED_DEADLINE tasks (other policies don't care about stable
> > cpusets anyway).
> >
> > Signed-off-by: Juri Lelli <[email protected]>
>
> I am thinking that maybe we should switch the percpu rwsem to a regular
> rwsem as there are cases where a read lock is sufficient. This will also
> avoid the potential PREEMPT_RT problem with PI and reduce the time it needs
> to take a write lock.

I'm not a big fan of rwsems for reasons like
https://lore.kernel.org/lkml/[email protected]/, so
I'd vote for a standard mutex unless we have a strong argument and/or
numbers.

Thanks!
Juri

2023-04-26 12:06:38

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 6/6] cgroup/cpuset: Iterate only if DEADLINE tasks are present

On 04/04/23 21:06, Qais Yousef wrote:
> On 03/29/23 14:55, Juri Lelli wrote:
> > update_tasks_root_domain currently iterates over all tasks even if no
> > DEADLINE task is present on the cpuset/root domain for which bandwidth
> > accounting is being rebuilt. This has been reported to introduce 10+ ms
> > delays on suspend-resume operations.
> >
> > Skip the costly iteration for cpusets that don't contain DEADLINE tasks.
> >
> > Reported-by: Qais Yousef <[email protected]>
> > Link: https://lore.kernel.org/lkml/[email protected]/
> > Signed-off-by: Juri Lelli <[email protected]>
> > ---
>
> Wouldn't this be better placed as patch 4? The two fixes from Dietmar look
> orthogonal to me to the accounting problem. But it seems the whole lot needs to
> go to stable anyway, so good to keep them together. Should Dietmar fixes be at
> the end instead of this?

That should be an easy fixing indeed. Dietmar is working on refreshig
his bits.

Thanks for your reviews and tests Qais!

Best,
Juri

2023-04-26 14:19:09

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched/cpuset: Bring back cpuset_mutex

On 4/26/23 07:57, Juri Lelli wrote:
> On 04/04/23 13:31, Waiman Long wrote:
>> On 3/29/23 08:55, Juri Lelli wrote:
>>> Turns out percpu_cpuset_rwsem - commit 1243dc518c9d ("cgroup/cpuset:
>>> Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea,
>>> as it has been reported to cause slowdowns in workloads that need to
>>> change cpuset configuration frequently and it is also not implementing
>>> priority inheritance (which causes troubles with realtime workloads).
>>>
>>> Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it
>>> only for SCHED_DEADLINE tasks (other policies don't care about stable
>>> cpusets anyway).
>>>
>>> Signed-off-by: Juri Lelli <[email protected]>
>> I am thinking that maybe we should switch the percpu rwsem to a regular
>> rwsem as there are cases where a read lock is sufficient. This will also
>> avoid the potential PREEMPT_RT problem with PI and reduce the time it needs
>> to take a write lock.
> I'm not a big fan of rwsems for reasons like
> https://lore.kernel.org/lkml/[email protected]/, so
> I'd vote for a standard mutex unless we have a strong argument and/or
> numbers.

That is fine for me too.

Cheers,
Longman

Subject: Re: [PATCH 2/6] sched/cpuset: Bring back cpuset_mutex

On 4/26/23 13:57, Juri Lelli wrote:
> On 04/04/23 13:31, Waiman Long wrote:
>> On 3/29/23 08:55, Juri Lelli wrote:
>>> Turns out percpu_cpuset_rwsem - commit 1243dc518c9d ("cgroup/cpuset:
>>> Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea,
>>> as it has been reported to cause slowdowns in workloads that need to
>>> change cpuset configuration frequently and it is also not implementing
>>> priority inheritance (which causes troubles with realtime workloads).
>>>
>>> Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it
>>> only for SCHED_DEADLINE tasks (other policies don't care about stable
>>> cpusets anyway).
>>>
>>> Signed-off-by: Juri Lelli <[email protected]>
>> I am thinking that maybe we should switch the percpu rwsem to a regular
>> rwsem as there are cases where a read lock is sufficient. This will also
>> avoid the potential PREEMPT_RT problem with PI and reduce the time it needs
>> to take a write lock.
> I'm not a big fan of rwsems for reasons like
> https://lore.kernel.org/lkml/[email protected]/, so
> I'd vote for a standard mutex unless we have a strong argument and/or
> numbers.

+1

-- Daniel

2023-04-27 03:05:47

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched/cpuset: Bring back cpuset_mutex

HI Juri,

Would this patch be merged tobe stable-rc? In kernel5.15, we also find
that the rwsem would be blocked for a long time, when we change the
task's cpuset cgroup.
And when we revert to the mutex, the delay would disappear.

BR
Thanks!

On Wed, Apr 26, 2023 at 10:50 PM Waiman Long <[email protected]> wrote:
>
> On 4/26/23 07:57, Juri Lelli wrote:
> > On 04/04/23 13:31, Waiman Long wrote:
> >> On 3/29/23 08:55, Juri Lelli wrote:
> >>> Turns out percpu_cpuset_rwsem - commit 1243dc518c9d ("cgroup/cpuset:
> >>> Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea,
> >>> as it has been reported to cause slowdowns in workloads that need to
> >>> change cpuset configuration frequently and it is also not implementing
> >>> priority inheritance (which causes troubles with realtime workloads).
> >>>
> >>> Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it
> >>> only for SCHED_DEADLINE tasks (other policies don't care about stable
> >>> cpusets anyway).
> >>>
> >>> Signed-off-by: Juri Lelli <[email protected]>
> >> I am thinking that maybe we should switch the percpu rwsem to a regular
> >> rwsem as there are cases where a read lock is sufficient. This will also
> >> avoid the potential PREEMPT_RT problem with PI and reduce the time it needs
> >> to take a write lock.
> > I'm not a big fan of rwsems for reasons like
> > https://lore.kernel.org/lkml/[email protected]/, so
> > I'd vote for a standard mutex unless we have a strong argument and/or
> > numbers.
>
> That is fine for me too.
>
> Cheers,
> Longman
>

2023-04-27 05:57:27

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched/cpuset: Bring back cpuset_mutex

Hi,

On 27/04/23 10:58, Xuewen Yan wrote:
> HI Juri,
>
> Would this patch be merged tobe stable-rc? In kernel5.15, we also find
> that the rwsem would be blocked for a long time, when we change the
> task's cpuset cgroup.
> And when we revert to the mutex, the delay would disappear.

Honestly, I'm not sure. This change is mostly improving performance, but
it is also true that it's fixing some priority inheritance corner cases.
So, I'm not sure it qualifies for stable, but it would be probably good to
have it there.

Thanks,
Juri

2023-04-27 12:10:49

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched/cpuset: Bring back cpuset_mutex

On Thu, Apr 27, 2023 at 1:53 PM Juri Lelli <[email protected]> wrote:
>
> Hi,
>
> On 27/04/23 10:58, Xuewen Yan wrote:
> > HI Juri,
> >
> > Would this patch be merged tobe stable-rc? In kernel5.15, we also find
> > that the rwsem would be blocked for a long time, when we change the
> > task's cpuset cgroup.
> > And when we revert to the mutex, the delay would disappear.
>
> Honestly, I'm not sure. This change is mostly improving performance, but
> it is also true that it's fixing some priority inheritance corner cases.
> So, I'm not sure it qualifies for stable, but it would be probably good to
> have it there.
>
Dear Juri,

Thanks for the reply! We will test more in kernel5.15.

Thanks!
BR

> Thanks,
> Juri
>

2023-04-28 11:28:18

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched/cpuset: Bring back cpuset_mutex

On 04/27/23 07:53, Juri Lelli wrote:
> Hi,
>
> On 27/04/23 10:58, Xuewen Yan wrote:
> > HI Juri,
> >
> > Would this patch be merged tobe stable-rc? In kernel5.15, we also find
> > that the rwsem would be blocked for a long time, when we change the
> > task's cpuset cgroup.
> > And when we revert to the mutex, the delay would disappear.
>
> Honestly, I'm not sure. This change is mostly improving performance, but
> it is also true that it's fixing some priority inheritance corner cases.
> So, I'm not sure it qualifies for stable, but it would be probably good to
> have it there.

I'm under the impression we need the whole lot back to stable, no? I'm not sure
if we can decouple this patch from the rest.

FWIW, I did my testing on 5.15 - so we can definitely help with the backport
and testing for 5.15 and 5.10.


Thanks!

--
Qais Yousef

2023-10-09 11:43:40

by Xia Fukun

[permalink] [raw]
Subject: Re: [PATCH 3/6] sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets

On 2023/3/29 20:55, Juri Lelli wrote:

> To fix the problem keep track of the number of DEADLINE tasks belonging
> to each cpuset and then use this information (followup patch) to only
> perform the above iteration if DEADLINE tasks are actually present in
> the cpuset for which a corresponding root domain is being rebuilt.
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 935e8121b21e..ff27b2d2bf0b 100644

> @@ -6673,6 +6674,9 @@ void cgroup_exit(struct task_struct *tsk)
> list_add_tail(&tsk->cg_list, &cset->dying_tasks);
> cset->nr_tasks--;
>
> + if (dl_task(tsk))
> + dec_dl_tasks_cs(tsk);
> +
> WARN_ON_ONCE(cgroup_task_frozen(tsk));
> if (unlikely(!(tsk->flags & PF_KTHREAD) &&
> test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags)))


The cgroup_exit() function decrements the value of the nr_deadline_tasks by one.


> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index fbc10b494292..eb0854ef9757 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -193,6 +193,12 @@ struct cpuset {
> + /*
> + * number of SCHED_DEADLINE tasks attached to this cpuset, so that we
> + * know when to rebuild associated root domain bandwidth information.
> + */
> + int nr_deadline_tasks;
> +

> +void inc_dl_tasks_cs(struct task_struct *p)
> +{
> + struct cpuset *cs = task_cs(p);
> +
> + cs->nr_deadline_tasks++;
> +}
> +
> +void dec_dl_tasks_cs(struct task_struct *p)
> +{
> + struct cpuset *cs = task_cs(p);
> +
> + cs->nr_deadline_tasks--;
> +}
> +

> @@ -2477,6 +2497,11 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> ret = security_task_setscheduler(task);
> if (ret)
> goto out_unlock;
> +
> + if (dl_task(task)) {
> + cs->nr_deadline_tasks++;
> + cpuset_attach_old_cs->nr_deadline_tasks--;
> + }
> }


The cpuset_can_attach() function increments the value of the nr_deadline_tasks by one.


> /*
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 4cc7e1ca066d..8f92f0f87383 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -16,6 +16,8 @@
> * Fabio Checconi <[email protected]>
> */
>
> +#include <linux/cpuset.h>
> +
> /*
> * Default limits for DL period; on the top end we guard against small util
> * tasks still getting ridiculously long effective runtimes, on the bottom end we
> @@ -2595,6 +2597,12 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
> if (task_on_rq_queued(p) && p->dl.dl_runtime)
> task_non_contending(p);
>
> + /*
> + * In case a task is setscheduled out from SCHED_DEADLINE we need to
> + * keep track of that on its cpuset (for correct bandwidth tracking).
> + */
> + dec_dl_tasks_cs(p);
> +
> if (!task_on_rq_queued(p)) {
> /*
> * Inactive timer is armed. However, p is leaving DEADLINE and
> @@ -2635,6 +2643,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
> if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> put_task_struct(p);
>
> + /*
> + * In case a task is setscheduled to SCHED_DEADLINE we need to keep
> + * track of that on its cpuset (for correct bandwidth tracking).
> + */
> + inc_dl_tasks_cs(p);
> +
> /* If p is not queued we will update its parameters at next wakeup. */
> if (!task_on_rq_queued(p)) {
> add_rq_bw(&p->dl, &rq->dl);


And both switched_from_dl() and switched_to_dl() can change the value of
nr_deadline_tasks.

I suspect that changing the values of the nr_deadline_tasks in these
4 paths will cause data race problems.

And this patch([PATCH 6/6] cgroup/cpuset: Iterate only if DEADLINE tasks are present)
has the following judgment:

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f8ebec66da51..05c0a1255218 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1092,6 +1092,9 @@ static void dl_update_tasks_root_domain(struct cpuset *cs)
struct css_task_iter it;
struct task_struct *task;

+ if (cs->nr_deadline_tasks == 0)
+ return;
+
css_task_iter_start(&cs->css, 0, &it);

while ((task = css_task_iter_next(&it)))
--


The uncertainty of nr_deadline_tasks can lead to logical problems.

May I ask what experts think of the Data Race problem?

I would like to inquire if there is a problem and if so, is it
necessary to use atomic operations to avoid it?

Thank you all.

2023-10-09 15:27:51

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 3/6] sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets


On 10/9/23 07:43, Xia Fukun wrote:
> On 2023/3/29 20:55, Juri Lelli wrote:
>
>> To fix the problem keep track of the number of DEADLINE tasks belonging
>> to each cpuset and then use this information (followup patch) to only
>> perform the above iteration if DEADLINE tasks are actually present in
>> the cpuset for which a corresponding root domain is being rebuilt.
>>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 935e8121b21e..ff27b2d2bf0b 100644
>> @@ -6673,6 +6674,9 @@ void cgroup_exit(struct task_struct *tsk)
>> list_add_tail(&tsk->cg_list, &cset->dying_tasks);
>> cset->nr_tasks--;
>>
>> + if (dl_task(tsk))
>> + dec_dl_tasks_cs(tsk);
>> +
>> WARN_ON_ONCE(cgroup_task_frozen(tsk));
>> if (unlikely(!(tsk->flags & PF_KTHREAD) &&
>> test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags)))
>
> The cgroup_exit() function decrements the value of the nr_deadline_tasks by one.
>
>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index fbc10b494292..eb0854ef9757 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -193,6 +193,12 @@ struct cpuset {
>> + /*
>> + * number of SCHED_DEADLINE tasks attached to this cpuset, so that we
>> + * know when to rebuild associated root domain bandwidth information.
>> + */
>> + int nr_deadline_tasks;
>> +
>> +void inc_dl_tasks_cs(struct task_struct *p)
>> +{
>> + struct cpuset *cs = task_cs(p);
>> +
>> + cs->nr_deadline_tasks++;
>> +}
>> +
>> +void dec_dl_tasks_cs(struct task_struct *p)
>> +{
>> + struct cpuset *cs = task_cs(p);
>> +
>> + cs->nr_deadline_tasks--;
>> +}
>> +
>> @@ -2477,6 +2497,11 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>> ret = security_task_setscheduler(task);
>> if (ret)
>> goto out_unlock;
>> +
>> + if (dl_task(task)) {
>> + cs->nr_deadline_tasks++;
>> + cpuset_attach_old_cs->nr_deadline_tasks--;
>> + }
>> }
>
> The cpuset_can_attach() function increments the value of the nr_deadline_tasks by one.
>
>
>> /*
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 4cc7e1ca066d..8f92f0f87383 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -16,6 +16,8 @@
>> * Fabio Checconi <[email protected]>
>> */
>>
>> +#include <linux/cpuset.h>
>> +
>> /*
>> * Default limits for DL period; on the top end we guard against small util
>> * tasks still getting ridiculously long effective runtimes, on the bottom end we
>> @@ -2595,6 +2597,12 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
>> if (task_on_rq_queued(p) && p->dl.dl_runtime)
>> task_non_contending(p);
>>
>> + /*
>> + * In case a task is setscheduled out from SCHED_DEADLINE we need to
>> + * keep track of that on its cpuset (for correct bandwidth tracking).
>> + */
>> + dec_dl_tasks_cs(p);
>> +
>> if (!task_on_rq_queued(p)) {
>> /*
>> * Inactive timer is armed. However, p is leaving DEADLINE and
>> @@ -2635,6 +2643,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
>> if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
>> put_task_struct(p);
>>
>> + /*
>> + * In case a task is setscheduled to SCHED_DEADLINE we need to keep
>> + * track of that on its cpuset (for correct bandwidth tracking).
>> + */
>> + inc_dl_tasks_cs(p);
>> +
>> /* If p is not queued we will update its parameters at next wakeup. */
>> if (!task_on_rq_queued(p)) {
>> add_rq_bw(&p->dl, &rq->dl);
>
> And both switched_from_dl() and switched_to_dl() can change the value of
> nr_deadline_tasks.
>
> I suspect that changing the values of the nr_deadline_tasks in these
> 4 paths will cause data race problems.
>
> And this patch([PATCH 6/6] cgroup/cpuset: Iterate only if DEADLINE tasks are present)
> has the following judgment:
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index f8ebec66da51..05c0a1255218 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1092,6 +1092,9 @@ static void dl_update_tasks_root_domain(struct cpuset *cs)
> struct css_task_iter it;
> struct task_struct *task;
>
> + if (cs->nr_deadline_tasks == 0)
> + return;
> +
> css_task_iter_start(&cs->css, 0, &it);
>
> while ((task = css_task_iter_next(&it)))
> --
>
>
> The uncertainty of nr_deadline_tasks can lead to logical problems.
>
> May I ask what experts think of the Data Race problem?
>
> I would like to inquire if there is a problem and if so, is it
> necessary to use atomic operations to avoid it?

It does look like the value of nr_deadline_tasks can be subjected to
data race leading to incorrect value. Changing it to atomic_t should
avoid that at the expense of a bit higher overhead.

Cheers,
Longman