2018-09-07 21:55:41

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 00/60] Coscheduling for Linux

This patch series extends CFS with support for coscheduling. The
implementation is versatile enough to cover many different coscheduling
use-cases, while at the same time being non-intrusive, so that behavior of
legacy workloads does not change.

Peter Zijlstra once called coscheduling a "scalability nightmare waiting to
happen". Well, with this patch series, coscheduling certainly happened.
However, I disagree on the scalability nightmare. :)

In the remainder of this email, you will find:

A) Quickstart guide for the impatient.
B) Why would I want this?
C) How does it work?
D) What can I *not* do with this?
E) What's the overhead?
F) High-level overview of the patches in this series.

Regards
Jan


A) Quickstart guide for the impatient.
--------------------------------------

Here is a quickstart guide to set up coscheduling at core-level for
selected tasks on an SMT-capable system:

1. Apply the patch series to v4.19-rc2.
2. Compile with "CONFIG_COSCHEDULING=y".
3. Boot into the newly built kernel with an additional kernel command line
argument "cosched_max_level=1" to enable coscheduling up to core-level.
4. Create one or more cgroups and set their "cpu.scheduled" to "1".
5. Put tasks into the created cgroups and set their affinity explicitly.
6. Enjoy tasks of the same group and on the same core executing
simultaneously, whenever they are executed.

You are not restricted to coscheduling at core-level. Just select higher
numbers in steps 3 and 4. See also further below for more information, esp.
when you want to try higher numbers on larger systems.

Setting affinity explicitly for tasks within coscheduled cgroups is
currently necessary, as the load balancing portion is still missing in this
series.


B) Why would I want this?
-------------------------

Coscheduling can be useful for many different use cases. Here is an
incomplete (very condensed) list:

1. Execute parallel applications that rely on active waiting or synchronous
execution concurrently with other applications.

The prime example in this class are probably virtual machines. Here,
coscheduling is an alternative to paravirtualized spinlocks, pause loop
exiting, and other techniques with its own set of advantages and
disadvantages over the other approaches.

2. Execute parallel applications with architecture-specific optimizations
concurrently with other applications.

For example, a coscheduled application has a (usually) shared cache for
itself, while it is executing. This keeps various cache-optimization
techniques effective in face of other load, making coscheduling an
alternative to other cache partitioning techniques.

3. Reduce resource contention between independent applications.

This is probably one of the most researched use-cases in recent years:
if we can derive subsets of tasks, where tasks in a subset don't
interfere much with each other when executed in parallel, then
coscheduling can be used to realize this more efficient schedule. And
"resource" is a really loose term here: from execution units in an SMT
system, over cache pressure, over memory bandwidth, to a processor's
power budget and resulting frequency selection.

4. Support the management of (multiple) (parallel) applications.

Coscheduling does not only enable simultaneous execution, it also gives
a form of concurrency control, which can be used for various effects.
The currently most relevant example in this category is, that
coscheduling can be used to close certain side-channels or at least
contribute to making their exploitation harder by isolating applications
in time.

In the L1TF context, it prevents other applications from loading
additional data into the L1 cache, while one application tries to leak
data.


C) How does it work?
--------------------

This patch series introduces hierarchical runqueues, that represent larger
and larger fractions of the system. By default, there is one runqueue per
scheduling domain. These additional levels of runqueues are activated by
the "cosched_max_level=" kernel command line argument. The bottom level is
0.

One CPU per hierarchical runqueue is considered the leader, who is
primarily responsible for the scheduling decision at this level. Once the
leader has selected a task group to execute, it notifies all leaders of the
runqueues below it to select tasks/task groups within the selected task
group.

For each task-group, the user can select at which level it should be
scheduled. If you set "cpu.scheduled" to "1", coscheduling will typically
happen at core-level on systems with SMT. That is, if one SMT sibling
executes a task from this task group, the other sibling will do so, too. If
no task is available, the SMT sibling will be idle. With "cpu.scheduled"
set to "2" this is extended to the next level, which is typically a whole
socket on many systems. And so on. If you feel, that this does not provide
enough flexibility, you can specify "cosched_split_domains" on the kernel
command line to create more fine-grained scheduling domains for your
system.

You currently have to explicitly set affinities of tasks within coscheduled
task groups, as load balancing is not implemented for them at this point.


D) What can I *not* do with this?
---------------------------------

Besides the missing load-balancing within coscheduled task-groups, this
implementation has the following properties, which might be considered
short-comings.

This particular implementation focuses on SCHED_OTHER tasks managed by CFS
and allows coscheduling them. Interrupts as well as tasks in higher
scheduling classes are currently out-of-scope: they are assumed to be
negligible interruptions as far as coscheduling is concerned and they do
*not* cause a preemption of a whole group. This implementation could be
extended to cover higher scheduling classes. Interrupts, however, are an
orthogonal issue.

The collective context switch from one coscheduled set of tasks to another
-- while fast -- is not atomic. If a use-case needs the absolute guarantee
that all tasks of the previous set have stopped executing before any task
of the next set starts executing, an additional hand-shake/barrier needs to
be added.

Together with load-balancing, this implementation gains the ability to
restrict execution of tasks within a task-group to be below a single
hierarchical runqueue of a certain level. From there, it is a short step to
dynamically adjust this level in relation to the number of runnable tasks.
This will enable wide coscheduling with a minimum of fragmentation under
dynamic load.


E) What's the overhead?
-----------------------

Each (active) hierarchy level has roughly the same effect as one additional
level of nested cgroups. In addition -- at this stage -- there may be some
additional lock contention if you coschedule larger fractions of the system
with a dynamic task set.


F) High-level overview of the patches in this series.
-----------------------------------------------------

1 to 21: Preparation patches that keep the following coscheduling patches
manageable. Of general interest, even without coscheduling, may
be the following:

1: Store task_group->se[] pointers as part of cfs_rq
2: Introduce set_entity_cfs() to place a SE into a certain CFS runqueue
4: Replace sd_numa_mask() hack with something sane
15: Introduce parent_cfs_rq() and use it
17: Introduce and use generic task group CFS traversal functions

As well as some simpler clean-ups in patches 8, 10, 13, and 18.


22 to 60: The actual coscheduling functionality. Highlights are:

23: Data structures used for coscheduling.
24-26: Creation of root-task-group runqueue hierarchy.
39-40: Runqueue hierarchies for normal task groups.
41-42: Locking strategies under coscheduling.
47-49: Adjust core CFS code.
52: Adjust core CFS code.
54-56: Adjust core CFS code.
57-59: Enabling/disabling of coscheduling via cpu.scheduled


Jan H. Schönherr (60):
sched: Store task_group->se[] pointers as part of cfs_rq
sched: Introduce set_entity_cfs() to place a SE into a certain CFS
runqueue
sched: Setup sched_domain_shared for all sched_domains
sched: Replace sd_numa_mask() hack with something sane
sched: Allow to retrieve the sched_domain_topology
sched: Add a lock-free variant of resched_cpu()
sched: Reduce dependencies of init_tg_cfs_entry()
sched: Move init_entity_runnable_average() into init_tg_cfs_entry()
sched: Do not require a CFS in init_tg_cfs_entry()
sched: Use parent_entity() in more places
locking/lockdep: Increase number of supported lockdep subclasses
locking/lockdep: Make cookie generator accessible
sched: Remove useless checks for root task-group
sched: Refactor sync_throttle() to accept a CFS runqueue as argument
sched: Introduce parent_cfs_rq() and use it
sched: Preparatory code movement
sched: Introduce and use generic task group CFS traversal functions
sched: Fix return value of SCHED_WARN_ON()
sched: Add entity variants of enqueue_task_fair() and
dequeue_task_fair()
sched: Let {en,de}queue_entity_fair() work with a varying amount of
tasks
sched: Add entity variants of put_prev_task_fair() and
set_curr_task_fair()
cosched: Add config option for coscheduling support
cosched: Add core data structures for coscheduling
cosched: Do minimal pre-SMP coscheduler initialization
cosched: Prepare scheduling domain topology for coscheduling
cosched: Construct runqueue hierarchy
cosched: Add some small helper functions for later use
cosched: Add is_sd_se() to distinguish SD-SEs from TG-SEs
cosched: Adjust code reflecting on the total number of CFS tasks on a
CPU
cosched: Disallow share modification on task groups for now
cosched: Don't disable idle tick for now
cosched: Specialize parent_cfs_rq() for hierarchical runqueues
cosched: Allow resched_curr() to be called for hierarchical runqueues
cosched: Add rq_of() variants for different use cases
cosched: Adjust rq_lock() functions to work with hierarchical
runqueues
cosched: Use hrq_of() for rq_clock() and rq_clock_task()
cosched: Use hrq_of() for (indirect calls to) ___update_load_sum()
cosched: Skip updates on non-CPU runqueues in cfs_rq_util_change()
cosched: Adjust task group management for hierarchical runqueues
cosched: Keep track of task group hierarchy within each SD-RQ
cosched: Introduce locking for leader activities
cosched: Introduce locking for (mostly) enqueuing and dequeuing
cosched: Add for_each_sched_entity() variant for owned entities
cosched: Perform various rq_of() adjustments in scheduler code
cosched: Continue to account all load on per-CPU runqueues
cosched: Warn on throttling attempts of non-CPU runqueues
cosched: Adjust SE traversal and locking for common leader activities
cosched: Adjust SE traversal and locking for yielding and buddies
cosched: Adjust locking for enqueuing and dequeueing
cosched: Propagate load changes across hierarchy levels
cosched: Hacky work-around to avoid observing zero weight SD-SE
cosched: Support SD-SEs in enqueuing and dequeuing
cosched: Prevent balancing related functions from crossing hierarchy
levels
cosched: Support idling in a coscheduled set
cosched: Adjust task selection for coscheduling
cosched: Adjust wakeup preemption rules for coscheduling
cosched: Add sysfs interface to configure coscheduling on cgroups
cosched: Switch runqueues between regular scheduling and coscheduling
cosched: Handle non-atomicity during switches to and from coscheduling
cosched: Add command line argument to enable coscheduling

include/linux/lockdep.h | 4 +-
include/linux/sched/topology.h | 18 +-
init/Kconfig | 11 +
kernel/locking/lockdep.c | 21 +-
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 109 +++-
kernel/sched/cosched.c | 882 +++++++++++++++++++++++++++++
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 1196 ++++++++++++++++++++++++++++++++--------
kernel/sched/idle.c | 7 +-
kernel/sched/sched.h | 461 +++++++++++++++-
kernel/sched/topology.c | 57 +-
kernel/time/tick-sched.c | 14 +
13 files changed, 2474 insertions(+), 309 deletions(-)
create mode 100644 kernel/sched/cosched.c

--
2.9.3.1.gcba166c.dirty



2018-09-07 21:42:42

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 04/60] sched: Replace sd_numa_mask() hack with something sane

Get rid of the global variable sched_domains_curr_level, which is used
to pass state into a sd_numa_mask(), which is used as a callback for
sched_domain_topology_level->mask().

Extend the ->mask() callback instead, so that it takes the topology level
as an extra argument. Provide a backward compatible ->simple_mask()
callback, so that existing code can stay as it is.

This enables other users to do queries via ->mask() without having to
worry about the global variable. It also opens up the possibility for
more generic topologies that require a dynamic number of levels (similar
to what NUMA already does on top of the system topology).

Signed-off-by: Jan H. Schönherr <[email protected]>
---
include/linux/sched/topology.h | 11 ++++++++---
kernel/sched/topology.c | 40 ++++++++++++++++++++++------------------
2 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 530ad856372e..f78534f1cc1e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -165,7 +165,11 @@ void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);

bool cpus_share_cache(int this_cpu, int that_cpu);

-typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
+struct sched_domain_topology_level;
+
+typedef const struct cpumask *(*sched_domain_simple_mask_f)(int cpu);
+typedef const struct cpumask *(*sched_domain_mask_f)(struct sched_domain_topology_level *tl,
+ int cpu);
typedef int (*sched_domain_flags_f)(void);

#define SDTL_OVERLAP 0x01
@@ -178,10 +182,11 @@ struct sd_data {
};

struct sched_domain_topology_level {
- sched_domain_mask_f mask;
+ sched_domain_simple_mask_f simple_mask;
sched_domain_flags_f sd_flags;
+ sched_domain_mask_f mask;
int flags;
- int numa_level;
+ int level;
struct sd_data data;
#ifdef CONFIG_SCHED_DEBUG
char *name;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8b64f3f57d50..0f2c3aa0a097 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1043,7 +1043,6 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
enum numa_topology_type sched_numa_topology_type;

static int sched_domains_numa_levels;
-static int sched_domains_curr_level;

int sched_max_numa_distance;
static int *sched_domains_numa_distance;
@@ -1084,15 +1083,9 @@ sd_init(struct sched_domain_topology_level *tl,
struct sd_data *sdd = &tl->data;
struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
int sd_id, sd_weight, sd_flags = 0;
+ const struct cpumask *mask = tl->mask(tl, cpu);

-#ifdef CONFIG_NUMA
- /*
- * Ugly hack to pass state to sd_numa_mask()...
- */
- sched_domains_curr_level = tl->numa_level;
-#endif
-
- sd_weight = cpumask_weight(tl->mask(cpu));
+ sd_weight = cpumask_weight(mask);

if (tl->sd_flags)
sd_flags = (*tl->sd_flags)();
@@ -1138,7 +1131,7 @@ sd_init(struct sched_domain_topology_level *tl,
#endif
};

- cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
+ cpumask_and(sched_domain_span(sd), cpu_map, mask);
sd_id = cpumask_first(sched_domain_span(sd));

/*
@@ -1170,7 +1163,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->idle_idx = 2;

sd->flags |= SD_SERIALIZE;
- if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
+ if (sched_domains_numa_distance[tl->level] > RECLAIM_DISTANCE) {
sd->flags &= ~(SD_BALANCE_EXEC |
SD_BALANCE_FORK |
SD_WAKE_AFFINE);
@@ -1195,17 +1188,23 @@ sd_init(struct sched_domain_topology_level *tl,
return sd;
}

+static const struct cpumask *
+sd_simple_mask(struct sched_domain_topology_level *tl, int cpu)
+{
+ return tl->simple_mask(cpu);
+}
+
/*
* Topology list, bottom-up.
*/
static struct sched_domain_topology_level default_topology[] = {
#ifdef CONFIG_SCHED_SMT
- { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+ { cpu_smt_mask, cpu_smt_flags, sd_simple_mask, SD_INIT_NAME(SMT) },
#endif
#ifdef CONFIG_SCHED_MC
- { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+ { cpu_coregroup_mask, cpu_core_flags, sd_simple_mask, SD_INIT_NAME(MC) },
#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+ { cpu_cpu_mask, NULL, sd_simple_mask, SD_INIT_NAME(DIE) },
{ NULL, },
};

@@ -1221,13 +1220,18 @@ void set_sched_topology(struct sched_domain_topology_level *tl)
return;

sched_domain_topology = tl;
+ for (; tl->mask || tl->simple_mask; tl++) {
+ if (tl->simple_mask)
+ tl->mask = sd_simple_mask;
+ }
}

#ifdef CONFIG_NUMA

-static const struct cpumask *sd_numa_mask(int cpu)
+static const struct cpumask *
+sd_numa_mask(struct sched_domain_topology_level *tl, int cpu)
{
- return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)];
+ return sched_domains_numa_masks[tl->level][cpu_to_node(cpu)];
}

static void sched_numa_warn(const char *str)
@@ -1446,7 +1450,7 @@ void sched_init_numa(void)
*/
tl[i++] = (struct sched_domain_topology_level){
.mask = sd_numa_mask,
- .numa_level = 0,
+ .level = 0,
SD_INIT_NAME(NODE)
};

@@ -1458,7 +1462,7 @@ void sched_init_numa(void)
.mask = sd_numa_mask,
.sd_flags = cpu_numa_flags,
.flags = SDTL_OVERLAP,
- .numa_level = j,
+ .level = j,
SD_INIT_NAME(NUMA)
};
}
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:43:11

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 05/60] sched: Allow to retrieve the sched_domain_topology

While it is possible to simply overwrite the sched_domain_topology,
it was not possible to retrieve the sched_domain_topology to modify
it instead. Add a function to enable that use case.

Note, that this does not help with the already existing potential
memory leak, when one dynamically allocated sched_domain_topology
is replaced with another one.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
include/linux/sched/topology.h | 1 +
kernel/sched/topology.c | 5 +++++
2 files changed, 6 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index f78534f1cc1e..28d037d0050e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -194,6 +194,7 @@ struct sched_domain_topology_level {
};

extern void set_sched_topology(struct sched_domain_topology_level *tl);
+struct sched_domain_topology_level *get_sched_topology(void);

#ifdef CONFIG_SCHED_DEBUG
# define SD_INIT_NAME(type) .name = #type
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 0f2c3aa0a097..f2db2368ac5f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1226,6 +1226,11 @@ void set_sched_topology(struct sched_domain_topology_level *tl)
}
}

+struct sched_domain_topology_level *get_sched_topology(void)
+{
+ return sched_domain_topology;
+}
+
#ifdef CONFIG_NUMA

static const struct cpumask *
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:43:14

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 19/60] sched: Add entity variants of enqueue_task_fair() and dequeue_task_fair()

There is fair amount of overlap between enqueue_task_fair() and
unthrottle_cfs_rq(), as well as between dequeue_task_fair() and
throttle_cfs_rq(). This is the first step toward having both of
them use the same basic function.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 82 ++++++++++++++++++++++++++++++----------------------
kernel/sched/sched.h | 3 ++
2 files changed, 51 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9f63ac37f5ef..a96328c5a864 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4979,32 +4979,9 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

-/*
- * The enqueue_task method is called before nr_running is
- * increased. Here we update the fair scheduling stats and
- * then put the task into the rbtree:
- */
-static void
-enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
+bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags)
{
struct cfs_rq *cfs_rq;
- struct sched_entity *se = &p->se;
-
- /*
- * The code below (indirectly) updates schedutil which looks at
- * the cfs_rq utilization to select a frequency.
- * Let's add the task's estimated utilization to the cfs_rq's
- * estimated utilization, before we update schedutil.
- */
- util_est_enqueue(&rq->cfs, p);
-
- /*
- * If in_iowait is set, the code below may not trigger any cpufreq
- * utilization updates, so do it here explicitly with the IOWAIT flag
- * passed.
- */
- if (p->in_iowait)
- cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);

for_each_sched_entity(se) {
if (se->on_rq)
@@ -5036,7 +5013,38 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_group(se);
}

- if (!se)
+ return se != NULL;
+}
+
+/*
+ * The enqueue_task method is called before nr_running is
+ * increased. Here we update the fair scheduling stats and
+ * then put the task into the rbtree:
+ */
+static void
+enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
+{
+ bool throttled;
+
+ /*
+ * The code below (indirectly) updates schedutil which looks at
+ * the cfs_rq utilization to select a frequency.
+ * Let's add the task's estimated utilization to the cfs_rq's
+ * estimated utilization, before we update schedutil.
+ */
+ util_est_enqueue(&rq->cfs, p);
+
+ /*
+ * If in_iowait is set, the code below may not trigger any cpufreq
+ * utilization updates, so do it here explicitly with the IOWAIT flag
+ * passed.
+ */
+ if (p->in_iowait)
+ cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
+
+ throttled = enqueue_entity_fair(rq, &p->se, flags);
+
+ if (!throttled)
add_nr_running(rq, 1);

hrtick_update(rq);
@@ -5044,15 +5052,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

static void set_next_buddy(struct sched_entity *se);

-/*
- * The dequeue_task method is called before nr_running is
- * decreased. We remove the task from the rbtree and
- * update the fair scheduling stats:
- */
-static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
+bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags)
{
struct cfs_rq *cfs_rq;
- struct sched_entity *se = &p->se;
int task_sleep = flags & DEQUEUE_SLEEP;

for_each_sched_entity(se) {
@@ -5095,10 +5097,22 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_group(se);
}

- if (!se)
+ return se != NULL;
+}
+
+/*
+ * The dequeue_task method is called before nr_running is
+ * decreased. We remove the task from the rbtree and
+ * update the fair scheduling stats:
+ */
+static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
+{
+ bool throttled = dequeue_entity_fair(rq, &p->se, flags);
+
+ if (!throttled)
sub_nr_running(rq, 1);

- util_est_dequeue(&rq->cfs, p, task_sleep);
+ util_est_dequeue(&rq->cfs, p, flags & DEQUEUE_SLEEP);
hrtick_update(rq);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e0ad36938fb..9016049f36c3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1543,6 +1543,9 @@ extern const u32 sched_prio_to_wmult[40];

#define RETRY_TASK ((void *)-1UL)

+bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags);
+bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags);
+
struct sched_class {
const struct sched_class *next;

--
2.9.3.1.gcba166c.dirty


2018-09-07 21:43:16

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 09/60] sched: Do not require a CFS in init_tg_cfs_entry()

Just like init_tg_cfs_entry() does something useful without a scheduling
entity, let it do something useful without a CFS runqueue.

This prepares for the addition of new types of SEs.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2b3fd7cd9fde..bccd7a66858e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9903,21 +9903,23 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
struct sched_entity *se, struct rq *rq,
struct cfs_rq *parent)
{
- cfs_rq->tg = tg;
- cfs_rq->rq = rq;
- cfs_rq->my_se = se;
- init_cfs_rq_runtime(cfs_rq);
-
- /* se could be NULL for root_task_group */
- if (!se)
- return;
+ /* cfs_rq may be NULL for certain types of SE */
+ if (cfs_rq) {
+ cfs_rq->tg = tg;
+ cfs_rq->rq = rq;
+ cfs_rq->my_se = se;
+ init_cfs_rq_runtime(cfs_rq);
+ }

- set_entity_cfs(se, parent);
- se->my_q = cfs_rq;
+ /* se is NULL for root_task_group */
+ if (se) {
+ set_entity_cfs(se, parent);
+ se->my_q = cfs_rq;

- /* guarantee group entities always have weight */
- update_load_set(&se->load, NICE_0_LOAD);
- init_entity_runnable_average(se);
+ /* guarantee group entities always have weight */
+ update_load_set(&se->load, NICE_0_LOAD);
+ init_entity_runnable_average(se);
+ }
}

static DEFINE_MUTEX(shares_mutex);
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:43:39

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 20/60] sched: Let {en,de}queue_entity_fair() work with a varying amount of tasks

Make the task delta handled by enqueue_entity_fair() and dequeue_task_fair()
variable as required by unthrottle_cfs_rq() and throttle_cfs_rq().

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 18 ++++++++++--------
kernel/sched/sched.h | 6 ++++--
2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a96328c5a864..f13fb4460b66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4979,7 +4979,8 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

-bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags)
+bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
+ unsigned int task_delta)
{
struct cfs_rq *cfs_rq;

@@ -4997,14 +4998,14 @@ bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags)
*/
if (cfs_rq_throttled(cfs_rq))
break;
- cfs_rq->h_nr_running++;
+ cfs_rq->h_nr_running += task_delta;

flags = ENQUEUE_WAKEUP;
}

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
- cfs_rq->h_nr_running++;
+ cfs_rq->h_nr_running += task_delta;

if (cfs_rq_throttled(cfs_rq))
break;
@@ -5042,7 +5043,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (p->in_iowait)
cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);

- throttled = enqueue_entity_fair(rq, &p->se, flags);
+ throttled = enqueue_entity_fair(rq, &p->se, flags, 1);

if (!throttled)
add_nr_running(rq, 1);
@@ -5052,7 +5053,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

static void set_next_buddy(struct sched_entity *se);

-bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags)
+bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
+ unsigned int task_delta)
{
struct cfs_rq *cfs_rq;
int task_sleep = flags & DEQUEUE_SLEEP;
@@ -5069,7 +5071,7 @@ bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags)
*/
if (cfs_rq_throttled(cfs_rq))
break;
- cfs_rq->h_nr_running--;
+ cfs_rq->h_nr_running -= task_delta;

/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight) {
@@ -5088,7 +5090,7 @@ bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags)

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
- cfs_rq->h_nr_running--;
+ cfs_rq->h_nr_running -= task_delta;

if (cfs_rq_throttled(cfs_rq))
break;
@@ -5107,7 +5109,7 @@ bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags)
*/
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
- bool throttled = dequeue_entity_fair(rq, &p->se, flags);
+ bool throttled = dequeue_entity_fair(rq, &p->se, flags, 1);

if (!throttled)
sub_nr_running(rq, 1);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9016049f36c3..569a487ed07c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1543,8 +1543,10 @@ extern const u32 sched_prio_to_wmult[40];

#define RETRY_TASK ((void *)-1UL)

-bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags);
-bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags);
+bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
+ unsigned int task_delta);
+bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
+ unsigned int task_delta);

struct sched_class {
const struct sched_class *next;
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:43:45

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 21/60] sched: Add entity variants of put_prev_task_fair() and set_curr_task_fair()

Add entity variants of put_prev_task_fair() and set_curr_task_fair()
that will be later used by coscheduling.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 34 +++++++++++++++++++++-------------
kernel/sched/sched.h | 2 ++
2 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f13fb4460b66..18b1d81951f1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6651,12 +6651,8 @@ done: __maybe_unused;
return NULL;
}

-/*
- * Account for a descheduled task:
- */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+void put_prev_entity_fair(struct rq *rq, struct sched_entity *se)
{
- struct sched_entity *se = &prev->se;
struct cfs_rq *cfs_rq;

for_each_sched_entity(se) {
@@ -6666,6 +6662,14 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
}

/*
+ * Account for a descheduled task:
+ */
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+{
+ put_prev_entity_fair(rq, &prev->se);
+}
+
+/*
* sched_yield() is very simple
*
* The magic of dealing with the ->skip buddy is in pick_next_entity.
@@ -9758,15 +9762,8 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
}
}

-/* Account for a task changing its policy or group.
- *
- * This routine is mostly called to set cfs_rq->curr field when a task
- * migrates between groups/classes.
- */
-static void set_curr_task_fair(struct rq *rq)
+void set_curr_entity_fair(struct rq *rq, struct sched_entity *se)
{
- struct sched_entity *se = &rq->curr->se;
-
for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);

@@ -9776,6 +9773,17 @@ static void set_curr_task_fair(struct rq *rq)
}
}

+/*
+ * Account for a task changing its policy or group.
+ *
+ * This routine is mostly called to set cfs_rq->curr field when a task
+ * migrates between groups/classes.
+ */
+static void set_curr_task_fair(struct rq *rq)
+{
+ set_curr_entity_fair(rq, &rq->curr->se);
+}
+
void init_cfs_rq(struct cfs_rq *cfs_rq)
{
cfs_rq->tasks_timeline = RB_ROOT_CACHED;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 569a487ed07c..b36e61914a42 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1547,6 +1547,8 @@ bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
unsigned int task_delta);
bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
unsigned int task_delta);
+void put_prev_entity_fair(struct rq *rq, struct sched_entity *se);
+void set_curr_entity_fair(struct rq *rq, struct sched_entity *se);

struct sched_class {
const struct sched_class *next;
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:43:47

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 35/60] cosched: Adjust rq_lock() functions to work with hierarchical runqueues

Locks within the runqueue hierarchy are always taken from bottom to top
to avoid deadlocks. Let the lock validator know about this by declaring
different runqueue levels as distinct lock classes.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/sched.h | 29 ++++++++++++++++++++++++++---
1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 594eb9489f3d..bc3631b8b955 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2083,11 +2083,26 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
}

+#ifdef CONFIG_COSCHEDULING
+/*
+ * The hierarchical runqueues have locks which are taken from bottom to
+ * top. For lock validation, we use the level to calculate the subclass.
+ * As it is sometimes necessary to take two locks on the same level, we
+ * leave some space in the subclass values for that purpose.
+ */
+#define RQ_LOCK_SUBCLASS(rq) (2 * (rq)->sdrq_data.level)
+#define RQ_LOCK_SUBCLASS_NESTED(rq) (2 * (rq)->sdrq_data.level + 1)
+#else
+#define RQ_LOCK_SUBCLASS(rq) 0
+#define RQ_LOCK_SUBCLASS_NESTED(rq) SINGLE_DEPTH_NESTING
+#endif
+
static inline void
rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
- raw_spin_lock_irqsave(&rq->lock, rf->flags);
+ raw_spin_lock_irqsave_nested(&rq->lock, rf->flags,
+ RQ_LOCK_SUBCLASS(rq));
rq_pin_lock(rq, rf);
}

@@ -2095,6 +2110,14 @@ static inline void
rq_lock_irq(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
+ /*
+ * There's no raw_spin_lock_irq_nested(). This is probably fine, as at
+ * most the first lock should be acquired this way. There might be some
+ * false negatives, though, if we start with a non-bottom lock and
+ * classify it incorrectly.
+ */
+ SCHED_WARN_ON(RQ_LOCK_SUBCLASS(rq));
+
raw_spin_lock_irq(&rq->lock);
rq_pin_lock(rq, rf);
}
@@ -2103,7 +2126,7 @@ static inline void
rq_lock(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
- raw_spin_lock(&rq->lock);
+ raw_spin_lock_nested(&rq->lock, RQ_LOCK_SUBCLASS(rq));
rq_pin_lock(rq, rf);
}

@@ -2111,7 +2134,7 @@ static inline void
rq_relock(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
- raw_spin_lock(&rq->lock);
+ raw_spin_lock_nested(&rq->lock, RQ_LOCK_SUBCLASS(rq));
rq_repin_lock(rq, rf);
}

--
2.9.3.1.gcba166c.dirty


2018-09-07 21:44:08

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 44/60] cosched: Perform various rq_of() adjustments in scheduler code

The functions check_preempt_tick() and entity_tick() are executed by
the leader of the group. As such, we already hold the lock for the
per CPU runqueue. Thus, we can use the quick path to resched_curr().
Also, hrtimers are only used/active on per-CPU runqueues. So, use that.

The function __account_cfs_rq_runtime() is called via the enqueue
path, where we don't necessarily hold the per-CPU runqueue lock.
Take the long route though resched_curr().

The function list_add_leaf_cfs_rq() manages a supposedly depth
ordered list of CFS runqueues that contribute to the load on a certain
runqueue. This is used during load balancing. We keep these lists per
hierarchy level, which corresponds to the lock we hold and also
keeps the per-CPU logic compatible to what is there.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f55954e7cedc..fff88694560c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -342,7 +342,7 @@ static inline struct cfs_rq *parent_cfs_rq(struct cfs_rq *cfs_rq)
static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
{
if (!cfs_rq->on_list) {
- struct rq *rq = rq_of(cfs_rq);
+ struct rq *rq = hrq_of(cfs_rq);
struct cfs_rq *pcfs_rq = parent_cfs_rq(cfs_rq);
/*
* Ensure we either appear before our parent (if already
@@ -4072,7 +4072,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
if (delta_exec > ideal_runtime) {
- resched_curr(rq_of(cfs_rq));
+ resched_curr(cpu_rq_of(cfs_rq));
/*
* The current task ran long enough, ensure it doesn't get
* re-elected due to buddy favours.
@@ -4096,7 +4096,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
return;

if (delta > ideal_runtime)
- resched_curr(rq_of(cfs_rq));
+ resched_curr(cpu_rq_of(cfs_rq));
}

static void
@@ -4238,14 +4238,14 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
* validating it and just reschedule.
*/
if (queued) {
- resched_curr(rq_of(cfs_rq));
+ resched_curr(cpu_rq_of(cfs_rq));
return;
}
/*
* don't let the period tick interfere with the hrtick preemption
*/
if (!sched_feat(DOUBLE_TICK) &&
- hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
+ hrtimer_active(&cpu_rq_of(cfs_rq)->hrtick_timer))
return;
#endif

@@ -4422,7 +4422,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
* hierarchy can be throttled
*/
if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
- resched_curr(rq_of(cfs_rq));
+ resched_curr(hrq_of(cfs_rq));
}

static __always_inline
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:44:22

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 40/60] cosched: Keep track of task group hierarchy within each SD-RQ

At a later point (load balancing and throttling at non-CPU levels), we
will have to iterate through parts of the task group hierarchy, visiting
all SD-RQs at the same position within the SD-hierarchy.

Keep track of the task group hierarchy within each SD-RQ to make that
use case efficient.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/core.c | 2 ++
kernel/sched/cosched.c | 19 +++++++++++++++++++
kernel/sched/sched.h | 4 ++++
3 files changed, 25 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a9f5339d58cb..b3ff885a88d4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6310,6 +6310,7 @@ void sched_online_group(struct task_group *tg, struct task_group *parent)
tg->parent = parent;
INIT_LIST_HEAD(&tg->children);
list_add_rcu(&tg->siblings, &parent->children);
+ cosched_online_group(tg);
spin_unlock_irqrestore(&task_group_lock, flags);

online_fair_sched_group(tg);
@@ -6338,6 +6339,7 @@ void sched_offline_group(struct task_group *tg)
spin_lock_irqsave(&task_group_lock, flags);
list_del_rcu(&tg->list);
list_del_rcu(&tg->siblings);
+ cosched_offline_group(tg);
spin_unlock_irqrestore(&task_group_lock, flags);
}

diff --git a/kernel/sched/cosched.c b/kernel/sched/cosched.c
index b897319d046c..1b442e20faad 100644
--- a/kernel/sched/cosched.c
+++ b/kernel/sched/cosched.c
@@ -495,3 +495,22 @@ void cosched_init_sdrq(struct task_group *tg, struct cfs_rq *cfs_rq,
init_sdrq(tg, &cfs_rq->sdrq, sd_parent ? &sd_parent->sdrq : NULL,
&tg_parent->sdrq, tg_parent->sdrq.data);
}
+
+void cosched_online_group(struct task_group *tg)
+{
+ struct cfs_rq *cfs;
+
+ /* Track each SD-RQ within the same SD-RQ in the TG parent */
+ taskgroup_for_each_cfsrq(tg, cfs)
+ list_add_tail_rcu(&cfs->sdrq.tg_siblings,
+ &cfs->sdrq.tg_parent->tg_children);
+}
+
+void cosched_offline_group(struct task_group *tg)
+{
+ struct cfs_rq *cfs;
+
+ /* Remove each SD-RQ from the children list in its TG parent */
+ taskgroup_for_each_cfsrq(tg, cfs)
+ list_del_rcu(&cfs->sdrq.tg_siblings);
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 38b4500095ca..0dfefa31704e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1195,6 +1195,8 @@ void cosched_init_topology(void);
void cosched_init_hierarchy(void);
void cosched_init_sdrq(struct task_group *tg, struct cfs_rq *cfs,
struct cfs_rq *sd_parent, struct cfs_rq *tg_parent);
+void cosched_online_group(struct task_group *tg);
+void cosched_offline_group(struct task_group *tg);
#else /* !CONFIG_COSCHEDULING */
static inline void cosched_init_bottom(void) { }
static inline void cosched_init_topology(void) { }
@@ -1202,6 +1204,8 @@ static inline void cosched_init_hierarchy(void) { }
static inline void cosched_init_sdrq(struct task_group *tg, struct cfs_rq *cfs,
struct cfs_rq *sd_parent,
struct cfs_rq *tg_parent) { }
+static inline void cosched_online_group(struct task_group *tg) { }
+static inline void cosched_offline_group(struct task_group *tg) { }
#endif /* !CONFIG_COSCHEDULING */

#ifdef CONFIG_SCHED_SMT
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:44:27

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 52/60] cosched: Support SD-SEs in enqueuing and dequeuing

SD-SEs require some attention during enqueuing and dequeuing. In some
aspects they behave similar to TG-SEs, for example, we must not dequeue
a SD-SE if it still represents other load. But SD-SEs are also different
due to the concurrent load updates by multiple CPUs and that we need to
be careful when to access it, as an SD-SE belongs to the next hierarchy
level which is protected by a different lock.

Make sure to propagate enqueues and dequeues correctly, and to notify
the leader when needed.

Additionally, we define cfs_rq->h_nr_running to refer to number tasks
and SD-SEs below the CFS runqueue without drilling down into SD-SEs.
(Phrased differently, h_nr_running counts non-TG-SEs along the task
group hierarchy.) This makes later adjustments for load-balancing
more natural, as SD-SEs now appear similar to tasks, allowing to
balance coscheduled sets individually.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 102 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 483db54ee20a..bc219c9c3097 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4600,17 +4600,40 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
/* throttled entity or throttle-on-deactivate */
if (!se->on_rq)
break;
+ if (is_sd_se(se)) {
+ /*
+ * don't dequeue sd_se if it represents other
+ * children besides the dequeued one
+ */
+ if (se->load.weight)
+ dequeue = 0;
+
+ task_delta = 1;
+ }

if (dequeue)
dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
+ if (dequeue && is_sd_se(se)) {
+ /*
+ * If we dequeued an SD-SE and we are not the leader,
+ * the leader might want to select another task group
+ * right now.
+ *
+ * FIXME: Change leadership instead?
+ */
+ if (leader_of(se) != cpu_of(rq))
+ resched_cpu_locked(leader_of(se));
+ }
+ if (!dequeue && is_sd_se(se))
+ break;
qcfs_rq->h_nr_running -= task_delta;

if (qcfs_rq->load.weight)
dequeue = 0;
}

- if (!se)
- sub_nr_running(rq, task_delta);
+ if (!se || !is_cpu_rq(hrq_of(cfs_rq_of(se))))
+ sub_nr_running(rq, cfs_rq->h_nr_running);

rq_chain_unlock(&rc);

@@ -4641,8 +4664,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
int enqueue = 1;
- long task_delta;
+ long task_delta, orig_task_delta;
struct rq_chain rc;
+#ifdef CONFIG_COSCHEDULING
+ int lcpu = rq->sdrq_data.leader;
+#endif

SCHED_WARN_ON(!is_cpu_rq(rq));

@@ -4669,24 +4695,40 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
return;

task_delta = cfs_rq->h_nr_running;
+ orig_task_delta = task_delta;
rq_chain_init(&rc, rq);
for_each_sched_entity(se) {
rq_chain_lock(&rc, se);
update_sdse_load(se);
if (se->on_rq)
enqueue = 0;
+ if (is_sd_se(se))
+ task_delta = 1;

cfs_rq = cfs_rq_of(se);
if (enqueue)
enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+ if (!enqueue && is_sd_se(se))
+ break;
cfs_rq->h_nr_running += task_delta;

if (cfs_rq_throttled(cfs_rq))
break;
+
+#ifdef CONFIG_COSCHEDULING
+ /*
+ * FIXME: Pro-actively reschedule the leader, can't tell
+ * currently whether we actually have to.
+ */
+ if (lcpu != cfs_rq->sdrq.data->leader) {
+ lcpu = cfs_rq->sdrq.data->leader;
+ resched_cpu_locked(lcpu);
+ }
+#endif /* CONFIG_COSCHEDULING */
}

- if (!se)
- add_nr_running(rq, task_delta);
+ if (!se || !is_cpu_rq(hrq_of(cfs_rq_of(se))))
+ add_nr_running(rq, orig_task_delta);

rq_chain_unlock(&rc);

@@ -5213,6 +5255,9 @@ bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
{
struct cfs_rq *cfs_rq;
struct rq_chain rc;
+#ifdef CONFIG_COSCHEDULING
+ int lcpu = rq->sdrq_data.leader;
+#endif

rq_chain_init(&rc, rq);
for_each_sched_entity(se) {
@@ -5221,6 +5266,8 @@ bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
if (se->on_rq)
break;
cfs_rq = cfs_rq_of(se);
+ if (is_sd_se(se))
+ task_delta = 1;
enqueue_entity(cfs_rq, se, flags);

/*
@@ -5234,6 +5281,22 @@ bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
cfs_rq->h_nr_running += task_delta;

flags = ENQUEUE_WAKEUP;
+
+#ifdef CONFIG_COSCHEDULING
+ /*
+ * FIXME: Pro-actively reschedule the leader, can't tell
+ * currently whether we actually have to.
+ *
+ * There are some cases that slip through
+ * check_preempt_curr(), like the leader not getting
+ * notified (and not becoming aware of the addition
+ * timely), when an RT task is running.
+ */
+ if (lcpu != cfs_rq->sdrq.data->leader) {
+ lcpu = cfs_rq->sdrq.data->leader;
+ resched_cpu_locked(lcpu);
+ }
+#endif /* CONFIG_COSCHEDULING */
}

for_each_sched_entity(se) {
@@ -5241,6 +5304,9 @@ bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
rq_chain_lock(&rc, se);
update_sdse_load(se);
cfs_rq = cfs_rq_of(se);
+
+ if (is_sd_se(se))
+ task_delta = 0;
cfs_rq->h_nr_running += task_delta;

if (cfs_rq_throttled(cfs_rq))
@@ -5304,8 +5370,36 @@ bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
rq_chain_lock(&rc, se);
update_sdse_load(se);
cfs_rq = cfs_rq_of(se);
+
+ if (is_sd_se(se)) {
+ /*
+ * don't dequeue sd_se if it represents other
+ * children besides the dequeued one
+ */
+ if (se->load.weight)
+ break;
+
+ /* someone else did our job */
+ if (!se->on_rq)
+ break;
+
+ task_delta = 1;
+ }
+
dequeue_entity(cfs_rq, se, flags);

+ if (is_sd_se(se)) {
+ /*
+ * If we dequeued an SD-SE and we are not the leader,
+ * the leader might want to select another task group
+ * right now.
+ *
+ * FIXME: Change leadership instead?
+ */
+ if (leader_of(se) != cpu_of(rq))
+ resched_cpu_locked(leader_of(se));
+ }
+
/*
* end evaluation on encountering a throttled cfs_rq
*
@@ -5339,6 +5433,9 @@ bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
rq_chain_lock(&rc, se);
update_sdse_load(se);
cfs_rq = cfs_rq_of(se);
+
+ if (is_sd_se(se))
+ task_delta = 0;
cfs_rq->h_nr_running -= task_delta;

if (cfs_rq_throttled(cfs_rq))
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:44:44

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 25/60] cosched: Prepare scheduling domain topology for coscheduling

The ability to coschedule is coupled closely to the scheduling domain
topology: all CPUs within a scheduling domain will context switch
simultaneously. In other words, each scheduling domain also defines
a synchronization domain.

That means, that we should have a wider selection of scheduling domains
than just the typical core, socket, system distinction. Otherwise, it
won't be possible to, e.g., coschedule a subset of a processor. While
synchronization domains based on hardware boundaries are exactly the
right thing when it comes to most resource contention or security
related use cases for coscheduling, it is a limiting factor when
considering coscheduling use cases around parallel programs.

On the other hand that means, that all CPUs need to have the same view
on which groups of CPUs form scheduling domains, as synchronization
domains have to be global by definition.

Introduce a function that post-processes the scheduling domain
topology just before the scheduling domains are generated. We use this
opportunity to get rid of overlapping (aka. non-global) scheduling
domains and to introduce additional levels of scheduling domains to
keep a more reasonable fan-out.

Couple the splitting of scheduling domains to a command line argument,
that is disabled by default. The splitting is not NUMA-aware and may
generate non-optimal splits at that level, when there are four or more
NUMA nodes. Doing this right at this level would require a proper
graph partitioning algorithm operating on the NUMA distance matrix.
Also, as mentioned before, not everyone needs the finer granularity.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/core.c | 1 +
kernel/sched/cosched.c | 259 +++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 2 +
3 files changed, 262 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a235b6041cb5..cc801f84bf97 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5867,6 +5867,7 @@ int sched_cpu_dying(unsigned int cpu)
void __init sched_init_smp(void)
{
sched_init_numa();
+ cosched_init_topology();

/*
* There's no userspace yet to cause hotplug operations; hence all the
diff --git a/kernel/sched/cosched.c b/kernel/sched/cosched.c
index 03ba86676b90..7a793aa93114 100644
--- a/kernel/sched/cosched.c
+++ b/kernel/sched/cosched.c
@@ -6,6 +6,8 @@
* Author: Jan H. Schönherr <[email protected]>
*/

+#include <linux/moduleparam.h>
+
#include "sched.h"

static int mask_to_node(const struct cpumask *span)
@@ -92,3 +94,260 @@ void cosched_init_bottom(void)
init_sdrq(&root_task_group, sdrq, NULL, NULL, data);
}
}
+
+static bool __read_mostly cosched_split_domains;
+
+static int __init cosched_split_domains_setup(char *str)
+{
+ cosched_split_domains = true;
+ return 0;
+}
+
+early_param("cosched_split_domains", cosched_split_domains_setup);
+
+struct sd_sdrqmask_level {
+ int groups;
+ struct cpumask **masks;
+};
+
+static struct sd_sdrqmask_level *sd_sdrqmasks;
+
+static const struct cpumask *
+sd_cpu_mask(struct sched_domain_topology_level *tl, int cpu)
+{
+ return get_cpu_mask(cpu);
+}
+
+static const struct cpumask *
+sd_sdrqmask(struct sched_domain_topology_level *tl, int cpu)
+{
+ int i, nr = tl->level;
+
+ for (i = 0; i < sd_sdrqmasks[nr].groups; i++) {
+ if (cpumask_test_cpu(cpu, sd_sdrqmasks[nr].masks[i]))
+ return sd_sdrqmasks[nr].masks[i];
+ }
+
+ WARN(1, "CPU%d not in any group for level %d", cpu, nr);
+ return get_cpu_mask(cpu);
+}
+
+static int calc_agglomeration_factor(int upperweight, int lowerweight)
+{
+ int factor;
+
+ /* Only split domains if actually requested. */
+ if (!cosched_split_domains)
+ return 1;
+
+ /* Determine branching factor */
+ if (upperweight % lowerweight) {
+ pr_info("Non-homogeneous topology?! Not restructuring! (%d, %d)",
+ upperweight, lowerweight);
+ return 1;
+ }
+
+ factor = upperweight / lowerweight;
+ WARN_ON_ONCE(factor <= 0);
+
+ /* Determine number of lower groups to agglomerate */
+ if (factor <= 3)
+ return 1;
+ if (factor % 2 == 0)
+ return 2;
+ if (factor % 3 == 0)
+ return 3;
+
+ pr_info("Cannot find a suitable agglomeration. Not restructuring! (%d, %d)",
+ upperweight, lowerweight);
+ return 1;
+}
+
+/*
+ * Construct the needed masks for an intermediate level.
+ *
+ * There is the level above us, and the level below us.
+ *
+ * Both levels consist of disjoint groups, while the lower level contains
+ * multiple groups for each group of the higher level. The branching factor
+ * must be > 3, otherwise splitting is not useful.
+ *
+ * Thus, to facilitate a bottom up approach, that can later also add more
+ * than one level between two existing levels, we always group two (or
+ * three if possible) lower groups together, given that they are in
+ * the same upper group.
+ *
+ * To get deterministic results, we go through groups from left to right,
+ * ordered by their lowest numbered cpu.
+ *
+ * FIXME: This does not consider distances of NUMA nodes. They may end up
+ * in non-optimal groups.
+ *
+ * FIXME: This only does the right thing for homogeneous topologies,
+ * where additionally all CPUs are online.
+ */
+static int create_sdrqmask(struct sd_sdrqmask_level *current_level,
+ struct sched_domain_topology_level *upper,
+ struct sched_domain_topology_level *lower)
+{
+ int lowerweight, agg;
+ int i, g = 0;
+ const struct cpumask *next;
+ cpumask_var_t remaining_system, remaining_upper;
+
+ /* Determine number of lower groups to agglomerate */
+ lowerweight = cpumask_weight(lower->mask(lower, 0));
+ agg = calc_agglomeration_factor(cpumask_weight(upper->mask(upper, 0)),
+ lowerweight);
+ WARN_ON_ONCE(agg <= 1);
+
+ /* Determine number of agglomerated groups across the system */
+ current_level->groups = cpumask_weight(cpu_online_mask)
+ / (agg * lowerweight);
+
+ /* Allocate memory for new masks and tmp vars */
+ current_level->masks = kcalloc(current_level->groups, sizeof(void *),
+ GFP_KERNEL);
+ if (!current_level->masks)
+ return -ENOMEM;
+ if (!zalloc_cpumask_var(&remaining_system, GFP_KERNEL))
+ return -ENOMEM;
+ if (!zalloc_cpumask_var(&remaining_upper, GFP_KERNEL))
+ return -ENOMEM;
+
+ /* Go through groups in upper level, creating all agglomerated masks */
+ cpumask_copy(remaining_system, cpu_online_mask);
+
+ /* While there is an unprocessed upper group */
+ while (!cpumask_empty(remaining_system)) {
+ /* Get that group */
+ next = upper->mask(upper, cpumask_first(remaining_system));
+ cpumask_andnot(remaining_system, remaining_system, next);
+
+ cpumask_copy(remaining_upper, next);
+
+ /* While there are unprocessed lower groups */
+ while (!cpumask_empty(remaining_upper)) {
+ struct cpumask *mask = kzalloc(cpumask_size(),
+ GFP_KERNEL);
+
+ if (!mask)
+ return -ENOMEM;
+
+ if (WARN_ON_ONCE(g == current_level->groups))
+ return -EINVAL;
+ current_level->masks[g] = mask;
+ g++;
+
+ /* Create agglomerated mask */
+ for (i = 0; i < agg; i++) {
+ WARN_ON_ONCE(cpumask_empty(remaining_upper));
+
+ next = lower->mask(lower, cpumask_first(remaining_upper));
+ cpumask_andnot(remaining_upper, remaining_upper, next);
+
+ cpumask_or(mask, mask, next);
+ }
+ }
+ }
+
+ if (WARN_ON_ONCE(g != current_level->groups))
+ return -EINVAL;
+
+ free_cpumask_var(remaining_system);
+ free_cpumask_var(remaining_upper);
+
+ return 0;
+}
+
+void cosched_init_topology(void)
+{
+ struct sched_domain_topology_level *sched_domain_topology = get_sched_topology();
+ struct sched_domain_topology_level *tl;
+ int i, agg, orig_level, levels = 0, extra_levels = 0;
+ int span, prev_span = 1;
+
+ /* Only one CPU in the system, we are finished here */
+ if (cpumask_weight(cpu_possible_mask) == 1)
+ return;
+
+ /* Determine number of additional levels */
+ for (tl = sched_domain_topology; tl->mask; tl++) {
+ /* Skip overlap levels, except for the last one */
+ if (tl->flags & SDTL_OVERLAP && tl[1].mask)
+ continue;
+
+ levels++;
+
+ /* FIXME: this assumes a homogeneous topology */
+ span = cpumask_weight(tl->mask(tl, 0));
+ for (;;) {
+ agg = calc_agglomeration_factor(span, prev_span);
+ if (agg <= 1)
+ break;
+ levels++;
+ extra_levels++;
+ prev_span *= agg;
+ }
+ prev_span = span;
+ }
+
+ /* Allocate memory for all levels plus terminators on both ends */
+ tl = kcalloc((levels + 2), sizeof(*tl), GFP_KERNEL);
+ sd_sdrqmasks = kcalloc(extra_levels, sizeof(*sd_sdrqmasks), GFP_KERNEL);
+ if (!tl || !sd_sdrqmasks)
+ return;
+
+ /* Fill in start terminator and forget about it */
+ tl->mask = sd_cpu_mask;
+ tl++;
+
+ /* Copy existing levels and add new ones */
+ prev_span = 1;
+ orig_level = 0;
+ extra_levels = 0;
+ for (i = 0; i < levels; i++) {
+ BUG_ON(!sched_domain_topology[orig_level].mask);
+
+ /* Skip overlap levels, except for the last one */
+ while (sched_domain_topology[orig_level].flags & SDTL_OVERLAP &&
+ sched_domain_topology[orig_level + 1].mask) {
+ orig_level++;
+ }
+
+ /* Copy existing */
+ tl[i] = sched_domain_topology[orig_level];
+
+ /* Check if we must add a level */
+ /* FIXME: this assumes a homogeneous topology */
+ span = cpumask_weight(tl[i].mask(&tl[i], 0));
+ agg = calc_agglomeration_factor(span, prev_span);
+ if (agg <= 1) {
+ orig_level++;
+ prev_span = span;
+ continue;
+ }
+
+ /*
+ * For the new level, we take the same setting as the level
+ * above us (the one we already copied). We just give it
+ * different set of masks.
+ */
+ if (create_sdrqmask(&sd_sdrqmasks[extra_levels],
+ &sched_domain_topology[orig_level],
+ &tl[i - 1]))
+ return;
+
+ tl[i].mask = sd_sdrqmask;
+ tl[i].level = extra_levels;
+ tl[i].flags &= ~SDTL_OVERLAP;
+ tl[i].simple_mask = NULL;
+
+ extra_levels++;
+ prev_span = cpumask_weight(tl[i].mask(&tl[i], 0));
+ }
+ BUG_ON(sched_domain_topology[orig_level].mask);
+
+ /* Make permanent */
+ set_sched_topology(tl);
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 21b7c6cf8b87..ed9c526b74ee 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1131,8 +1131,10 @@ static inline struct cfs_rq *taskgroup_next_cfsrq(struct task_group *tg,

#ifdef CONFIG_COSCHEDULING
void cosched_init_bottom(void);
+void cosched_init_topology(void);
#else /* !CONFIG_COSCHEDULING */
static inline void cosched_init_bottom(void) { }
+static inline void cosched_init_topology(void) { }
#endif /* !CONFIG_COSCHEDULING */

#ifdef CONFIG_SCHED_SMT
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:44:52

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 57/60] cosched: Add sysfs interface to configure coscheduling on cgroups

Add the sysfs interface to configure the scheduling domain hierarchy
level at which coscheduling should happen for a cgroup. By default,
task groups are created with a value of zero corresponding to regular
task groups without any coscheduling.

Note, that you cannot specify a value that goes beyond that of the
root task group. The value for the root task group cannot be configured
via this interface. It has to be configured with a command line
argument, which will be added later.

The function sdrq_update_root() will be filled in a follow-up commit.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/core.c | 44 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/cosched.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 4 ++++
3 files changed, 101 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 75de3b83a8c6..ad2ff9bc535c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6336,6 +6336,9 @@ void sched_offline_group(struct task_group *tg)
{
unsigned long flags;

+ /* Don't let offlining/destruction worry about coscheduling aspects */
+ cosched_set_scheduled(tg, 0);
+
/* End participation in shares distribution: */
unregister_fair_sched_group(tg);

@@ -6529,7 +6532,33 @@ static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,

return (u64) scale_load_down(tg->shares);
}
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+#ifdef CONFIG_COSCHEDULING
+static int cpu_scheduled_write_u64(struct cgroup_subsys_state *css, struct cftype *cftype,
+ u64 val)
+{
+ struct task_group *tg = css_tg(css);
+
+ if (tg == &root_task_group)
+ return -EACCES;
+
+ if (val > root_task_group.scheduled)
+ return -EINVAL;
+
+ cosched_set_scheduled(tg, val);
+ return 0;
+}

+static u64 cpu_scheduled_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+ struct task_group *tg = css_tg(css);
+
+ return cosched_get_scheduled(tg);
+}
+#endif /* !CONFIG_COSCHEDULING */
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
#ifdef CONFIG_CFS_BANDWIDTH
static DEFINE_MUTEX(cfs_constraints_mutex);

@@ -6825,6 +6854,13 @@ static struct cftype cpu_legacy_files[] = {
.write_u64 = cpu_shares_write_u64,
},
#endif
+#ifdef CONFIG_COSCHEDULING
+ {
+ .name = "scheduled",
+ .read_u64 = cpu_scheduled_read_u64,
+ .write_u64 = cpu_scheduled_write_u64,
+ },
+#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
.name = "cfs_quota_us",
@@ -7012,6 +7048,14 @@ static struct cftype cpu_files[] = {
.write_s64 = cpu_weight_nice_write_s64,
},
#endif
+#ifdef CONFIG_COSCHEDULING
+ /* FIXME: This is not conform to cgroup-v2 conventions. */
+ {
+ .name = "scheduled",
+ .read_u64 = cpu_scheduled_read_u64,
+ .write_u64 = cpu_scheduled_write_u64,
+ },
+#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
.name = "max",
diff --git a/kernel/sched/cosched.c b/kernel/sched/cosched.c
index f2d51079b3db..7c8b8c8d2814 100644
--- a/kernel/sched/cosched.c
+++ b/kernel/sched/cosched.c
@@ -515,6 +515,59 @@ void cosched_offline_group(struct task_group *tg)
list_del_rcu(&cfs->sdrq.tg_siblings);
}

+static void sdrq_update_root(struct sdrq *sdrq)
+{
+ /* TBD */
+}
+
+void cosched_set_scheduled(struct task_group *tg, int level)
+{
+ struct cfs_rq *cfs_rq;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&tg->lock, flags);
+
+ /*
+ * Update the is_root-fields of all hierarchical CFS runqueues in this
+ * task group. To avoid repetitive enqueues and dequeues on every level
+ * change, we chose pre- or post-order traversal.
+ */
+ if (level > tg->scheduled) {
+ /*
+ * roots move upwards: start reconfiguration at the top, so
+ * that everything is dequeued/enqueued only when we reach
+ * the previous scheduling level.
+ */
+ tg->scheduled = level;
+ taskgroup_for_each_cfsrq_topdown(tg, cfs_rq)
+ sdrq_update_root(&cfs_rq->sdrq);
+ }
+ if (level < tg->scheduled) {
+ /*
+ * roots move downwards: start reconfiguration at the bottom,
+ * so that we do the dequeuing/enqueuing immediately, when we
+ * reach the new scheduling level.
+ */
+ tg->scheduled = level;
+ taskgroup_for_each_cfsrq(tg, cfs_rq)
+ sdrq_update_root(&cfs_rq->sdrq);
+ }
+
+ raw_spin_unlock_irqrestore(&tg->lock, flags);
+}
+
+int cosched_get_scheduled(struct task_group *tg)
+{
+ unsigned long flags;
+ int level;
+
+ raw_spin_lock_irqsave(&tg->lock, flags);
+ level = tg->scheduled;
+ raw_spin_unlock_irqrestore(&tg->lock, flags);
+
+ return level;
+}
+
/*****************************************************************************
* Locking related functions
*****************************************************************************/
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f6146feb7e55..e257451e05a5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1212,6 +1212,8 @@ void cosched_init_sdrq(struct task_group *tg, struct cfs_rq *cfs,
struct cfs_rq *sd_parent, struct cfs_rq *tg_parent);
void cosched_online_group(struct task_group *tg);
void cosched_offline_group(struct task_group *tg);
+void cosched_set_scheduled(struct task_group *tg, int level);
+int cosched_get_scheduled(struct task_group *tg);
struct rq *rq_lock_owned(struct rq *rq, struct rq_owner_flags *orf);
void rq_unlock_owned(struct rq *rq, struct rq_owner_flags *orf);
void rq_chain_init(struct rq_chain *rc, struct rq *rq);
@@ -1226,6 +1228,8 @@ static inline void cosched_init_sdrq(struct task_group *tg, struct cfs_rq *cfs,
struct cfs_rq *tg_parent) { }
static inline void cosched_online_group(struct task_group *tg) { }
static inline void cosched_offline_group(struct task_group *tg) { }
+static inline void cosched_set_scheduled(struct task_group *tg, int level) { }
+static inline int cosched_get_scheduled(struct task_group *tg) { return 0; }
static inline struct rq *rq_lock_owned(struct rq *rq, struct rq_owner_flags *orf) { return rq; }
static inline void rq_unlock_owned(struct rq *rq, struct rq_owner_flags *orf) { }
static inline void rq_chain_init(struct rq_chain *rc, struct rq *rq) { }
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:46:00

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 54/60] cosched: Support idling in a coscheduled set

If a coscheduled set is partly idle, some CPUs *must* do nothing, even
if they have other tasks (in other coscheduled sets). This forced idle
mode must work similar to normal task execution, e.g., not just any
task is allowed to replace the forced idle task.

Lay the ground work for this by introducing the general helper functions
to enter and leave the forced idle mode.

Whenever we are in forced idle, we execute the normal idle task, but we
forward many decisions to the fair scheduling class. The functions in
the fair scheduling class are made aware of the forced idle mode and
base their actual decisions on the (SD-)SE, under which there were no
tasks.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/core.c | 11 +++++++----
kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++++++-------
kernel/sched/idle.c | 7 ++++++-
kernel/sched/sched.h | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 104 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b3ff885a88d4..75de3b83a8c6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -856,13 +856,16 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,

void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
- const struct sched_class *class;
+ const struct sched_class *class, *curr_class = rq->curr->sched_class;
+
+ if (cosched_is_idle(rq, rq->curr))
+ curr_class = &fair_sched_class;

- if (p->sched_class == rq->curr->sched_class) {
- rq->curr->sched_class->check_preempt_curr(rq, p, flags);
+ if (p->sched_class == curr_class) {
+ curr_class->check_preempt_curr(rq, p, flags);
} else {
for_each_class(class) {
- if (class == rq->curr->sched_class)
+ if (class == curr_class)
break;
if (class == p->sched_class) {
resched_curr(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 210fcd534917..9e8b8119cdea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5206,12 +5206,14 @@ static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
{
struct sched_entity *se = &p->se;
- struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+ if (cosched_is_idle(rq, p))
+ se = cosched_get_idle_se(rq);

SCHED_WARN_ON(task_rq(p) != rq);

if (nr_cfs_tasks(rq) > 1) {
- u64 slice = sched_slice(cfs_rq, se);
+ u64 slice = sched_slice(cfs_rq_of(se), se);
u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
s64 delta = slice - ran;

@@ -5232,11 +5234,17 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
static void hrtick_update(struct rq *rq)
{
struct task_struct *curr = rq->curr;
+ struct sched_entity *se = &curr->se;
+
+ if (!hrtick_enabled(rq))
+ return;

- if (!hrtick_enabled(rq) || curr->sched_class != &fair_sched_class)
+ if (cosched_is_idle(rq, curr))
+ se = cosched_get_idle_se(rq);
+ else if (curr->sched_class != &fair_sched_class)
return;

- if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
+ if (cfs_rq_of(se)->nr_running < sched_nr_latency)
hrtick_start_fair(rq, curr);
}
#else /* !CONFIG_SCHED_HRTICK */
@@ -6802,13 +6810,20 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
{
struct task_struct *curr = rq->curr;
struct sched_entity *se = &curr->se, *pse = &p->se;
- struct cfs_rq *cfs_rq = task_cfs_rq(curr);
- int scale = cfs_rq->nr_running >= sched_nr_latency;
int next_buddy_marked = 0;
+ struct cfs_rq *cfs_rq;
+ int scale;
+
+ /* FIXME: locking may be off after fetching the idle_se */
+ if (cosched_is_idle(rq, curr))
+ se = cosched_get_idle_se(rq);

if (unlikely(se == pse))
return;

+ cfs_rq = cfs_rq_of(se);
+ scale = cfs_rq->nr_running >= sched_nr_latency;
+
/*
* This is possible from callers such as attach_tasks(), in which we
* unconditionally check_prempt_curr() after an enqueue (which may have
@@ -7038,7 +7053,15 @@ void put_prev_entity_fair(struct rq *rq, struct sched_entity *se)
*/
static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
{
- put_prev_entity_fair(rq, &prev->se);
+ struct sched_entity *se = &prev->se;
+
+ if (cosched_is_idle(rq, prev)) {
+ se = cosched_get_and_clear_idle_se(rq);
+ if (__leader_of(se) != cpu_of(rq))
+ return;
+ }
+
+ put_prev_entity_fair(rq, se);
}

/*
@@ -9952,6 +9975,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
struct sched_entity *se = &curr->se;
struct rq_owner_flags orf;

+ if (cosched_is_idle(rq, curr)) {
+ se = cosched_get_idle_se(rq);
+ if (__leader_of(se) != cpu_of(rq))
+ return;
+ }
+
rq_lock_owned(rq, &orf);
for_each_owned_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 16f84142f2f4..4df136ef1aeb 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -391,7 +391,8 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
static struct task_struct *
pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
- put_prev_task(rq, prev);
+ if (prev)
+ put_prev_task(rq, prev);
update_idle_core(rq);
schedstat_inc(rq->sched_goidle);

@@ -413,6 +414,8 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)

static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
{
+ if (cosched_is_idle(rq, prev))
+ fair_sched_class.put_prev_task(rq, prev);
}

/*
@@ -425,6 +428,8 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
*/
static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
{
+ if (cosched_is_idle(rq, curr))
+ fair_sched_class.task_tick(rq, curr, queued);
}

static void set_curr_task_idle(struct rq *rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 48939c8e539d..f6146feb7e55 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1914,6 +1914,61 @@ extern const struct sched_class rt_sched_class;
extern const struct sched_class fair_sched_class;
extern const struct sched_class idle_sched_class;

+#ifdef CONFIG_COSCHEDULING
+static inline bool cosched_is_idle(struct rq *rq, struct task_struct *p)
+{
+ if (!rq->sdrq_data.idle_se)
+ return false;
+ if (SCHED_WARN_ON(p != rq->idle))
+ return false;
+ return true;
+}
+
+static inline struct sched_entity *cosched_get_idle_se(struct rq *rq)
+{
+ return rq->sdrq_data.idle_se;
+}
+
+static inline struct sched_entity *cosched_get_and_clear_idle_se(struct rq *rq)
+{
+ struct sched_entity *se = rq->sdrq_data.idle_se;
+
+ rq->sdrq_data.idle_se = NULL;
+
+ return se;
+}
+
+static inline struct sched_entity *cosched_set_idle(struct rq *rq,
+ struct sched_entity *se)
+{
+ rq->sdrq_data.idle_se = se;
+ return &idle_sched_class.pick_next_task(rq, NULL, NULL)->se;
+}
+#else /* !CONFIG_COSCHEDULING */
+static inline bool cosched_is_idle(struct rq *rq, struct task_struct *p)
+{
+ return false;
+}
+
+static inline struct sched_entity *cosched_get_idle_se(struct rq *rq)
+{
+ BUILD_BUG();
+ return NULL;
+}
+
+static inline struct sched_entity *cosched_get_and_clear_idle_se(struct rq *rq)
+{
+ BUILD_BUG();
+ return NULL;
+}
+
+static inline struct sched_entity *cosched_set_idle(struct rq *rq,
+ struct sched_entity *se)
+{
+ BUILD_BUG();
+ return NULL;
+}
+#endif /* !CONFIG_COSCHEDULING */

#ifdef CONFIG_SMP

--
2.9.3.1.gcba166c.dirty


2018-09-07 21:46:23

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 41/60] cosched: Introduce locking for leader activities

With hierarchical runqueues and locks at each level, it is often
necessary to get multiple locks. Introduce the first of two locking
strategies, which is suitable for typical leader activities.

To avoid deadlocks the general rule is that multiple locks have to be
taken from bottom to top. Leaders make scheduling decisions and the
necessary maintenance for their part of the runqueue hierarchy. Hence,
they need to gather locks for all runqueues they own to operate freely
on them.

Provide two functions that do that: rq_lock_owned() and
rq_unlock_owned(). Typically, they walk from the already locked per-CPU
runqueue upwards, locking/unlocking runqueues as they go along, stopping
when they would leave their area of responsibility.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/cosched.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 11 ++++++
2 files changed, 105 insertions(+)

diff --git a/kernel/sched/cosched.c b/kernel/sched/cosched.c
index 1b442e20faad..df62ee6d0520 100644
--- a/kernel/sched/cosched.c
+++ b/kernel/sched/cosched.c
@@ -514,3 +514,97 @@ void cosched_offline_group(struct task_group *tg)
taskgroup_for_each_cfsrq(tg, cfs)
list_del_rcu(&cfs->sdrq.tg_siblings);
}
+
+/*****************************************************************************
+ * Locking related functions
+ *****************************************************************************/
+
+/*
+ * Lock owned part of the runqueue hierarchy from the specified runqueue
+ * upwards.
+ *
+ * You may call rq_lock_owned() again in some nested code path. Currently, this
+ * is needed for put_prev_task(), which is sometimes called from within
+ * pick_next_task_fair(), and for throttle_cfs_rq(), which is sometimes called
+ * during enqueuing and dequeuing.
+ *
+ * When not called nested, returns the uppermost locked runqueue; used by
+ * pick_next_task_fair() to avoid going up the hierarchy again.
+ */
+struct rq *rq_lock_owned(struct rq *rq, struct rq_owner_flags *orf)
+{
+ int cpu = rq->sdrq_data.leader;
+ struct rq *ret = rq;
+
+ lockdep_assert_held(&rq->lock);
+
+ orf->nested = rq->sdrq_data.parent_locked;
+ if (orf->nested)
+ return NULL;
+
+ orf->cookie = lockdep_cookie();
+
+ WARN_ON_ONCE(!irqs_disabled());
+
+ /* Lowest level is already locked, begin with next level */
+ rq = parent_rq(rq);
+
+ while (rq) {
+ /*
+ * FIXME: This avoids ascending the hierarchy, if upper
+ * levels are not in use. Can we do this with leader==-1
+ * instead?
+ */
+ if (root_task_group.scheduled < rq->sdrq_data.level)
+ break;
+
+ /*
+ * Leadership is always taken, never given; if we're not
+ * already the leader, we won't be after taking the lock.
+ */
+ if (cpu != READ_ONCE(rq->sdrq_data.leader))
+ break;
+
+ rq_lock(rq, &rq->sdrq_data.rf);
+
+ /* Did we race with a leadership change? */
+ if (cpu != READ_ONCE(rq->sdrq_data.leader)) {
+ rq_unlock(rq, &rq->sdrq_data.rf);
+ break;
+ }
+
+ /* Apply the cookie that's not stored with the data structure */
+ lockdep_repin_lock(&rq->lock, orf->cookie);
+
+ ret->sdrq_data.parent_locked = true;
+ update_rq_clock(rq);
+ ret = rq;
+
+ rq = parent_rq(rq);
+ }
+
+ return ret;
+}
+
+void rq_unlock_owned(struct rq *rq, struct rq_owner_flags *orf)
+{
+ bool parent_locked = rq->sdrq_data.parent_locked;
+
+ if (orf->nested)
+ return;
+
+ /* Lowest level must stay locked, begin with next level */
+ lockdep_assert_held(&rq->lock);
+ rq->sdrq_data.parent_locked = false;
+
+ while (parent_locked) {
+ rq = parent_rq(rq);
+ lockdep_assert_held(&rq->lock);
+
+ parent_locked = rq->sdrq_data.parent_locked;
+ rq->sdrq_data.parent_locked = false;
+
+ lockdep_unpin_lock(&rq->lock, orf->cookie);
+ rq_unlock(rq, &rq->sdrq_data.rf);
+ }
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0dfefa31704e..7dba8fdc48c7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -506,6 +506,13 @@ struct rq_flags {
#endif
};

+struct rq_owner_flags {
+#ifdef CONFIG_COSCHEDULING
+ bool nested;
+ struct pin_cookie cookie;
+#endif
+};
+
#ifdef CONFIG_COSCHEDULING
struct sdrq_data {
/*
@@ -1197,6 +1204,8 @@ void cosched_init_sdrq(struct task_group *tg, struct cfs_rq *cfs,
struct cfs_rq *sd_parent, struct cfs_rq *tg_parent);
void cosched_online_group(struct task_group *tg);
void cosched_offline_group(struct task_group *tg);
+struct rq *rq_lock_owned(struct rq *rq, struct rq_owner_flags *orf);
+void rq_unlock_owned(struct rq *rq, struct rq_owner_flags *orf);
#else /* !CONFIG_COSCHEDULING */
static inline void cosched_init_bottom(void) { }
static inline void cosched_init_topology(void) { }
@@ -1206,6 +1215,8 @@ static inline void cosched_init_sdrq(struct task_group *tg, struct cfs_rq *cfs,
struct cfs_rq *tg_parent) { }
static inline void cosched_online_group(struct task_group *tg) { }
static inline void cosched_offline_group(struct task_group *tg) { }
+static inline struct rq *rq_lock_owned(struct rq *rq, struct rq_owner_flags *orf) { return rq; }
+static inline void rq_unlock_owned(struct rq *rq, struct rq_owner_flags *orf) { }
#endif /* !CONFIG_COSCHEDULING */

#ifdef CONFIG_SCHED_SMT
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:47:21

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 08/60] sched: Move init_entity_runnable_average() into init_tg_cfs_entry()

Move init_entity_runnable_average() into init_tg_cfs_entry(), where all
the other SE initialization is carried out.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0dd5825556c..2b3fd7cd9fde 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9846,7 +9846,6 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
tg->cfs_rq[i] = cfs_rq;
init_cfs_rq(cfs_rq);
init_tg_cfs_entry(tg, cfs_rq, se, cpu_rq(i), parent->cfs_rq[i]);
- init_entity_runnable_average(se);
}

return 1;
@@ -9918,6 +9917,7 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,

/* guarantee group entities always have weight */
update_load_set(&se->load, NICE_0_LOAD);
+ init_entity_runnable_average(se);
}

static DEFINE_MUTEX(shares_mutex);
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:47:23

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 13/60] sched: Remove useless checks for root task-group

The functions sync_throttle() and unregister_fair_sched_group() are
called during the creation and destruction of cgroups. They are never
called for the root task-group. Remove checks that always yield the
same result when operating on non-root task groups.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5d6225aedbfe..5cad364e3a88 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4716,9 +4716,6 @@ static void sync_throttle(struct task_group *tg, int cpu)
if (!cfs_bandwidth_used())
return;

- if (!tg->parent)
- return;
-
cfs_rq = tg->cfs_rq[cpu];
pcfs_rq = tg->parent->cfs_rq[cpu];

@@ -9881,8 +9878,7 @@ void unregister_fair_sched_group(struct task_group *tg)
int cpu;

for_each_possible_cpu(cpu) {
- if (tg->cfs_rq[cpu]->my_se)
- remove_entity_load_avg(tg->cfs_rq[cpu]->my_se);
+ remove_entity_load_avg(tg->cfs_rq[cpu]->my_se);

/*
* Only empty task groups can be destroyed; so we can speculatively
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:47:49

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 07/60] sched: Reduce dependencies of init_tg_cfs_entry()

Decouple init_tg_cfs_entry() from other structures' implementation
details, so that it only updates/accesses task group related fields
of the CFS runqueue and its SE.

This prepares calling this function in slightly different contexts.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/core.c | 3 ++-
kernel/sched/fair.c | 13 +++++--------
kernel/sched/sched.h | 4 ++--
3 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c38a54f57e90..48e37c3baed1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6019,7 +6019,8 @@ void __init sched_init(void)
* directly in rq->cfs (i.e root_task_group->se[] = NULL).
*/
init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
- init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
+ root_task_group.cfs_rq[i] = &rq->cfs;
+ init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, rq, NULL);
#endif /* CONFIG_FAIR_GROUP_SCHED */

rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3de0158729a6..c0dd5825556c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9843,8 +9843,9 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
if (!se)
goto err_free_rq;

+ tg->cfs_rq[i] = cfs_rq;
init_cfs_rq(cfs_rq);
- init_tg_cfs_entry(tg, cfs_rq, se, i, parent->cfs_rq[i]);
+ init_tg_cfs_entry(tg, cfs_rq, se, cpu_rq(i), parent->cfs_rq[i]);
init_entity_runnable_average(se);
}

@@ -9900,17 +9901,13 @@ void unregister_fair_sched_group(struct task_group *tg)
}

void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
- struct sched_entity *se, int cpu,
- struct cfs_rq *parent)
+ struct sched_entity *se, struct rq *rq,
+ struct cfs_rq *parent)
{
- struct rq *rq = cpu_rq(cpu);
-
cfs_rq->tg = tg;
cfs_rq->rq = rq;
- init_cfs_rq_runtime(cfs_rq);
-
- tg->cfs_rq[cpu] = cfs_rq;
cfs_rq->my_se = se;
+ init_cfs_rq_runtime(cfs_rq);

/* se could be NULL for root_task_group */
if (!se)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 926a26d816a2..b8c8dfd0e88d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -427,8 +427,8 @@ extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *pare
extern void online_fair_sched_group(struct task_group *tg);
extern void unregister_fair_sched_group(struct task_group *tg);
extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
- struct sched_entity *se, int cpu,
- struct cfs_rq *parent);
+ struct sched_entity *se, struct rq *rq,
+ struct cfs_rq *parent);
extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);

extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:47:49

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 27/60] cosched: Add some small helper functions for later use

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/sched.h | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d65c98c34c13..456b266b8a2c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1130,6 +1130,40 @@ static inline struct cfs_rq *taskgroup_next_cfsrq(struct task_group *tg,
#endif /* CONFIG_FAIR_GROUP_SCHED */

#ifdef CONFIG_COSCHEDULING
+static inline int node_of(struct rq *rq)
+{
+ return rq->sdrq_data.numa_node;
+}
+
+static inline bool is_cpu_rq(struct rq *rq)
+{
+ return !rq->sdrq_data.level;
+}
+
+static inline struct rq *parent_rq(struct rq *rq)
+{
+ if (!rq->sdrq_data.parent)
+ return NULL;
+ return container_of(rq->sdrq_data.parent, struct rq, sdrq_data);
+}
+#else /* !CONFIG_COSCHEDULING */
+static inline int node_of(struct rq *rq)
+{
+ return cpu_to_node(cpu_of(rq));
+}
+
+static inline bool is_cpu_rq(struct rq *rq)
+{
+ return true;
+}
+
+static inline struct rq *parent_rq(struct rq *rq)
+{
+ return NULL;
+}
+#endif /* !CONFIG_COSCHEDULING */
+
+#ifdef CONFIG_COSCHEDULING
void cosched_init_bottom(void);
void cosched_init_topology(void);
void cosched_init_hierarchy(void);
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:48:22

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 32/60] cosched: Specialize parent_cfs_rq() for hierarchical runqueues

Make parent_cfs_rq() coscheduling-aware.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8504790944bf..8cba7b8fb6bd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -284,12 +284,25 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
return grp->my_q;
}

+#ifdef CONFIG_COSCHEDULING
+static inline struct cfs_rq *parent_cfs_rq(struct cfs_rq *cfs_rq)
+{
+ /*
+ * Only return TG parent (if valid), don't switch levels.
+ * Callers stay or expect to stay within the same level.
+ */
+ if (!cfs_rq->sdrq.tg_parent || !cfs_rq->sdrq.is_root)
+ return NULL;
+ return cfs_rq->sdrq.tg_parent->cfs_rq;
+}
+#else /* !CONFIG_COSCHEDULING */
static inline struct cfs_rq *parent_cfs_rq(struct cfs_rq *cfs_rq)
{
if (!cfs_rq->tg->parent)
return NULL;
return cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))];
}
+#endif /* !CONFIG_COSCHEDULING */

static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
{
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:48:36

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 36/60] cosched: Use hrq_of() for rq_clock() and rq_clock_task()

We use and keep rq->clock updated on all hierarchical runqueues. In
fact, not using the hierarchical runqueue would be incorrect as there is
no guarantee that the leader's CPU runqueue clock is updated.

Switch all obvious cases from rq_of() to hrq_of().

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/core.c | 7 +++++++
kernel/sched/fair.c | 24 ++++++++++++------------
2 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c4358396f588..a9f5339d58cb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -138,6 +138,13 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
s64 steal = 0, irq_delta = 0;
#endif
+#ifdef CONFIG_COSCHEDULING
+ /*
+ * FIXME: We don't have IRQ and steal time aggregates on non-CPU
+ * runqueues. The following just accounts for one of the CPUs
+ * instead of all.
+ */
+#endif
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 24d01bf8f796..fde1c4ba4bb4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -858,7 +858,7 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
- u64 now = rq_clock_task(rq_of(cfs_rq));
+ u64 now = rq_clock_task(hrq_of(cfs_rq));
u64 delta_exec;

if (unlikely(!curr))
@@ -903,7 +903,7 @@ update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
if (!schedstat_enabled())
return;

- wait_start = rq_clock(rq_of(cfs_rq));
+ wait_start = rq_clock(hrq_of(cfs_rq));
prev_wait_start = schedstat_val(se->statistics.wait_start);

if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
@@ -922,7 +922,7 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
if (!schedstat_enabled())
return;

- delta = rq_clock(rq_of(cfs_rq)) - schedstat_val(se->statistics.wait_start);
+ delta = rq_clock(hrq_of(cfs_rq)) - schedstat_val(se->statistics.wait_start);

if (entity_is_task(se)) {
p = task_of(se);
@@ -961,7 +961,7 @@ update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
tsk = task_of(se);

if (sleep_start) {
- u64 delta = rq_clock(rq_of(cfs_rq)) - sleep_start;
+ u64 delta = rq_clock(hrq_of(cfs_rq)) - sleep_start;

if ((s64)delta < 0)
delta = 0;
@@ -978,7 +978,7 @@ update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
}
}
if (block_start) {
- u64 delta = rq_clock(rq_of(cfs_rq)) - block_start;
+ u64 delta = rq_clock(hrq_of(cfs_rq)) - block_start;

if ((s64)delta < 0)
delta = 0;
@@ -1052,10 +1052,10 @@ update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)

if (tsk->state & TASK_INTERRUPTIBLE)
__schedstat_set(se->statistics.sleep_start,
- rq_clock(rq_of(cfs_rq)));
+ rq_clock(hrq_of(cfs_rq)));
if (tsk->state & TASK_UNINTERRUPTIBLE)
__schedstat_set(se->statistics.block_start,
- rq_clock(rq_of(cfs_rq)));
+ rq_clock(hrq_of(cfs_rq)));
}
}

@@ -1068,7 +1068,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
/*
* We are starting a new run period:
*/
- se->exec_start = rq_clock_task(rq_of(cfs_rq));
+ se->exec_start = rq_clock_task(hrq_of(cfs_rq));
}

/**************************************************
@@ -4253,7 +4253,7 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
if (unlikely(cfs_rq->throttle_count))
return cfs_rq->throttled_clock_task - cfs_rq->throttled_clock_task_time;

- return rq_clock_task(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time;
+ return rq_clock_task(hrq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time;
}

/* returns 0 on failure to allocate runtime */
@@ -4306,7 +4306,7 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);

/* if the deadline is ahead of our clock, nothing to do */
- if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
+ if (likely((s64)(rq_clock(hrq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
return;

if (cfs_rq->runtime_remaining < 0)
@@ -4771,7 +4771,7 @@ static void sync_throttle(struct cfs_rq *cfs_rq)
pcfs_rq = parent_cfs_rq(cfs_rq);

cfs_rq->throttle_count = pcfs_rq->throttle_count;
- cfs_rq->throttled_clock_task = rq_clock_task(rq_of(cfs_rq));
+ cfs_rq->throttled_clock_task = rq_clock_task(hrq_of(cfs_rq));
}

/* conditionally throttle active cfs_rq's from put_prev_entity() */
@@ -4932,7 +4932,7 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
#else /* CONFIG_CFS_BANDWIDTH */
static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
{
- return rq_clock_task(rq_of(cfs_rq));
+ return rq_clock_task(hrq_of(cfs_rq));
}

static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:48:45

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 10/60] sched: Use parent_entity() in more places

Replace open-coded cases of parent_entity() with actual parent_entity()
invocations.

This will make later checks within parent_entity() more useful.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bccd7a66858e..5d6225aedbfe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -263,7 +263,7 @@ static inline struct task_struct *task_of(struct sched_entity *se)

/* Walk up scheduling entities hierarchy */
#define for_each_sched_entity(se) \
- for (; se; se = se->parent)
+ for (; se; se = parent_entity(se))

static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
{
@@ -9653,7 +9653,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
struct cfs_rq *cfs_rq;

/* Start to propagate at parent */
- se = se->parent;
+ se = parent_entity(se);

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:48:46

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 33/60] cosched: Allow resched_curr() to be called for hierarchical runqueues

The function resched_curr() kicks the scheduler for a certain runqueue,
assuming that the runqueue is already locked.

If called for a hierarchical runqueue, the equivalent operation is to
kick the leader. Unfortunately, we don't know whether we also hold
the CPU runqueue lock at this point, which is needed for resched_curr()
to work. Therefore, fall back to resched_cpu_locked() on higher levels.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/core.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 337bae6fa836..c4358396f588 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -457,6 +457,11 @@ void resched_curr(struct rq *rq)

lockdep_assert_held(&rq->lock);

+#ifdef CONFIG_COSCHEDULING
+ if (rq->sdrq_data.level)
+ return resched_cpu_locked(rq->sdrq_data.leader);
+#endif
+
if (test_tsk_need_resched(curr))
return;

--
2.9.3.1.gcba166c.dirty


2018-09-07 21:48:55

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 42/60] cosched: Introduce locking for (mostly) enqueuing and dequeuing

With hierarchical runqueues and locks at each level, it is often
necessary to get locks at different level. Introduce the second of two
locking strategies, which is suitable for progressing upwards through
the hierarchy with minimal impact on lock contention.

During enqueuing and dequeuing, a scheduling entity is recursively
added or removed from the runqueue hierarchy. This is an upwards only
progression through the hierarchy, which does not care about which CPU
is responsible for a certain part of the hierarchy. Hence, it is not
necessary to hold on to a lock of a lower hierarchy level once we moved
on to a higher one.

Introduce rq_chain_init(), rq_chain_lock(), and rq_chain_unlock(), which
implement lock chaining, where the previous lock is only released after
we acquired the next one, so that concurrent operations cannot overtake
each other. The functions can be used even when parts have already been
locked via rq_lock_owned(), as for example dequeueing might happen
during task selection if a runqueue is throttled.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/cosched.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 14 +++++++++++++
2 files changed, 67 insertions(+)

diff --git a/kernel/sched/cosched.c b/kernel/sched/cosched.c
index df62ee6d0520..f2d51079b3db 100644
--- a/kernel/sched/cosched.c
+++ b/kernel/sched/cosched.c
@@ -608,3 +608,56 @@ void rq_unlock_owned(struct rq *rq, struct rq_owner_flags *orf)
rq_unlock(rq, &rq->sdrq_data.rf);
}
}
+
+void rq_chain_init(struct rq_chain *rc, struct rq *rq)
+{
+ bool parent_locked = rq->sdrq_data.parent_locked;
+
+ WARN_ON_ONCE(!irqs_disabled());
+ lockdep_assert_held(&rq->lock);
+
+ rq = parent_rq(rq);
+ while (parent_locked) {
+ lockdep_assert_held(&rq->lock);
+ parent_locked = rq->sdrq_data.parent_locked;
+ rq = parent_rq(rq);
+ }
+
+ rc->next = rq;
+ rc->curr = NULL;
+}
+
+void rq_chain_unlock(struct rq_chain *rc)
+{
+ if (rc->curr)
+ rq_unlock(rc->curr, &rc->rf);
+}
+
+void rq_chain_lock(struct rq_chain *rc, struct sched_entity *se)
+{
+ struct cfs_rq *cfs_rq = se->cfs_rq;
+ struct rq *rq = cfs_rq->rq;
+
+ if (!is_sd_se(se) || rc->curr == rq) {
+ lockdep_assert_held(&rq->lock);
+ return;
+ }
+
+ if (rq == rc->next) {
+ struct rq_flags rf = rc->rf;
+
+ /* Get the new lock (and release previous lock afterwards) */
+ rq_lock(rq, &rc->rf);
+
+ if (rc->curr) {
+ lockdep_assert_held(&rc->curr->lock);
+ rq_unlock(rc->curr, &rf);
+ }
+
+ rc->curr = rq;
+ rc->next = parent_rq(rq);
+
+ /* FIXME: Only update clock, when necessary */
+ update_rq_clock(rq);
+ }
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7dba8fdc48c7..48939c8e539d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -513,6 +513,14 @@ struct rq_owner_flags {
#endif
};

+struct rq_chain {
+#ifdef CONFIG_COSCHEDULING
+ struct rq *next;
+ struct rq *curr;
+ struct rq_flags rf;
+#endif
+};
+
#ifdef CONFIG_COSCHEDULING
struct sdrq_data {
/*
@@ -1206,6 +1214,9 @@ void cosched_online_group(struct task_group *tg);
void cosched_offline_group(struct task_group *tg);
struct rq *rq_lock_owned(struct rq *rq, struct rq_owner_flags *orf);
void rq_unlock_owned(struct rq *rq, struct rq_owner_flags *orf);
+void rq_chain_init(struct rq_chain *rc, struct rq *rq);
+void rq_chain_unlock(struct rq_chain *rc);
+void rq_chain_lock(struct rq_chain *rc, struct sched_entity *se);
#else /* !CONFIG_COSCHEDULING */
static inline void cosched_init_bottom(void) { }
static inline void cosched_init_topology(void) { }
@@ -1217,6 +1228,9 @@ static inline void cosched_online_group(struct task_group *tg) { }
static inline void cosched_offline_group(struct task_group *tg) { }
static inline struct rq *rq_lock_owned(struct rq *rq, struct rq_owner_flags *orf) { return rq; }
static inline void rq_unlock_owned(struct rq *rq, struct rq_owner_flags *orf) { }
+static inline void rq_chain_init(struct rq_chain *rc, struct rq *rq) { }
+static inline void rq_chain_unlock(struct rq_chain *rc) { }
+static inline void rq_chain_lock(struct rq_chain *rc, struct sched_entity *se) { }
#endif /* !CONFIG_COSCHEDULING */

#ifdef CONFIG_SCHED_SMT
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:48:57

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 28/60] cosched: Add is_sd_se() to distinguish SD-SEs from TG-SEs

Add a function is_sd_se() to easily distinguish SD-SEs from a TG-SEs.

Internally, we distinguish tasks, SD-SEs, and TG-SEs based on the my_q
field. For tasks it is empty, for TG-SEs it is a pointer, and for
SD-SEs it is a magic value.

Also modify propagate_entity_load_avg() to not page fault on SD-SEs.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 12 +++++-------
kernel/sched/sched.h | 14 ++++++++++++++
2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 18b1d81951f1..9cbdd027d449 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -279,6 +279,8 @@ static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
/* runqueue "owned" by this group */
static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
{
+ if (grp->my_q == INDIRECT_GROUP)
+ return NULL;
return grp->my_q;
}

@@ -3240,13 +3242,9 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
/* Update task and its cfs_rq load average */
static inline int propagate_entity_load_avg(struct sched_entity *se)
{
- struct cfs_rq *cfs_rq, *gcfs_rq;
-
- if (entity_is_task(se))
- return 0;
+ struct cfs_rq *cfs_rq, *gcfs_rq = group_cfs_rq(se);

- gcfs_rq = group_cfs_rq(se);
- if (!gcfs_rq->propagate)
+ if (!gcfs_rq || !gcfs_rq->propagate)
return 0;

gcfs_rq->propagate = 0;
@@ -9931,7 +9929,7 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
/* se is NULL for root_task_group */
if (se) {
set_entity_cfs(se, parent);
- se->my_q = cfs_rq;
+ se->my_q = cfs_rq ?: INDIRECT_GROUP;

/* guarantee group entities always have weight */
update_load_set(&se->load, NICE_0_LOAD);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 456b266b8a2c..5e2d231b1dbf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1146,6 +1146,13 @@ static inline struct rq *parent_rq(struct rq *rq)
return NULL;
return container_of(rq->sdrq_data.parent, struct rq, sdrq_data);
}
+
+#define INDIRECT_GROUP ERR_PTR(-EREMOTE)
+
+static inline bool is_sd_se(struct sched_entity *se)
+{
+ return se->my_q == INDIRECT_GROUP;
+}
#else /* !CONFIG_COSCHEDULING */
static inline int node_of(struct rq *rq)
{
@@ -1161,6 +1168,13 @@ static inline struct rq *parent_rq(struct rq *rq)
{
return NULL;
}
+
+#define INDIRECT_GROUP NULL
+
+static inline bool is_sd_se(struct sched_entity *se)
+{
+ return false;
+}
#endif /* !CONFIG_COSCHEDULING */

#ifdef CONFIG_COSCHEDULING
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:49:22

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 58/60] cosched: Switch runqueues between regular scheduling and coscheduling

A regularly scheduled runqueue is enqueued via its TG-SE in its parent
task-group. When coscheduled it is enqueued via its hierarchical
parent's SD-SE. Switching between both means to replace one with the
other, and taking care to get rid of all references to the no longer
current SE, which is recorded as parent SE for various other SEs.

Essentially, this changes the SE-parent path through the task-group and
SD hierarchy, by flipping a part of this path. For example, switching
the runqueue marked with X from !is_root to is_root as part of switching
the child-TG from scheduled==2 to scheduled==1:

Before:
parent-TG
child-TG
,----------------O
System ,O´
/ O O
Core X´ O
/ O O O O
CPU O O O O

CPU0 1 2 3 0 1 2 3

After:
parent-TG
child-TG
,O
System O /
,----------------O´ O
Core X´ O
/ O O O O
CPU O O O O

CPU0 1 2 3 0 1 2 3

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/cosched.c | 138 ++++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/fair.c | 14 ++++-
kernel/sched/sched.h | 2 +
3 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cosched.c b/kernel/sched/cosched.c
index 7c8b8c8d2814..eb6a6a61521e 100644
--- a/kernel/sched/cosched.c
+++ b/kernel/sched/cosched.c
@@ -515,9 +515,145 @@ void cosched_offline_group(struct task_group *tg)
list_del_rcu(&cfs->sdrq.tg_siblings);
}

+static void update_parent_entities(struct cfs_rq *cfs)
+{
+ struct sched_entity *se = __pick_first_entity(cfs);
+
+ while (se) {
+ set_entity_cfs(se, se->cfs_rq);
+ se = __pick_next_entity(se);
+ }
+
+ if (cfs->curr) {
+ /* curr is not kept within the tree */
+ set_entity_cfs(cfs->curr, cfs->curr->cfs_rq);
+ }
+}
+
+/*
+ * FIXME: We may be missing calls to attach_entity_cfs_rq() & co here
+ * and maybe elsewhere.
+ */
static void sdrq_update_root(struct sdrq *sdrq)
{
- /* TBD */
+ bool is_root, running;
+ struct sdrq *child;
+ struct rq *rq = sdrq->cfs_rq->rq;
+ struct rq *prq = parent_rq(rq);
+ struct rq_flags rf, prf;
+
+ lockdep_assert_held(&sdrq->cfs_rq->tg->lock);
+
+ if (!sdrq->sd_parent) {
+ /* If we are at the top, is_root must always be true */
+ SCHED_WARN_ON(sdrq->is_root != 1);
+ return;
+ }
+
+ is_root = sdrq->cfs_rq->tg->scheduled <= sdrq->data->level;
+
+ /* Exit early, when there is no change */
+ if (is_root == sdrq->is_root)
+ return;
+
+ /* Get proper locks */
+ rq_lock_irqsave(rq, &rf);
+
+ sdrq->is_root = is_root;
+ if (is_root)
+ sdrq->cfs_rq->my_se = sdrq->tg_se;
+ else
+ sdrq->cfs_rq->my_se = sdrq->sd_parent->sd_se;
+
+ /* Update parent entity of SD-SE */
+ if (sdrq->sd_se)
+ set_entity_cfs(sdrq->sd_se, sdrq->cfs_rq);
+
+ /* Update parent entities of TG-SEs of child task groups */
+ rcu_read_lock();
+ list_for_each_entry_rcu(child, &sdrq->tg_children, tg_siblings)
+ set_entity_cfs(child->tg_se, sdrq->cfs_rq);
+ rcu_read_unlock();
+
+ /*
+ * Update parent entities of tasks
+ *
+ * This is complicated by the fact, that there are no per-cpu lists of
+ * tasks. There is the complete list of tasks via do_each_thread/
+ * while_each_thread, but that is too much. Then, there is a list
+ * of all tasks within the current task group via cgroup_iter_start/
+ * cgroup_iter_next/cgroup_iter_end, but that would require additional
+ * filtering for the correct CPU, which is also not nice.
+ *
+ * Therefore, we only go through all currently enqueued tasks, and make
+ * sure to update all non-enqueued tasks during enqueue in
+ * enqueue_task_fair().
+ */
+ update_parent_entities(sdrq->cfs_rq);
+
+ /*
+ * FIXME: update_parent_entities() also updates non-task-SEs.
+ * So we could skip sd_se and tg_se updates, when we also update
+ * them during enqueuing. Not sure about the overhead, though.
+ */
+
+ running = sdrq->cfs_rq->nr_running > 0;
+
+ /* FIXME: Might fire on dynamic reconfigurations with throttling */
+ SCHED_WARN_ON(running && sdrq->cfs_rq->load.weight == 0);
+ SCHED_WARN_ON(!running && sdrq->cfs_rq->load.weight);
+
+ if (is_root) {
+ /* Change from 0 to 1: possibly dequeue sd_se, enqueue tg_se */
+ if (running) {
+ atomic64_sub(sdrq->cfs_rq->load.weight,
+ &sdrq->sd_parent->sdse_load);
+ dequeue_entity_fair(rq, sdrq->sd_parent->sd_se,
+ DEQUEUE_SLEEP,
+ sdrq->cfs_rq->h_nr_running);
+ }
+ if (sdrq->cfs_rq->curr) {
+ rq_lock(prq, &prf);
+ if (sdrq->data->leader == sdrq->sd_parent->data->leader)
+ put_prev_entity_fair(prq, sdrq->sd_parent->sd_se);
+ rq_unlock(prq, &prf);
+ if (sdrq->tg_se)
+ set_curr_entity_fair(rq, sdrq->tg_se);
+ }
+ /*
+ * FIXME: this is probably not enough with nested TGs, as the weights of the
+ * nested TGS could still be zero.
+ */
+ if ((sdrq->cfs_rq->curr || running) && sdrq->tg_se)
+ update_cfs_group(sdrq->tg_se);
+ if (running && sdrq->tg_se)
+ enqueue_entity_fair(rq, sdrq->tg_se,
+ ENQUEUE_WAKEUP,
+ sdrq->cfs_rq->h_nr_running);
+ } else {
+ /* Change from 1 to 0: dequeue tg_se, possibly enqueue sd_se */
+ if (running && sdrq->tg_se)
+ dequeue_entity_fair(rq, sdrq->tg_se, DEQUEUE_SLEEP,
+ sdrq->cfs_rq->h_nr_running);
+ if (sdrq->cfs_rq->curr) {
+ if (sdrq->tg_se)
+ put_prev_entity_fair(rq, sdrq->tg_se);
+ rq_lock(prq, &prf);
+ update_rq_clock(prq);
+ if (sdrq->data->leader == sdrq->sd_parent->data->leader)
+ set_curr_entity_fair(prq, sdrq->sd_parent->sd_se);
+ rq_unlock(prq, &prf);
+ }
+ if (running) {
+ atomic64_add(sdrq->cfs_rq->load.weight,
+ &sdrq->sd_parent->sdse_load);
+ enqueue_entity_fair(rq, sdrq->sd_parent->sd_se,
+ ENQUEUE_WAKEUP,
+ sdrq->cfs_rq->h_nr_running);
+ }
+ }
+
+ rq_unlock_irqrestore(rq, &rf);
}

void cosched_set_scheduled(struct task_group *tg, int level)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c1d9334ea8e..322a84ec9511 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -768,7 +768,7 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
return rb_entry(left, struct sched_entity, run_node);
}

-static struct sched_entity *__pick_next_entity(struct sched_entity *se)
+struct sched_entity *__pick_next_entity(struct sched_entity *se)
{
struct rb_node *next = rb_next(&se->run_node);

@@ -3145,7 +3145,7 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
* Recomputes the group entity based on the current state of its group
* runqueue.
*/
-static void update_cfs_group(struct sched_entity *se)
+void update_cfs_group(struct sched_entity *se)
{
struct cfs_rq *gcfs_rq = group_cfs_rq(se);
long shares, runnable;
@@ -5336,6 +5336,16 @@ bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
int lcpu = rq->sdrq_data.leader;
#endif

+#ifdef CONFIG_COSCHEDULING
+ /*
+ * Update se->parent, in case sdrq_update_root() was called while
+ * this task was sleeping.
+ *
+ * FIXME: Can this be moved into enqueue_task_fair()?
+ */
+ set_entity_cfs(se, se->cfs_rq);
+#endif
+
rq_chain_init(&rc, rq);
for_each_sched_entity(se) {
rq_chain_lock(&rc, se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e257451e05a5..310a706f0361 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -477,6 +477,7 @@ extern void sched_move_task(struct task_struct *tsk);

#ifdef CONFIG_FAIR_GROUP_SCHED
extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
+void update_cfs_group(struct sched_entity *se);

#ifdef CONFIG_SMP
extern void set_task_rq_fair(struct sched_entity *se,
@@ -2453,6 +2454,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)

extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq);
extern struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq);
+struct sched_entity *__pick_next_entity(struct sched_entity *se);

#ifdef CONFIG_SCHED_DEBUG
extern bool sched_debug_enabled;
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:49:25

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 18/60] sched: Fix return value of SCHED_WARN_ON()

SCHED_WARN_ON() is conditionally compiled depending on CONFIG_SCHED_DEBUG.
WARN_ON() and variants can be used in if() statements to take an action
in the unlikely case that the WARN_ON condition is true. This is supposed
to work independently of whether the warning is actually printed. However,
without CONFIG_SCHED_DEBUG, SCHED_WARN_ON() evaluates to false
unconditionally.

Change SCHED_WARN_ON() to not discard the WARN_ON condition, even without
CONFIG_SCHED_DEBUG, so that it can be used within if() statements as
expected.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/sched.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9ecbb57049a2..3e0ad36938fb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -76,7 +76,7 @@
#ifdef CONFIG_SCHED_DEBUG
# define SCHED_WARN_ON(x) WARN_ONCE(x, #x)
#else
-# define SCHED_WARN_ON(x) ({ (void)(x), 0; })
+# define SCHED_WARN_ON(x) ({ unlikely(!!(x)); })
#endif

struct rq;
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:49:35

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 23/60] cosched: Add core data structures for coscheduling

For coscheduling, we will set up hierarchical runqueues that correspond
to larger fractions of the system. They will be organized along the
scheduling domains.

Although it is overkill at the moment, we keep a full struct rq per
scheduling domain. The existing code is so used to pass struct rq
around, that it would be a large refactoring effort to concentrate the
actually needed fields of struct rq in a smaller structure. Also, we
will probably need more fields in the future.

Extend struct rq and struct cfs_rq with extra structs encapsulating
all purely coscheduling related fields: struct sdrq_data and struct
sdrq, respectively.

Extend struct task_group, so that we can keep track of the hierarchy
and how this task_group should behave. We can now distinguish between
regular task groups and scheduled task groups. The former work as usual,
while the latter actively utilize the hierarchical aspect and represent
SEs of a lower hierarchy level at a higher level within the parent task
group, causing SEs at the lower level to get coscheduled.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/sched.h | 151 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 151 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b36e61914a42..1bce6061ac45 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -368,6 +368,27 @@ struct task_group {
#endif
#endif

+#ifdef CONFIG_COSCHEDULING
+ /*
+ * References the top of this task-group's RQ hierarchy. This is
+ * static and does not change. It is used as entry-point to traverse
+ * the structure on creation/destruction.
+ */
+ struct cfs_rq *top_cfsrq;
+
+ /* Protects .scheduled from concurrent modifications */
+ raw_spinlock_t lock;
+
+ /*
+ * Indicates the level at which this task group is scheduled:
+ * 0 == bottom level == regular task group
+ * >0 == scheduled task group
+ *
+ * Modifications are (for now) requested by the user.
+ */
+ int scheduled;
+#endif
+
#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity **rt_se;
struct rt_rq **rt_rq;
@@ -485,6 +506,120 @@ struct rq_flags {
#endif
};

+#ifdef CONFIG_COSCHEDULING
+struct sdrq_data {
+ /*
+ * Leader for this part of the hierarchy.
+ *
+ * The leader CPU is responsible for scheduling decisions and any
+ * required maintenance.
+ *
+ * Leadership is variable and may be taken while the hierarchy level
+ * is locked.
+ */
+ int leader;
+
+ /* Height within the hierarchy: leaf == 0, parent == child + 1 */
+ int level;
+
+ /* Parent runqueue */
+ struct sdrq_data *parent;
+
+ /*
+ * SD-RQ from which SEs get selected.
+ *
+ * This is set by the parent's leader and defines the current
+ * schedulable subset of tasks within this part of the hierarchy.
+ */
+ struct sdrq *current_sdrq;
+
+ /* CPUs making up this part of the hierarchy */
+ const struct cpumask *span;
+
+ /* Number of CPUs within this part of the hierarchy */
+ unsigned int span_weight;
+
+ /*
+ * Determines if the corresponding SD-RQs are to be allocated on
+ * a specific NUMA node.
+ */
+ int numa_node;
+
+ /* Storage for rq_flags, when we need to lock multiple runqueues. */
+ struct rq_flags rf;
+
+ /* Do we have the parent runqueue locked? */
+ bool parent_locked;
+
+ /*
+ * In case the CPU has been forced into idle, the idle_se references the
+ * scheduling entity responsible for this. Only used on bottom level at
+ * the moment.
+ */
+ struct sched_entity *idle_se;
+};
+
+struct sdrq {
+ /* Common information for all SD-RQs at the same position */
+ struct sdrq_data *data;
+
+ /* SD hierarchy */
+ struct sdrq *sd_parent; /* parent of this node */
+ struct list_head children; /* children of this node */
+ struct list_head siblings; /* link to parent's children list */
+
+ /*
+ * is_root == 1 => link via tg_se into tg_parent->cfs_rq
+ * is_root == 0 => link via sd_parent->sd_se into sd_parent->cfs_rq
+ */
+ int is_root;
+
+ /*
+ * SD-SE: an SE to be enqueued in .cfs_rq to represent this
+ * node's children in order to make their members schedulable.
+ *
+ * In the bottom layer .sd_se has to be NULL for various if-conditions
+ * and loop terminations. On other layers .sd_se points to .__sd_se.
+ *
+ * .__sd_se is unused within the bottom layer.
+ */
+ struct sched_entity *sd_se;
+ struct sched_entity __sd_se;
+
+ /* Accumulated load of all SD-children */
+ atomic64_t sdse_load;
+
+ /*
+ * Reference to the SD-runqueue at the same hierarchical position
+ * in the parent task group.
+ */
+ struct sdrq *tg_parent;
+ struct list_head tg_children; /* child TGs of this node */
+ struct list_head tg_siblings; /* link to parent's children list */
+
+ /*
+ * TG-SE: a SE to be enqueued in .tg_parent->cfs_rq.
+ *
+ * In the case of a regular TG it is enqueued if .cfs_rq is not empty.
+ * In the case of a scheduled TG it is enqueued if .cfs_rq is not empty
+ * and this SD-RQ acts as a root SD within its TG.
+ *
+ * .tg_se takes over the role of .cfs_rq->my_se and points to the same
+ * SE over its life-time, while .cfs_rq->my_se now points to either the
+ * TG-SE or the SD-SE (or NULL in the parts of the root task group).
+ */
+ struct sched_entity *tg_se;
+
+ /*
+ * CFS runqueue of this SD runqueue.
+ *
+ * FIXME: Now that struct sdrq is embedded in struct cfs_rq, we could
+ * drop this.
+ */
+ struct cfs_rq *cfs_rq;
+};
+#endif /* CONFIG_COSCHEDULING */
+
/* CFS-related fields in a runqueue */
struct cfs_rq {
struct load_weight load;
@@ -544,6 +679,12 @@ struct cfs_rq {
u64 last_h_load_update;
struct sched_entity *h_load_next;
#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+#ifdef CONFIG_COSCHEDULING
+ /* Extra info needed for hierarchical scheduling */
+ struct sdrq sdrq;
+#endif
+
#endif /* CONFIG_SMP */

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -817,6 +958,11 @@ struct rq {
struct rt_rq rt;
struct dl_rq dl;

+#ifdef CONFIG_COSCHEDULING
+ /* Extra information for hierarchical scheduling */
+ struct sdrq_data sdrq_data;
+#endif
+
#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this CPU: */
struct list_head leaf_cfs_rq_list;
@@ -935,6 +1081,11 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+
+#ifdef CONFIG_COSCHEDULING
+ /* Top level runqueue for this sched_domain */
+ struct rq rq;
+#endif
};

static inline int cpu_of(struct rq *rq)
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:49:40

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 12/60] locking/lockdep: Make cookie generator accessible

If a dynamic amount of locks needs to be pinned in the same context,
it is impractical to have a cookie per lock. Make the cookie generator
accessible, so that such a group of locks can be (re-)pinned with
just one (shared) cookie.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
include/linux/lockdep.h | 2 ++
kernel/locking/lockdep.c | 21 +++++++++++++++------
2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index d93a5ec3077f..06aee3386071 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -363,6 +363,7 @@ struct pin_cookie { unsigned int val; };

#define NIL_COOKIE (struct pin_cookie){ .val = 0U, }

+struct pin_cookie lockdep_cookie(void);
extern struct pin_cookie lock_pin_lock(struct lockdep_map *lock);
extern void lock_repin_lock(struct lockdep_map *lock, struct pin_cookie);
extern void lock_unpin_lock(struct lockdep_map *lock, struct pin_cookie);
@@ -452,6 +453,7 @@ struct pin_cookie { };

#define NIL_COOKIE (struct pin_cookie){ }

+#define lockdep_cookie() (NIL_COOKIE)
#define lockdep_pin_lock(l) ({ struct pin_cookie cookie; cookie; })
#define lockdep_repin_lock(l, c) do { (void)(l); (void)(c); } while (0)
#define lockdep_unpin_lock(l, c) do { (void)(l); (void)(c); } while (0)
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index e406c5fdb41e..76aae27e1736 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3729,6 +3729,20 @@ static int __lock_is_held(const struct lockdep_map *lock, int read)
return 0;
}

+struct pin_cookie lockdep_cookie(void)
+{
+ struct pin_cookie cookie;
+
+ /*
+ * Grab 16bits of randomness; this is sufficient to not
+ * be guessable and still allows some pin nesting in
+ * our u32 pin_count.
+ */
+ cookie.val = 1 + (prandom_u32() >> 16);
+
+ return cookie;
+}
+
static struct pin_cookie __lock_pin_lock(struct lockdep_map *lock)
{
struct pin_cookie cookie = NIL_COOKIE;
@@ -3742,12 +3756,7 @@ static struct pin_cookie __lock_pin_lock(struct lockdep_map *lock)
struct held_lock *hlock = curr->held_locks + i;

if (match_held_lock(hlock, lock)) {
- /*
- * Grab 16bits of randomness; this is sufficient to not
- * be guessable and still allows some pin nesting in
- * our u32 pin_count.
- */
- cookie.val = 1 + (prandom_u32() >> 16);
+ cookie = lockdep_cookie();
hlock->pin_count += cookie.val;
return cookie;
}
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:49:41

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 46/60] cosched: Warn on throttling attempts of non-CPU runqueues

Initially, coscheduling won't support throttling of CFS runqueues that
are not at CPU level. Print a warning to remind us of this fact and note
down everything that's currently known to be broken, if we wanted to
throttle higher level CFS runqueues (which would totally make sense
from a coscheduling perspective).

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0bba924b40ba..2aa3a60dfca5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4493,12 +4493,25 @@ static int tg_throttle_down(struct task_group *tg, void *data)

static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
- struct rq *rq = rq_of(cfs_rq);
+ struct rq *rq = hrq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
long task_delta, dequeue = 1;
bool empty;

+ /*
+ * FIXME: We can only handle CPU runqueues at the moment.
+ *
+ * rq->nr_running adjustments are incorrect for higher levels,
+ * as well as tg_throttle_down/up() functionality. Also
+ * update_runtime_enabled(), unthrottle_offline_cfs_rqs()
+ * have not been adjusted (used for CPU hotplugging).
+ *
+ * Ideally, we would apply throttling only to is_root runqueues,
+ * instead of the bottom level.
+ */
+ SCHED_WARN_ON(!is_cpu_rq(rq));
+
se = cfs_rq->my_se;

/* freeze hierarchy runnable averages while throttled */
@@ -4547,12 +4560,14 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)

void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
{
- struct rq *rq = rq_of(cfs_rq);
+ struct rq *rq = hrq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
int enqueue = 1;
long task_delta;

+ SCHED_WARN_ON(!is_cpu_rq(rq));
+
se = cfs_rq->my_se;

cfs_rq->throttled = 0;
@@ -5171,6 +5186,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

throttled = enqueue_entity_fair(rq, &p->se, flags, 1);

+ /* FIXME: assumes, that only bottom level runqueues get throttled */
if (!throttled)
add_nr_running(rq, 1);

@@ -5237,6 +5253,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
bool throttled = dequeue_entity_fair(rq, &p->se, flags, 1);

+ /* FIXME: assumes, that only bottom level runqueues get throttled */
if (!throttled)
sub_nr_running(rq, 1);

--
2.9.3.1.gcba166c.dirty


2018-09-07 21:49:41

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 48/60] cosched: Adjust SE traversal and locking for yielding and buddies

Buddies are not very well defined with coscheduling. Usually, they
bubble up the hierarchy on a single CPU to steer task picking either
away from a certain task (yield a task: skip buddy) or towards a certain
task (yield to a task, execute a woken task: next buddy; execute a
recently preempted task: last buddy).

If we still allow buddies to bubble up the full hierarchy with
coscheduling, then for example yielding a task would always yield the
coscheduled set of tasks it is part of. If we keep effects constrained
to a coscheduled set, then one set could never preempt another set.

For now, we limit buddy activities to the scope of the leader that
does the activity with an exception for preemption, which may operate
in the scope of a different leader. That makes yielding behavior
potentially weird and asymmetric for the time being, but it seems to
work well for preemption.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 51 ++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 42 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2227e4840355..6d64f4478fda 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3962,7 +3962,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)

static void __clear_buddies_last(struct sched_entity *se)
{
- for_each_sched_entity(se) {
+ for_each_owned_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
if (cfs_rq->last != se)
break;
@@ -3973,7 +3973,7 @@ static void __clear_buddies_last(struct sched_entity *se)

static void __clear_buddies_next(struct sched_entity *se)
{
- for_each_sched_entity(se) {
+ for_each_owned_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
if (cfs_rq->next != se)
break;
@@ -3984,7 +3984,7 @@ static void __clear_buddies_next(struct sched_entity *se)

static void __clear_buddies_skip(struct sched_entity *se)
{
- for_each_sched_entity(se) {
+ for_each_owned_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
if (cfs_rq->skip != se)
break;
@@ -4005,6 +4005,18 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
__clear_buddies_skip(se);
}

+static void clear_buddies_lock(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ struct rq_owner_flags orf;
+
+ if (cfs_rq->last != se && cfs_rq->next != se && cfs_rq->skip != se)
+ return;
+
+ rq_lock_owned(hrq_of(cfs_rq), &orf);
+ clear_buddies(cfs_rq, se);
+ rq_unlock_owned(hrq_of(cfs_rq), &orf);
+}
+
static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);

static void
@@ -4028,7 +4040,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)

update_stats_dequeue(cfs_rq, se, flags);

- clear_buddies(cfs_rq, se);
+ clear_buddies_lock(cfs_rq, se);

if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
@@ -6547,31 +6559,45 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)

static void set_last_buddy(struct sched_entity *se)
{
+ struct rq_owner_flags orf;
+ struct rq *rq;
+
if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
return;

- for_each_sched_entity(se) {
+ rq = hrq_of(cfs_rq_of(se));
+
+ rq_lock_owned(rq, &orf);
+ for_each_owned_sched_entity(se) {
if (SCHED_WARN_ON(!se->on_rq))
- return;
+ break;
cfs_rq_of(se)->last = se;
}
+ rq_unlock_owned(rq, &orf);
}

static void set_next_buddy(struct sched_entity *se)
{
+ struct rq_owner_flags orf;
+ struct rq *rq;
+
if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
return;

- for_each_sched_entity(se) {
+ rq = hrq_of(cfs_rq_of(se));
+
+ rq_lock_owned(rq, &orf);
+ for_each_owned_sched_entity(se) {
if (SCHED_WARN_ON(!se->on_rq))
- return;
+ break;
cfs_rq_of(se)->next = se;
}
+ rq_unlock_owned(rq, &orf);
}

static void set_skip_buddy(struct sched_entity *se)
{
- for_each_sched_entity(se)
+ for_each_owned_sched_entity(se)
cfs_rq_of(se)->skip = se;
}

@@ -6831,6 +6857,7 @@ static void yield_task_fair(struct rq *rq)
struct task_struct *curr = rq->curr;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
struct sched_entity *se = &curr->se;
+ struct rq_owner_flags orf;

/*
* Are we the only task in the tree?
@@ -6838,6 +6865,7 @@ static void yield_task_fair(struct rq *rq)
if (unlikely(rq->nr_running == 1))
return;

+ rq_lock_owned(rq, &orf);
clear_buddies(cfs_rq, se);

if (curr->policy != SCHED_BATCH) {
@@ -6855,21 +6883,26 @@ static void yield_task_fair(struct rq *rq)
}

set_skip_buddy(se);
+ rq_unlock_owned(rq, &orf);
}

static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preempt)
{
struct sched_entity *se = &p->se;
+ struct rq_owner_flags orf;

/* throttled hierarchies are not runnable */
if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
return false;

+ rq_lock_owned(rq, &orf);
+
/* Tell the scheduler that we'd really like pse to run next. */
set_next_buddy(se);

yield_task_fair(rq);

+ rq_unlock_owned(rq, &orf);
return true;
}

--
2.9.3.1.gcba166c.dirty


2018-09-07 21:49:42

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 29/60] cosched: Adjust code reflecting on the total number of CFS tasks on a CPU

There are a few places that make decisions based on the total number
of CFS tasks on a certain CPU. With coscheduling, the inspected value
rq->cfs.h_nr_running does not contain all tasks anymore, as some are
accounted on higher hierarchy levels instead. This would lead to
incorrect conclusions as the system seems more idle than it actually is.

Adjust these code paths to use an alternative way of deriving the same
value: take the total amount of tasks on the runqueue and subtract all
running tasks of other scheduling classes. What remains are all CFS tasks
on a certain CPU.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/core.c | 5 ++---
kernel/sched/fair.c | 11 +++++------
kernel/sched/sched.h | 21 +++++++++++++++++++++
3 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5350cab7ac4a..337bae6fa836 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3326,12 +3326,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
/*
* Optimization: we know that if all tasks are in the fair class we can
* call that function directly, but only if the @prev task wasn't of a
- * higher scheduling class, because otherwise those loose the
+ * higher scheduling class, because otherwise those lose the
* opportunity to pull in more work from other CPUs.
*/
if (likely((prev->sched_class == &idle_sched_class ||
prev->sched_class == &fair_sched_class) &&
- rq->nr_running == rq->cfs.h_nr_running)) {
+ rq->nr_running == nr_cfs_tasks(rq))) {

p = fair_sched_class.pick_next_task(rq, prev, rf);
if (unlikely(p == RETRY_TASK))
@@ -3343,7 +3343,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

return p;
}
-
again:
for_each_class(class) {
p = class->pick_next_task(rq, prev, rf);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9cbdd027d449..30e5ff30f442 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4460,7 +4460,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
add_nr_running(rq, task_delta);

/* Determine whether we need to wake up potentially idle CPU: */
- if (rq->curr == rq->idle && rq->cfs.nr_running)
+ if (rq->curr == rq->idle && nr_cfs_tasks(rq))
resched_curr(rq);
}

@@ -4937,7 +4937,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)

SCHED_WARN_ON(task_rq(p) != rq);

- if (rq->cfs.h_nr_running > 1) {
+ if (nr_cfs_tasks(rq) > 1) {
u64 slice = sched_slice(cfs_rq, se);
u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
s64 delta = slice - ran;
@@ -9067,8 +9067,7 @@ static void nohz_balancer_kick(struct rq *rq)

sd = rcu_dereference(rq->sd);
if (sd) {
- if ((rq->cfs.h_nr_running >= 1) &&
- check_cpu_capacity(rq, sd)) {
+ if ((nr_cfs_tasks(rq) >= 1) && check_cpu_capacity(rq, sd)) {
flags = NOHZ_KICK_MASK;
goto unlock;
}
@@ -9479,7 +9478,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
* have been enqueued in the meantime. Since we're not going idle,
* pretend we pulled a task.
*/
- if (this_rq->cfs.h_nr_running && !pulled_task)
+ if (nr_cfs_tasks(this_rq) && !pulled_task)
pulled_task = 1;

/* Move the next balance forward */
@@ -9487,7 +9486,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
this_rq->next_balance = next_balance;

/* Is there a task of a high priority class? */
- if (this_rq->nr_running != this_rq->cfs.h_nr_running)
+ if (this_rq->nr_running != nr_cfs_tasks(this_rq))
pulled_task = -1;

if (pulled_task)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5e2d231b1dbf..594eb9489f3d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1691,6 +1691,27 @@ static inline int task_on_rq_migrating(struct task_struct *p)
return p->on_rq == TASK_ON_RQ_MIGRATING;
}

+#ifdef CONFIG_COSCHEDULING
+static inline unsigned int nr_cfs_tasks(struct rq *rq)
+{
+ unsigned int total = rq->nr_running;
+
+ /* Deadline and real time tasks */
+ total -= rq->dl.dl_nr_running + rq->rt.rt_nr_running;
+
+ /* Stop task */
+ if (rq->stop && task_on_rq_queued(rq->stop))
+ total--;
+
+ return total;
+}
+#else
+static inline unsigned int nr_cfs_tasks(struct rq *rq)
+{
+ return rq->cfs.h_nr_running;
+}
+#endif
+
/*
* wake flags
*/
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:49:46

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 03/60] sched: Setup sched_domain_shared for all sched_domains

Relax the restriction to setup a sched_domain_shared only for domains
with SD_SHARE_PKG_RESOURCES. Set it up for every domain.

This restriction was imposed since the struct was created via commit
24fc7edb92ee ("sched/core: Introduce 'struct sched_domain_shared'") for
the lack of another use case. This will change soon.

Also, move the structure definition below kernel/sched/. It is not used
outside and in the future it will carry some more internal types that
we don't want to expose.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
include/linux/sched/topology.h | 6 ------
kernel/sched/sched.h | 6 ++++++
kernel/sched/topology.c | 12 ++++--------
3 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..530ad856372e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -68,12 +68,6 @@ extern int sched_domain_level_max;

struct sched_group;

-struct sched_domain_shared {
- atomic_t ref;
- atomic_t nr_busy_cpus;
- int has_idle_cores;
-};
-
struct sched_domain {
/* These fields must be setup */
struct sched_domain *parent; /* top domain must be null terminated */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b4d0e8a68697..f6da85447f3c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -915,6 +915,12 @@ struct rq {
#endif
};

+struct sched_domain_shared {
+ atomic_t ref;
+ atomic_t nr_busy_cpus;
+ int has_idle_cores;
+};
+
static inline int cpu_of(struct rq *rq)
{
#ifdef CONFIG_SMP
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 56a0fed30c0a..8b64f3f57d50 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1184,15 +1184,11 @@ sd_init(struct sched_domain_topology_level *tl,
sd->idle_idx = 1;
}

- /*
- * For all levels sharing cache; connect a sched_domain_shared
- * instance.
- */
- if (sd->flags & SD_SHARE_PKG_RESOURCES) {
- sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
- atomic_inc(&sd->shared->ref);
+ /* Setup a shared sched_domain_shared instance */
+ sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
+ atomic_inc(&sd->shared->ref);
+ if (sd->flags & SD_SHARE_PKG_RESOURCES)
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
- }

sd->private = sdd;

--
2.9.3.1.gcba166c.dirty


2018-09-07 21:49:49

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 60/60] cosched: Add command line argument to enable coscheduling

Add a new command line argument cosched_max_level=<n>, which allows
enabling coscheduling at boot. The number corresponds to the scheduling
domain up to which coscheduling can later be enabled for cgroups.

For example, to enable coscheduling of cgroups at SMT level, one would
specify cosched_max_level=1.

The use of symbolic names (like off, core, socket, system) is currently
not possible, but could be added. However, to force coscheduling at up
to system level not knowing the scheduling domain topology in advance,
it is possible to just specify a too large number. It will be clamped
transparently to system level.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/cosched.c | 32 +++++++++++++++++++++++++++++++-
1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cosched.c b/kernel/sched/cosched.c
index eb6a6a61521e..a1f0d3a7b02a 100644
--- a/kernel/sched/cosched.c
+++ b/kernel/sched/cosched.c
@@ -162,6 +162,29 @@ static int __init cosched_split_domains_setup(char *str)

early_param("cosched_split_domains", cosched_split_domains_setup);

+static int __read_mostly cosched_max_level;
+
+static __init int cosched_max_level_setup(char *str)
+{
+ int val, ret;
+
+ ret = kstrtoint(str, 10, &val);
+ if (ret)
+ return ret;
+ if (val < 0)
+ val = 0;
+
+ /*
+ * Note, that we cannot validate the upper bound here as we do not
+ * know it yet. It will happen in cosched_init_topology().
+ */
+
+ cosched_max_level = val;
+ return 0;
+}
+
+early_param("cosched_max_level", cosched_max_level_setup);
+
struct sd_sdrqmask_level {
int groups;
struct cpumask **masks;
@@ -407,6 +430,10 @@ void cosched_init_topology(void)

/* Make permanent */
set_sched_topology(tl);
+
+ /* Adjust user preference */
+ if (cosched_max_level >= levels)
+ cosched_max_level = levels - 1;
}

/*
@@ -419,7 +446,7 @@ void cosched_init_topology(void)
*
* We can do this without any locks, as nothing will automatically traverse into
* these data structures. This requires an update of the sdrq.is_root property,
- * which will happen only later.
+ * which will happen only after everything as been setup at the very end.
*/
void cosched_init_hierarchy(void)
{
@@ -483,6 +510,9 @@ void cosched_init_hierarchy(void)
sdrq->sd_parent = &sd->shared->rq.cfs.sdrq;
list_add_tail(&sdrq->siblings, &sdrq->sd_parent->children);
}
+
+ /* Activate the hierarchy according to user preferences */
+ cosched_set_scheduled(&root_task_group, cosched_max_level);
}

/*****************************************************************************
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:49:53

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 17/60] sched: Introduce and use generic task group CFS traversal functions

Task group management has to iterate over all CFS runqueues within the
task group. Currently, this uses for_each_possible_cpu() loops and
accesses tg->cfs_rq[] directly. This does not adjust well to the
upcoming addition of coscheduling, where we will have additional CFS
runqueues.

Introduce more general traversal loop constructs, which will extend
nicely to coscheduling. Rewrite task group management functions to
make use of these new loop constructs. Except for the function
alloc_fair_sched_group(), the changes are mostly cosmetic.
alloc_fair_sched_group() now iterates over the parent group to
create a new group.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 64 +++++++++++++++++++++++-----------------------------
kernel/sched/sched.h | 31 +++++++++++++++++++++++++
2 files changed, 59 insertions(+), 36 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 82cdd75e88b9..9f63ac37f5ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9805,16 +9805,13 @@ static void task_change_group_fair(struct task_struct *p, int type)

void free_fair_sched_group(struct task_group *tg)
{
- int i;
+ struct cfs_rq *cfs, *ncfs;

destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));

- if (!tg->cfs_rq)
- return;
-
- for_each_possible_cpu(i) {
- kfree(tg->cfs_rq[i]->my_se);
- kfree(tg->cfs_rq[i]);
+ taskgroup_for_each_cfsrq_safe(tg, cfs, ncfs) {
+ kfree(cfs->my_se);
+ kfree(cfs);
}

kfree(tg->cfs_rq);
@@ -9823,8 +9820,7 @@ void free_fair_sched_group(struct task_group *tg)
int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
{
struct sched_entity *se;
- struct cfs_rq *cfs_rq;
- int i;
+ struct cfs_rq *cfs_rq, *pcfs_rq;

tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
if (!tg->cfs_rq)
@@ -9834,26 +9830,25 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)

init_cfs_bandwidth(tg_cfs_bandwidth(tg));

- for_each_possible_cpu(i) {
- cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
- GFP_KERNEL, cpu_to_node(i));
- if (!cfs_rq)
- goto err;
+ taskgroup_for_each_cfsrq(parent, pcfs_rq) {
+ struct rq *rq = rq_of(pcfs_rq);
+ int node = cpu_to_node(cpu_of(rq));

- se = kzalloc_node(sizeof(struct sched_entity),
- GFP_KERNEL, cpu_to_node(i));
- if (!se)
- goto err_free_rq;
+ cfs_rq = kzalloc_node(sizeof(*cfs_rq), GFP_KERNEL, node);
+ se = kzalloc_node(sizeof(*se), GFP_KERNEL, node);
+ if (!cfs_rq || !se)
+ goto err_free;

- tg->cfs_rq[i] = cfs_rq;
+ tg->cfs_rq[cpu_of(rq)] = cfs_rq;
init_cfs_rq(cfs_rq);
- init_tg_cfs_entry(tg, cfs_rq, se, cpu_rq(i), parent->cfs_rq[i]);
+ init_tg_cfs_entry(tg, cfs_rq, se, rq, pcfs_rq);
}

return 1;

-err_free_rq:
+err_free:
kfree(cfs_rq);
+ kfree(se);
err:
return 0;
}
@@ -9861,17 +9856,17 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
void online_fair_sched_group(struct task_group *tg)
{
struct sched_entity *se;
+ struct cfs_rq *cfs;
struct rq *rq;
- int i;

- for_each_possible_cpu(i) {
- rq = cpu_rq(i);
- se = tg->cfs_rq[i]->my_se;
+ taskgroup_for_each_cfsrq(tg, cfs) {
+ rq = rq_of(cfs);
+ se = cfs->my_se;

raw_spin_lock_irq(&rq->lock);
update_rq_clock(rq);
attach_entity_cfs_rq(se);
- sync_throttle(tg->cfs_rq[i]);
+ sync_throttle(cfs);
raw_spin_unlock_irq(&rq->lock);
}
}
@@ -9879,24 +9874,21 @@ void online_fair_sched_group(struct task_group *tg)
void unregister_fair_sched_group(struct task_group *tg)
{
unsigned long flags;
- struct rq *rq;
- int cpu;
+ struct cfs_rq *cfs;

- for_each_possible_cpu(cpu) {
- remove_entity_load_avg(tg->cfs_rq[cpu]->my_se);
+ taskgroup_for_each_cfsrq(tg, cfs) {
+ remove_entity_load_avg(cfs->my_se);

/*
* Only empty task groups can be destroyed; so we can speculatively
* check on_list without danger of it being re-added.
*/
- if (!tg->cfs_rq[cpu]->on_list)
+ if (!cfs->on_list)
continue;

- rq = cpu_rq(cpu);
-
- raw_spin_lock_irqsave(&rq->lock, flags);
- list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_lock_irqsave(&rq_of(cfs)->lock, flags);
+ list_del_leaf_cfs_rq(cfs);
+ raw_spin_unlock_irqrestore(&rq_of(cfs)->lock, flags);
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cd3a32ce8fc6..9ecbb57049a2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -946,6 +946,37 @@ static inline int cpu_of(struct rq *rq)
#endif
}

+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define taskgroup_for_each_cfsrq(tg, cfs) \
+ for ((cfs) = taskgroup_first_cfsrq(tg); (cfs); \
+ (cfs) = taskgroup_next_cfsrq(tg, cfs))
+
+#define taskgroup_for_each_cfsrq_safe(tg, cfs, ncfs) \
+ for ((cfs) = taskgroup_first_cfsrq(tg), \
+ (ncfs) = (cfs) ? taskgroup_next_cfsrq(tg, cfs) : NULL; \
+ (cfs); \
+ (cfs) = (ncfs), \
+ (ncfs) = (cfs) ? taskgroup_next_cfsrq(tg, cfs) : NULL)
+
+static inline struct cfs_rq *taskgroup_first_cfsrq(struct task_group *tg)
+{
+ int cpu = cpumask_first(cpu_possible_mask);
+
+ if (!tg->cfs_rq)
+ return NULL;
+ return tg->cfs_rq[cpu];
+}
+
+static inline struct cfs_rq *taskgroup_next_cfsrq(struct task_group *tg,
+ struct cfs_rq *cfs)
+{
+ int cpu = cpumask_next(cpu_of(cfs->rq), cpu_possible_mask);
+
+ if (cpu >= nr_cpu_ids)
+ return NULL;
+ return tg->cfs_rq[cpu];
+}
+#endif /* CONFIG_FAIR_GROUP_SCHED */

#ifdef CONFIG_SCHED_SMT

--
2.9.3.1.gcba166c.dirty


2018-09-07 21:50:03

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 56/60] cosched: Adjust wakeup preemption rules for coscheduling

Modify check_preempt_wakeup() to work correctly with coscheduled sets.

On the one hand, that means not blindly preempting, when the woken
task potentially belongs to a different set and we're not allowed to
switch sets. Instead we have to notify the correct leader to follow up.

On the other hand, we need to handle additional idle cases, as CPUs
are now idle *within* certain coscheduled sets and woken tasks may
not preempt the idle task blindly anymore.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 83 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 07fd9dd5561d..0c1d9334ea8e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6882,6 +6882,9 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
int next_buddy_marked = 0;
struct cfs_rq *cfs_rq;
int scale;
+#ifdef CONFIG_COSCHEDULING
+ struct rq_flags rf;
+#endif

/* FIXME: locking may be off after fetching the idle_se */
if (cosched_is_idle(rq, curr))
@@ -6908,6 +6911,13 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
}

/*
+ * FIXME: Check whether this can be re-enabled with coscheduling
+ *
+ * We might want to send a reschedule IPI to the leader, which is only
+ * checked further below.
+ */
+#ifndef CONFIG_COSCHEDULING
+ /*
* We can come here with TIF_NEED_RESCHED already set from new task
* wake up path.
*
@@ -6919,11 +6929,22 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
*/
if (test_tsk_need_resched(curr))
return;
+#endif

+ /*
+ * FIXME: Check whether this can be re-enabled with coscheduling
+ *
+ * curr and p may belong could belong to different coscheduled sets,
+ * in which case the decision is not straight-forward. Additionally,
+ * the preempt code needs to know the CPU it has to send an IPI
+ * to. This is not yet known here.
+ */
+#ifndef CONFIG_COSCHEDULING
/* Idle tasks are by definition preempted by non-idle tasks. */
if (unlikely(curr->policy == SCHED_IDLE) &&
likely(p->policy != SCHED_IDLE))
goto preempt;
+#endif

/*
* Batch and idle tasks do not preempt non-idle tasks (their preemption
@@ -6932,7 +6953,55 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
return;

+ /*
+ * FIXME: find_matching_se() might end up at SEs where a different CPU
+ * is leader. While we do get locks *afterwards* the question is
+ * whether anything bad can happen due to the lock-free traversal.
+ */
find_matching_se(&se, &pse);
+
+#ifdef CONFIG_COSCHEDULING
+ if (se == pse) {
+ /*
+ * There is nothing to do on this CPU within the current
+ * coscheduled set and the newly woken task belongs to this
+ * coscheduled set. Hence, it is a welcome distraction.
+ *
+ * [find_matching_se() walks up the hierarchy for se and pse
+ * until they are within the same CFS runqueue. As equality
+ * was eliminated at the beginning, equality now means that
+ * se was rq->idle_se from the start and pse approached it
+ * from within a child runqueue.]
+ */
+ SCHED_WARN_ON(!cosched_is_idle(rq, curr));
+ SCHED_WARN_ON(cosched_get_idle_se(rq) != se);
+ goto preempt;
+ }
+
+ if (hrq_of(cfs_rq_of(se))->sdrq_data.level) {
+ rq_lock(hrq_of(cfs_rq_of(se)), &rf);
+ update_rq_clock(hrq_of(cfs_rq_of(se)));
+ }
+
+ if (!cfs_rq_of(se)->curr) {
+ /*
+ * There is nothing to do at a higher level within the current
+ * coscheduled set and the newly woken task belongs to a
+ * different coscheduled set. Hence, it is a welcome
+ * distraction for the leader of that higher level.
+ *
+ * [If a leader does not find a SE in its top_cfs_rq, it does
+ * not record anything as current. Still, it tells its
+ * children within which coscheduled set they are idle.
+ * find_matching_se() now ended at such an idle leader. As
+ * we checked for se==pse earlier, we cannot be this leader.]
+ */
+ SCHED_WARN_ON(leader_of(se) == cpu_of(rq));
+ resched_cpu_locked(leader_of(se));
+ goto out;
+ }
+#endif
+
update_curr(cfs_rq_of(se));
BUG_ON(!pse);
if (wakeup_preempt_entity(se, pse) == 1) {
@@ -6942,18 +7011,30 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
*/
if (!next_buddy_marked)
set_next_buddy(pse);
+#ifdef CONFIG_COSCHEDULING
+ if (leader_of(se) != cpu_of(rq)) {
+ resched_cpu_locked(leader_of(se));
+ goto out;
+ }
+ if (hrq_of(cfs_rq_of(se))->sdrq_data.level)
+ rq_unlock(hrq_of(cfs_rq_of(se)), &rf);
+#endif
goto preempt;
}

+#ifdef CONFIG_COSCHEDULING
+out:
+ if (hrq_of(cfs_rq_of(se))->sdrq_data.level)
+ rq_unlock(hrq_of(cfs_rq_of(se)), &rf);
+#endif
return;
-
preempt:
resched_curr(rq);
/*
* Only set the backward buddy when the current task is still
* on the rq. This can happen when a wakeup gets interleaved
* with schedule on the ->pre_schedule() or idle_balance()
- * point, either of which can * drop the rq lock.
+ * point, either of which can drop the rq lock.
*
* Also, during early boot the idle thread is in the fair class,
* for obvious reasons its a bad idea to schedule back to it.
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:50:07

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 59/60] cosched: Handle non-atomicity during switches to and from coscheduling

We cannot switch a task group from regular scheduling to coscheduling
atomically, as it would require locking the whole system. Instead,
the switch is done runqueue by runqueue via cosched_set_scheduled().

This means that other CPUs may see an intermediate state when locking
a bunch of runqueues, where the sdrq->is_root fields do not yield
a consistent picture across a task group.

Handle these cases.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 68 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 322a84ec9511..8da2033596ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -646,6 +646,15 @@ static struct cfs_rq *current_cfs(struct rq *rq)
{
struct sdrq *sdrq = READ_ONCE(rq->sdrq_data.current_sdrq);

+ /*
+ * We might race with concurrent is_root-changes, causing
+ * current_sdrq to reference an sdrq which is no longer
+ * !is_root. Counter that by ascending the tg-hierarchy
+ * until we find an sdrq with is_root.
+ */
+ while (sdrq->is_root && sdrq->tg_parent)
+ sdrq = sdrq->tg_parent;
+
return sdrq->cfs_rq;
}
#else
@@ -7141,6 +7150,23 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf

se = pick_next_entity(cfs_rq, curr);
cfs_rq = pick_next_cfs(se);
+
+#ifdef CONFIG_COSCHEDULING
+ if (cfs_rq && is_sd_se(se) && cfs_rq->sdrq.is_root) {
+ WARN_ON_ONCE(1); /* Untested code path */
+ /*
+ * Race with is_root update.
+ *
+ * We just moved downwards in the hierarchy via an
+ * SD-SE, the CFS-RQ should have is_root set to zero.
+ * However, a reconfiguration may be in progress. We
+ * basically ignore that reconfiguration.
+ *
+ * Contrary to the case below, there is nothing to fix
+ * as all the set_next_entity() calls are done later.
+ */
+ }
+#endif
} while (cfs_rq);

if (is_sd_se(se))
@@ -7192,6 +7218,48 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
se = pick_next_entity(cfs_rq, NULL);
set_next_entity(cfs_rq, se);
cfs_rq = pick_next_cfs(se);
+
+#ifdef CONFIG_COSCHEDULING
+ if (cfs_rq && is_sd_se(se) && cfs_rq->sdrq.is_root) {
+ /*
+ * Race with is_root update.
+ *
+ * We just moved downwards in the hierarchy via an
+ * SD-SE, the CFS-RQ should have is_root set to zero.
+ * However, a reconfiguration may be in progress. We
+ * basically ignore that reconfiguration, but we need
+ * to fix the picked path to correspond to that
+ * reconfiguration.
+ *
+ * Thus, we walk the hierarchy upwards again and do two
+ * things simultaneously:
+ *
+ * 1. put back picked entities which are not on the
+ * "correct" path,
+ * 2. pick the entities along the correct path.
+ *
+ * We do this until both paths upwards converge.
+ */
+ struct sched_entity *se2 = cfs_rq->sdrq.tg_se;
+ bool top = false;
+
+ WARN_ON_ONCE(1); /* Untested code path */
+ while (se && se != se2) {
+ if (!top) {
+ put_prev_entity(cfs_rq_of(se), se);
+ if (cfs_rq_of(se) == top_cfs_rq)
+ top = true;
+ }
+ if (top)
+ se = cfs_rq_of(se)->sdrq.tg_se;
+ else
+ se = parent_entity(se);
+ set_next_entity(cfs_rq_of(se2), se2);
+ se2 = parent_entity(se2);
+ }
+ }
+#endif
+
} while (cfs_rq);

retidle: __maybe_unused;
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:50:19

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 11/60] locking/lockdep: Increase number of supported lockdep subclasses

With coscheduling the number of required classes is twice the depth of
the scheduling domain hierarchy. For a 256 CPU system, there are eight
levels at most. Adjust the number of subclasses, so that lockdep can
still be used on such systems.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
include/linux/lockdep.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index b0d0b51c4d85..d93a5ec3077f 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -17,7 +17,7 @@ struct lockdep_map;
extern int prove_locking;
extern int lock_stat;

-#define MAX_LOCKDEP_SUBCLASSES 8UL
+#define MAX_LOCKDEP_SUBCLASSES 16UL

#include <linux/types.h>

--
2.9.3.1.gcba166c.dirty


2018-09-07 21:50:48

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 55/60] cosched: Adjust task selection for coscheduling

Introduce the selection and notification mechanism used to realize
coscheduling.

Every CPU starts selecting tasks from its current_sdrq, which points
into the currently active coscheduled set and which is only updated by
the leader. Whenever task selection crosses a hierarchy level, the
leaders of the next hierarchy level are notified to switch coscheduled
sets.

Contrary to the original task picking algorithm, we're no longer
guaranteed to find an executable task, as the coscheduled set may
be partly idle. If that is the case and the (not yet adjusted) idle
balancing also cannot find something to execute, then we force the
CPU into idle.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 150 +++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 137 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9e8b8119cdea..07fd9dd5561d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -591,6 +591,75 @@ static inline int __leader_of(struct sched_entity *se)
#define for_each_owned_sched_entity(se) \
for (; se; se = owned_parent_entity(se))

+#ifdef CONFIG_COSCHEDULING
+static struct sdrq *get_and_set_next_sdrq(struct sdrq *sdrq)
+{
+ int cpu = sdrq->data->leader;
+ struct sdrq *child_sdrq;
+ struct sdrq *ret = NULL;
+
+ /* find our child, notify others */
+ list_for_each_entry(child_sdrq, &sdrq->children, siblings) {
+ /* FIXME: should protect against leader change of child_sdrq */
+ if (cpu == child_sdrq->data->leader) {
+ /* this is our child */
+ ret = child_sdrq;
+ child_sdrq->data->current_sdrq = child_sdrq;
+ } else {
+ /* not our child, notify it */
+ if (child_sdrq->data->current_sdrq != child_sdrq) {
+ WRITE_ONCE(child_sdrq->data->current_sdrq,
+ child_sdrq);
+ resched_cpu_locked(child_sdrq->data->leader);
+ }
+ }
+ }
+ return ret;
+}
+
+static void notify_remaining_sdrqs(struct cfs_rq *cfs_rq)
+{
+ struct sdrq *sdrq = &cfs_rq->sdrq;
+
+ while (sdrq)
+ sdrq = get_and_set_next_sdrq(sdrq);
+}
+
+static struct cfs_rq *pick_next_cfs(struct sched_entity *se)
+{
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+ if (!is_sd_se(se))
+ return group_cfs_rq(se);
+
+ cfs_rq = get_and_set_next_sdrq(&cfs_rq->sdrq)->cfs_rq;
+
+ if (!cfs_rq->nr_running || cfs_rq->throttled) {
+ notify_remaining_sdrqs(cfs_rq);
+ return NULL;
+ }
+
+ return cfs_rq;
+}
+
+static struct cfs_rq *current_cfs(struct rq *rq)
+{
+ struct sdrq *sdrq = READ_ONCE(rq->sdrq_data.current_sdrq);
+
+ return sdrq->cfs_rq;
+}
+#else
+static struct cfs_rq *pick_next_cfs(struct sched_entity *se)
+{
+ return group_cfs_rq(se);
+}
+
+static struct cfs_rq *current_cfs(struct rq *rq)
+{
+ return &rq->cfs;
+}
+#endif
+
/**************************************************************
* Scheduling class tree data structure manipulation methods:
*/
@@ -6901,14 +6970,31 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
{
struct cfs_rq *cfs_rq, *top_cfs_rq;
struct rq_owner_flags orf;
- struct sched_entity *se;
- struct task_struct *p;
- int new_tasks;
+ struct sched_entity *se = NULL;
+ struct task_struct *p = NULL;
+ int new_tasks = 0;

again:
- top_cfs_rq = cfs_rq = &rq_lock_owned(rq, &orf)->cfs;
- if (!cfs_rq->nr_running)
+ top_cfs_rq = cfs_rq = current_cfs(rq_lock_owned(rq, &orf));
+ /*
+ * FIXME: There are issues with respect to the scheduling
+ * classes, if there was a class between fair and idle.
+ * Then, greater care must be taken whether it is appropriate
+ * to return NULL to give the next class a chance to run,
+ * or to return the idle-thread so that nothing else runs.
+ */
+
+ if (!cfs_rq->nr_running || cfs_rq->throttled) {
+#ifdef CONFIG_COSCHEDULING
+ notify_remaining_sdrqs(cfs_rq);
+ if (!cfs_rq->sdrq.is_root) {
+ se = cfs_rq->sdrq.sd_parent->sd_se;
+ put_prev_task(rq, prev);
+ goto retidle;
+ }
+#endif
goto idle;
+ }

#ifdef CONFIG_FAIR_GROUP_SCHED
if (prev->sched_class != &fair_sched_class)
@@ -6946,17 +7032,29 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
cfs_rq = top_cfs_rq;

- if (!cfs_rq->nr_running)
+ if (!cfs_rq->nr_running || cfs_rq->throttled) {
+#ifdef CONFIG_COSCHEDULING
+ notify_remaining_sdrqs(cfs_rq);
+ if (!cfs_rq->sdrq.is_root) {
+ se = cfs_rq->sdrq.sd_parent->sd_se;
+ put_prev_task(rq, prev);
+ goto retidle;
+ }
+#endif
goto idle;
+ }

goto simple;
}
}

se = pick_next_entity(cfs_rq, curr);
- cfs_rq = group_cfs_rq(se);
+ cfs_rq = pick_next_cfs(se);
} while (cfs_rq);

+ if (is_sd_se(se))
+ se = cosched_set_idle(rq, se);
+
p = task_of(se);

/*
@@ -6966,23 +7064,31 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
*/
if (prev != p) {
struct sched_entity *pse = &prev->se;
+ int cpu = leader_of(pse);
+
+ if (cosched_is_idle(rq, p))
+ se = cosched_get_idle_se(rq);

while (!(cfs_rq = is_same_group(se, pse))) {
int se_depth = se->depth;
int pse_depth = pse->depth;

- if (se_depth <= pse_depth) {
+ if (se_depth <= pse_depth && leader_of(pse) == cpu) {
put_prev_entity(cfs_rq_of(pse), pse);
pse = parent_entity(pse);
}
- if (se_depth >= pse_depth) {
+ if (se_depth >= pse_depth && leader_of(se) == cpu) {
set_next_entity(cfs_rq_of(se), se);
se = parent_entity(se);
}
+ if (leader_of(pse) != cpu && leader_of(se) != cpu)
+ break;
}

- put_prev_entity(cfs_rq, pse);
- set_next_entity(cfs_rq, se);
+ if (leader_of(pse) == cpu)
+ put_prev_entity(cfs_rq, pse);
+ if (leader_of(se) == cpu)
+ set_next_entity(cfs_rq, se);
}

goto done;
@@ -6994,9 +7100,13 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
do {
se = pick_next_entity(cfs_rq, NULL);
set_next_entity(cfs_rq, se);
- cfs_rq = group_cfs_rq(se);
+ cfs_rq = pick_next_cfs(se);
} while (cfs_rq);

+retidle: __maybe_unused;
+ if (is_sd_se(se))
+ se = cosched_set_idle(rq, se);
+
p = task_of(se);

done: __maybe_unused;
@@ -7006,7 +7116,8 @@ done: __maybe_unused;
* the list, so our cfs_tasks list becomes MRU
* one.
*/
- list_move(&p->se.group_node, &rq->cfs_tasks);
+ if (!cosched_is_idle(rq, p))
+ list_move(&p->se.group_node, &rq->cfs_tasks);
#endif

if (hrtick_enabled(rq))
@@ -7019,6 +7130,19 @@ done: __maybe_unused;
idle:
rq_unlock_owned(rq, &orf);

+#ifdef CONFIG_COSCHEDULING
+ /*
+ * With coscheduling idle_balance() may think there are tasks, when they
+ * are in fact not eligible right now. Exit, if we still did not find
+ * a suitable task after one idle_balance() round.
+ *
+ * FIXME: idle_balance() should only look at tasks that can actually
+ * be executed.
+ */
+ if (new_tasks)
+ return NULL;
+#endif
+
new_tasks = idle_balance(rq, rf);

/*
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:51:03

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 24/60] cosched: Do minimal pre-SMP coscheduler initialization

The scheduler is operational before we have the necessary information
about scheduling domains, which would allow us to set up the runqueue
hierarchy. Because of that, we have to postpone the "real"
initialization a bit. We cannot not totally skip all initialization,
though, because all the adapted scheduler code paths do not really
like NULL pointers within the extended coscheduling related data
structures.

Perform a minimal setup of the coscheduling data structures at the
bottom level (i.e., CPU level), so that all traversals terminate
correctly. There is no hierarchy yet.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/core.c | 2 ++
kernel/sched/cosched.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 6 ++++
3 files changed, 93 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48e37c3baed1..a235b6041cb5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6058,6 +6058,8 @@ void __init sched_init(void)
atomic_set(&rq->nr_iowait, 0);
}

+ cosched_init_bottom();
+
set_load_weight(&init_task, false);

/*
diff --git a/kernel/sched/cosched.c b/kernel/sched/cosched.c
index 3bd4557ca5b7..03ba86676b90 100644
--- a/kernel/sched/cosched.c
+++ b/kernel/sched/cosched.c
@@ -7,3 +7,88 @@
*/

#include "sched.h"
+
+static int mask_to_node(const struct cpumask *span)
+{
+ int node = cpu_to_node(cpumask_first(span));
+
+ if (cpumask_subset(span, cpumask_of_node(node)))
+ return node;
+
+ return NUMA_NO_NODE;
+}
+
+static void init_sdrq_data(struct sdrq_data *data, struct sdrq_data *parent,
+ const struct cpumask *span, int level)
+{
+ struct rq *rq = container_of(data, struct rq, sdrq_data);
+ int cpu = cpumask_first(span);
+ int weight = cpumask_weight(span);
+
+ data->numa_node = weight == 1 ? cpu_to_node(cpu) : mask_to_node(span);
+ data->leader = cpu;
+ data->level = level >= 0 ? level : parent->level - 1;
+ data->parent = parent;
+ data->span = span;
+ data->span_weight = weight;
+ data->current_sdrq = &rq->cfs.sdrq;
+
+ /*
+ * Bottom level runqueues have already been initialized and are live,
+ * do not initialize them again.
+ */
+ if (!data->level)
+ return;
+
+ /* Initialize the subset of struct rq in which we are interested. */
+ raw_spin_lock_init(&rq->lock);
+ rq->cpu = cpu;
+ INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
+ rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
+}
+
+static void init_sdrq(struct task_group *tg, struct sdrq *sdrq,
+ struct sdrq *sd_parent, struct sdrq *tg_parent,
+ struct sdrq_data *data)
+{
+ struct rq *rq = container_of(data, struct rq, sdrq_data);
+ struct cfs_rq *cfs_rq = container_of(sdrq, struct cfs_rq, sdrq);
+
+ /* Attention: tg->parent may not yet be set. Check tg_parent instead. */
+
+ sdrq->sd_parent = sd_parent;
+ sdrq->tg_parent = tg_parent;
+ sdrq->data = data;
+ sdrq->is_root = 1;
+ sdrq->cfs_rq = cfs_rq;
+ sdrq->tg_se = cfs_rq->my_se;
+
+ INIT_LIST_HEAD(&sdrq->tg_children);
+ INIT_LIST_HEAD(&sdrq->children);
+ if (sd_parent)
+ list_add_tail(&sdrq->siblings, &sd_parent->children);
+
+ /*
+ * If we are not at the bottom level in hierarchy, we need to setup
+ * a SD-SE, so that the level below us can be represented within this
+ * level.
+ */
+ if (data->level) {
+ sdrq->sd_se = &sdrq->__sd_se;
+ init_tg_cfs_entry(tg, NULL, sdrq->sd_se, rq, sdrq->cfs_rq);
+ }
+}
+
+void cosched_init_bottom(void)
+{
+ int cpu;
+
+ raw_spin_lock_init(&root_task_group.lock);
+ for_each_possible_cpu(cpu) {
+ struct sdrq_data *data = &cpu_rq(cpu)->sdrq_data;
+ struct sdrq *sdrq = &root_task_group.cfs_rq[cpu]->sdrq;
+
+ init_sdrq_data(data, NULL, cpumask_of(cpu), 0);
+ init_sdrq(&root_task_group, sdrq, NULL, NULL, data);
+ }
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1bce6061ac45..21b7c6cf8b87 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1129,6 +1129,12 @@ static inline struct cfs_rq *taskgroup_next_cfsrq(struct task_group *tg,
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

+#ifdef CONFIG_COSCHEDULING
+void cosched_init_bottom(void);
+#else /* !CONFIG_COSCHEDULING */
+static inline void cosched_init_bottom(void) { }
+#endif /* !CONFIG_COSCHEDULING */
+
#ifdef CONFIG_SCHED_SMT

extern struct static_key_false sched_smt_present;
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:51:20

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 38/60] cosched: Skip updates on non-CPU runqueues in cfs_rq_util_change()

The function cfs_rq_util_change() notifies frequency governors of
utilization changes, so that they can be scheduler driven. This is
coupled to per CPU runqueue statistics. So, don't do anything
when called for non-CPU runqueues.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a2945355f823..33e3f759eb99 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3015,7 +3015,16 @@ static inline void update_cfs_group(struct sched_entity *se)

static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
{
- struct rq *rq = rq_of(cfs_rq);
+ struct rq *rq = hrq_of(cfs_rq);
+
+#ifdef CONFIG_COSCHEDULING
+ /*
+ * This function is currently only well defined for per-CPU
+ * runqueues. Don't execute it for anything else.
+ */
+ if (rq->sdrq_data.level)
+ return;
+#endif

if (&rq->cfs == cfs_rq || (flags & SCHED_CPUFREQ_MIGRATION)) {
/*
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:51:33

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 51/60] cosched: Hacky work-around to avoid observing zero weight SD-SE

The aggregated SD-SE weight is updated lock-free to avoid contention
on the higher level. This also means, that we have to be careful
with intermediate values as another CPU could pick up the value and
perform actions based on it.

Within reweight_entity() there is such a place, where weight is removed,
locally modified, and added back. If another CPU locks the higher level
and observes a zero weight, it will perform incorrect decisions when
it is dequeuing a task: it won't stop dequeuing, although there is still
load in one of the child runqueues.

Prevent this from happening by temporarily bumping the aggregated value.

(A nicer solution would be to apply only the actual difference to the
aggregate instead of doing full removal and a subsequent addition.)

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1eee262ecf88..483db54ee20a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2865,6 +2865,16 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
/* commit outstanding execution time */
if (cfs_rq->curr == se)
update_curr(cfs_rq);
+#ifdef CONFIG_COSCHEDULING
+ /*
+ * FIXME: Temporarily adjust the sdse_load to prevent a zero
+ * value from being visible over the calls to
+ * account_entity_dequeue()/account_entity_enqueue().
+ * It leads to incorrect decisions in scheduler code.
+ */
+ if (!cfs_rq->sdrq.is_root && !cfs_rq->throttled)
+ atomic64_add(NICE_0_LOAD, &cfs_rq->sdrq.sd_parent->sdse_load);
+#endif
account_entity_dequeue(cfs_rq, se);
dequeue_runnable_load_avg(cfs_rq, se);
}
@@ -2886,6 +2896,11 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
enqueue_load_avg(cfs_rq, se);
if (se->on_rq) {
account_entity_enqueue(cfs_rq, se);
+#ifdef CONFIG_COSCHEDULING
+ /* FIXME: see above */
+ if (!cfs_rq->sdrq.is_root && !cfs_rq->throttled)
+ atomic64_sub(NICE_0_LOAD, &cfs_rq->sdrq.sd_parent->sdse_load);
+#endif
enqueue_runnable_load_avg(cfs_rq, se);
}
}
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:51:48

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 49/60] cosched: Adjust locking for enqueuing and dequeueing

Enqueuing and dequeuing of tasks (or entities) are a general activities
that span across leader boundaries. They start from the bottom of the
runqueue hierarchy and bubble upwards, until they hit their terminating
condition (for example, enqueuing stops when the parent entity is already
enqueued).

We employ chain-locking in these cases to minimize lock contention.
For example, if enqueuing has moved past a hierarchy level of a different
leader, that leader can already make scheduling decisions again. Also,
this opens the possibility to combine concurrent enqueues/dequeues to
some extend, so that only one of multiple CPUs has to walk up the
hierarchy.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d64f4478fda..0dc4d289497c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4510,6 +4510,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
struct sched_entity *se;
long task_delta, dequeue = 1;
bool empty;
+ struct rq_chain rc;

/*
* FIXME: We can only handle CPU runqueues at the moment.
@@ -4532,8 +4533,11 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
rcu_read_unlock();

task_delta = cfs_rq->h_nr_running;
+ rq_chain_init(&rc, rq);
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+
+ rq_chain_lock(&rc, se);
/* throttled entity or throttle-on-deactivate */
if (!se->on_rq)
break;
@@ -4549,6 +4553,8 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
if (!se)
sub_nr_running(rq, task_delta);

+ rq_chain_unlock(&rc);
+
cfs_rq->throttled = 1;
cfs_rq->throttled_clock = rq_clock(rq);
raw_spin_lock(&cfs_b->lock);
@@ -4577,6 +4583,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
struct sched_entity *se;
int enqueue = 1;
long task_delta;
+ struct rq_chain rc;

SCHED_WARN_ON(!is_cpu_rq(rq));

@@ -4598,7 +4605,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
return;

task_delta = cfs_rq->h_nr_running;
+ rq_chain_init(&rc, rq);
for_each_sched_entity(se) {
+ rq_chain_lock(&rc, se);
if (se->on_rq)
enqueue = 0;

@@ -4614,6 +4623,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
if (!se)
add_nr_running(rq, task_delta);

+ rq_chain_unlock(&rc);
+
/* Determine whether we need to wake up potentially idle CPU: */
if (rq->curr == rq->idle && nr_cfs_tasks(rq))
resched_curr(rq);
@@ -5136,8 +5147,11 @@ bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
unsigned int task_delta)
{
struct cfs_rq *cfs_rq;
+ struct rq_chain rc;

+ rq_chain_init(&rc, rq);
for_each_sched_entity(se) {
+ rq_chain_lock(&rc, se);
if (se->on_rq)
break;
cfs_rq = cfs_rq_of(se);
@@ -5157,6 +5171,8 @@ bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
}

for_each_sched_entity(se) {
+ /* FIXME: taking locks up to the top is bad */
+ rq_chain_lock(&rc, se);
cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_running += task_delta;

@@ -5167,6 +5183,8 @@ bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
update_cfs_group(se);
}

+ rq_chain_unlock(&rc);
+
return se != NULL;
}

@@ -5211,9 +5229,12 @@ bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
unsigned int task_delta)
{
struct cfs_rq *cfs_rq;
+ struct rq_chain rc;
int task_sleep = flags & DEQUEUE_SLEEP;

+ rq_chain_init(&rc, rq);
for_each_sched_entity(se) {
+ rq_chain_lock(&rc, se);
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);

@@ -5231,6 +5252,9 @@ bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
if (cfs_rq->load.weight) {
/* Avoid re-evaluating load for this entity: */
se = parent_entity(se);
+ if (se)
+ rq_chain_lock(&rc, se);
+
/*
* Bias pick_next to pick a task from this cfs_rq, as
* p is sleeping when it is within its sched_slice.
@@ -5243,6 +5267,8 @@ bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
}

for_each_sched_entity(se) {
+ /* FIXME: taking locks up to the top is bad */
+ rq_chain_lock(&rc, se);
cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_running -= task_delta;

@@ -5253,6 +5279,8 @@ bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
update_cfs_group(se);
}

+ rq_chain_unlock(&rc);
+
return se != NULL;
}

@@ -9860,11 +9888,15 @@ static inline bool vruntime_normalized(struct task_struct *p)
static void propagate_entity_cfs_rq(struct sched_entity *se)
{
struct cfs_rq *cfs_rq;
+ struct rq_chain rc;
+
+ rq_chain_init(&rc, hrq_of(cfs_rq_of(se)));

/* Start to propagate at parent */
se = parent_entity(se);

for_each_sched_entity(se) {
+ rq_chain_lock(&rc, se);
cfs_rq = cfs_rq_of(se);

if (cfs_rq_throttled(cfs_rq))
@@ -9872,6 +9904,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)

update_load_avg(cfs_rq, se, UPDATE_TG);
}
+ rq_chain_unlock(&rc);
}
#else
static void propagate_entity_cfs_rq(struct sched_entity *se) { }
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:51:57

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 47/60] cosched: Adjust SE traversal and locking for common leader activities

Modify some of the core scheduler paths, which function as entry points
into the CFS scheduling class and which are activities where the leader
operates on behalf of the group.

There are (a) handling the tick, (b) picking the next task from the
runqueue, (c) setting a task to be current, and (d) putting the current
task back.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 28 +++++++++++++++++++++++-----
1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2aa3a60dfca5..2227e4840355 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6664,12 +6664,14 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
- struct cfs_rq *cfs_rq = &rq->cfs;
+ struct cfs_rq *cfs_rq, *top_cfs_rq;
+ struct rq_owner_flags orf;
struct sched_entity *se;
struct task_struct *p;
int new_tasks;

again:
+ top_cfs_rq = cfs_rq = &rq_lock_owned(rq, &orf)->cfs;
if (!cfs_rq->nr_running)
goto idle;

@@ -6707,7 +6709,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
* be correct.
*/
if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
- cfs_rq = &rq->cfs;
+ cfs_rq = top_cfs_rq;

if (!cfs_rq->nr_running)
goto idle;
@@ -6775,9 +6777,13 @@ done: __maybe_unused;
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);

+ rq_unlock_owned(rq, &orf);
+
return p;

idle:
+ rq_unlock_owned(rq, &orf);
+
new_tasks = idle_balance(rq, rf);

/*
@@ -6796,12 +6802,15 @@ done: __maybe_unused;

void put_prev_entity_fair(struct rq *rq, struct sched_entity *se)
{
+ struct rq_owner_flags orf;
struct cfs_rq *cfs_rq;

- for_each_sched_entity(se) {
+ rq_lock_owned(rq, &orf);
+ for_each_owned_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
put_prev_entity(cfs_rq, se);
}
+ rq_unlock_owned(rq, &orf);
}

/*
@@ -9712,11 +9721,14 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;
+ struct rq_owner_flags orf;

- for_each_sched_entity(se) {
+ rq_lock_owned(rq, &orf);
+ for_each_owned_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
entity_tick(cfs_rq, se, queued);
}
+ rq_unlock_owned(rq, &orf);

if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
@@ -9906,13 +9918,19 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)

void set_curr_entity_fair(struct rq *rq, struct sched_entity *se)
{
- for_each_sched_entity(se) {
+ struct rq_owner_flags orf;
+
+ rq_lock_owned(rq, &orf);
+
+ for_each_owned_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);

set_next_entity(cfs_rq, se);
/* ensure bandwidth has been allocated on our new cfs_rq */
account_cfs_rq_runtime(cfs_rq, 0);
}
+
+ rq_unlock_owned(rq, &orf);
}

/*
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:52:10

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 50/60] cosched: Propagate load changes across hierarchy levels

The weight of an SD-SE is defined to be the average weight of all
runqueues that are represented by the SD-SE. Hence, its weight
should change whenever one of the child runqueues changes its
weight. However, as these are two different hierarchy levels,
they are protected by different locks. To reduce lock contention,
we want to avoid holding higher level locks for prolonged amounts
of time, if possible.

Therefore, we update an aggregated weight -- sdrq->sdse_load --
in a lock-free manner during enqueue and dequeue in the lower level,
and once we actually get the higher level lock, we perform the actual
SD-SE weight adjustment via update_sdse_load().

At some point in the future (the code isn't there yet), this will
allow software combining, where not all CPUs have to walk up the
full hierarchy on enqueue/dequeue.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 55 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0dc4d289497c..1eee262ecf88 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2740,6 +2740,10 @@ static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
static void
account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+#ifdef CONFIG_COSCHEDULING
+ if (!cfs_rq->sdrq.is_root && !cfs_rq->throttled)
+ atomic64_add(se->load.weight, &cfs_rq->sdrq.sd_parent->sdse_load);
+#endif
update_load_add(&cfs_rq->load, se->load.weight);
if (!parent_entity(se) || is_sd_se(parent_entity(se)))
update_load_add(&hrq_of(cfs_rq)->load, se->load.weight);
@@ -2757,6 +2761,10 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
static void
account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+#ifdef CONFIG_COSCHEDULING
+ if (!cfs_rq->sdrq.is_root && !cfs_rq->throttled)
+ atomic64_sub(se->load.weight, &cfs_rq->sdrq.sd_parent->sdse_load);
+#endif
update_load_sub(&cfs_rq->load, se->load.weight);
if (!parent_entity(se) || is_sd_se(parent_entity(se)))
update_load_sub(&hrq_of(cfs_rq)->load, se->load.weight);
@@ -3083,6 +3091,35 @@ static inline void update_cfs_group(struct sched_entity *se)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

+#ifdef CONFIG_COSCHEDULING
+static void update_sdse_load(struct sched_entity *se)
+{
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ struct sdrq *sdrq = &cfs_rq->sdrq;
+ unsigned long load;
+
+ if (!is_sd_se(se))
+ return;
+
+ /* FIXME: the load calculation assumes a homogeneous topology */
+ load = atomic64_read(&sdrq->sdse_load);
+
+ if (!list_empty(&sdrq->children)) {
+ struct sdrq *entry;
+
+ entry = list_first_entry(&sdrq->children, struct sdrq, siblings);
+ load *= entry->data->span_weight;
+ }
+
+ load /= sdrq->data->span_weight;
+
+ /* FIXME: Use a proper runnable */
+ reweight_entity(cfs_rq, se, load, load);
+}
+#else /* !CONFIG_COSCHEDULING */
+static void update_sdse_load(struct sched_entity *se) { }
+#endif /* !CONFIG_COSCHEDULING */
+
static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
{
struct rq *rq = hrq_of(cfs_rq);
@@ -4527,6 +4564,11 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)

se = cfs_rq->my_se;

+#ifdef CONFIG_COSCHEDULING
+ if (!cfs_rq->sdrq.is_root && !cfs_rq->throttled)
+ atomic64_sub(cfs_rq->load.weight,
+ &cfs_rq->sdrq.sd_parent->sdse_load);
+#endif
/* freeze hierarchy runnable averages while throttled */
rcu_read_lock();
walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
@@ -4538,6 +4580,8 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
struct cfs_rq *qcfs_rq = cfs_rq_of(se);

rq_chain_lock(&rc, se);
+ update_sdse_load(se);
+
/* throttled entity or throttle-on-deactivate */
if (!se->on_rq)
break;
@@ -4590,6 +4634,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
se = cfs_rq->my_se;

cfs_rq->throttled = 0;
+#ifdef CONFIG_COSCHEDULING
+ if (!cfs_rq->sdrq.is_root && !cfs_rq->throttled)
+ atomic64_add(cfs_rq->load.weight,
+ &cfs_rq->sdrq.sd_parent->sdse_load);
+#endif

update_rq_clock(rq);

@@ -4608,6 +4657,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
rq_chain_init(&rc, rq);
for_each_sched_entity(se) {
rq_chain_lock(&rc, se);
+ update_sdse_load(se);
if (se->on_rq)
enqueue = 0;

@@ -5152,6 +5202,7 @@ bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
rq_chain_init(&rc, rq);
for_each_sched_entity(se) {
rq_chain_lock(&rc, se);
+ update_sdse_load(se);
if (se->on_rq)
break;
cfs_rq = cfs_rq_of(se);
@@ -5173,6 +5224,7 @@ bool enqueue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
for_each_sched_entity(se) {
/* FIXME: taking locks up to the top is bad */
rq_chain_lock(&rc, se);
+ update_sdse_load(se);
cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_running += task_delta;

@@ -5235,6 +5287,7 @@ bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
rq_chain_init(&rc, rq);
for_each_sched_entity(se) {
rq_chain_lock(&rc, se);
+ update_sdse_load(se);
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);

@@ -5269,6 +5322,7 @@ bool dequeue_entity_fair(struct rq *rq, struct sched_entity *se, int flags,
for_each_sched_entity(se) {
/* FIXME: taking locks up to the top is bad */
rq_chain_lock(&rc, se);
+ update_sdse_load(se);
cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_running -= task_delta;

@@ -9897,6 +9951,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)

for_each_sched_entity(se) {
rq_chain_lock(&rc, se);
+ update_sdse_load(se);
cfs_rq = cfs_rq_of(se);

if (cfs_rq_throttled(cfs_rq))
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:52:15

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 16/60] sched: Preparatory code movement

Move struct rq_flags around to keep future commits crisp.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/sched.h | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b8c8dfd0e88d..cd3a32ce8fc6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -472,6 +472,19 @@ struct cfs_bandwidth { };

#endif /* CONFIG_CGROUP_SCHED */

+struct rq_flags {
+ unsigned long flags;
+ struct pin_cookie cookie;
+#ifdef CONFIG_SCHED_DEBUG
+ /*
+ * A copy of (rq::clock_update_flags & RQCF_UPDATED) for the
+ * current pin context is stashed here in case it needs to be
+ * restored in rq_repin_lock().
+ */
+ unsigned int clock_update_flags;
+#endif
+};
+
/* CFS-related fields in a runqueue */
struct cfs_rq {
struct load_weight load;
@@ -1031,19 +1044,6 @@ static inline void rq_clock_cancel_skipupdate(struct rq *rq)
rq->clock_update_flags &= ~RQCF_REQ_SKIP;
}

-struct rq_flags {
- unsigned long flags;
- struct pin_cookie cookie;
-#ifdef CONFIG_SCHED_DEBUG
- /*
- * A copy of (rq::clock_update_flags & RQCF_UPDATED) for the
- * current pin context is stashed here in case it needs to be
- * restored in rq_repin_lock().
- */
- unsigned int clock_update_flags;
-#endif
-};
-
static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
{
rf->cookie = lockdep_pin_lock(&rq->lock);
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:52:27

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 43/60] cosched: Add for_each_sched_entity() variant for owned entities

Add a new loop constuct for_each_owned_sched_entity(), which iterates
over all owned scheduling entities, stopping when it encounters a
leader change.

This allows relatively straight-forward adaptations of existing code,
where the leader only handles that part of the hierarchy it actually
owns.

Include some lockdep goodness, so that we detect incorrect usage.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 70 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f72a72c8c3b8..f55954e7cedc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -521,6 +521,76 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
static __always_inline
void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);

+#ifdef CONFIG_COSCHEDULING
+struct sched_entity *owned_parent_entity(struct sched_entity *se)
+{
+ struct rq *prq, *rq = hrq_of(cfs_rq_of(se));
+
+ lockdep_assert_held(&rq->lock);
+
+ se = parent_entity(se);
+ if (!se)
+ return NULL;
+
+ prq = hrq_of(cfs_rq_of(se));
+ if (rq == prq)
+ return se;
+
+#ifdef CONFIG_SCHED_DEBUG
+ if (!rq->sdrq_data.parent_locked) {
+ int leader = rq->sdrq_data.leader;
+ int pleader = READ_ONCE(prq->sdrq_data.leader);
+
+ SCHED_WARN_ON(leader == pleader);
+ } else {
+ int leader = rq->sdrq_data.leader;
+ int pleader = prq->sdrq_data.leader;
+
+ SCHED_WARN_ON(leader != pleader);
+ }
+#endif
+
+ if (!rq->sdrq_data.parent_locked)
+ return NULL;
+
+ lockdep_assert_held(&prq->lock);
+ return se;
+}
+
+static inline int leader_of(struct sched_entity *se)
+{
+ struct rq *rq = hrq_of(cfs_rq_of(se));
+
+ lockdep_assert_held(&rq->lock);
+ return rq->sdrq_data.leader;
+}
+
+static inline int __leader_of(struct sched_entity *se)
+{
+ struct rq *rq = hrq_of(cfs_rq_of(se));
+
+ return READ_ONCE(rq->sdrq_data.leader);
+}
+#else
+struct sched_entity *owned_parent_entity(struct sched_entity *se)
+{
+ return parent_entity(se);
+}
+
+static inline int leader_of(struct sched_entity *se)
+{
+ return cpu_of(rq_of(cfs_rq_of(se)));
+}
+
+static inline int __leader_of(struct sched_entity *se)
+{
+ return leader_of(se);
+}
+#endif
+
+#define for_each_owned_sched_entity(se) \
+ for (; se; se = owned_parent_entity(se))
+
/**************************************************************
* Scheduling class tree data structure manipulation methods:
*/
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:52:36

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 37/60] cosched: Use hrq_of() for (indirect calls to) ___update_load_sum()

The cpu argument supplied to all callers of ___update_load_sum() is used
in accumulate_sum() to scale load values according to the CPU capacity.
While we should think about that at some point, it is out-of-scope for now.
Also, it does not matter on homogeneous system topologies.

Update all callers to use hrq_of() instead of rq_of() to derive the cpu
argument.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fde1c4ba4bb4..a2945355f823 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3114,7 +3114,7 @@ void set_task_rq_fair(struct sched_entity *se,
p_last_update_time = prev->avg.last_update_time;
n_last_update_time = next->avg.last_update_time;
#endif
- __update_load_avg_blocked_se(p_last_update_time, cpu_of(rq_of(prev)), se);
+ __update_load_avg_blocked_se(p_last_update_time, cpu_of(hrq_of(prev)), se);
se->avg.last_update_time = n_last_update_time;
}

@@ -3397,7 +3397,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
decayed = 1;
}

- decayed |= __update_load_avg_cfs_rq(now, cpu_of(rq_of(cfs_rq)), cfs_rq);
+ decayed |= __update_load_avg_cfs_rq(now, cpu_of(hrq_of(cfs_rq)), cfs_rq);

#ifndef CONFIG_64BIT
smp_wmb();
@@ -3487,7 +3487,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
u64 now = cfs_rq_clock_task(cfs_rq);
- struct rq *rq = rq_of(cfs_rq);
+ struct rq *rq = hrq_of(cfs_rq);
int cpu = cpu_of(rq);
int decayed;

@@ -3548,7 +3548,7 @@ void sync_entity_load_avg(struct sched_entity *se)
u64 last_update_time;

last_update_time = cfs_rq_last_update_time(cfs_rq);
- __update_load_avg_blocked_se(last_update_time, cpu_of(rq_of(cfs_rq)), se);
+ __update_load_avg_blocked_se(last_update_time, cpu_of(hrq_of(cfs_rq)), se);
}

/*
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:52:45

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 53/60] cosched: Prevent balancing related functions from crossing hierarchy levels

Modify update_blocked_averages() and update_cfs_rq_h_load() so that they
won't access the next higher hierarchy level, for which they don't hold a
lock.

This will have to be touched again, when load balancing is made
functional.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc219c9c3097..210fcd534917 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7686,7 +7686,7 @@ static void update_blocked_averages(int cpu)

/* Propagate pending load changes to the parent, if any: */
se = cfs_rq->my_se;
- if (se && !skip_blocked_update(se))
+ if (se && !is_sd_se(se) && !skip_blocked_update(se))
update_load_avg(cfs_rq_of(se), se, 0);

/*
@@ -7731,6 +7731,8 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)

cfs_rq->h_load_next = NULL;
for_each_sched_entity(se) {
+ if (is_sd_se(se))
+ break;
cfs_rq = cfs_rq_of(se);
cfs_rq->h_load_next = se;
if (cfs_rq->last_h_load_update == now)
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:52:46

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 39/60] cosched: Adjust task group management for hierarchical runqueues

Provide variants of the task group CFS traversal constructs that also
reach the hierarchical runqueues. Adjust task group management functions
where necessary.

The most changes are in alloc_fair_sched_group(), where we now need to
be a bit more careful during initialization.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/cosched.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++------
kernel/sched/sched.h | 17 +++++++++++++
3 files changed, 124 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/cosched.c b/kernel/sched/cosched.c
index 48394050ec34..b897319d046c 100644
--- a/kernel/sched/cosched.c
+++ b/kernel/sched/cosched.c
@@ -10,6 +10,63 @@

#include "sched.h"

+/*****************************************************************************
+ * Task group traversal
+ *****************************************************************************/
+
+static struct sdrq *leftmost_sdrq(struct sdrq *sdrq)
+{
+ while (!list_empty(&sdrq->children))
+ sdrq = list_first_entry(&sdrq->children, struct sdrq, siblings);
+ return sdrq;
+}
+
+struct cfs_rq *taskgroup_first_cfsrq(struct task_group *tg)
+{
+ if (!tg->top_cfsrq)
+ return NULL;
+ return leftmost_sdrq(&tg->top_cfsrq->sdrq)->cfs_rq;
+}
+
+struct cfs_rq *taskgroup_next_cfsrq(struct task_group *tg, struct cfs_rq *cfs)
+{
+ struct sdrq *sdrq = &cfs->sdrq;
+ struct sdrq *parent = sdrq->sd_parent;
+
+ if (cfs == tg->top_cfsrq)
+ return NULL;
+
+ list_for_each_entry_continue(sdrq, &parent->children, siblings)
+ return leftmost_sdrq(sdrq)->cfs_rq;
+
+ return parent->cfs_rq;
+}
+
+struct cfs_rq *taskgroup_first_cfsrq_topdown(struct task_group *tg)
+{
+ return tg->top_cfsrq;
+}
+
+struct cfs_rq *taskgroup_next_cfsrq_topdown(struct task_group *tg,
+ struct cfs_rq *cfs)
+{
+ struct sdrq *sdrq = &cfs->sdrq;
+ struct sdrq *parent = sdrq->sd_parent;
+
+ if (!list_empty(&sdrq->children)) {
+ sdrq = list_first_entry(&sdrq->children, struct sdrq, siblings);
+ return sdrq->cfs_rq;
+ }
+
+ while (sdrq != &tg->top_cfsrq->sdrq) {
+ list_for_each_entry_continue(sdrq, &parent->children, siblings)
+ return sdrq->cfs_rq;
+ sdrq = parent;
+ parent = sdrq->sd_parent;
+ }
+ return NULL;
+}
+
static int mask_to_node(const struct cpumask *span)
{
int node = cpu_to_node(cpumask_first(span));
@@ -427,3 +484,14 @@ void cosched_init_hierarchy(void)
list_add_tail(&sdrq->siblings, &sdrq->sd_parent->children);
}
}
+
+/*****************************************************************************
+ * Task group management functions
+ *****************************************************************************/
+
+void cosched_init_sdrq(struct task_group *tg, struct cfs_rq *cfs_rq,
+ struct cfs_rq *sd_parent, struct cfs_rq *tg_parent)
+{
+ init_sdrq(tg, &cfs_rq->sdrq, sd_parent ? &sd_parent->sdrq : NULL,
+ &tg_parent->sdrq, tg_parent->sdrq.data);
+}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33e3f759eb99..f72a72c8c3b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9895,10 +9895,29 @@ void free_fair_sched_group(struct task_group *tg)
kfree(tg->cfs_rq);
}

+#ifdef CONFIG_COSCHEDULING
+static struct cfs_rq *find_sd_parent(struct cfs_rq *sd_parent,
+ struct cfs_rq *tg_parent)
+{
+ if (!sd_parent)
+ return NULL;
+
+ while (sd_parent->sdrq.tg_parent != tg_parent->sdrq.sd_parent)
+ sd_parent = sd_parent->sdrq.sd_parent->cfs_rq;
+ return sd_parent;
+}
+#else
+static struct cfs_rq *find_sd_parent(struct cfs_rq *sd_parent,
+ struct cfs_rq *tg_parent)
+{
+ return NULL;
+}
+#endif
+
int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
{
struct sched_entity *se;
- struct cfs_rq *cfs_rq, *pcfs_rq;
+ struct cfs_rq *cfs_rq = NULL, *pcfs_rq;

tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
if (!tg->cfs_rq)
@@ -9908,18 +9927,30 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)

init_cfs_bandwidth(tg_cfs_bandwidth(tg));

- taskgroup_for_each_cfsrq(parent, pcfs_rq) {
- struct rq *rq = rq_of(pcfs_rq);
- int node = cpu_to_node(cpu_of(rq));
+#ifdef CONFIG_COSCHEDULING
+ raw_spin_lock_init(&tg->lock);
+#endif
+
+ taskgroup_for_each_cfsrq_topdown(parent, pcfs_rq) {
+ struct rq *rq = hrq_of(pcfs_rq);
+ int node = node_of(rq);
+ struct cfs_rq *sdcfs_rq = find_sd_parent(cfs_rq, pcfs_rq);

cfs_rq = kzalloc_node(sizeof(*cfs_rq), GFP_KERNEL, node);
se = kzalloc_node(sizeof(*se), GFP_KERNEL, node);
if (!cfs_rq || !se)
goto err_free;

- tg->cfs_rq[cpu_of(rq)] = cfs_rq;
+#ifdef CONFIG_COSCHEDULING
+ if (!sdcfs_rq)
+ tg->top_cfsrq = cfs_rq;
+#endif
+ if (is_cpu_rq(rq))
+ tg->cfs_rq[cpu_of(rq)] = cfs_rq;
+
init_cfs_rq(cfs_rq);
init_tg_cfs_entry(tg, cfs_rq, se, rq, pcfs_rq);
+ cosched_init_sdrq(tg, cfs_rq, sdcfs_rq, pcfs_rq);
}

return 1;
@@ -9938,7 +9969,7 @@ void online_fair_sched_group(struct task_group *tg)
struct rq *rq;

taskgroup_for_each_cfsrq(tg, cfs) {
- rq = rq_of(cfs);
+ rq = hrq_of(cfs);
se = cfs->my_se;

raw_spin_lock_irq(&rq->lock);
@@ -9964,9 +9995,9 @@ void unregister_fair_sched_group(struct task_group *tg)
if (!cfs->on_list)
continue;

- raw_spin_lock_irqsave(&rq_of(cfs)->lock, flags);
+ raw_spin_lock_irqsave(&hrq_of(cfs)->lock, flags);
list_del_leaf_cfs_rq(cfs);
- raw_spin_unlock_irqrestore(&rq_of(cfs)->lock, flags);
+ raw_spin_unlock_irqrestore(&hrq_of(cfs)->lock, flags);
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bc3631b8b955..38b4500095ca 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1109,6 +1109,17 @@ static inline int cpu_of(struct rq *rq)
(cfs) = (ncfs), \
(ncfs) = (cfs) ? taskgroup_next_cfsrq(tg, cfs) : NULL)

+#ifdef CONFIG_COSCHEDULING
+#define taskgroup_for_each_cfsrq_topdown(tg, cfs) \
+ for ((cfs) = taskgroup_first_cfsrq_topdown(tg); (cfs); \
+ (cfs) = taskgroup_next_cfsrq_topdown(tg, cfs))
+struct cfs_rq *taskgroup_first_cfsrq(struct task_group *tg);
+struct cfs_rq *taskgroup_next_cfsrq(struct task_group *tg, struct cfs_rq *cfs);
+struct cfs_rq *taskgroup_first_cfsrq_topdown(struct task_group *tg);
+struct cfs_rq *taskgroup_next_cfsrq_topdown(struct task_group *tg,
+ struct cfs_rq *cfs);
+#else /* !CONFIG_COSCHEDULING */
+#define taskgroup_for_each_cfsrq_topdown taskgroup_for_each_cfsrq
static inline struct cfs_rq *taskgroup_first_cfsrq(struct task_group *tg)
{
int cpu = cpumask_first(cpu_possible_mask);
@@ -1127,6 +1138,7 @@ static inline struct cfs_rq *taskgroup_next_cfsrq(struct task_group *tg,
return NULL;
return tg->cfs_rq[cpu];
}
+#endif /* !CONFIG_COSCHEDULING */
#endif /* CONFIG_FAIR_GROUP_SCHED */

#ifdef CONFIG_COSCHEDULING
@@ -1181,10 +1193,15 @@ static inline bool is_sd_se(struct sched_entity *se)
void cosched_init_bottom(void);
void cosched_init_topology(void);
void cosched_init_hierarchy(void);
+void cosched_init_sdrq(struct task_group *tg, struct cfs_rq *cfs,
+ struct cfs_rq *sd_parent, struct cfs_rq *tg_parent);
#else /* !CONFIG_COSCHEDULING */
static inline void cosched_init_bottom(void) { }
static inline void cosched_init_topology(void) { }
static inline void cosched_init_hierarchy(void) { }
+static inline void cosched_init_sdrq(struct task_group *tg, struct cfs_rq *cfs,
+ struct cfs_rq *sd_parent,
+ struct cfs_rq *tg_parent) { }
#endif /* !CONFIG_COSCHEDULING */

#ifdef CONFIG_SCHED_SMT
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:53:19

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 34/60] cosched: Add rq_of() variants for different use cases

The rq_of() function is used everywhere. With the introduction of
hierarchical runqueues, we could modify rq_of() to return the
corresponding queue. In fact, no change would be necessary for that.

However, many code paths do not handle a hierarchical runqueue
adequately. Thus, we introduce variants of rq_of() to catch and
handle these code paths on a case by case basis.

The normal rq_of() will complain loudly if used with a hierarchical
runqueue. If a code path was audited for hierarchical use, it can be
switched to hrq_of(), which returns the proper hierarchical runqueue,
or to cpu_rq_of(), which returns the leader's CPU runqueue.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8cba7b8fb6bd..24d01bf8f796 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -249,9 +249,44 @@ const struct sched_class fair_sched_class;

#ifdef CONFIG_FAIR_GROUP_SCHED

+/* Variant of rq_of() that always returns a per-CPU runqueue */
+static inline struct rq *cpu_rq_of(struct cfs_rq *cfs_rq)
+{
+#ifdef CONFIG_COSCHEDULING
+ if (cfs_rq->sdrq.data->level) {
+ struct rq *rq = cpu_rq(cfs_rq->sdrq.data->leader);
+
+ /* We should have this lock already. */
+ lockdep_assert_held(&rq->lock);
+ return rq;
+ }
+#endif
+ return cfs_rq->rq;
+}
+
/* cpu runqueue to which this cfs_rq is attached */
static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
{
+#ifdef CONFIG_COSCHEDULING
+ if (SCHED_WARN_ON(cfs_rq->sdrq.data->level)) {
+ /*
+ * This should be only called for the bottom level.
+ * If not, then it's an indicator of a not yet adjusted
+ * code path. Return the CPU runqueue as this is more likely
+ * to work than the proper hierarchical runqueue.
+ *
+ * Note, that we don't necessarily have the lock for
+ * this bottom level, which may lead to weird issues.
+ */
+ return cpu_rq_of(cfs_rq);
+ }
+#endif
+ return cfs_rq->rq;
+}
+
+/* Hierarchical variant of rq_of() */
+static inline struct rq *hrq_of(struct cfs_rq *cfs_rq)
+{
return cfs_rq->rq;
}

--
2.9.3.1.gcba166c.dirty


2018-09-07 21:53:25

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 31/60] cosched: Don't disable idle tick for now

Coscheduling relies on the leader to drive preemption of the
group when the time slice is exhausted. This is also the case,
when the leader is idle but the group as a whole is not.
Because of that, we currently cannot disable the idle tick.

Keep the tick enabled in code. This relieves the user from disabling
NOHZ and allows us to gradually improve the situation later.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/time/tick-sched.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 5b33e2f5c0ed..5e9c2a7d4ea9 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -861,6 +861,20 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts)

static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
{
+#ifdef CONFIG_COSCHEDULING
+ /*
+ * FIXME: Coscheduling relies on the leader to drive preemption of the
+ * group when the time slice is exhausted. This is also the case,
+ * when the leader is idle but the group as a whole is not.
+ * Because of that, we currently cannot disable the idle tick.
+ *
+ * Short-term, we could disable the idle tick, in cases where the
+ * whole group is idle. Long-term, we can consider switching
+ * leadership of the group to a non-idle CPU.
+ */
+ return false;
+#endif
+
/*
* If this CPU is offline and it is the one which updates
* jiffies, then give up the assignment and let it be taken by
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:53:26

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 30/60] cosched: Disallow share modification on task groups for now

The code path is not yet adjusted for coscheduling. Disable
it for now.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 30e5ff30f442..8504790944bf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9942,6 +9942,15 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
{
int i;

+#ifdef CONFIG_COSCHEDULING
+ /*
+ * FIXME: This function has not been adjusted for coscheduling.
+ * Disable it completely for now.
+ */
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+#endif
+
/*
* We can't change the weight of the root cgroup.
*/
@@ -9955,6 +9964,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
goto done;

tg->shares = shares;
+
for_each_possible_cpu(i) {
struct rq *rq = cpu_rq(i);
struct sched_entity *se = tg->cfs_rq[i]->my_se;
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:53:29

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 45/60] cosched: Continue to account all load on per-CPU runqueues

Even with coscheduling, we define the fields rq->nr_running and rq->load
of per-CPU runqueues to represent the total amount of tasks and the
total amount of load on that CPU, respectively, so that existing code
continues to work as expected.

Make sure to still account load changes on per-CPU runqueues.

The change in set_next_entity() just silences a warning. The code looks
bogus even without coscheduling, as the weight of an SE is independent
from the weight of the runqueue, when task groups are involved. It's
just for statistics anyway.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fff88694560c..0bba924b40ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2741,8 +2741,8 @@ static void
account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
update_load_add(&cfs_rq->load, se->load.weight);
- if (!parent_entity(se))
- update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
+ if (!parent_entity(se) || is_sd_se(parent_entity(se)))
+ update_load_add(&hrq_of(cfs_rq)->load, se->load.weight);
#ifdef CONFIG_SMP
if (entity_is_task(se)) {
struct rq *rq = rq_of(cfs_rq);
@@ -2758,8 +2758,8 @@ static void
account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
update_load_sub(&cfs_rq->load, se->load.weight);
- if (!parent_entity(se))
- update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
+ if (!parent_entity(se) || is_sd_se(parent_entity(se)))
+ update_load_sub(&hrq_of(cfs_rq)->load, se->load.weight);
#ifdef CONFIG_SMP
if (entity_is_task(se)) {
account_numa_dequeue(rq_of(cfs_rq), task_of(se));
@@ -4122,7 +4122,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
* least twice that of our own weight (i.e. dont track it
* when there are only lesser-weight tasks around):
*/
- if (schedstat_enabled() && rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
+ if (schedstat_enabled() &&
+ hrq_of(cfs_rq)->load.weight >= 2 * se->load.weight) {
schedstat_set(se->statistics.slice_max,
max((u64)schedstat_val(se->statistics.slice_max),
se->sum_exec_runtime - se->prev_sum_exec_runtime));
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:54:01

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 22/60] cosched: Add config option for coscheduling support

Scheduled task groups will bring coscheduling to Linux.
The actual functionality will be added successively.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
init/Kconfig | 11 +++++++++++
kernel/sched/Makefile | 1 +
kernel/sched/cosched.c | 9 +++++++++
3 files changed, 21 insertions(+)
create mode 100644 kernel/sched/cosched.c

diff --git a/init/Kconfig b/init/Kconfig
index 1e234e2f1cba..78807f8f2cf7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -762,6 +762,17 @@ config FAIR_GROUP_SCHED
depends on CGROUP_SCHED
default CGROUP_SCHED

+config COSCHEDULING
+ bool "Coscheduling extension for SCHED_OTHER (EXPERIMENTAL)"
+ depends on FAIR_GROUP_SCHED && SMP
+ default n
+ help
+ This feature enables coscheduling for SCHED_OTHER. Unlike
+ regular task groups the system aims to run members of
+ scheduled task groups (STGs) simultaneously.
+
+ Say N if unsure.
+
config CFS_BANDWIDTH
bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
depends on FAIR_GROUP_SCHED
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 7fe183404c38..f32781fe76c5 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
+obj-$(CONFIG_COSCHEDULING) += cosched.o
diff --git a/kernel/sched/cosched.c b/kernel/sched/cosched.c
new file mode 100644
index 000000000000..3bd4557ca5b7
--- /dev/null
+++ b/kernel/sched/cosched.c
@@ -0,0 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Coscheduling extension for CFS.
+ *
+ * Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ * Author: Jan H. Schönherr <[email protected]>
+ */
+
+#include "sched.h"
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:54:18

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 26/60] cosched: Construct runqueue hierarchy

With scheduling domains sufficiently prepared, we can now initialize
the full hierarchy of runqueues and link it with the already existing
bottom level, which we set up earlier.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/core.c | 1 +
kernel/sched/cosched.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 2 ++
3 files changed, 79 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cc801f84bf97..5350cab7ac4a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5876,6 +5876,7 @@ void __init sched_init_smp(void)
*/
mutex_lock(&sched_domains_mutex);
sched_init_domains(cpu_active_mask);
+ cosched_init_hierarchy();
mutex_unlock(&sched_domains_mutex);

/* Move init over to a non-isolated CPU */
diff --git a/kernel/sched/cosched.c b/kernel/sched/cosched.c
index 7a793aa93114..48394050ec34 100644
--- a/kernel/sched/cosched.c
+++ b/kernel/sched/cosched.c
@@ -351,3 +351,79 @@ void cosched_init_topology(void)
/* Make permanent */
set_sched_topology(tl);
}
+
+/*
+ * Build the SD-RQ hierarchy according to the scheduling domains.
+ *
+ * Note, that the scheduler is already live at this point, but the scheduling
+ * domains only just have become available. That means, we only setup everything
+ * above the bottom level of the SD-RQ hierarchy and link it with the already
+ * active bottom level.
+ *
+ * We can do this without any locks, as nothing will automatically traverse into
+ * these data structures. This requires an update of the sdrq.is_root property,
+ * which will happen only later.
+ */
+void cosched_init_hierarchy(void)
+{
+ struct sched_domain *sd;
+ struct sdrq *sdrq;
+ int cpu, level = 1;
+
+ /* Only one CPU in the system, we are finished here */
+ if (cpumask_weight(cpu_possible_mask) == 1)
+ return;
+
+ /* Determine and initialize top */
+ for_each_domain(0, sd) {
+ if (!sd->parent)
+ break;
+ level++;
+ }
+
+ init_sdrq_data(&sd->shared->rq.sdrq_data, NULL, sched_domain_span(sd),
+ level);
+ init_cfs_rq(&sd->shared->rq.cfs);
+ init_tg_cfs_entry(&root_task_group, &sd->shared->rq.cfs, NULL,
+ &sd->shared->rq, NULL);
+ init_sdrq(&root_task_group, &sd->shared->rq.cfs.sdrq, NULL, NULL,
+ &sd->shared->rq.sdrq_data);
+
+ root_task_group.top_cfsrq = &sd->shared->rq.cfs;
+
+ /* Initialize others top-down, per CPU */
+ for_each_possible_cpu(cpu) {
+ /* Find highest not-yet initialized position for this CPU */
+ for_each_domain(cpu, sd) {
+ if (sd->shared->rq.sdrq_data.span_weight)
+ break;
+ }
+ if (WARN(!sd, "SD hierarchy seems to have multiple roots"))
+ continue;
+ sd = sd->child;
+
+ /* Initialize from there downwards */
+ for_each_lower_domain(sd) {
+ init_sdrq_data(&sd->shared->rq.sdrq_data,
+ &sd->parent->shared->rq.sdrq_data,
+ sched_domain_span(sd), -1);
+ init_cfs_rq(&sd->shared->rq.cfs);
+ init_tg_cfs_entry(&root_task_group, &sd->shared->rq.cfs,
+ NULL, &sd->shared->rq, NULL);
+ init_sdrq(&root_task_group, &sd->shared->rq.cfs.sdrq,
+ &sd->parent->shared->rq.cfs.sdrq, NULL,
+ &sd->shared->rq.sdrq_data);
+ }
+
+ /* Link up with local data structures */
+ sdrq = &cpu_rq(cpu)->cfs.sdrq;
+ sd = cpu_rq(cpu)->sd;
+
+ /* sdrq_data */
+ sdrq->data->parent = &sd->shared->rq.sdrq_data;
+
+ /* sdrq */
+ sdrq->sd_parent = &sd->shared->rq.cfs.sdrq;
+ list_add_tail(&sdrq->siblings, &sdrq->sd_parent->children);
+ }
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ed9c526b74ee..d65c98c34c13 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1132,9 +1132,11 @@ static inline struct cfs_rq *taskgroup_next_cfsrq(struct task_group *tg,
#ifdef CONFIG_COSCHEDULING
void cosched_init_bottom(void);
void cosched_init_topology(void);
+void cosched_init_hierarchy(void);
#else /* !CONFIG_COSCHEDULING */
static inline void cosched_init_bottom(void) { }
static inline void cosched_init_topology(void) { }
+static inline void cosched_init_hierarchy(void) { }
#endif /* !CONFIG_COSCHEDULING */

#ifdef CONFIG_SCHED_SMT
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:54:20

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 15/60] sched: Introduce parent_cfs_rq() and use it

Factor out the logic to retrieve the parent CFS runqueue of another
CFS runqueue into its own function and replace open-coded variants.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9f0ce4555c26..82cdd75e88b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -282,11 +282,18 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
return grp->my_q;
}

+static inline struct cfs_rq *parent_cfs_rq(struct cfs_rq *cfs_rq)
+{
+ if (!cfs_rq->tg->parent)
+ return NULL;
+ return cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))];
+}
+
static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
{
if (!cfs_rq->on_list) {
struct rq *rq = rq_of(cfs_rq);
- int cpu = cpu_of(rq);
+ struct cfs_rq *pcfs_rq = parent_cfs_rq(cfs_rq);
/*
* Ensure we either appear before our parent (if already
* enqueued) or force our parent to appear after us when it is
@@ -296,8 +303,7 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
* tmp_alone_branch either when the branch is connected
* to a tree or when we reach the beg of the tree
*/
- if (cfs_rq->tg->parent &&
- cfs_rq->tg->parent->cfs_rq[cpu]->on_list) {
+ if (pcfs_rq && pcfs_rq->on_list) {
/*
* If parent is already on the list, we add the child
* just before. Thanks to circular linked property of
@@ -305,14 +311,14 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
* of the list that starts by parent.
*/
list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
- &(cfs_rq->tg->parent->cfs_rq[cpu]->leaf_cfs_rq_list));
+ &pcfs_rq->leaf_cfs_rq_list);
/*
* The branch is now connected to its tree so we can
* reset tmp_alone_branch to the beginning of the
* list.
*/
rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
- } else if (!cfs_rq->tg->parent) {
+ } else if (!pcfs_rq) {
/*
* cfs rq without parent should be put
* at the tail of the list.
@@ -4716,7 +4722,7 @@ static void sync_throttle(struct cfs_rq *cfs_rq)
if (!cfs_bandwidth_used())
return;

- pcfs_rq = cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))];
+ pcfs_rq = parent_cfs_rq(cfs_rq);

cfs_rq->throttle_count = pcfs_rq->throttle_count;
cfs_rq->throttled_clock_task = rq_clock_task(rq_of(cfs_rq));
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:54:39

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 14/60] sched: Refactor sync_throttle() to accept a CFS runqueue as argument

Prepare for future changes and refactor sync_throttle() to work with
a different set of arguments.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5cad364e3a88..9f0ce4555c26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4709,18 +4709,17 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
throttle_cfs_rq(cfs_rq);
}

-static void sync_throttle(struct task_group *tg, int cpu)
+static void sync_throttle(struct cfs_rq *cfs_rq)
{
- struct cfs_rq *pcfs_rq, *cfs_rq;
+ struct cfs_rq *pcfs_rq;

if (!cfs_bandwidth_used())
return;

- cfs_rq = tg->cfs_rq[cpu];
- pcfs_rq = tg->parent->cfs_rq[cpu];
+ pcfs_rq = cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))];

cfs_rq->throttle_count = pcfs_rq->throttle_count;
- cfs_rq->throttled_clock_task = rq_clock_task(cpu_rq(cpu));
+ cfs_rq->throttled_clock_task = rq_clock_task(rq_of(cfs_rq));
}

/* conditionally throttle active cfs_rq's from put_prev_entity() */
@@ -4887,7 +4886,7 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
-static inline void sync_throttle(struct task_group *tg, int cpu) {}
+static inline void sync_throttle(struct cfs_rq *cfs_rq) {}
static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}

static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
@@ -9866,7 +9865,7 @@ void online_fair_sched_group(struct task_group *tg)
raw_spin_lock_irq(&rq->lock);
update_rq_clock(rq);
attach_entity_cfs_rq(se);
- sync_throttle(tg, i);
+ sync_throttle(tg->cfs_rq[i]);
raw_spin_unlock_irq(&rq->lock);
}
}
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:55:23

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 01/60] sched: Store task_group->se[] pointers as part of cfs_rq

Move around the storage location of the scheduling entity references
of task groups. Instead of linking them from the task_group struct,
link each SE from the CFS runqueue itself with a new field "my_se".

This resembles the "my_q" field that is already available, just in
the other direction.

Adjust all users, simplifying many of them.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/core.c | 7 ++-----
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 36 ++++++++++++++++--------------------
kernel/sched/sched.h | 5 ++---
4 files changed, 21 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 625bc9897f62..fd1b0abd8474 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5915,7 +5915,7 @@ void __init sched_init(void)
wait_bit_init();

#ifdef CONFIG_FAIR_GROUP_SCHED
- alloc_size += 2 * nr_cpu_ids * sizeof(void **);
+ alloc_size += nr_cpu_ids * sizeof(void **);
#endif
#ifdef CONFIG_RT_GROUP_SCHED
alloc_size += 2 * nr_cpu_ids * sizeof(void **);
@@ -5924,9 +5924,6 @@ void __init sched_init(void)
ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT);

#ifdef CONFIG_FAIR_GROUP_SCHED
- root_task_group.se = (struct sched_entity **)ptr;
- ptr += nr_cpu_ids * sizeof(void **);
-
root_task_group.cfs_rq = (struct cfs_rq **)ptr;
ptr += nr_cpu_ids * sizeof(void **);

@@ -6746,7 +6743,7 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
int i;

for_each_possible_cpu(i)
- ws += schedstat_val(tg->se[i]->statistics.wait_sum);
+ ws += schedstat_val(tg->cfs_rq[i]->my_se->statistics.wait_sum);

seq_printf(sf, "wait_sum %llu\n", ws);
}
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 60caf1fb94e0..4045bd8b2e5d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -385,7 +385,7 @@ void unregister_sched_domain_sysctl(void)
#ifdef CONFIG_FAIR_GROUP_SCHED
static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group *tg)
{
- struct sched_entity *se = tg->se[cpu];
+ struct sched_entity *se = tg->cfs_rq[cpu]->my_se;

#define P(F) SEQ_printf(m, " .%-30s: %lld\n", #F, (long long)F)
#define P_SCHEDSTAT(F) SEQ_printf(m, " .%-30s: %lld\n", #F, (long long)schedstat_val(F))
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..638fd14bb6c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4367,7 +4367,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
long task_delta, dequeue = 1;
bool empty;

- se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+ se = cfs_rq->my_se;

/* freeze hierarchy runnable averages while throttled */
rcu_read_lock();
@@ -4421,7 +4421,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
int enqueue = 1;
long task_delta;

- se = cfs_rq->tg->se[cpu_of(rq)];
+ se = cfs_rq->my_se;

cfs_rq->throttled = 0;

@@ -7284,7 +7284,7 @@ static void update_blocked_averages(int cpu)
update_tg_load_avg(cfs_rq, 0);

/* Propagate pending load changes to the parent, if any: */
- se = cfs_rq->tg->se[cpu];
+ se = cfs_rq->my_se;
if (se && !skip_blocked_update(se))
update_load_avg(cfs_rq_of(se), se, 0);

@@ -7321,8 +7321,7 @@ static void update_blocked_averages(int cpu)
*/
static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
{
- struct rq *rq = rq_of(cfs_rq);
- struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
+ struct sched_entity *se = cfs_rq->my_se;
unsigned long now = jiffies;
unsigned long load;

@@ -9819,15 +9818,15 @@ void free_fair_sched_group(struct task_group *tg)

destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));

+ if (!tg->cfs_rq)
+ return;
+
for_each_possible_cpu(i) {
- if (tg->cfs_rq)
- kfree(tg->cfs_rq[i]);
- if (tg->se)
- kfree(tg->se[i]);
+ kfree(tg->cfs_rq[i]->my_se);
+ kfree(tg->cfs_rq[i]);
}

kfree(tg->cfs_rq);
- kfree(tg->se);
}

int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
@@ -9839,9 +9838,6 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
if (!tg->cfs_rq)
goto err;
- tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL);
- if (!tg->se)
- goto err;

tg->shares = NICE_0_LOAD;

@@ -9859,7 +9855,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
goto err_free_rq;

init_cfs_rq(cfs_rq);
- init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
+ init_tg_cfs_entry(tg, cfs_rq, se, i, parent->cfs_rq[i]->my_se);
init_entity_runnable_average(se);
}

@@ -9879,7 +9875,7 @@ void online_fair_sched_group(struct task_group *tg)

for_each_possible_cpu(i) {
rq = cpu_rq(i);
- se = tg->se[i];
+ se = tg->cfs_rq[i]->my_se;

raw_spin_lock_irq(&rq->lock);
update_rq_clock(rq);
@@ -9896,8 +9892,8 @@ void unregister_fair_sched_group(struct task_group *tg)
int cpu;

for_each_possible_cpu(cpu) {
- if (tg->se[cpu])
- remove_entity_load_avg(tg->se[cpu]);
+ if (tg->cfs_rq[cpu]->my_se)
+ remove_entity_load_avg(tg->cfs_rq[cpu]->my_se);

/*
* Only empty task groups can be destroyed; so we can speculatively
@@ -9925,7 +9921,7 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
init_cfs_rq_runtime(cfs_rq);

tg->cfs_rq[cpu] = cfs_rq;
- tg->se[cpu] = se;
+ cfs_rq->my_se = se;

/* se could be NULL for root_task_group */
if (!se)
@@ -9954,7 +9950,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
/*
* We can't change the weight of the root cgroup.
*/
- if (!tg->se[0])
+ if (!tg->cfs_rq[0]->my_se)
return -EINVAL;

shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
@@ -9966,7 +9962,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
tg->shares = shares;
for_each_possible_cpu(i) {
struct rq *rq = cpu_rq(i);
- struct sched_entity *se = tg->se[i];
+ struct sched_entity *se = tg->cfs_rq[i]->my_se;
struct rq_flags rf;

/* Propagate contribution to hierarchy */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8cae63c4..8435bf70a701 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -354,8 +354,6 @@ struct task_group {
struct cgroup_subsys_state css;

#ifdef CONFIG_FAIR_GROUP_SCHED
- /* schedulable entities of this group on each CPU */
- struct sched_entity **se;
/* runqueue "owned" by this group on each CPU */
struct cfs_rq **cfs_rq;
unsigned long shares;
@@ -537,6 +535,7 @@ struct cfs_rq {

#ifdef CONFIG_FAIR_GROUP_SCHED
struct rq *rq; /* CPU runqueue to which this cfs_rq is attached */
+ struct sched_entity *my_se; /* entity representing this cfs_rq */

/*
* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
@@ -1301,7 +1300,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
#ifdef CONFIG_FAIR_GROUP_SCHED
set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);
p->se.cfs_rq = tg->cfs_rq[cpu];
- p->se.parent = tg->se[cpu];
+ p->se.parent = tg->cfs_rq[cpu]->my_se;
#endif

#ifdef CONFIG_RT_GROUP_SCHED
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:55:34

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 06/60] sched: Add a lock-free variant of resched_cpu()

Add resched_cpu_locked(), which still works as expected, when it is called
while we already hold a runqueue lock from a different CPU.

There is some optimization potential by merging the logic of resched_curr()
and resched_cpu_locked() to avoid IPIs when calls to both functions happen.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/core.c | 21 +++++++++++++++++++--
kernel/sched/sched.h | 6 ++++++
2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fd1b0abd8474..c38a54f57e90 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -486,6 +486,15 @@ void resched_cpu(int cpu)
}

#ifdef CONFIG_SMP
+/* resched_cpu() when you're already holding a RQ lock of a different CPU */
+void resched_cpu_locked(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+
+ if (!atomic_read(&rq->resched) && !atomic_xchg(&rq->resched, 1))
+ smp_send_reschedule(cpu);
+}
+
#ifdef CONFIG_NO_HZ_COMMON
/*
* In the semi idle case, use the nearest busy CPU for migrating timers
@@ -1744,6 +1753,14 @@ void sched_ttwu_pending(void)

void scheduler_ipi(void)
{
+ struct rq *rq = this_rq();
+
+ /* Handle lock-free requests to reschedule the current task */
+ if (atomic_read(&rq->resched)) {
+ atomic_set(&rq->resched, 0);
+ set_thread_flag(TIF_NEED_RESCHED);
+ }
+
/*
* Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
* TIF_NEED_RESCHED remotely (for the first time) will also send
@@ -1751,7 +1768,7 @@ void scheduler_ipi(void)
*/
preempt_fold_need_resched();

- if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
+ if (llist_empty(&rq->wake_list) && !got_nohz_idle_kick())
return;

/*
@@ -1774,7 +1791,7 @@ void scheduler_ipi(void)
* Check if someone kicked us for doing the nohz idle load balance.
*/
if (unlikely(got_nohz_idle_kick())) {
- this_rq()->idle_balance = 1;
+ rq->idle_balance = 1;
raise_softirq_irqoff(SCHED_SOFTIRQ);
}
irq_exit();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f6da85447f3c..926a26d816a2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -850,6 +850,9 @@ struct rq {
int cpu;
int online;

+ /* Lock-free rescheduling request for this runqueue */
+ atomic_t resched;
+
struct list_head cfs_tasks;

struct sched_avg avg_rt;
@@ -1647,6 +1650,9 @@ extern void reweight_task(struct task_struct *p, int prio);

extern void resched_curr(struct rq *rq);
extern void resched_cpu(int cpu);
+#ifdef CONFIG_SMP
+void resched_cpu_locked(int cpu);
+#endif

extern struct rt_bandwidth def_rt_bandwidth;
extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
--
2.9.3.1.gcba166c.dirty


2018-09-07 21:56:11

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 02/60] sched: Introduce set_entity_cfs() to place a SE into a certain CFS runqueue

Factor out the logic to place a SE into a CFS runqueue into its own
function.

This consolidates various sprinkled updates of se->cfs_rq, se->parent,
and se->depth at the cost of updating se->depth unnecessarily on
same-group movements between CPUs.

Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/fair.c | 26 ++++----------------------
kernel/sched/sched.h | 12 +++++++++---
2 files changed, 13 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 638fd14bb6c4..3de0158729a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9683,14 +9683,6 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);

-#ifdef CONFIG_FAIR_GROUP_SCHED
- /*
- * Since the real-depth could have been changed (only FAIR
- * class maintain depth value), reset depth properly.
- */
- se->depth = se->parent ? se->parent->depth + 1 : 0;
-#endif
-
/* Synchronize entity with its cfs_rq */
update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
attach_entity_load_avg(cfs_rq, se, 0);
@@ -9781,10 +9773,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#ifdef CONFIG_FAIR_GROUP_SCHED
static void task_set_group_fair(struct task_struct *p)
{
- struct sched_entity *se = &p->se;
-
set_task_rq(p, task_cpu(p));
- se->depth = se->parent ? se->parent->depth + 1 : 0;
}

static void task_move_group_fair(struct task_struct *p)
@@ -9855,7 +9844,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
goto err_free_rq;

init_cfs_rq(cfs_rq);
- init_tg_cfs_entry(tg, cfs_rq, se, i, parent->cfs_rq[i]->my_se);
+ init_tg_cfs_entry(tg, cfs_rq, se, i, parent->cfs_rq[i]);
init_entity_runnable_average(se);
}

@@ -9912,7 +9901,7 @@ void unregister_fair_sched_group(struct task_group *tg)

void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
struct sched_entity *se, int cpu,
- struct sched_entity *parent)
+ struct cfs_rq *parent)
{
struct rq *rq = cpu_rq(cpu);

@@ -9927,18 +9916,11 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
if (!se)
return;

- if (!parent) {
- se->cfs_rq = &rq->cfs;
- se->depth = 0;
- } else {
- se->cfs_rq = parent->my_q;
- se->depth = parent->depth + 1;
- }
-
+ set_entity_cfs(se, parent);
se->my_q = cfs_rq;
+
/* guarantee group entities always have weight */
update_load_set(&se->load, NICE_0_LOAD);
- se->parent = parent;
}

static DEFINE_MUTEX(shares_mutex);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8435bf70a701..b4d0e8a68697 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -428,7 +428,7 @@ extern void online_fair_sched_group(struct task_group *tg);
extern void unregister_fair_sched_group(struct task_group *tg);
extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
struct sched_entity *se, int cpu,
- struct sched_entity *parent);
+ struct cfs_rq *parent);
extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);

extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
@@ -1290,6 +1290,13 @@ static inline struct task_group *task_group(struct task_struct *p)
return p->sched_task_group;
}

+static inline void set_entity_cfs(struct sched_entity *se, struct cfs_rq *cfs_rq)
+{
+ se->cfs_rq = cfs_rq;
+ se->parent = cfs_rq->my_se;
+ se->depth = se->parent ? se->parent->depth + 1 : 0;
+}
+
/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
{
@@ -1299,8 +1306,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)

#ifdef CONFIG_FAIR_GROUP_SCHED
set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);
- p->se.cfs_rq = tg->cfs_rq[cpu];
- p->se.parent = tg->cfs_rq[cpu]->my_se;
+ set_entity_cfs(&p->se, tg->cfs_rq[cpu]);
#endif

#ifdef CONFIG_RT_GROUP_SCHED
--
2.9.3.1.gcba166c.dirty


2018-09-10 02:52:16

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC 60/60] cosched: Add command line argument to enable coscheduling

On 9/7/18 2:40 PM, Jan H. Schönherr wrote:
> Add a new command line argument cosched_max_level=<n>, which allows
> enabling coscheduling at boot. The number corresponds to the scheduling
> domain up to which coscheduling can later be enabled for cgroups.
>
> For example, to enable coscheduling of cgroups at SMT level, one would
> specify cosched_max_level=1.
>
> The use of symbolic names (like off, core, socket, system) is currently
> not possible, but could be added. However, to force coscheduling at up
> to system level not knowing the scheduling domain topology in advance,
> it is possible to just specify a too large number. It will be clamped
> transparently to system level.
>
> Signed-off-by: Jan H. Schönherr <[email protected]>
> ---
> kernel/sched/cosched.c | 32 +++++++++++++++++++++++++++++++-
> 1 file changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/cosched.c b/kernel/sched/cosched.c
> index eb6a6a61521e..a1f0d3a7b02a 100644
> --- a/kernel/sched/cosched.c
> +++ b/kernel/sched/cosched.c
> @@ -162,6 +162,29 @@ static int __init cosched_split_domains_setup(char *str)
>
> early_param("cosched_split_domains", cosched_split_domains_setup);
>

> +
> +early_param("cosched_max_level", cosched_max_level_setup);
> +

Hi,
Please document both of these kernel parameters in
Documentation/admin-guide/kernel-parameters.txt.

thanks,
--
~Randy

2018-09-12 00:25:42

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

[ I am not subscribed to LKML, please keep me CC'd on replies ]

On 07.09.2018 [23:39:47 +0200], Jan H. Sch?nherr wrote:
> This patch series extends CFS with support for coscheduling. The
> implementation is versatile enough to cover many different
> coscheduling use-cases, while at the same time being non-intrusive, so
> that behavior of legacy workloads does not change.

I tried a simple test with several VMs (in my initial test, I have 48
idle 1-cpu 512-mb VMs and 2 idle 2-cpu, 2-gb VMs) using libvirt, none
pinned to any CPUs. When I tried to set all of the top-level libvirt cpu
cgroups' to be co-scheduled (/bin/echo 1 >
/sys/fs/cgroup/cpu/machine/<VM-x>.libvirt-qemu/cpu.scheduled), the
machine hangs. This is using cosched_max_level=1.

There are several moving parts there, so I tried narrowing it down, by
only coscheduling one VM, and thing seemed fine:

/sys/fs/cgroup/cpu/machine/<VM-1>.libvirt-qemu# echo 1 > cpu.scheduled
/sys/fs/cgroup/cpu/machine/<VM-1>.libvirt-qemu# cat cpu.scheduled
1

One thing that is not entirely obvious to me (but might be completely
intentional) is that since by default the top-level libvirt cpu cgroups
are empty:

/sys/fs/cgroup/cpu/machine/<VM-1>.libvirt-qemu# cat tasks

the result of this should be a no-op, right? [This becomes relevant
below] Specifically, all of the threads of qemu are in sub-cgroups,
which do not indicate they are co-scheduling:

/sys/fs/cgroup/cpu/machine/<VM-1>.libvirt-qemu# cat emulator/cpu.scheduled
0
/sys/fs/cgroup/cpu/machine/<VM-1>.libvirt-qemu# cat vcpu0/cpu.scheduled
0

When I then try to coschedule the second VM, the machine hangs.

/sys/fs/cgroup/cpu/machine/<VM-2>.libvirt-qemu# echo 1 > cpu.scheduled
Timeout, server <HOST> not responding.

On the console, I see the same backtraces I see when I try to set all of
the VMs to be coscheduled:

[ 144.494091] watchdog: BUG: soft lockup - CPU#87 stuck for 22s! [CPU 0/KVM:25344]
[ 144.507629] Modules linked in: act_police cls_basic ebtable_filter ebtables ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw ip_tables s
[ 144.578858] xxhash raid10 raid0 multipath linear raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor ses raid6_pq enclosure libcrc32c raid1 scsi
[ 144.599227] CPU: 87 PID: 25344 Comm: CPU 0/KVM Tainted: G O 4.19.0-rc2-amazon-cosched+ #1
[ 144.608819] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 06/29/2018
[ 144.616403] RIP: 0010:smp_call_function_single+0xa7/0xd0
[ 144.621818] Code: 01 48 89 d1 48 89 f2 4c 89 c6 e8 64 fe ff ff c9 c3 48 89 d1 48 89 f2 48 89 e6 e8 54 fe ff ff 8b 54 24 18 83 e2 01 74 0b f3 90 <8b> 54 24 18 83 e25
[ 144.640703] RSP: 0018:ffffb2a4a75abb40 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 144.648390] RAX: 0000000000000000 RBX: 0000000000000057 RCX: 0000000000000000
[ 144.655607] RDX: 0000000000000001 RSI: 00000000000000fb RDI: 0000000000000202
[ 144.662826] RBP: ffffb2a4a75abb60 R08: 0000000000000000 R09: 0000000000000f39
[ 144.670073] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a9c03fc8000
[ 144.677301] R13: ffff8ab4589dc100 R14: 0000000000000057 R15: 0000000000000000
[ 144.684519] FS: 00007f51cd41a700(0000) GS:ffff8ab45fac0000(0000) knlGS:0000000000000000
[ 144.692710] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 144.698542] CR2: 000000c4203c0000 CR3: 000000178a97e005 CR4: 00000000007626e0
[ 144.705771] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 144.712989] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 144.720215] PKRU: 55555554
[ 144.723016] Call Trace:
[ 144.725553] ? vmx_sched_in+0xc0/0xc0 [kvm_intel]
[ 144.730341] vmx_vcpu_load+0x244/0x310 [kvm_intel]
[ 144.735220] ? __switch_to_asm+0x40/0x70
[ 144.739231] ? __switch_to_asm+0x34/0x70
[ 144.743235] ? __switch_to_asm+0x40/0x70
[ 144.747240] ? __switch_to_asm+0x34/0x70
[ 144.751243] ? __switch_to_asm+0x40/0x70
[ 144.755246] ? __switch_to_asm+0x34/0x70
[ 144.759250] ? __switch_to_asm+0x40/0x70
[ 144.763272] ? __switch_to_asm+0x34/0x70
[ 144.767284] ? __switch_to_asm+0x40/0x70
[ 144.771296] ? __switch_to_asm+0x34/0x70
[ 144.775299] ? __switch_to_asm+0x40/0x70
[ 144.779313] ? __switch_to_asm+0x34/0x70
[ 144.783317] ? __switch_to_asm+0x40/0x70
[ 144.787338] kvm_arch_vcpu_load+0x40/0x270 [kvm]
[ 144.792056] finish_task_switch+0xe2/0x260
[ 144.796238] __schedule+0x316/0x890
[ 144.799810] schedule+0x32/0x80
[ 144.803039] kvm_vcpu_block+0x7a/0x2e0 [kvm]
[ 144.807399] kvm_arch_vcpu_ioctl_run+0x1a7/0x1990 [kvm]
[ 144.812705] ? futex_wake+0x84/0x150
[ 144.816368] kvm_vcpu_ioctl+0x3ab/0x5d0 [kvm]
[ 144.820810] ? wake_up_q+0x70/0x70
[ 144.824311] do_vfs_ioctl+0x92/0x600
[ 144.827985] ? syscall_trace_enter+0x1ac/0x290
[ 144.832517] ksys_ioctl+0x60/0x90
[ 144.835913] ? exit_to_usermode_loop+0xa6/0xc2
[ 144.840436] __x64_sys_ioctl+0x16/0x20
[ 144.844267] do_syscall_64+0x55/0x110
[ 144.848012] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 144.853160] RIP: 0033:0x7f51cf82bea7
[ 144.856816] Code: 44 00 00 48 8b 05 e1 cf 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff8
[ 144.875752] RSP: 002b:00007f51cd419a18 EFLAGS: 00000246 ORIG_RAX: 0000000000000010

I am happy to do any further debugging I can do, or try patches on top
of those posted on the mailing list.

Thanks,
Nish

2018-09-12 19:35:03

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 09/12/2018 02:24 AM, Nishanth Aravamudan wrote:
> [ I am not subscribed to LKML, please keep me CC'd on replies ]
>
> I tried a simple test with several VMs (in my initial test, I have 48
> idle 1-cpu 512-mb VMs and 2 idle 2-cpu, 2-gb VMs) using libvirt, none
> pinned to any CPUs. When I tried to set all of the top-level libvirt cpu
> cgroups' to be co-scheduled (/bin/echo 1 >
> /sys/fs/cgroup/cpu/machine/<VM-x>.libvirt-qemu/cpu.scheduled), the
> machine hangs. This is using cosched_max_level=1.
>
> There are several moving parts there, so I tried narrowing it down, by
> only coscheduling one VM, and thing seemed fine:
>
> /sys/fs/cgroup/cpu/machine/<VM-1>.libvirt-qemu# echo 1 > cpu.scheduled
> /sys/fs/cgroup/cpu/machine/<VM-1>.libvirt-qemu# cat cpu.scheduled
> 1
>
> One thing that is not entirely obvious to me (but might be completely
> intentional) is that since by default the top-level libvirt cpu cgroups
> are empty:
>
> /sys/fs/cgroup/cpu/machine/<VM-1>.libvirt-qemu# cat tasks
>
> the result of this should be a no-op, right? [This becomes relevant
> below] Specifically, all of the threads of qemu are in sub-cgroups,
> which do not indicate they are co-scheduling:
>
> /sys/fs/cgroup/cpu/machine/<VM-1>.libvirt-qemu# cat emulator/cpu.scheduled
> 0
> /sys/fs/cgroup/cpu/machine/<VM-1>.libvirt-qemu# cat vcpu0/cpu.scheduled
> 0
>

This setup *should* work. It should be possible to set cpu.scheduled
independent of the cpu.scheduled values of parent and child task groups.
Any intermediate regular task group (i.e. cpu.scheduled==0) will still
contribute the group fairness aspects.

That said, I see a hang, too. It seems to happen, when there is a
cpu.scheduled!=0 group that is not a direct child of the root task group.
You seem to have "/sys/fs/cgroup/cpu/machine" as an intermediate group.
(The case ==0 within !=0 within the root task group works for me.)

I'm going to dive into the code.

[...]
> I am happy to do any further debugging I can do, or try patches on top
> of those posted on the mailing list.

If you're willing, you can try to get rid of the intermediate "machine"
cgroup in your setup for the moment. This might tell us, whether we're
looking at the same issue.

Thanks,
Jan

2018-09-12 23:16:13

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 12.09.2018 [21:34:14 +0200], Jan H. Sch?nherr wrote:
> On 09/12/2018 02:24 AM, Nishanth Aravamudan wrote:
> > [ I am not subscribed to LKML, please keep me CC'd on replies ]
> >
> > I tried a simple test with several VMs (in my initial test, I have 48
> > idle 1-cpu 512-mb VMs and 2 idle 2-cpu, 2-gb VMs) using libvirt, none
> > pinned to any CPUs. When I tried to set all of the top-level libvirt cpu
> > cgroups' to be co-scheduled (/bin/echo 1 >
> > /sys/fs/cgroup/cpu/machine/<VM-x>.libvirt-qemu/cpu.scheduled), the
> > machine hangs. This is using cosched_max_level=1.
> >
> > There are several moving parts there, so I tried narrowing it down, by
> > only coscheduling one VM, and thing seemed fine:
> >
> > /sys/fs/cgroup/cpu/machine/<VM-1>.libvirt-qemu# echo 1 > cpu.scheduled
> > /sys/fs/cgroup/cpu/machine/<VM-1>.libvirt-qemu# cat cpu.scheduled
> > 1
> >
> > One thing that is not entirely obvious to me (but might be completely
> > intentional) is that since by default the top-level libvirt cpu cgroups
> > are empty:
> >
> > /sys/fs/cgroup/cpu/machine/<VM-1>.libvirt-qemu# cat tasks
> >
> > the result of this should be a no-op, right? [This becomes relevant
> > below] Specifically, all of the threads of qemu are in sub-cgroups,
> > which do not indicate they are co-scheduling:
> >
> > /sys/fs/cgroup/cpu/machine/<VM-1>.libvirt-qemu# cat emulator/cpu.scheduled
> > 0
> > /sys/fs/cgroup/cpu/machine/<VM-1>.libvirt-qemu# cat vcpu0/cpu.scheduled
> > 0
> >
>
> This setup *should* work. It should be possible to set cpu.scheduled
> independent of the cpu.scheduled values of parent and child task groups.
> Any intermediate regular task group (i.e. cpu.scheduled==0) will still
> contribute the group fairness aspects.

Ah I see, that makes sense, thank you.

> That said, I see a hang, too. It seems to happen, when there is a
> cpu.scheduled!=0 group that is not a direct child of the root task group.
> You seem to have "/sys/fs/cgroup/cpu/machine" as an intermediate group.
> (The case ==0 within !=0 within the root task group works for me.)
>
> I'm going to dive into the code.
>
> [...]
> > I am happy to do any further debugging I can do, or try patches on top
> > of those posted on the mailing list.
>
> If you're willing, you can try to get rid of the intermediate "machine"
> cgroup in your setup for the moment. This might tell us, whether we're
> looking at the same issue.

Yep I will do this now. Note that if I just try to set machine's
cpu.scheduled to 1, with no other changes (not even changing any child
cgroup's cpu.scheduled yet), I get the following trace:

[16052.164259] ------------[ cut here ]------------
[16052.168973] rq->clock_update_flags < RQCF_ACT_SKIP
[16052.168991] WARNING: CPU: 59 PID: 59533 at kernel/sched/sched.h:1303 assert_clock_updated.isra.82.part.83+0x15/0x18
[16052.184424] Modules linked in: act_police cls_basic ebtable_filter ebtables ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw ip_tables s
[16052.255653] xxhash raid10 raid0 multipath linear raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq ses libcrc32c raid1 enclosure scsi
[16052.276029] CPU: 59 PID: 59533 Comm: bash Tainted: G O 4.19.0-rc2-amazon-cosched+ #1
[16052.291142] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 06/29/2018
[16052.298728] RIP: 0010:assert_clock_updated.isra.82.part.83+0x15/0x18
[16052.305166] Code: 0f 85 75 ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c7 28 30 eb 94 31 c0 c6 05 47 18 27 01 01 e8 f4 df fb ff <0f> 0b c3 48 8b 970
[16052.324050] RSP: 0018:ffff9cada610bca8 EFLAGS: 00010096
[16052.329361] RAX: 0000000000000026 RBX: ffff8f06d65bae00 RCX: 0000000000000006
[16052.336580] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff8f1edf756620
[16052.343799] RBP: ffff8f06e0462e00 R08: 000000000000079b R09: ffff9cada610bc48
[16052.351018] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8f06e0462e80
[16052.358237] R13: 0000000000000001 R14: ffff8f06e0462e00 R15: 0000000000000001
[16052.365458] FS: 00007ff07ab02740(0000) GS:ffff8f1edf740000(0000) knlGS:0000000000000000
[16052.373647] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16052.379480] CR2: 00007ff07ab139d8 CR3: 0000002ca2aea002 CR4: 00000000007626e0
[16052.386698] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[16052.393917] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[16052.401137] PKRU: 55555554
[16052.403927] Call Trace:
[16052.406460] update_curr+0x19f/0x1c0
[16052.410116] dequeue_entity+0x21/0x8c0
[16052.413950] ? terminate_walk+0x55/0xb0
[16052.417871] dequeue_entity_fair+0x46/0x1c0
[16052.422136] sdrq_update_root+0x35d/0x480
[16052.426227] cosched_set_scheduled+0x80/0x1c0
[16052.430675] cpu_scheduled_write_u64+0x26/0x30
[16052.435209] cgroup_file_write+0xe3/0x140
[16052.439305] kernfs_fop_write+0x110/0x190
[16052.443397] __vfs_write+0x26/0x170
[16052.446974] ? __audit_syscall_entry+0x101/0x130
[16052.451674] ? _cond_resched+0x15/0x30
[16052.455509] ? __sb_start_write+0x41/0x80
[16052.459600] vfs_write+0xad/0x1a0
[16052.462997] ksys_write+0x42/0x90
[16052.466397] do_syscall_64+0x55/0x110
[16052.470152] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[16052.475286] RIP: 0033:0x7ff07a1e93c0
[16052.478943] Code: 73 01 c3 48 8b 0d c8 2a 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d bd 8c 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff4
[16052.497827] RSP: 002b:00007ffc73e335b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[16052.505498] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007ff07a1e93c0
[16052.512715] RDX: 0000000000000002 RSI: 00000000023a0408 RDI: 0000000000000001
[16052.519936] RBP: 00000000023a0408 R08: 000000000000000a R09: 00007ff07ab02740
[16052.527156] R10: 00007ff07a4bb6a0 R11: 0000000000000246 R12: 00007ff07a4bd400
[16052.534374] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
[16052.541593] ---[ end trace b20c73e6c2bec22c ]---

I'll reboot and move some cgroups around :)

Thanks,
Nish

2018-09-12 23:20:23

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 00/60] Coscheduling for Linux

On 09/12/2018 09:34 PM, Jan H. Schönherr wrote:
> That said, I see a hang, too. It seems to happen, when there is a
> cpu.scheduled!=0 group that is not a direct child of the root task group.
> You seem to have "/sys/fs/cgroup/cpu/machine" as an intermediate group.
> (The case ==0 within !=0 within the root task group works for me.)
>
> I'm going to dive into the code.

With the patch below (which technically changes patch 55/60), the hang I experienced is gone.

Please let me know, if it works for you as well.

Regards
Jan


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8da2033596ff..2d8b3f9a275f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7189,23 +7189,26 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
while (!(cfs_rq = is_same_group(se, pse))) {
int se_depth = se->depth;
int pse_depth = pse->depth;
+ bool work = false;

- if (se_depth <= pse_depth && leader_of(pse) == cpu) {
+ if (se_depth <= pse_depth && __leader_of(pse) == cpu) {
put_prev_entity(cfs_rq_of(pse), pse);
pse = parent_entity(pse);
+ work = true;
}
- if (se_depth >= pse_depth && leader_of(se) == cpu) {
+ if (se_depth >= pse_depth && __leader_of(se) == cpu) {
set_next_entity(cfs_rq_of(se), se);
se = parent_entity(se);
+ work = true;
}
- if (leader_of(pse) != cpu && leader_of(se) != cpu)
+ if (!work)
break;
}

- if (leader_of(pse) == cpu)
- put_prev_entity(cfs_rq, pse);
- if (leader_of(se) == cpu)
- set_next_entity(cfs_rq, se);
+ if (__leader_of(pse) == cpu)
+ put_prev_entity(cfs_rq_of(pse), pse);
+ if (__leader_of(se) == cpu)
+ set_next_entity(cfs_rq_of(se), se);
}

goto done;

2018-09-13 03:06:05

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 13.09.2018 [01:18:14 +0200], Jan H. Sch?nherr wrote:
> On 09/12/2018 09:34 PM, Jan H. Sch?nherr wrote:
> > That said, I see a hang, too. It seems to happen, when there is a
> > cpu.scheduled!=0 group that is not a direct child of the root task group.
> > You seem to have "/sys/fs/cgroup/cpu/machine" as an intermediate group.
> > (The case ==0 within !=0 within the root task group works for me.)
> >
> > I'm going to dive into the code.
>
> With the patch below (which technically changes patch 55/60), the hang
> I experienced is gone.
>
> Please let me know, if it works for you as well.

Yep, this does fix the soft lockups for me, thanks! However, if I do a:

# find /sys/fs/cgroup/cpu/machine -mindepth 2 -maxdepth 2 -name cpu.scheduled -exec /bin/sh -c "echo 1 > {} " \;

which should co-schedule all the cgroups for emulator and vcpu threads,
I see the same warning I mentioned in my other e-mail:

[10469.832822] ------------[ cut here ]------------
[10469.837555] rq->clock_update_flags < RQCF_ACT_SKIP
[10469.837574] WARNING: CPU: 89 PID: 49630 at kernel/sched/sched.h:1303 assert_clock_updated.isra.82.part.83+0x15/0x18
[10469.853042] Modules linked in: act_police cls_basic ebtable_filter ebtables ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw ip_tables s
[10469.924590] xxhash raid10 raid0 multipath linear raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq ses libcrc32c raid1 enclosure scsi
[10469.945010] CPU: 89 PID: 49630 Comm: sh Tainted: G O 4.19.0-rc2-amazon-cosched+ #2
[10469.960061] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 06/29/2018
[10469.967657] RIP: 0010:assert_clock_updated.isra.82.part.83+0x15/0x18
[10469.974126] Code: 0f 85 75 ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c7 28 30 eb 8d 31 c0 c6 05 67 18 27 01 01 e8 14 e0 fb ff <0f> 0b c3 48 8b 970
[10469.993018] RSP: 0018:ffffabc0b534fca8 EFLAGS: 00010096
[10469.998341] RAX: 0000000000000026 RBX: ffff9d74d12ede00 RCX: 0000000000000006
[10470.005559] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff9d74dfb16620
[10470.012780] RBP: ffff9d74df562e00 R08: 0000000000000796 R09: ffffabc0b534fc48
[10470.020005] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9d74d2849800
[10470.027226] R13: 0000000000000001 R14: ffff9d74df562e00 R15: 0000000000000001
[10470.034445] FS: 00007fea86812740(0000) GS:ffff9d74dfb00000(0000) knlGS:0000000000000000
[10470.042678] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10470.048511] CR2: 00005620f00314d8 CR3: 0000002cc55ea004 CR4: 00000000007626e0
[10470.055739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10470.062965] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[10470.070186] PKRU: 55555554
[10470.072976] Call Trace:
[10470.075508] update_curr+0x19f/0x1c0
[10470.079211] dequeue_entity+0x21/0x8c0
[10470.083056] dequeue_entity_fair+0x46/0x1c0
[10470.087321] sdrq_update_root+0x35d/0x480
[10470.091420] cosched_set_scheduled+0x80/0x1c0
[10470.095892] cpu_scheduled_write_u64+0x26/0x30
[10470.100427] cgroup_file_write+0xe3/0x140
[10470.104523] kernfs_fop_write+0x110/0x190
[10470.108624] __vfs_write+0x26/0x170
[10470.112236] ? __audit_syscall_entry+0x101/0x130
[10470.116943] ? _cond_resched+0x15/0x30
[10470.120781] ? __sb_start_write+0x41/0x80
[10470.124871] vfs_write+0xad/0x1a0
[10470.128268] ksys_write+0x42/0x90
[10470.131668] do_syscall_64+0x55/0x110
[10470.135421] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10470.140558] RIP: 0033:0x7fea863253c0
[10470.144213] Code: 73 01 c3 48 8b 0d c8 2a 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d bd 8c 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff4
[10470.163114] RSP: 002b:00007ffe7cb22d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[10470.170783] RAX: ffffffffffffffda RBX: 00005620f002f4d0 RCX: 00007fea863253c0
[10470.178002] RDX: 0000000000000002 RSI: 00005620f002f4d0 RDI: 0000000000000001
[10470.185222] RBP: 0000000000000002 R08: 0000000000000001 R09: 000000000000006b
[10470.192486] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000001
[10470.199705] R13: 0000000000000002 R14: 7fffffffffffffff R15: 0000000000000002
[10470.206923] ---[ end trace fbf46e2c721c7acb ]---

Thanks,
Nish

2018-09-13 11:32:46

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 09/13/2018 01:15 AM, Nishanth Aravamudan wrote:
> [...] if I just try to set machine's
> cpu.scheduled to 1, with no other changes (not even changing any child
> cgroup's cpu.scheduled yet), I get the following trace:
>
> [16052.164259] ------------[ cut here ]------------
> [16052.168973] rq->clock_update_flags < RQCF_ACT_SKIP
> [16052.168991] WARNING: CPU: 59 PID: 59533 at kernel/sched/sched.h:1303 assert_clock_updated.isra.82.part.83+0x15/0x18
[snip]

This goes away with the change below (which fixes patch 58/60).

Thanks,
Jan

diff --git a/kernel/sched/cosched.c b/kernel/sched/cosched.c
index a1f0d3a7b02a..a98ea11ba172 100644
--- a/kernel/sched/cosched.c
+++ b/kernel/sched/cosched.c
@@ -588,6 +588,7 @@ static void sdrq_update_root(struct sdrq *sdrq)

/* Get proper locks */
rq_lock_irqsave(rq, &rf);
+ update_rq_clock(rq);

sdrq->is_root = is_root;
if (is_root)

2018-09-13 18:17:09

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 13.09.2018 [13:31:36 +0200], Jan H. Sch?nherr wrote:
> On 09/13/2018 01:15 AM, Nishanth Aravamudan wrote:
> > [...] if I just try to set machine's
> > cpu.scheduled to 1, with no other changes (not even changing any child
> > cgroup's cpu.scheduled yet), I get the following trace:
> >
> > [16052.164259] ------------[ cut here ]------------
> > [16052.168973] rq->clock_update_flags < RQCF_ACT_SKIP
> > [16052.168991] WARNING: CPU: 59 PID: 59533 at kernel/sched/sched.h:1303 assert_clock_updated.isra.82.part.83+0x15/0x18
> [snip]
>
> This goes away with the change below (which fixes patch 58/60).

Yep, I can confirm this one as well, is now fixed.

-Nish

2018-09-13 19:31:33

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 61/60] cosched: Accumulated fixes and improvements

Here is an "extra" patch containing bug fixes and warning removals,
that I have accumulated up to this point.

It goes on top of the other 60 patches. (When it is time for v2,
these fixes will be integrated into the appropriate patches within
the series.)

The changes are:

1. Avoid a hang with nested scheduled task groups.
2. Get rid of a lockdep warning.
3. Get rid of warnings about missed clock updates.
4. Get rid of "untested code path" warnings/reminders (after testing
said code paths).

This should make experimenting with this patch series a little less bumpy.

Partly-reported-by: Nishanth Aravamudan <[email protected]>
Signed-off-by: Jan H. Schönherr <[email protected]>
---
kernel/sched/cosched.c | 2 ++
kernel/sched/fair.c | 35 ++++++++++-------------------------
2 files changed, 12 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/cosched.c b/kernel/sched/cosched.c
index a1f0d3a7b02a..88b7bc6d4bfa 100644
--- a/kernel/sched/cosched.c
+++ b/kernel/sched/cosched.c
@@ -588,6 +588,7 @@ static void sdrq_update_root(struct sdrq *sdrq)

/* Get proper locks */
rq_lock_irqsave(rq, &rf);
+ update_rq_clock(rq);

sdrq->is_root = is_root;
if (is_root)
@@ -644,6 +645,7 @@ static void sdrq_update_root(struct sdrq *sdrq)
}
if (sdrq->cfs_rq->curr) {
rq_lock(prq, &prf);
+ update_rq_clock(prq);
if (sdrq->data->leader == sdrq->sd_parent->data->leader)
put_prev_entity_fair(prq, sdrq->sd_parent->sd_se);
rq_unlock(prq, &prf);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8da2033596ff..4c823fa367ad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7150,23 +7150,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf

se = pick_next_entity(cfs_rq, curr);
cfs_rq = pick_next_cfs(se);
-
-#ifdef CONFIG_COSCHEDULING
- if (cfs_rq && is_sd_se(se) && cfs_rq->sdrq.is_root) {
- WARN_ON_ONCE(1); /* Untested code path */
- /*
- * Race with is_root update.
- *
- * We just moved downwards in the hierarchy via an
- * SD-SE, the CFS-RQ should have is_root set to zero.
- * However, a reconfiguration may be in progress. We
- * basically ignore that reconfiguration.
- *
- * Contrary to the case below, there is nothing to fix
- * as all the set_next_entity() calls are done later.
- */
- }
-#endif
} while (cfs_rq);

if (is_sd_se(se))
@@ -7189,23 +7172,26 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
while (!(cfs_rq = is_same_group(se, pse))) {
int se_depth = se->depth;
int pse_depth = pse->depth;
+ bool work = false;

- if (se_depth <= pse_depth && leader_of(pse) == cpu) {
+ if (se_depth <= pse_depth && __leader_of(pse) == cpu) {
put_prev_entity(cfs_rq_of(pse), pse);
pse = parent_entity(pse);
+ work = true;
}
- if (se_depth >= pse_depth && leader_of(se) == cpu) {
+ if (se_depth >= pse_depth && __leader_of(se) == cpu) {
set_next_entity(cfs_rq_of(se), se);
se = parent_entity(se);
+ work = true;
}
- if (leader_of(pse) != cpu && leader_of(se) != cpu)
+ if (!work)
break;
}

- if (leader_of(pse) == cpu)
- put_prev_entity(cfs_rq, pse);
- if (leader_of(se) == cpu)
- set_next_entity(cfs_rq, se);
+ if (__leader_of(pse) == cpu)
+ put_prev_entity(cfs_rq_of(pse), pse);
+ if (__leader_of(se) == cpu)
+ set_next_entity(cfs_rq_of(se), se);
}

goto done;
@@ -7243,7 +7229,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
struct sched_entity *se2 = cfs_rq->sdrq.tg_se;
bool top = false;

- WARN_ON_ONCE(1); /* Untested code path */
while (se && se != se2) {
if (!top) {
put_prev_entity(cfs_rq_of(se), se);
--
2.9.3.1.gcba166c.dirty


2018-09-14 11:13:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Sch?nherr wrote:
> This patch series extends CFS with support for coscheduling. The
> implementation is versatile enough to cover many different coscheduling
> use-cases, while at the same time being non-intrusive, so that behavior of
> legacy workloads does not change.

I don't call this non-intrusive.

> Peter Zijlstra once called coscheduling a "scalability nightmare waiting to
> happen". Well, with this patch series, coscheduling certainly happened.

I'll beg to differ; this isn't anywhere near something to consider
merging. Also 'happened' suggests a certain stage of completeness, this
again doesn't qualify.

> However, I disagree on the scalability nightmare. :)

There are known scalability problems with the existing cgroup muck; you
just made things a ton worse. The existing cgroup overhead is
significant, you also made that many times worse.

The cgroup stuff needs cleanups and optimization, not this.

> B) Why would I want this?

> In the L1TF context, it prevents other applications from loading
> additional data into the L1 cache, while one application tries to leak
> data.

That is the whole and only reason you did this; and it doesn't even
begin to cover the requirements for it.

Not to mention I detest cgroups; for their inherent complixity and the
performance costs associated with them. _If_ we're going to do
something for L1TF then I feel it should not depend on cgroups.

It is after all, perfectly possible to run a kvm thingy without cgroups.

> 1. Execute parallel applications that rely on active waiting or synchronous
> execution concurrently with other applications.
>
> The prime example in this class are probably virtual machines. Here,
> coscheduling is an alternative to paravirtualized spinlocks, pause loop
> exiting, and other techniques with its own set of advantages and
> disadvantages over the other approaches.

Note that in order to avoid PLE and paravirt spinlocks and paravirt
tlb-invalidate you have to gang-schedule the _entire_ VM, not just SMT
siblings.

Now explain to me how you're going to gang-schedule a VM with a good
number of vCPU threads (say spanning a number of nodes) and preserving
the rest of CFS without it turning into a massive trainwreck?

Such things (gang scheduling VMs) _are_ possible, but not within the
confines of something like CFS, they are also fairly inefficient
because, as you do note, you will have to explicitly schedule idle time
for idle vCPUs.

Things like the Tableau scheduler are what come to mind; but I'm not
sure how to integrate that with a general purpose scheduling scheme. You
pretty much have to dedicate a set of CPUs to just scheduling VMs with
such a scheduler.

And that would call for cpuset-v2 integration along with a new
scheduling class.

And then people will complain again that partitioning a system isn't
dynamic enough and we need magic :/

(and this too would be tricky to virtualize itself)

> C) How does it work?
> --------------------
>
> This patch series introduces hierarchical runqueues, that represent larger
> and larger fractions of the system. By default, there is one runqueue per
> scheduling domain. These additional levels of runqueues are activated by
> the "cosched_max_level=" kernel command line argument. The bottom level is
> 0.

You gloss over a ton of details here; many of which are non trivial and
marked broken in your patches. Unless you have solid suggestions on how
to deal with all of them, this is a complete non-starter.

The per-cpu IRQ/steal time accounting for example. The task timeline
isn't the same on every CPU because of those.

You now basically require steal time and IRQ load to match between CPUs.
That places very strict requirements and effectively breaks virt
invariance. That is, the scheduler now behaves significantly different
inside a VM than it does outside of it -- without the guest being gang
scheduled itself and having physical pinning to reflect the same
topology the coschedule=1 thing should not be exposed in a guest. And
that is a mayor failing IMO.

Also; I think you're sharing a cfs_rq between CPUs:

+ init_cfs_rq(&sd->shared->rq.cfs);

that is broken, the virtual runtime stuff needs nontrivial modifications
for multiple CPUs. And if you do that, I've no idea how you're dealing
with SMP affinities.

> You currently have to explicitly set affinities of tasks within coscheduled
> task groups, as load balancing is not implemented for them at this point.

You don't even begin to outline how you preserve smp-nice fairness.

> D) What can I *not* do with this?
> ---------------------------------
>
> Besides the missing load-balancing within coscheduled task-groups, this
> implementation has the following properties, which might be considered
> short-comings.
>
> This particular implementation focuses on SCHED_OTHER tasks managed by CFS
> and allows coscheduling them. Interrupts as well as tasks in higher
> scheduling classes are currently out-of-scope: they are assumed to be
> negligible interruptions as far as coscheduling is concerned and they do
> *not* cause a preemption of a whole group. This implementation could be
> extended to cover higher scheduling classes. Interrupts, however, are an
> orthogonal issue.
>
> The collective context switch from one coscheduled set of tasks to another
> -- while fast -- is not atomic. If a use-case needs the absolute guarantee
> that all tasks of the previous set have stopped executing before any task
> of the next set starts executing, an additional hand-shake/barrier needs to
> be added.

IOW it's completely friggin useless for L1TF.

> E) What's the overhead?
> -----------------------
>
> Each (active) hierarchy level has roughly the same effect as one additional
> level of nested cgroups. In addition -- at this stage -- there may be some
> additional lock contention if you coschedule larger fractions of the system
> with a dynamic task set.

Have you actually read your own code?

What about that atrocious locking you sprinkle all over the place?
'some additional lock contention' doesn't even begin to describe that
horror show.

Hint: we're not going to increase the lockdep subclasses, and most
certainly not for scheduler locking.


All in all, I'm not inclined to consider this approach, it complicates
an already overly complicated thing (cpu-cgroups) and has a ton of
unresolved issues while at the same time it doesn't (and cannot) meet
the goal it was made for.

2018-09-14 16:27:41

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 09/14/2018 01:12 PM, Peter Zijlstra wrote:
> On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Schönherr wrote:
>> This patch series extends CFS with support for coscheduling. The
>> implementation is versatile enough to cover many different coscheduling
>> use-cases, while at the same time being non-intrusive, so that behavior of
>> legacy workloads does not change.
>
> I don't call this non-intrusive.

Mm... there is certainly room for interpretation. :) For example, it is still
possible to set affinities, to use nice, and to tune all the other existing CFS
knobs. That is, if you have tuned the scheduler to your workload or your workload
depends on some CFS feature to work efficiently (whether on purpose or not), then
running with this patch set should not change the behavior of said workload.

This patch set should "just" give the user the additional ability to coordinate
scheduling decisions across multiple CPUs. At least, that's my goal.

If someone doesn't need it, they don't have to use it. Just like task groups.

But maybe, people start experimenting with coordinated scheduling decisions --
after all, there is a ton of research on what one *could* do, if there was
coscheduling. I did look over much of that research. What I didn't like about
many of them, is that evaluation is based on a "prototype", that -- while
making the point that coscheduling might be beneficial for that use case --
totally screws over the scheduler for any other use case. Like coscheduling
based on deterministic, timed context switches across all CPUs. Bye bye
interactivity. That is, what I call intrusive.

As mentioned before, existing scheduler features, like preemption, (should)
still work as before with this variant of coscheduling, with the same look and
feel.

And who knows, maybe someone will come up with a use case that moves coscheduling
out of its niche; like the auto-grouping feature promoted the use of task groups.


>> Peter Zijlstra once called coscheduling a "scalability nightmare waiting to
>> happen". Well, with this patch series, coscheduling certainly happened.
>
> I'll beg to differ; this isn't anywhere near something to consider
> merging. Also 'happened' suggests a certain stage of completeness, this
> again doesn't qualify.

I agree, that this isn't ready to be merged. Still, the current state is good
to start a discussion about the involved mechanics.


>> However, I disagree on the scalability nightmare. :)
>
> There are known scalability problems with the existing cgroup muck; you
> just made things a ton worse. The existing cgroup overhead is
> significant, you also made that many times worse.
>
> The cgroup stuff needs cleanups and optimization, not this.

Are you referring to cgroups in general, or task groups (aka. the cpu
controller) specifically?


With respect to scalability: many coscheduling use cases don't require
synchronization across the whole system. With this patch set, only those
parts that are actually coscheduled are involved in synchronization.
So, conceptually, this scales to larger systems from that point of view.

If coscheduling of a larger fraction of the system is required, costs
increase. So what? It's a trade-off. It may *still* be beneficial for a
use case. If it is, it might get adopted. If not, that particular use
case may be considered impractical unless someone comes up with a better
implementation of coscheduling.


With respect to the need of cleanups and optimizations: I agree, that
task groups are a bit messy. For example, here's my current wish list
off the top of my head:

a) lazy scheduler operations; for example: when dequeuing a task, don't bother
walking up the task group hierarchy to dequeue all the SEs -- do it lazily
when encountering an empty CFS RQ during picking when we hold the lock anyway.

b) ability to move CFS RQs between CPUs: someone changed the affinity of
a cpuset? No problem, just attach the runqueue with all the tasks elsewhere.
No need to touch each and every task.

c) light-weight task groups: don't allocate a runqueue for every CPU in the
system, when it is known that tasks in the task group will only ever run
on at most two CPUs, or so. (And while there is of course a use case for
VMs in this, another class of use cases are auxiliary tasks, see eg, [1-5].)

Is this the level of optimizations, you're thinking about? Or do you want
to throw away the whole nested CFS RQ experience in the code?


>> B) Why would I want this?
>
>> In the L1TF context, it prevents other applications from loading
>> additional data into the L1 cache, while one application tries to leak
>> data.
>
> That is the whole and only reason you did this;
It really isn't. But as your mind seems made up, I'm not going to bother
to argue.


> and it doesn't even
> begin to cover the requirements for it.
>
> Not to mention I detest cgroups; for their inherent complixity and the
> performance costs associated with them. _If_ we're going to do
> something for L1TF then I feel it should not depend on cgroups.
>
> It is after all, perfectly possible to run a kvm thingy without cgroups.

Yes it is. But, for example, you won't have group-based fairness between
multiple kvm thingies.

Assuming, there is a cgroup-less solution that can prevent simultaneous
execution of tasks on a core, when they're not supposed to. How would you
tell the scheduler, which tasks these are?


>> 1. Execute parallel applications that rely on active waiting or synchronous
>> execution concurrently with other applications.
>>
>> The prime example in this class are probably virtual machines. Here,
>> coscheduling is an alternative to paravirtualized spinlocks, pause loop
>> exiting, and other techniques with its own set of advantages and
>> disadvantages over the other approaches.
>
> Note that in order to avoid PLE and paravirt spinlocks and paravirt
> tlb-invalidate you have to gang-schedule the _entire_ VM, not just SMT
> siblings.
>
> Now explain to me how you're going to gang-schedule a VM with a good
> number of vCPU threads (say spanning a number of nodes) and preserving
> the rest of CFS without it turning into a massive trainwreck?

You probably don't -- for the same reason, why it is a bad idea to give
an endless loop realtime priority. It's just a bad idea. As I said in the
text you quoted: coscheduling comes with its own set of advantages and
disadvantages. Just because you find one example, where it is a bad idea,
doesn't make it a bad thing in general.


> Such things (gang scheduling VMs) _are_ possible, but not within the
> confines of something like CFS, they are also fairly inefficient
> because, as you do note, you will have to explicitly schedule idle time
> for idle vCPUs.

With gang scheduling as defined by Feitelson and Rudolph [6], you'd have to
explicitly schedule idle time. With coscheduling as defined by Ousterhout [7],
you don't. In this patch set, the scheduling of idle time is "merely" a quirk
of the implementation. And even with this implementation, there's nothing
stopping you from down-sizing the width of the coscheduled set to take out
the idle vCPUs dynamically, cutting down on fragmentation.


> Things like the Tableau scheduler are what come to mind; but I'm not
> sure how to integrate that with a general purpose scheduling scheme. You
> pretty much have to dedicate a set of CPUs to just scheduling VMs with
> such a scheduler.
>
> And that would call for cpuset-v2 integration along with a new
> scheduling class.
>
> And then people will complain again that partitioning a system isn't
> dynamic enough and we need magic :/
>
> (and this too would be tricky to virtualize itself)

Hence my "counter" suggestion in the form of this patch set: Integrated
into a general purpose scheduler, no need to partition off a part of the system,
not tied to just VM use cases.


>> C) How does it work?
>> --------------------
>>
>> This patch series introduces hierarchical runqueues, that represent larger
>> and larger fractions of the system. By default, there is one runqueue per
>> scheduling domain. These additional levels of runqueues are activated by
>> the "cosched_max_level=" kernel command line argument. The bottom level is
>> 0.
>
> You gloss over a ton of details here;

Yes, I do. :) I wanted a summary, not a design document. Maybe I was a bit
to eager in condensing the design to just a few paragraphs...


> many of which are non trivial and
> marked broken in your patches. Unless you have solid suggestions on how
> to deal with all of them, this is a complete non-starter.

Address them one by one. Probably do some of the optimizations you suggested
to just get rid of some of them. It's work in progress. Though, at this
stage I am also really interested in things that are broken, that I am not
aware of yet.


> The per-cpu IRQ/steal time accounting for example. The task timeline
> isn't the same on every CPU because of those.
>
> You now basically require steal time and IRQ load to match between CPUs.
> That places very strict requirements and effectively breaks virt
> invariance. That is, the scheduler now behaves significantly different
> inside a VM than it does outside of it -- without the guest being gang
> scheduled itself and having physical pinning to reflect the same
> topology the coschedule=1 thing should not be exposed in a guest. And
> that is a mayor failing IMO.

I'll have to read up some more code to make a qualified statement here.


> Also; I think you're sharing a cfs_rq between CPUs:
>
> + init_cfs_rq(&sd->shared->rq.cfs);
>
> that is broken, the virtual runtime stuff needs nontrivial modifications
> for multiple CPUs. And if you do that, I've no idea how you're dealing
> with SMP affinities.

It is not shared per se. There's only one CPU (the leader) making the scheduling
decision for that runqueue and if another CPU needs to modify the runqueue, it
works like it does for CPU runqueues as well: the other CPU works with the
leader's time. There are also no tasks in a runqueue when it is responsible for
more than one CPU.

Assuming, that a runqueue is responsible for a core and there are runnable
tasks within the task group on said core, then there will one SE enqueued in
that runqueue, a so called SD-SE (scheduling domain SE, or synchronization
domain SE). This SD-SE represents the per CPU runqueues of this core of this
task group. (As opposed to a "normal" task group SE (TG-SE), which represents
just one runqueue in a different task group.) Tasks are still only enqueued
in the per CPU runqueues.


>> You currently have to explicitly set affinities of tasks within coscheduled
>> task groups, as load balancing is not implemented for them at this point.
>
> You don't even begin to outline how you preserve smp-nice fairness.

Works as before (or will work as before): a coscheduled task group has its
own set of per CPU runqueues that hold the tasks of this group (per CPU).
The load balancer will work on this subset of runqueues as it does on the
"normal" per CPU runqueues -- smp-nice fairness and all.


>> D) What can I *not* do with this?
>> ---------------------------------
>>
>> Besides the missing load-balancing within coscheduled task-groups, this
>> implementation has the following properties, which might be considered
>> short-comings.
>>
>> This particular implementation focuses on SCHED_OTHER tasks managed by CFS
>> and allows coscheduling them. Interrupts as well as tasks in higher
>> scheduling classes are currently out-of-scope: they are assumed to be
>> negligible interruptions as far as coscheduling is concerned and they do
>> *not* cause a preemption of a whole group. This implementation could be
>> extended to cover higher scheduling classes. Interrupts, however, are an
>> orthogonal issue.
>>
>> The collective context switch from one coscheduled set of tasks to another
>> -- while fast -- is not atomic. If a use-case needs the absolute guarantee
>> that all tasks of the previous set have stopped executing before any task
>> of the next set starts executing, an additional hand-shake/barrier needs to
>> be added.
>
> IOW it's completely friggin useless for L1TF.

Do you believe me now, that L1TF is not "the whole and only reason" I did this? :D


>> E) What's the overhead?
>> -----------------------
>>
>> Each (active) hierarchy level has roughly the same effect as one additional
>> level of nested cgroups. In addition -- at this stage -- there may be some
>> additional lock contention if you coschedule larger fractions of the system
>> with a dynamic task set.
>
> Have you actually read your own code?
>
> What about that atrocious locking you sprinkle all over the place?
> 'some additional lock contention' doesn't even begin to describe that
> horror show.

Currently, there are more code paths than I like, that climb up the se->parent
relation to the top. They need to go, if we want to coschedule larger parts of
the system in a more efficient manner. Hence, parts of my wish list further up.

That said, it is not as bad as you make it sound for the following three reasons:

a) The amount of CPUs that compete for a lock is currently governed by the
"cosched_max_level" command line argument, making it a conscious decision to
increase the overall overhead. Hence, coscheduling at, e.g., core level
does not have a too serious impact on lock contention.

b) The runqueue locks are usually only taken by the leader of said runqueue.
Hence, there is often only one user per lock even at higher levels.
The prominent exception at this stage of the patch set is that enqueue and
dequeue operations walk up the hierarchy up to the "cosched_max_level".
And even then, due to lock chaining, multiple enqueue/dequeue operations
on different CPUs can bubble up the shared part of the hierarchy in parallel.

c) The scheduling decision does not cause any lock contention by itself. Each
CPU only accesses runqueues, where itself is the leader. Hence, once you
have a relatively stable situation, lock contention is not an issue.


> Hint: we're not going to increase the lockdep subclasses, and most
> certainly not for scheduler locking.

That's fine. Due to the overhead of nesting cgroups that you mentioned earlier,
that many levels in the runqueue hierarchy are likely to be impracticable
anyway. For the future, I imagine a more dynamic variant of task groups/scheduling
domains, that can provide all the flexibility one would want without that deep
of a nesting. At this stage, it is just a way to experiment with larger systems
without having to disable lockdep.

Of course, if you have a suggestion for a different locking scheme, we can
discuss that as well. The current one, is what I considered most suitable
among some alternatives under the premise I was working: integrate coscheduling
in a scheduler as an additional feature (instead of, eg, write a scheduler
capable of coscheduling). So, I probably haven't considered all alternatives.


> All in all, I'm not inclined to consider this approach, it complicates
> an already overly complicated thing (cpu-cgroups) and has a ton of
> unresolved issues
Even if you're not inclined -- at this stage, if I may be so bold :) --
your feedback is valuable. Thank you for that.

Regards
Jan


References (for those that are into that kind of thing):

[1] D. Kim, S. S.-w. Liao, P. H. Wang, J. del Cuvillo, X. Tian, X. Zou,
H. Wang, D. Yeung, M. Girkar, and J. P. Shen, “Physical experimentation
with prefetching helper threads on Intel’s hyper-threaded processors,”
in Proceedings of the International Symposium on Code Generation and
Optimization (CGO ’04). Los Alamitos, CA, USA: IEEE Computer
Society, Mar. 2004, pp. 27–38.

[2] C. Jung, D. Lim, J. Lee, and D. Solihin, “Helper thread prefetching for
loosely-coupled multiprocessor systems,” in Parallel and Distributed Pro-
cessing Symposium, 2006. IPDPS 2006. 20th International, April 2006.

[3] C. G. Quiñones, C. Madriles, J. Sánchez, P. Marcuello, A. González,
and D. M. Tullsen, “Mitosis compiler: An infrastructure for speculative
threading based on pre-computation slices,” in Proceedings of the 2005
ACM SIGPLAN Conference on Programming Language Design and
Implementation, ser. PLDI ’05. New York, NY, USA: ACM, 2005, pp.
269–279.

[4] J. Mars, L. Tang, and M. L. Soffa, “Directly characterizing cross
core interference through contention synthesis,” in Proceedings of the
6th International Conference on High Performance and Embedded
Architectures and Compilers, ser. HiPEAC ’11. New York, NY, USA:
ACM, 2011, pp. 167–176.

[5] Q. Zeng, D. Wu, and P. Liu, “Cruiser: Concurrent heap buffer overflow
monitoring using lock-free data structures,” in Proceedings of the 32Nd
ACM SIGPLAN Conference on Programming Language Design and
Implementation, ser. PLDI ’11. New York, NY, USA: ACM, 2011, pp.
367–377.

[6] D. G. Feitelson and L. Rudolph, “Distributed hierarchical control for
parallel processing,” Computer, vol. 23, no. 5, pp. 65–77, May 1990.

[7] J. Ousterhout, “Scheduling techniques for concurrent systems,” in
Proceedings of the 3rd International Conference on Distributed Computing
Systems (ICDCS ’82). Los Alamitos, CA, USA: IEEE Computer Society,
Oct. 1982, pp. 22–30.

2018-09-15 08:49:01

by Jan H. Schönherr

[permalink] [raw]
Subject: Task group cleanups and optimizations (was: Re: [RFC 00/60] Coscheduling for Linux)

On 09/14/2018 06:25 PM, Jan H. Schönherr wrote:
> On 09/14/2018 01:12 PM, Peter Zijlstra wrote:
>>
>> There are known scalability problems with the existing cgroup muck; you
>> just made things a ton worse. The existing cgroup overhead is
>> significant, you also made that many times worse.
>>
>> The cgroup stuff needs cleanups and optimization, not this.

[...]

> With respect to the need of cleanups and optimizations: I agree, that
> task groups are a bit messy. For example, here's my current wish list
> off the top of my head:
>
> a) lazy scheduler operations; for example: when dequeuing a task, don't bother
> walking up the task group hierarchy to dequeue all the SEs -- do it lazily
> when encountering an empty CFS RQ during picking when we hold the lock anyway.
>
> b) ability to move CFS RQs between CPUs: someone changed the affinity of
> a cpuset? No problem, just attach the runqueue with all the tasks elsewhere.
> No need to touch each and every task.
>
> c) light-weight task groups: don't allocate a runqueue for every CPU in the
> system, when it is known that tasks in the task group will only ever run
> on at most two CPUs, or so. (And while there is of course a use case for
> VMs in this, another class of use cases are auxiliary tasks, see eg, [1-5].)
>
> Is this the level of optimizations, you're thinking about? Or do you want
> to throw away the whole nested CFS RQ experience in the code?

I guess, it would be possible to flatten the task group hierarchy, that is usually
created when nesting cgroups. That is, enqueue task group SEs always within the
root task group.

That should take away much of the (runtime-)overhead, no?

The calculation of shares would need to be a different kind of complex than it is
now. But that might be manageable.

CFS bandwidth control would also need to change significantly as we would now
have to dequeue/enqueue nested cgroups below a throttled/unthrottled hierarchy.
Unless *those* task groups don't participate in this flattening.

(And probably lots of other stuff, I didn't think about right now.)

Regards
Jan


2018-09-17 09:49:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Task group cleanups and optimizations (was: Re: [RFC 00/60] Coscheduling for Linux)

On Sat, Sep 15, 2018 at 10:48:20AM +0200, Jan H. Sch?nherr wrote:
> On 09/14/2018 06:25 PM, Jan H. Sch?nherr wrote:
> > On 09/14/2018 01:12 PM, Peter Zijlstra wrote:
> >>
> >> There are known scalability problems with the existing cgroup muck; you
> >> just made things a ton worse. The existing cgroup overhead is
> >> significant, you also made that many times worse.
> >>
> >> The cgroup stuff needs cleanups and optimization, not this.
>
> [...]
>
> > With respect to the need of cleanups and optimizations: I agree, that
> > task groups are a bit messy. For example, here's my current wish list
> > off the top of my head:
> >
> > a) lazy scheduler operations; for example: when dequeuing a task, don't bother
> > walking up the task group hierarchy to dequeue all the SEs -- do it lazily
> > when encountering an empty CFS RQ during picking when we hold the lock anyway.

That sounds like it will wreck the runnable_weight accounting. Although,
if, as you write below, we do away with the hierarchical runqueues, that
isn't in fact needed anymore I think.

Still, even without runnable_weight, I suspect we need the 'runnable'
state, even for the other accounting.

> > b) ability to move CFS RQs between CPUs: someone changed the affinity of
> > a cpuset? No problem, just attach the runqueue with all the tasks elsewhere.
> > No need to touch each and every task.

Can't do that, tasks might have individual constraints that are tighter
than the cpuset. Also, changing affinities isn't really a hot path, so
who cares.

> > c) light-weight task groups: don't allocate a runqueue for every CPU in the
> > system, when it is known that tasks in the task group will only ever run
> > on at most two CPUs, or so. (And while there is of course a use case for
> > VMs in this, another class of use cases are auxiliary tasks, see eg, [1-5].)

I have yet to go over your earlier email; but no. The scheduler is very
much per-cpu. And as I mentioned earlier, CFS as is doesn't work right
if you share the runqueue between multiple CPUs (and 'fixing' that is
non trivial).

> > Is this the level of optimizations, you're thinking about? Or do you want
> > to throw away the whole nested CFS RQ experience in the code?
>
> I guess, it would be possible to flatten the task group hierarchy, that is usually
> created when nesting cgroups. That is, enqueue task group SEs always within the
> root task group.
>
> That should take away much of the (runtime-)overhead, no?

Yes, Rik was going to look at trying this. Put all the tasks in the root
rq and adjust the vtime calculations. Facebook is seeing significant
overhead from cpu-cgroup and has to disable it because of that on at
least part of their setup IIUC.

> The calculation of shares would need to be a different kind of complex than it is
> now. But that might be manageable.

That is the hope; indeed. We'll still need to create the hierarchy for
accounting purposes, but it can be a smaller/simpler data structure.

So the weight computation would be the normalized product of the parents
etc.. and since PELT only updates the values on ~1ms scale, we can keep
a cache of the product -- that is, we don't have to recompute that
product and walk the hierarchy all the time either.

> CFS bandwidth control would also need to change significantly as we would now
> have to dequeue/enqueue nested cgroups below a throttled/unthrottled hierarchy.
> Unless *those* task groups don't participate in this flattening.

Right, so the whole bandwidth thing becomes a pain; the simplest
solution is to detect the throttle at task-pick time, dequeue and try
again. But that is indeed quite horrible.

I'm not quite sure how this will play out.

Anyway, if we pull off this flattening feat, then you can no longer use
the hierarchy for this co-scheduling stuff.

Now, let me go read your earlier email and reply to that (in parts).

2018-09-17 11:36:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Sch?nherr wrote:
> On 09/14/2018 01:12 PM, Peter Zijlstra wrote:
> > On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Sch?nherr wrote:

> >> B) Why would I want this?
> >
> >> In the L1TF context, it prevents other applications from loading
> >> additional data into the L1 cache, while one application tries to leak
> >> data.
> >
> > That is the whole and only reason you did this;
> It really isn't. But as your mind seems made up, I'm not going to bother
> to argue.

> >> D) What can I *not* do with this?
> >> ---------------------------------
> >>
> >> Besides the missing load-balancing within coscheduled task-groups, this
> >> implementation has the following properties, which might be considered
> >> short-comings.
> >>
> >> This particular implementation focuses on SCHED_OTHER tasks managed by CFS
> >> and allows coscheduling them. Interrupts as well as tasks in higher
> >> scheduling classes are currently out-of-scope: they are assumed to be
> >> negligible interruptions as far as coscheduling is concerned and they do
> >> *not* cause a preemption of a whole group. This implementation could be
> >> extended to cover higher scheduling classes. Interrupts, however, are an
> >> orthogonal issue.
> >>
> >> The collective context switch from one coscheduled set of tasks to another
> >> -- while fast -- is not atomic. If a use-case needs the absolute guarantee
> >> that all tasks of the previous set have stopped executing before any task
> >> of the next set starts executing, an additional hand-shake/barrier needs to
> >> be added.
> >
> > IOW it's completely friggin useless for L1TF.
>
> Do you believe me now, that L1TF is not "the whole and only reason" I did this? :D

You did mention this work first to me in the context of L1TF, so I might
have jumped to conclusions here.

Also, I have, of course, been looking at (SMT) co-scheduling,
specifically in the context of L1TF, myself. I came up with a vastly
different approach. Tim - where are we on getting some of that posted?

Note; that even though I wrote much of that code, I don't particularly
like it either :-)

2018-09-17 12:27:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Sch?nherr wrote:

> Assuming, there is a cgroup-less solution that can prevent simultaneous
> execution of tasks on a core, when they're not supposed to. How would you
> tell the scheduler, which tasks these are?

Specifically for L1TF I hooked into/extended KVM's preempt_notifier
registration interface, which tells us which tasks are VCPUs and to
which VM they belong.

But if we want to actually expose this to userspace, we can either do a
prctl() or extend struct sched_attr.

> >> 1. Execute parallel applications that rely on active waiting or synchronous
> >> execution concurrently with other applications.
> >>
> >> The prime example in this class are probably virtual machines. Here,
> >> coscheduling is an alternative to paravirtualized spinlocks, pause loop
> >> exiting, and other techniques with its own set of advantages and
> >> disadvantages over the other approaches.
> >
> > Note that in order to avoid PLE and paravirt spinlocks and paravirt
> > tlb-invalidate you have to gang-schedule the _entire_ VM, not just SMT
> > siblings.
> >
> > Now explain to me how you're going to gang-schedule a VM with a good
> > number of vCPU threads (say spanning a number of nodes) and preserving
> > the rest of CFS without it turning into a massive trainwreck?
>
> You probably don't -- for the same reason, why it is a bad idea to give
> an endless loop realtime priority. It's just a bad idea. As I said in the
> text you quoted: coscheduling comes with its own set of advantages and
> disadvantages. Just because you find one example, where it is a bad idea,
> doesn't make it a bad thing in general.

Well, you mentioned it as an alternative to paravirt spinlocks -- I'm
saying that co-scheduling cannot do that, you need full featured
gang-scheduling for that.

2018-09-17 13:38:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Sch?nherr wrote:
> On 09/14/2018 01:12 PM, Peter Zijlstra wrote:

> >> 1. Execute parallel applications that rely on active waiting or synchronous
> >> execution concurrently with other applications.
> >>
> >> The prime example in this class are probably virtual machines. Here,
> >> coscheduling is an alternative to paravirtualized spinlocks, pause loop
> >> exiting, and other techniques with its own set of advantages and
> >> disadvantages over the other approaches.
> >
> > Note that in order to avoid PLE and paravirt spinlocks and paravirt
> > tlb-invalidate you have to gang-schedule the _entire_ VM, not just SMT
> > siblings.
> >
> > Now explain to me how you're going to gang-schedule a VM with a good
> > number of vCPU threads (say spanning a number of nodes) and preserving
> > the rest of CFS without it turning into a massive trainwreck?
>
> You probably don't -- for the same reason, why it is a bad idea to give
> an endless loop realtime priority. It's just a bad idea. As I said in the
> text you quoted: coscheduling comes with its own set of advantages and
> disadvantages. Just because you find one example, where it is a bad idea,
> doesn't make it a bad thing in general.
>
>
> > Such things (gang scheduling VMs) _are_ possible, but not within the
> > confines of something like CFS, they are also fairly inefficient
> > because, as you do note, you will have to explicitly schedule idle time
> > for idle vCPUs.
>
> With gang scheduling as defined by Feitelson and Rudolph [6], you'd have to
> explicitly schedule idle time. With coscheduling as defined by Ousterhout [7],
> you don't. In this patch set, the scheduling of idle time is "merely" a quirk
> of the implementation. And even with this implementation, there's nothing
> stopping you from down-sizing the width of the coscheduled set to take out
> the idle vCPUs dynamically, cutting down on fragmentation.

The thing is, if you drop the full width gang scheduling, you instantly
require the paravirt spinlock / tlb-invalidate stuff again.

Of course, the constraints of L1TF itself requires the explicit
scheduling of idle time under a bunch of conditions.

I did not read your [7] in much detail (also very bad quality scan that
:-/; but I don't get how they leap from 'thrashing' to co-scheduling.
Their initial problem, where A generates data that B needs and the 3
scenarios:

1) A has to wait for B
2) B has to wait for A
3) the data gets buffered

Seems fairly straight forward and is indeed quite common, needing
co-scheduling for that, I'm not convinced.

We have of course added all sorts of adaptive wait loops in the kernel
to deal with just that issue.

With co-scheduling you 'ensure' B is running when A is, but that doesn't
mean you can actually make more progress, you could just be burning a
lot of CPu cycles (which could've been spend doing other work).

I'm also not convinced co-scheduling makes _any_ sense outside SMT --
does one of the many papers you cite make a good case for !SMT
co-scheduling? It just doesn't make sense to co-schedule the LLC domain,
that's 16+ cores on recent chips.

2018-09-18 00:34:28

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux



On 09/07/2018 02:39 PM, Jan H. Schönherr wrote:
> This patch series extends CFS with support for coscheduling. The
> implementation is versatile enough to cover many different coscheduling
> use-cases, while at the same time being non-intrusive, so that behavior of
> legacy workloads does not change.
>
> Peter Zijlstra once called coscheduling a "scalability nightmare waiting to
> happen". Well, with this patch series, coscheduling certainly happened.
> However, I disagree on the scalability nightmare. :)
>
> In the remainder of this email, you will find:
>
> A) Quickstart guide for the impatient.
> B) Why would I want this?
> C) How does it work?
> D) What can I *not* do with this?
> E) What's the overhead?
> F) High-level overview of the patches in this series.
>
> Regards
> Jan
>
>
> A) Quickstart guide for the impatient.
> --------------------------------------
>
> Here is a quickstart guide to set up coscheduling at core-level for
> selected tasks on an SMT-capable system:
>
> 1. Apply the patch series to v4.19-rc2.
> 2. Compile with "CONFIG_COSCHEDULING=y".
> 3. Boot into the newly built kernel with an additional kernel command line
> argument "cosched_max_level=1" to enable coscheduling up to core-level.
> 4. Create one or more cgroups and set their "cpu.scheduled" to "1".
> 5. Put tasks into the created cgroups and set their affinity explicitly.
> 6. Enjoy tasks of the same group and on the same core executing
> simultaneously, whenever they are executed.
>
> You are not restricted to coscheduling at core-level. Just select higher
> numbers in steps 3 and 4. See also further below for more information, esp.
> when you want to try higher numbers on larger systems.
>
> Setting affinity explicitly for tasks within coscheduled cgroups is
> currently necessary, as the load balancing portion is still missing in this
> series.
>
I don't get the affinity part. If I create two cgroups by giving them only
cpu shares (no cpuset) and set their cpu.scheduled=1, will this ensure
co-scheduling of each group on core level for all cores in the system?

Thanks,
Subhra

2018-09-18 11:45:24

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 09/18/2018 02:33 AM, Subhra Mazumdar wrote:
> On 09/07/2018 02:39 PM, Jan H. Schönherr wrote:
>> A) Quickstart guide for the impatient.
>> --------------------------------------
>>
>> Here is a quickstart guide to set up coscheduling at core-level for
>> selected tasks on an SMT-capable system:
>>
>> 1. Apply the patch series to v4.19-rc2.
>> 2. Compile with "CONFIG_COSCHEDULING=y".
>> 3. Boot into the newly built kernel with an additional kernel command line
>>     argument "cosched_max_level=1" to enable coscheduling up to core-level.
>> 4. Create one or more cgroups and set their "cpu.scheduled" to "1".
>> 5. Put tasks into the created cgroups and set their affinity explicitly.
>> 6. Enjoy tasks of the same group and on the same core executing
>>     simultaneously, whenever they are executed.
>>
>> You are not restricted to coscheduling at core-level. Just select higher
>> numbers in steps 3 and 4. See also further below for more information, esp.
>> when you want to try higher numbers on larger systems.
>>
>> Setting affinity explicitly for tasks within coscheduled cgroups is
>> currently necessary, as the load balancing portion is still missing in this
>> series.
>>
> I don't get the affinity part. If I create two cgroups by giving them only
> cpu shares (no cpuset) and set their cpu.scheduled=1, will this ensure
> co-scheduling of each group on core level for all cores in the system?

Short answer: Yes. But ignoring the affinity part will very likely result in
a poor experience with this patch set.


I was referring to the CPU affinity of a task, that you can set via
sched_setaffinity() from within a program or via taskset from the command
line. For each task/thread within a cgroup, you should set the affinity to
exactly one CPU. Otherwise -- as the load balancing part is still missing --
you might end up with all tasks running on one CPU or some other unfortunate
load distribution.

Coscheduling itself does not care about the load, so each group will be
(co-)scheduled at core level, no matter where the tasks ended up.

Regards
Jan

PS: Below is an example to illustrate the resulting schedules a bit better,
and what might happen, if you don't bind the to-be-coscheduled tasks to
individual CPUs.



For example, consider a dual-core system with SMT (i.e. 4 CPUs in total),
two task groups A and B, and tasks within them a0, a1, .. and b0, b1, ..
respectively.

Let the system topology look like this:

System (level 2)
/ \
Core 0 Core 1 (level 1)
/ \ / \
CPU0 CPU1 CPU2 CPU3 (level 0)


If you set cpu.scheduled=1 for A and B, each core will be coscheduled
independently, if there are tasks of A or B on the core. Assuming there
are runnable tasks in A and B and some other tasks on a core, you will
see a schedule like:

A -> B -> other tasks -> A -> B -> other tasks -> ...

(or some permutation thereof) happen synchronously across both CPUs
of a core -- with no guarantees which tasks within A/within B/
within the other tasks will execute simultaneously -- and with no
guarantee what will execute on the other two CPUs simultaneously. (The
distribution of CPU time between A, B, and other tasks follows the usual
CFS weight proportional distribution, just at core level.) If neither
CPU of a core has any runnable tasks of a certain group, it won't be part
of the schedule (e.g., A -> other -> A -> other).

With cpu.scheduled=2, you lift this schedule to system-level and you would
see it happen across all four CPUs synchronously. With cpu.scheduled=0, you
get this schedule at CPU-level as we're all used to with no synchronization
between CPUs. (It gets a tad more interesting, when you start mixing groups
with cpu.scheduled=1 and =2.)


Here are some schedules, that you might see, with A and B coscheduled at
core level (and that can be enforced this way (along the horizontal dimension)
by setting the affinity of tasks; without setting the affinity, it could be
any of them):

Tasks equally distributed within A and B:

t CPU0 CPU1 CPU2 CPU3
0 a0 a1 b2 b3
1 a0 a1 other other
2 b0 b1 other other
3 b0 b1 a2 a3
4 other other a2 a3
5 other other b2 b3

All tasks within A and B on one CPU:

t CPU0 CPU1 CPU2 CPU3
0 a0 -- other other
1 a1 -- other other
2 b0 -- other other
3 b1 -- other other
4 other other other other
5 a2 -- other other
6 a3 -- other other
7 b2 -- other other
8 b3 -- other other

Tasks within a group equally distributed across one core:

t CPU0 CPU1 CPU2 CPU3
0 a0 a2 b1 b3
1 a0 a3 other other
2 a1 a3 other other
3 a1 a2 b0 b3
4 other other b0 b2
5 other other b1 b2

You will never see an A-task sharing a core with a B-task at any point in time
(except for the 2 microseconds or so, that the collective context switch takes).


2018-09-18 13:24:57

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: Task group cleanups and optimizations (was: Re: [RFC 00/60] Coscheduling for Linux)

On 09/17/2018 11:48 AM, Peter Zijlstra wrote:
> On Sat, Sep 15, 2018 at 10:48:20AM +0200, Jan H. Schönherr wrote:
>> On 09/14/2018 06:25 PM, Jan H. Schönherr wrote:
>
>>> b) ability to move CFS RQs between CPUs: someone changed the affinity of
>>> a cpuset? No problem, just attach the runqueue with all the tasks elsewhere.
>>> No need to touch each and every task.
>
> Can't do that, tasks might have individual constraints that are tighter
> than the cpuset.

AFAIK, changing the affinity of a cpuset overwrites the individual affinities of tasks
within them. Thus, it shouldn't be an issue.

> Also, changing affinities isn't really a hot path, so
> who cares.

This kind of code path gets a little hotter, when a coscheduled set gets load-balanced
from one core to another.

Apart from that, I also think, that normal user-space applications should never have
to concern themselves with actual affinities. More often than not, they only want to
express a relation to some other task (or sometimes resource), like "run on the same
NUMA node", "run on the same core", so that application design assumptions are
fulfilled. That's an interface, that I'd like to see as a cgroup controller at some
point. It would also benefit from the ability to move/balance whole runqueues.

(It might also be a way to just bulk-balance a bunch of tasks in the current code,
by exchanging two CFS runqueues. But that has probably some additional issues.)


>>> c) light-weight task groups: don't allocate a runqueue for every CPU in the
>>> system, when it is known that tasks in the task group will only ever run
>>> on at most two CPUs, or so. (And while there is of course a use case for
>>> VMs in this, another class of use cases are auxiliary tasks, see eg, [1-5].)
>
> I have yet to go over your earlier email; but no. The scheduler is very
> much per-cpu. And as I mentioned earlier, CFS as is doesn't work right
> if you share the runqueue between multiple CPUs (and 'fixing' that is
> non trivial).

No sharing. Just not allocating runqueues that won't be used anyway.
Assume you express this "always run on the same core" or have other reasons
to always restrict tasks in a task group to just one core/node/whatever. On an
SMT system, you would typically need at most two runqueues for a core; the memory
foot-print of a task group would no longer increase linearly with system size.

It would be possible to (space-)efficiently express nested parallelism use cases
without having to resort to managing affinities manually (which restrict the
scheduler more than necessary).

(And it would be okay for an adjustment of the maximum number of runqueues to fail
with an -ENOMEM in dire situations, as this adjustment would be an explicit
(user-)action.)


>>> Is this the level of optimizations, you're thinking about? Or do you want
>>> to throw away the whole nested CFS RQ experience in the code?
>>
>> I guess, it would be possible to flatten the task group hierarchy, that is usually
>> created when nesting cgroups. That is, enqueue task group SEs always within the
>> root task group.
>>
>> That should take away much of the (runtime-)overhead, no?
>
> Yes, Rik was going to look at trying this. Put all the tasks in the root
> rq and adjust the vtime calculations. Facebook is seeing significant
> overhead from cpu-cgroup and has to disable it because of that on at
> least part of their setup IIUC.
>
>> The calculation of shares would need to be a different kind of complex than it is
>> now. But that might be manageable.
>
> That is the hope; indeed. We'll still need to create the hierarchy for
> accounting purposes, but it can be a smaller/simpler data structure.
>
> So the weight computation would be the normalized product of the parents
> etc.. and since PELT only updates the values on ~1ms scale, we can keep
> a cache of the product -- that is, we don't have to recompute that
> product and walk the hierarchy all the time either.
>
>> CFS bandwidth control would also need to change significantly as we would now
>> have to dequeue/enqueue nested cgroups below a throttled/unthrottled hierarchy.
>> Unless *those* task groups don't participate in this flattening.
>
> Right, so the whole bandwidth thing becomes a pain; the simplest
> solution is to detect the throttle at task-pick time, dequeue and try
> again. But that is indeed quite horrible.
>
> I'm not quite sure how this will play out.
>
> Anyway, if we pull off this flattening feat, then you can no longer use
> the hierarchy for this co-scheduling stuff.

Yeah. I might be a bit biased towards keeping or at least not fully throwing away
the nesting of CFS runqueues. ;)

However, the only efficient way that I can currently think of, is a hybrid model
between the "full nesting" that is currently there, and the "no nesting" you were
describing above.

It would flatten all task groups that do not actively contribute some function,
which would be all task groups purely for accounting purposes and those for
*unthrottled* CFS hierarchies (and those for coscheduling that contain exactly
one SE in a runqueue). The nesting would still be kept for *throttled* hierarchies
(and the coscheduling stuff). (And if you wouldn't have mentioned a way to get
rid of nesting completely, I would have kept a single level of nesting for
accounting purposes as well.)

This would allow us to lazily dequeue SEs that have run out of bandwidth when
we encounter them, and already enqueue them in the nested task group (whose SE
is not enqueued at the moment). That way, it's still a O(1) operation to re-enable
all tasks, once runtime is available again. And O(1) to throttle a repeat offender.

Regards
Jan

2018-09-18 13:38:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Task group cleanups and optimizations (was: Re: [RFC 00/60] Coscheduling for Linux)

On Tue, Sep 18, 2018 at 03:22:13PM +0200, Jan H. Sch?nherr wrote:
> On 09/17/2018 11:48 AM, Peter Zijlstra wrote:
> > On Sat, Sep 15, 2018 at 10:48:20AM +0200, Jan H. Sch?nherr wrote:
> >> On 09/14/2018 06:25 PM, Jan H. Sch?nherr wrote:
> >
> >>> b) ability to move CFS RQs between CPUs: someone changed the affinity of
> >>> a cpuset? No problem, just attach the runqueue with all the tasks elsewhere.
> >>> No need to touch each and every task.
> >
> > Can't do that, tasks might have individual constraints that are tighter
> > than the cpuset.
>
> AFAIK, changing the affinity of a cpuset overwrites the individual affinities of tasks
> within them. Thus, it shouldn't be an issue.

No, it only shrinks the set. Also nothing stops you calling
sched_setaffinity() once you're inside the cpuset. The only contraint is
that your mask is a subset of the cpuset mask.

2018-09-18 13:44:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Task group cleanups and optimizations (was: Re: [RFC 00/60] Coscheduling for Linux)

On Tue, Sep 18, 2018 at 03:22:13PM +0200, Jan H. Sch?nherr wrote:
> to always restrict tasks in a task group to just one core/node/whatever. On an
> SMT system, you would typically need at most two runqueues for a core; the memory

You forget that SMT4 and SMT8 are fairly common outside of x86.

2018-09-18 13:54:35

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: Task group cleanups and optimizations (was: Re: [RFC 00/60] Coscheduling for Linux)

On 09/18/2018 03:38 PM, Peter Zijlstra wrote:
> On Tue, Sep 18, 2018 at 03:22:13PM +0200, Jan H. Schönherr wrote:
>> AFAIK, changing the affinity of a cpuset overwrites the individual affinities of tasks
>> within them. Thus, it shouldn't be an issue.
>
> No, it only shrinks the set. Also nothing stops you calling
> sched_setaffinity() once you're inside the cpuset. The only contraint is
> that your mask is a subset of the cpuset mask.

I meant setting the affinity of the cpuset *after* setting an individual affinity.

Like this:

# mkdir /sys/fs/cgroup/cpuset/test
# cat /sys/fs/cgroup/cpuset/cpuset.mems > /sys/fs/cgroup/cpuset/test/cpuset.mems
# cat /sys/fs/cgroup/cpuset/cpuset.cpus > /sys/fs/cgroup/cpuset/test/cpuset.cpus
# echo $$ > /sys/fs/cgroup/cpuset/test/tasks
# taskset -c -p 0 $$
pid 4745's current affinity list: 0-3
pid 4745's new affinity list: 0
# echo "0-1" > /sys/fs/cgroup/cpuset/test/cpuset.cpus
# taskset -c -p $$
pid 4745's current affinity list: 0,1
#

The individual affinity of $$ is lost, despite it being a subset.

2018-09-18 14:36:17

by Rik van Riel

[permalink] [raw]
Subject: Re: Task group cleanups and optimizations (was: Re: [RFC 00/60] Coscheduling for Linux)

On Tue, 2018-09-18 at 15:22 +0200, Jan H. Schönherr wrote:
> On 09/17/2018 11:48 AM, Peter Zijlstra wrote:
> > On Sat, Sep 15, 2018 at 10:48:20AM +0200, Jan H. Schönherr wrote:
> >
> > >
> > > CFS bandwidth control would also need to change significantly as
> > > we would now
> > > have to dequeue/enqueue nested cgroups below a
> > > throttled/unthrottled hierarchy.
> > > Unless *those* task groups don't participate in this flattening.
> >
> > Right, so the whole bandwidth thing becomes a pain; the simplest
> > solution is to detect the throttle at task-pick time, dequeue and
> > try
> > again. But that is indeed quite horrible.
> >
> > I'm not quite sure how this will play out.
> >
> > Anyway, if we pull off this flattening feat, then you can no longer
> > use
> > the hierarchy for this co-scheduling stuff.
>
> Yeah. I might be a bit biased towards keeping or at least not fully
> throwing away
> the nesting of CFS runqueues. ;)

I do not have a strong bias either way. However, I
would like the overhead of the cpu controller to be
so low that we can actually use it :)

Task priorities in a flat runqueue are relatively
straightforward, with vruntime scaling just like
done for nice levels, but I have to admit that
throttled groups provide a challenge.

Dequeueing throttled tasks is pretty straightforward,
but requeueing them afterwards when they are no
longer throttled could present a real challenge
in some situations.

> However, the only efficient way that I can currently think of, is a
> hybrid model
> between the "full nesting" that is currently there, and the "no
> nesting" you were
> describing above.
>
> It would flatten all task groups that do not actively contribute some
> function,
> which would be all task groups purely for accounting purposes and
> those for
> *unthrottled* CFS hierarchies (and those for coscheduling that
> contain exactly
> one SE in a runqueue). The nesting would still be kept for
> *throttled* hierarchies
> (and the coscheduling stuff). (And if you wouldn't have mentioned a
> way to get
> rid of nesting completely, I would have kept a single level of
> nesting for
> accounting purposes as well.)
>
> This would allow us to lazily dequeue SEs that have run out of
> bandwidth when
> we encounter them, and already enqueue them in the nested task group
> (whose SE
> is not enqueued at the moment). That way, it's still a O(1) operation
> to re-enable
> all tasks, once runtime is available again. And O(1) to throttle a
> repeat offender.

I suspect most systems will have a number of runnable
tasks no larger than the number of CPUs most of the
time.

That makes "re-enable all the tasks" often equivalent
to "re-enable one task".

Can we handle the re-enabling (or waking up!) of one
task almost as fast as we can without the cpu controller?

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2018-09-18 14:42:18

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On Fri, 2018-09-14 at 18:25 +0200, Jan H. Schönherr wrote:
> On 09/14/2018 01:12 PM, Peter Zijlstra wrote:
> > On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Schönherr wrote:
> > >
> > > B) Why would I want this?
> > > In the L1TF context, it prevents other applications from
> > > loading
> > > additional data into the L1 cache, while one application tries
> > > to leak
> > > data.
> >
> > That is the whole and only reason you did this;
>
> It really isn't. But as your mind seems made up, I'm not going to
> bother
> to argue.

What are the other use cases, and what kind of performance
numbers do you have to show examples of workloads where
coscheduling provides a performance benefit?

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2018-09-19 09:24:13

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: Task group cleanups and optimizations (was: Re: [RFC 00/60] Coscheduling for Linux)

On 09/18/2018 04:35 PM, Rik van Riel wrote:
> On Tue, 2018-09-18 at 15:22 +0200, Jan H. Schönherr wrote:
[...]
> Task priorities in a flat runqueue are relatively straightforward, with
> vruntime scaling just like done for nice levels, but I have to admit
> that throttled groups provide a challenge.

Do you know/have an idea how the flat approach would further skew the
approximations currently done in calc_group_shares()/calc_group_runnable()?

With the nested hierarchy the (shared) task group SE is updated
whenever something changes. With the flat approach, you'd only be
able to update a task SE, when you have to touch the task anyway.
Just from thinking briefly about it, it feels like values would be
out of date for longer periods of time.


>> However, the only efficient way that I can currently think of, is a
>> hybrid model between the "full nesting" that is currently there, and
>> the "no nesting" you were describing above.
>>
>> It would flatten all task groups that do not actively contribute some
>> function, which would be all task groups purely for accounting
>> purposes and those for *unthrottled* CFS hierarchies (and those for
>> coscheduling that contain exactly one SE in a runqueue). The nesting
>> would still be kept for *throttled* hierarchies (and the coscheduling
>> stuff). (And if you wouldn't have mentioned a way to get rid of
>> nesting completely, I would have kept a single level of nesting for
>> accounting purposes as well.)
>>
>> This would allow us to lazily dequeue SEs that have run out of
>> bandwidth when we encounter them, and already enqueue them in the
>> nested task group (whose SE is not enqueued at the moment). That way,
>> it's still a O(1) operation to re-enable all tasks, once runtime is
>> available again. And O(1) to throttle a repeat offender.
>
> I suspect most systems will have a number of runnable tasks no larger
> than the number of CPUs most of the time.
>
> That makes "re-enable all the tasks" often equivalent to "re-enable one
> task".
>
> Can we handle the re-enabling (or waking up!) of one task almost as fast
> as we can without the cpu controller?

We could start by transparently special-casing the "just one SE in a
runqueue" case, where that single SE is enqueued directly into the next
parent, and everything falls back to nesting, the moment a second SE pops
up.

That might allow reaping the benefits for many cases without hurting
other cases. It's also more a gradual conversion of code.

Regards
Jan

2018-09-19 21:55:48

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux



On 09/18/2018 04:44 AM, Jan H. Schönherr wrote:
> On 09/18/2018 02:33 AM, Subhra Mazumdar wrote:
>> On 09/07/2018 02:39 PM, Jan H. Schönherr wrote:
>>> A) Quickstart guide for the impatient.
>>> --------------------------------------
>>>
>>> Here is a quickstart guide to set up coscheduling at core-level for
>>> selected tasks on an SMT-capable system:
>>>
>>> 1. Apply the patch series to v4.19-rc2.
>>> 2. Compile with "CONFIG_COSCHEDULING=y".
>>> 3. Boot into the newly built kernel with an additional kernel command line
>>>     argument "cosched_max_level=1" to enable coscheduling up to core-level.
>>> 4. Create one or more cgroups and set their "cpu.scheduled" to "1".
>>> 5. Put tasks into the created cgroups and set their affinity explicitly.
>>> 6. Enjoy tasks of the same group and on the same core executing
>>>     simultaneously, whenever they are executed.
>>>
>>> You are not restricted to coscheduling at core-level. Just select higher
>>> numbers in steps 3 and 4. See also further below for more information, esp.
>>> when you want to try higher numbers on larger systems.
>>>
>>> Setting affinity explicitly for tasks within coscheduled cgroups is
>>> currently necessary, as the load balancing portion is still missing in this
>>> series.
>>>
>> I don't get the affinity part. If I create two cgroups by giving them only
>> cpu shares (no cpuset) and set their cpu.scheduled=1, will this ensure
>> co-scheduling of each group on core level for all cores in the system?
> Short answer: Yes. But ignoring the affinity part will very likely result in
> a poor experience with this patch set.
>
>
> I was referring to the CPU affinity of a task, that you can set via
> sched_setaffinity() from within a program or via taskset from the command
> line. For each task/thread within a cgroup, you should set the affinity to
> exactly one CPU. Otherwise -- as the load balancing part is still missing --
> you might end up with all tasks running on one CPU or some other unfortunate
> load distribution.
>
> Coscheduling itself does not care about the load, so each group will be
> (co-)scheduled at core level, no matter where the tasks ended up.
>
> Regards
> Jan
>
> PS: Below is an example to illustrate the resulting schedules a bit better,
> and what might happen, if you don't bind the to-be-coscheduled tasks to
> individual CPUs.
>
>
>
> For example, consider a dual-core system with SMT (i.e. 4 CPUs in total),
> two task groups A and B, and tasks within them a0, a1, .. and b0, b1, ..
> respectively.
>
> Let the system topology look like this:
>
> System (level 2)
> / \
> Core 0 Core 1 (level 1)
> / \ / \
> CPU0 CPU1 CPU2 CPU3 (level 0)
>
>
> If you set cpu.scheduled=1 for A and B, each core will be coscheduled
> independently, if there are tasks of A or B on the core. Assuming there
> are runnable tasks in A and B and some other tasks on a core, you will
> see a schedule like:
>
> A -> B -> other tasks -> A -> B -> other tasks -> ...
>
> (or some permutation thereof) happen synchronously across both CPUs
> of a core -- with no guarantees which tasks within A/within B/
> within the other tasks will execute simultaneously -- and with no
> guarantee what will execute on the other two CPUs simultaneously. (The
> distribution of CPU time between A, B, and other tasks follows the usual
> CFS weight proportional distribution, just at core level.) If neither
> CPU of a core has any runnable tasks of a certain group, it won't be part
> of the schedule (e.g., A -> other -> A -> other).
>
> With cpu.scheduled=2, you lift this schedule to system-level and you would
> see it happen across all four CPUs synchronously. With cpu.scheduled=0, you
> get this schedule at CPU-level as we're all used to with no synchronization
> between CPUs. (It gets a tad more interesting, when you start mixing groups
> with cpu.scheduled=1 and =2.)
>
>
> Here are some schedules, that you might see, with A and B coscheduled at
> core level (and that can be enforced this way (along the horizontal dimension)
> by setting the affinity of tasks; without setting the affinity, it could be
> any of them):
>
> Tasks equally distributed within A and B:
>
> t CPU0 CPU1 CPU2 CPU3
> 0 a0 a1 b2 b3
> 1 a0 a1 other other
> 2 b0 b1 other other
> 3 b0 b1 a2 a3
> 4 other other a2 a3
> 5 other other b2 b3
>
> All tasks within A and B on one CPU:
>
> t CPU0 CPU1 CPU2 CPU3
> 0 a0 -- other other
> 1 a1 -- other other
> 2 b0 -- other other
> 3 b1 -- other other
> 4 other other other other
> 5 a2 -- other other
> 6 a3 -- other other
> 7 b2 -- other other
> 8 b3 -- other other
>
> Tasks within a group equally distributed across one core:
>
> t CPU0 CPU1 CPU2 CPU3
> 0 a0 a2 b1 b3
> 1 a0 a3 other other
> 2 a1 a3 other other
> 3 a1 a2 b0 b3
> 4 other other b0 b2
> 5 other other b1 b2
>
> You will never see an A-task sharing a core with a B-task at any point in time
> (except for the 2 microseconds or so, that the collective context switch takes).
>
Ok got it. Can we have a more generic interface, like specifying a set of
task ids to be co-scheduled with a particular level rather than tying this
with cgroups? KVMs may not always run with cgroups and there might be other
use cases where we might want co-scheduling that doesn't relate to cgroups.

2018-09-24 15:25:47

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 09/18/2018 04:40 PM, Rik van Riel wrote:
> On Fri, 2018-09-14 at 18:25 +0200, Jan H. Schönherr wrote:
>> On 09/14/2018 01:12 PM, Peter Zijlstra wrote:
>>> On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Schönherr wrote:
>>>>
>>>> B) Why would I want this?
>>>> [one quoted use case from the original e-mail]
>
> What are the other use cases, and what kind of performance
> numbers do you have to show examples of workloads where
> coscheduling provides a performance benefit?

For further use cases (still an incomplete list) let me redirect you to the
unabridged Section B of the original e-mail:
https://lkml.org/lkml/2018/9/7/1521

If you want me to, I can go into more detail and make the list from that
e-mail more complete.


Note, that many coscheduling use cases are not primarily about performance.

Sure, there are the resource contention use cases, which are barely about
anything else. See, e.g., [1] for a survey with further pointers to the
potential performance gains. Realizing those use cases would require either
a user space component driving this, or another kernel component performing
a function similar to the current auto-grouping with some more complexity
depending on the desired level of sophistication. This extra component is
out of my scope. But I see a coscheduler like this as an enabler for
practical applications of these kind of use cases.

If you use coscheduling as part of a solution that closes a side-channel,
performance is a secondary aspect, and hopefully we don't lose much of it.

Then, there's the large fraction of use cases, where coscheduling is
primarily about design flexibility, because it enables different (old and
new) application designs, which usually cannot be executed in an efficient
manner without coscheduling. For these use cases performance is important,
but there is also a trade-off against development costs of alternative
solutions to consider. These are also the use cases where we can do
measurements today, i.e., without some yet-to-be-written extra component.
For example, with coscheduling it is possible to use active waiting instead
of passive waiting/spin-blocking on non-dedicated systems, because lock
holder preemption is not an issue anymore. It also allows using
applications that were developed for dedicated scenarios in non-dedicated
settings without loss in performance -- like an (unmodified) operating
system within a VM, or HPC code. Another example is cache optimization of
parallel algorithms, where you don't have to resort to cache-oblivious
algorithms for efficiency, but where you can stay with manually tuned or
auto-tuned algorithms, even on non-dedicated systems. (You're even able to
do the tuning itself on a system that has other load.)


Now, you asked about performance numbers, that *I* have.

If a workload has issues with lock-holder preemption, I've seen up to 5x to
20x improvement with coscheduling. (This includes parallel programs [2] and
VMs with unmodified guests without PLE [3].) That is of course highly
dependent on the workload. I currently don't have any numbers comparing
coscheduling to other solutions used to reduce/avoid lock holder
preemption, that don't mix in any other aspect like resource contention.
These would have to be micro-benchmarked.

If you're happy to compare across some more moving variables, then more or
less blind coscheduling of parallel applications with some automatic
workload-driven (but application-agnostic) width adjustment of coscheduled
sets yielded an overall performance benefit between roughly 10% to 20%
compared to approaches with passive waiting [2]. It was roughly on par with
pure space-partitioning approaches (slight minus on performance, slight
plus on flexibility/fairness).

I never went much into the resource contention use cases myself. Though, I
did use coscheduling to extend the concept of "nice" to sockets by putting
all niced programs into a coscheduled task group with appropriately reduced
shares. This way, niced programs don't just get any and all idle CPU
capacity -- taking away parts of the energy budget of more important tasks
all the time -- which leads to important tasks running at turbo frequencies
more often. Depending on the parallelism of niced workload and the
parallelism of normal workload, this translates to a performance
improvement of the normal workload that corresponds roughly to
the increase in frequency (for CPU-bound tasks) [4]. Depending on the
processor, that can be anything from just a few percent to about a factor
of 2.

Regards
Jan



References:

[1] S. Zhuravlev, J. C. Saez, S. Blagodurov, A. Fedorova, and M. Prieto,
“Survey of scheduling techniques for addressing shared resources in
multicore processors,” ACM Computing Surveys, vol. 45, no. 1, pp.
4:1–4:28, Dec. 2012.

[2] J. H. Schönherr, B. Juurlink, and J. Richling, “TACO: A scheduling
scheme for parallel applications on multicore architectures,”
Scientific Programming, vol. 22, no. 3, pp. 223–237, 2014.

[3] J. H. Schönherr, B. Lutz, and J. Richling, “Non-intrusive coscheduling
for general purpose operating systems,” in Proceedings of the
International Conference on Multicore Software Engineering,
Performance, and Tools (MSEPT ’12), ser. Lecture Notes in Computer
Science, vol. 7303. Berlin/Heidelberg, Germany: Springer, May 2012,
pp. 66–77.

[4] J. H. Schönherr, J. Richling, M. Werner, and G. Mühl, “A scheduling
approach for efficient utilization of hardware-driven frequency
scaling,” in Workshop Proceedings of the 23rd International Conference
on Architecture of Computing Systems (ARCS 2010 Workshops), M. Beigl
and F. J. Cazorla-Almeida, Eds. Berlin, Germany: VDE Verlag, Feb.
2010, pp. 367–376.


2018-09-24 15:44:56

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 09/19/2018 11:53 PM, Subhra Mazumdar wrote:
> Can we have a more generic interface, like specifying a set of task ids
> to be co-scheduled with a particular level rather than tying this with
> cgroups? KVMs may not always run with cgroups and there might be other
> use cases where we might want co-scheduling that doesn't relate to
> cgroups.

Currently: no.

At this point the implementation is tightly coupled to the cpu cgroup
controller. This *might* change, if the task group optimizations mentioned
in other parts of this e-mail thread are done, as I think, that it would
decouple the various mechanisms.

That said, what if you were able to disable the "group-based fairness"
aspect of the cpu cgroup controller? Then you would be able to control
just the coscheduling aspects on their own. Would that satisfy the use
case you have in mind?

Regards
Jan

2018-09-24 18:02:43

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On Mon, 2018-09-24 at 17:23 +0200, Jan H. Schönherr wrote:
> On 09/18/2018 04:40 PM, Rik van Riel wrote:
> > On Fri, 2018-09-14 at 18:25 +0200, Jan H. Schönherr wrote:
> > > On 09/14/2018 01:12 PM, Peter Zijlstra wrote:
> > > > On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Schönherr
> > > > wrote:
> > > > >
> > > > > B) Why would I want this?
> > > > > [one quoted use case from the original e-mail]
> >
> > What are the other use cases, and what kind of performance
> > numbers do you have to show examples of workloads where
> > coscheduling provides a performance benefit?
>
> For further use cases (still an incomplete list) let me redirect you
> to the
> unabridged Section B of the original e-mail:
> https://lkml.org/lkml/2018/9/7/1521
>
> If you want me to, I can go into more detail and make the list from
> that
> e-mail more complete.
>
>
> Note, that many coscheduling use cases are not primarily about
> performance.
>
> Sure, there are the resource contention use cases, which are barely
> about
> anything else. See, e.g., [1] for a survey with further pointers to
> the
> potential performance gains. Realizing those use cases would require
> either
> a user space component driving this, or another kernel component
> performing
> a function similar to the current auto-grouping with some more
> complexity
> depending on the desired level of sophistication. This extra
> component is
> out of my scope. But I see a coscheduler like this as an enabler for
> practical applications of these kind of use cases.

Sounds like a co-scheduling system would need the
following elements:
1) Identify groups of runnable tasks to run together.
2) Identify hardware that needs to be co-scheduled
(for L1TF reasons, POWER7/8 restrictions, etc).
3) Pack task groups into the system in a way that
allows maximum utilization by co-scheduled tasks.
4) Leave some CPU time for regular time sliced tasks.
5) In some cases, leave some CPU time idle on purpose.

Step 1 would have to be reevaluated periodically, as
tasks (eg. VCPUs) wake up and go to sleep.

I suspect this may be much better done as its own
scheduler class, instead of shoehorned into CFS.

I like the idea of having some co-scheduling functionality
in Linux, but I absolutely abhor the idea of making CFS
even more complicated than it already is.

The current code is already incredibly hard to debug
or improve.

Are you getting much out of CFS with your current code?

It appears that your patches are fighting CFS as much as
they are leveraging it, but admittedly I only looked at
them briefly.

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2018-09-26 09:37:33

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 09/17/2018 03:37 PM, Peter Zijlstra wrote:
> On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Schönherr wrote:
>> With gang scheduling as defined by Feitelson and Rudolph [6], you'd have to
>> explicitly schedule idle time. With coscheduling as defined by Ousterhout [7],
>> you don't. In this patch set, the scheduling of idle time is "merely" a quirk
>> of the implementation. And even with this implementation, there's nothing
>> stopping you from down-sizing the width of the coscheduled set to take out
>> the idle vCPUs dynamically, cutting down on fragmentation.
>
> The thing is, if you drop the full width gang scheduling, you instantly
> require the paravirt spinlock / tlb-invalidate stuff again.

Can't say much about tlb-invalidate, but yes to the spinlock stuff: if
there isn't any additional information available, all runnable tasks/vCPUs
have to be coscheduled to avoid lock holder preemption.

With additional information about tasks potentially holding locks or
potentially spinning on a lock, it would be possible to coschedule smaller
subsets -- no idea if that would be any more efficient though.


> Of course, the constraints of L1TF itself requires the explicit
> scheduling of idle time under a bunch of conditions.

That is true for some of the resource contention use cases, too. Though,
they are much more relaxed wrt. their requirements on the simultaneousness
of the context switch.


> I did not read your [7] in much detail (also very bad quality scan that
> :-/; but I don't get how they leap from 'thrashing' to co-scheduling.

In my personal interpretation, that analogy refers to the case where the
waiting time for a lock is shorter than the time for a context switch --
but where the context switch was done anyway, "thrashing" the CPU.


Anyway. I only brought it up, because everyone has a different
understanding of what "coscheduling" or "gang scheduling" actually means.
The memorable quotes are from Ousterhout:

"A task force is coscheduled if all of its runnable processes are exe-
cuting simultaneously on different processors. Each of the processes
in that task force is also said to be coscheduled."

(where a "task force" is a group of closely cooperating tasks), and from
Feitelson and Rudolph:

"[Gang scheduling is defined] as the scheduling of a group of threads
to run on a set of processors at the same time, on a one-to-one
basis."

(with the additional assumption of time slices, collective preemption,
and that threads don't relinquish the CPU during their time slice).

That makes gang scheduling much more specific, while coscheduling just
refers to the fact that some things are executed simultaneously.


> Their initial problem, where A generates data that B needs and the 3
> scenarios:
>
> 1) A has to wait for B
> 2) B has to wait for A
> 3) the data gets buffered
>
> Seems fairly straight forward and is indeed quite common, needing
> co-scheduling for that, I'm not convinced.
>
> We have of course added all sorts of adaptive wait loops in the kernel
> to deal with just that issue.
>
> With co-scheduling you 'ensure' B is running when A is, but that doesn't
> mean you can actually make more progress, you could just be burning a
> lot of CPu cycles (which could've been spend doing other work).

I don't think, that coscheduling should be applied blindly.

Just like the adaptive wait loops you mentioned: in the beginning there
was active waiting; it wasn't that great, so passive waiting was invented;
turns out, the overhead is too high in some cases, so let's spin adaptively
for a moment.

We went from uncoordinated scheduling to system-wide coordinated scheduling
(which turned out to be not very efficient for many cases). And now we are
in the phase to find the right adaptiveness. There is work on enabling
coscheduling only on-demand (when a parallel application profits from it)
or to be more fuzzy about it (giving the scheduler more freedom); there is
work to go away from system-wide coordination to (dynamically) smaller
isles (where I see my own work as well). And "recently" we also have the
resource contention and security use cases leaving their impression on the
topic as well.


> I'm also not convinced co-scheduling makes _any_ sense outside SMT --
> does one of the many papers you cite make a good case for !SMT
> co-scheduling? It just doesn't make sense to co-schedule the LLC domain,
> that's 16+ cores on recent chips.

There's the resource contention stuff, much of which targets the last
level cache or memory controller bandwidth. So, that is making a case for
coscheduling larger parts than SMT. However, I didn't find anything in a
short search that would already cover some of the more recent processors
with 16+ cores.

There's the auto-tuning of parallel algorithms to a certain system
architecture. That would also profit from LLC coscheduling (and slightly
larger time slices) to run multiple of those in parallel. Again, no idea
for recent processors.

There's work to coschedule whole clusters, which goes beyond the scope of a
single system, but also predates recent systems. (Search for, e.g.,
"implicit coscheduling").

So, 16+ cores is unknown territory, AFAIK. But not every recent system has
16+ cores, or will have 16+ cores in the near future.

Regards
Jan

2018-09-26 09:58:50

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 09/17/2018 02:25 PM, Peter Zijlstra wrote:
> On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Schönherr wrote:
>
>> Assuming, there is a cgroup-less solution that can prevent simultaneous
>> execution of tasks on a core, when they're not supposed to. How would you
>> tell the scheduler, which tasks these are?
>
> Specifically for L1TF I hooked into/extended KVM's preempt_notifier
> registration interface, which tells us which tasks are VCPUs and to
> which VM they belong.
>
> But if we want to actually expose this to userspace, we can either do a
> prctl() or extend struct sched_attr.

Both, Peter and Subhra, seem to prefer an interface different than cgroups
to specify what to coschedule.

Can you provide some extra motivation for me, why you feel that way?
(ignoring the current scalability issues with the cpu group controller)


After all, cgroups where designed to create arbitrary groups of tasks and
to attach functionality to those groups.

If we were to introduce a different interface to control that, we'd need to
introduce a whole new group concept, so that you make tasks part of some
group while at the same time preventing unauthorized tasks from joining a
group.


I currently don't see any wins, just a loss in flexibility.

Regards
Jan

2018-09-26 17:26:06

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC 61/60] cosched: Accumulated fixes and improvements

On 13.09.2018 [21:19:38 +0200], Jan H. Sch?nherr wrote:
> Here is an "extra" patch containing bug fixes and warning removals,
> that I have accumulated up to this point.
>
> It goes on top of the other 60 patches. (When it is time for v2,
> these fixes will be integrated into the appropriate patches within
> the series.)

I found another issue today, while attempting to test (with 61/60
applied) separate coscheduling cgroups for vcpus and emulator threads
[the default configuration with libvirt].

/sys/fs/cgroup/cpu# cat cpu.scheduled
1
/sys/fs/cgroup/cpu# cd machine/
/sys/fs/cgroup/cpu/machine# cat cpu.scheduled
0
/sys/fs/cgroup/cpu/machine# cd VM-1.libvirt-qemu/
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu# cat cpu.scheduled
0
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu# cd vcpu0/
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu/vcpu0# cat cpu.scheduled
0
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu/vcpu0# echo 1 > cpu.scheduled
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu/vcpu0# cd ../emulator/
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu/emulator# echo 1 > cpu.scheduled
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu/emulator# <crash>

Serial console output (I apologize that some lines got truncated)

[ 1060.840120] BUG: unable to handle kernel NULL pointer dere0
[ 1060.848782] PGD 0 P4D 0
[ 1060.852068] Oops: 0000 [#1] SMP PTI
[ 1060.856207] CPU: 44 PID: 0 Comm: swapper/44 Tainted: G OE 4.19b
[ 1060.867029] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.2.11 10/17
[ 1060.874872] RIP: 0010:set_next_entity+0x15/0x1d0
[ 1060.879770] Code: c8 48 8b 7d d0 eb 96 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
[ 1060.899165] RSP: 0018:ffffaa2b98c0fd78 EFLAGS: 00010046
[ 1060.904720] RAX: 0000000000000000 RBX: ffff996940ba2d80 RCX: 0000000000000000
[ 1060.912199] RDX: 0000000000000008 RSI: 0000000000000000 RDI: ffff996940ba2e00
[ 1060.919678] RBP: ffffaa2b98c0fda0 R08: 0000000000000000 R09: 0000000000000000
[ 1060.927174] R10: 0000000000000000 R11: 0000000000000001 R12: ffff996940ba2e00
[ 1060.934655] R13: 0000000000000000 R14: ffff996940ba2e00 R15: 0000000000000000
[ 1060.942134] FS: 0000000000000000(0000) GS:ffff996940b80000(0000) knlGS:00000
[ 1060.950572] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1060.956673] CR2: 0000000000000040 CR3: 00000064af40a006 CR4: 00000000007626e0
[ 1060.964172] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1060.971677] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1060.979191] PKRU: 55555554
[ 1060.982282] Call Trace:
[ 1060.985126] pick_next_task_fair+0x8a7/0xa20
[ 1060.989794] __schedule+0x13a/0x8e0
[ 1060.993691] ? update_ts_time_stats+0x59/0x80
[ 1060.998439] schedule_idle+0x2c/0x40
[ 1061.002410] do_idle+0x169/0x280
[ 1061.006032] cpu_startup_entry+0x73/0x80
[ 1061.010348] start_secondary+0x1ab/0x200
[ 1061.014673] secondary_startup_64+0xa4/0xb0
[ 1061.019265] Modules linked in: act_police cls_basic ebtable_filter ebtables i
[ 1061.093145] mac_hid coretemp lp parport btrfs zstd_compress raid456 async_ri
[ 1061.126494] CR2: 0000000000000040
[ 1061.130467] ---[ end trace 3462ef57e3394c4f ]---
[ 1061.147237] RIP: 0010:set_next_entity+0x15/0x1d0
[ 1061.152510] Code: c8 48 8b 7d d0 eb 96 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
[ 1061.172573] RSP: 0018:ffffaa2b98c0fd78 EFLAGS: 00010046
[ 1061.178482] RAX: 0000000000000000 RBX: ffff996940ba2d80 RCX: 0000000000000000
[ 1061.186309] RDX: 0000000000000008 RSI: 0000000000000000 RDI: ffff996940ba2e00
[ 1061.194109] RBP: ffffaa2b98c0fda0 R08: 0000000000000000 R09: 0000000000000000
[ 1061.201908] R10: 0000000000000000 R11: 0000000000000001 R12: ffff996940ba2e00
[ 1061.209698] R13: 0000000000000000 R14: ffff996940ba2e00 R15: 0000000000000000
[ 1061.217490] FS: 0000000000000000(0000) GS:ffff996940b80000(0000) knlGS:00000
[ 1061.226236] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1061.232622] CR2: 0000000000000040 CR3: 00000064af40a006 CR4: 00000000007626e0
[ 1061.240405] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1061.248168] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1061.255909] PKRU: 55555554
[ 1061.259221] Kernel panic - not syncing: Attempted to kill the idle task!
[ 1062.345087] Shutting down cpus with NMI
[ 1062.351037] Kernel Offset: 0x33400000 from 0xffffffff81000000 (relocation ra)
[ 1062.374645] ---[ end Kernel panic - not syncing: Attempted to kill the idle -
[ 1062.383218] WARNING: CPU: 44 PID: 0 at /build/linux-4.19-0rc3.ag.4/kernel/sc0
[ 1062.394380] Modules linked in: act_police cls_basic ebtable_filter ebtables i
[ 1062.469725] mac_hid coretemp lp parport btrfs zstd_compress raid456 async_ri
[ 1062.503656] CPU: 44 PID: 0 Comm: swapper/44 Tainted: G D OE 4.19b
[ 1062.514972] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.2.11 10/17
[ 1062.523357] RIP: 0010:set_task_cpu+0x193/0x1a0
[ 1062.528624] Code: 00 00 04 e9 36 ff ff ff 0f 0b e9 be fe ff ff f7 43 60 fd f5
[ 1062.549066] RSP: 0018:ffff996940b83dc8 EFLAGS: 00010046
[ 1062.555134] RAX: 0000000000000200 RBX: ffff99c90f2a9e00 RCX: 0000000000000080
[ 1062.563096] RDX: ffff99c90f2aa101 RSI: 000000000000000f RDI: ffff99c90f2a9e00
[ 1062.571053] RBP: ffff996940b83de8 R08: 000000000000000f R09: 000000000000002c
[ 1062.578990] R10: 0000000000000001 R11: 0000000000000009 R12: ffff99c90f2aa934
[ 1062.586911] R13: 000000000000000f R14: 000000000000000f R15: 0000000000022d80
[ 1062.594826] FS: 0000000000000000(0000) GS:ffff996940b80000(0000) knlGS:00000
[ 1062.603681] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1062.610182] CR2: 0000000000000040 CR3: 00000064af40a006 CR4: 00000000007626e0
[ 1062.618061] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1062.625919] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1062.633762] PKRU: 55555554
[ 1062.637186] Call Trace:
[ 1062.640350] <IRQ>
[ 1062.643066] try_to_wake_up+0x159/0x4b0
[ 1062.647588] default_wake_function+0x12/0x20
[ 1062.652539] autoremove_wake_function+0x12/0x40
[ 1062.657744] __wake_up_common+0x8c/0x130
[ 1062.662340] __wake_up_common_lock+0x80/0xc0
[ 1062.667277] __wake_up+0x13/0x20
[ 1062.671170] wake_up_klogd_work_func+0x40/0x60
[ 1062.676275] irq_work_run_list+0x55/0x80
[ 1062.680860] irq_work_run+0x2c/0x40
[ 1062.684992] flush_smp_call_function_queue+0xc0/0x100
[ 1062.690687] generic_smp_call_function_single_interrupt+0x13/0x30
[ 1062.697430] smp_call_function_single_interrupt+0x3e/0xe0
[ 1062.703485] call_function_single_interrupt+0xf/0x20
[ 1062.709100] </IRQ>
[ 1062.711851] RIP: 0010:panic+0x1fe/0x244
[ 1062.716329] Code: eb a6 83 3d 17 bc af 01 00 74 05 e8 b0 72 02 00 48 c7 c6 2f
[ 1062.736366] RSP: 0018:ffffaa2b98c0fe60 EFLAGS: 00000286 ORIG_RAX: ffffffffff4
[ 1062.744571] RAX: 000000000000004a RBX: ffff99693243bc00 RCX: 0000000000000006
[ 1062.752328] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff996940b96420
[ 1062.760077] RBP: ffffaa2b98c0fed8 R08: 000000000000002c R09: 0000000000aaaaaa
[ 1062.767814] R10: 0000000000000040 R11: 0000000000000001 R12: 0000000000000000
[ 1062.775536] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000046
[ 1062.783236] do_exit+0x886/0xb20
[ 1062.787023] ? cpu_startup_entry+0x73/0x80
[ 1062.791659] rewind_stack_do_exit+0x17/0x20
[ 1062.796364] ---[ end trace 3462ef57e3394c50 ]---
[ 1062.801485] ------------[ cut here ]------------
[ 1062.806599] sched: Unexpected reschedule of offline CPU#15!
[ 1062.812655] WARNING: CPU: 44 PID: 0 at /build/linux-4.19-0rc3.ag.4/arch/x86/0
[ 1062.825264] Modules linked in: act_police cls_basic ebtable_filter ebtables i
[ 1062.899387] mac_hid coretemp lp parport btrfs zstd_compress raid456 async_ri
[ 1062.932747] CPU: 44 PID: 0 Comm: swapper/44 Tainted: G D W OE 4.19b
[ 1062.943874] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.2.11 10/17
[ 1062.952057] RIP: 0010:native_smp_send_reschedule+0x3f/0x50
[ 1062.958164] Code: c0 84 c0 74 17 48 8b 05 ff d9 36 01 be fd 00 00 00 48 8b 40
[ 1062.978210] RSP: 0018:ffff996940b83de8 EFLAGS: 00010086
[ 1062.984093] RAX: 0000000000000000 RBX: ffff99c90f2a9e00 RCX: 0000000000000006
[ 1062.991894] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff996940b96420
[ 1062.999695] RBP: ffff996940b83de8 R08: 000000000000002c R09: 0000000000aaaaaa
[ 1063.007501] R10: ffff996940b83dc8 R11: 0000000000000001 R12: ffff99c90f2aa934
[ 1063.015303] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000022d80
[ 1063.023110] FS: 0000000000000000(0000) GS:ffff996940b80000(0000) knlGS:00000
[ 1063.031881] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1063.038312] CR2: 0000000000000040 CR3: 00000064af40a006 CR4: 00000000007626e0
[ 1063.046138] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1063.053973] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1063.061796] PKRU: 55555554
[ 1063.065193] Call Trace:
[ 1063.068323] <IRQ>
[ 1063.071021] try_to_wake_up+0x3e3/0x4b0
[ 1063.075534] default_wake_function+0x12/0x20
[ 1063.080485] autoremove_wake_function+0x12/0x40
[ 1063.085682] __wake_up_common+0x8c/0x130
[ 1063.090259] __wake_up_common_lock+0x80/0xc0
[ 1063.095172] __wake_up+0x13/0x20
[ 1063.099029] wake_up_klogd_work_func+0x40/0x60
[ 1063.104100] irq_work_run_list+0x55/0x80
[ 1063.108649] irq_work_run+0x2c/0x40
[ 1063.112767] flush_smp_call_function_queue+0xc0/0x100
[ 1063.118451] generic_smp_call_function_single_interrupt+0x13/0x30
[ 1063.125174] smp_call_function_single_interrupt+0x3e/0xe0
[ 1063.131209] call_function_single_interrupt+0xf/0x20
[ 1063.136807] </IRQ>
[ 1063.139535] RIP: 0010:panic+0x1fe/0x244
[ 1063.144009] Code: eb a6 83 3d 17 bc af 01 00 74 05 e8 b0 72 02 00 48 c7 c6 2f
[ 1063.164062] RSP: 0018:ffffaa2b98c0fe60 EFLAGS: 00000286 ORIG_RAX: ffffffffff4
[ 1063.172269] RAX: 000000000000004a RBX: ffff99693243bc00 RCX: 0000000000000006
[ 1063.180034] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff996940b96420
[ 1063.187781] RBP: ffffaa2b98c0fed8 R08: 000000000000002c R09: 0000000000aaaaaa
[ 1063.195519] R10: 0000000000000040 R11: 0000000000000001 R12: 0000000000000000
[ 1063.203243] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000046
[ 1063.210950] do_exit+0x886/0xb20
[ 1063.214736] ? cpu_startup_entry+0x73/0x80
[ 1063.219371] rewind_stack_do_exit+0x17/0x20
[ 1063.224076] ---[ end trace 3462ef57e3394c51 ]---

2018-09-26 21:06:41

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC 61/60] cosched: Accumulated fixes and improvements

On 26.09.2018 [10:25:19 -0700], Nishanth Aravamudan wrote:
> On 13.09.2018 [21:19:38 +0200], Jan H. Sch?nherr wrote:
> > Here is an "extra" patch containing bug fixes and warning removals,
> > that I have accumulated up to this point.
> >
> > It goes on top of the other 60 patches. (When it is time for v2,
> > these fixes will be integrated into the appropriate patches within
> > the series.)
>
> I found another issue today, while attempting to test (with 61/60
> applied) separate coscheduling cgroups for vcpus and emulator threads
> [the default configuration with libvirt].

<snip>

> Serial console output (I apologize that some lines got truncated)

I got an non-truncated log as well:

[ 764.132461] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
[ 764.141001] PGD 0 P4D 0
[ 764.144020] Oops: 0000 [#1] SMP PTI
[ 764.147988] CPU: 70 PID: 0 Comm: swapper/70 Tainted: G OE 4.19-0rc3.ag-generic #4+1536951040do~8680a1b
[ 764.159086] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.2.11 10/19/2017
[ 764.166968] RIP: 0010:set_next_entity+0x15/0x1d0
[ 764.171887] Code: c8 48 8b 7d d0 eb 96 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 <8b> 46 40 48 89 f30
[ 764.191276] RSP: 0018:ffffb97158cdfd78 EFLAGS: 00010046
[ 764.196888] RAX: 0000000000000000 RBX: ffff9806c0ee2d80 RCX: 0000000000000000
[ 764.204403] RDX: 0000000000000022 RSI: 0000000000000000 RDI: ffff9806c0ee2e00
[ 764.211918] RBP: ffffb97158cdfda0 R08: ffffb97178cd8000 R09: 0000000000006080
[ 764.219412] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9806c0ee2e00
[ 764.226903] R13: 0000000000000000 R14: ffff9806c0ee2e00 R15: 0000000000000000
[ 764.234433] FS: 0000000000000000(0000) GS:ffff9806c0ec0000(0000) knlGS:0000000000000000
[ 764.242919] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 764.249045] CR2: 0000000000000040 CR3: 00000002d720a004 CR4: 00000000007626e0
[ 764.256558] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 764.264108] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 764.271663] PKRU: 55555554
[ 764.274784] Call Trace:
[ 764.277633] pick_next_task_fair+0x8a7/0xa20
[ 764.282292] __schedule+0x13a/0x8e0
[ 764.286184] schedule_idle+0x2c/0x40
[ 764.290161] do_idle+0x169/0x280
[ 764.293816] cpu_startup_entry+0x73/0x80
[ 764.298151] start_secondary+0x1ab/0x200
[ 764.302513] secondary_startup_64+0xa4/0xb0
[ 764.307127] Modules linked in: act_police cls_basic ebtable_filter ebtables ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw ip_tables r
[ 764.381780] coretemp lp parport btrfs zstd_compress raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linei
[ 764.414567] CR2: 0000000000000040
[ 764.418596] ---[ end trace 9b35e3cb99f8eacb ]---
[ 764.437343] RIP: 0010:set_next_entity+0x15/0x1d0
[ 764.442748] Code: c8 48 8b 7d d0 eb 96 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 <8b> 46 40 48 89 f30
[ 764.462845] RSP: 0018:ffffb97158cdfd78 EFLAGS: 00010046
[ 764.468788] RAX: 0000000000000000 RBX: ffff9806c0ee2d80 RCX: 0000000000000000
[ 764.476633] RDX: 0000000000000022 RSI: 0000000000000000 RDI: ffff9806c0ee2e00
[ 764.484476] RBP: ffffb97158cdfda0 R08: ffffb97178cd8000 R09: 0000000000006080
[ 764.492322] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9806c0ee2e00
[ 764.500143] R13: 0000000000000000 R14: ffff9806c0ee2e00 R15: 0000000000000000
[ 764.507988] FS: 0000000000000000(0000) GS:ffff9806c0ec0000(0000) knlGS:0000000000000000
[ 764.516801] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 764.523258] CR2: 0000000000000040 CR3: 00000002d720a004 CR4: 00000000007626e0
[ 764.531084] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 764.538987] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 764.546813] PKRU: 55555554
[ 764.550185] Kernel panic - not syncing: Attempted to kill the idle task!
[ 764.557615] Kernel Offset: 0x1f400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 764.581890] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
[ 764.590574] WARNING: CPU: 70 PID: 0 at /build/linux-4.19-0rc3.ag.4/kernel/sched/core.c:1187 set_task_cpu+0x193/0x1a0
[ 764.601740] Modules linked in: act_police cls_basic ebtable_filter ebtables ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw ip_tables r
[ 764.677788] coretemp lp parport btrfs zstd_compress raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linei
[ 764.711018] CPU: 70 PID: 0 Comm: swapper/70 Tainted: G D OE 4.19-0rc3.ag-generic #4+1536951040do~8680a1b
[ 764.722332] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.2.11 10/19/2017
[ 764.730716] RIP: 0010:set_task_cpu+0x193/0x1a0
[ 764.735983] Code: 00 00 04 e9 36 ff ff ff 0f 0b e9 be fe ff ff f7 43 60 fd ff ff ff 0f 84 c8 fe ff ff 0f 0b e9 c1 fe ff ff 31 c0 e9 6d ff ff ff <0f> 0b e9 c9 fe ff5
[ 764.756428] RSP: 0018:ffff9806c0ec3e08 EFLAGS: 00010046
[ 764.762512] RAX: 0000000000000200 RBX: ffff980547829e00 RCX: 0000000000000080
[ 764.770492] RDX: ffff98054782a101 RSI: 0000000000000000 RDI: ffff980547829e00
[ 764.778456] RBP: ffff9806c0ec3e28 R08: 0000000000000000 R09: 0000000000000046
[ 764.786412] R10: 0000000000000001 R11: 0000000000000046 R12: ffff98054782a934
[ 764.794351] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000022d80
[ 764.802272] FS: 0000000000000000(0000) GS:ffff9806c0ec0000(0000) knlGS:0000000000000000
[ 764.811138] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 764.817657] CR2: 0000000000000040 CR3: 00000002d720a004 CR4: 00000000007626e0
[ 764.825550] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 764.833427] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 764.841280] PKRU: 55555554
[ 764.844702] Call Trace:
[ 764.847857] <IRQ>
[ 764.850581] try_to_wake_up+0x159/0x4b0
[ 764.855146] ? apic_timer_expired+0x70/0x70 [kvm]
[ 764.860529] wake_up_process+0x15/0x20
[ 764.864952] swake_up_locked+0x24/0x40
[ 764.869370] swake_up_one+0x1f/0x30
[ 764.873544] apic_timer_expired+0x4b/0x70 [kvm]
[ 764.878739] apic_timer_fn+0x1b/0x50 [kvm]
[ 764.883487] __hrtimer_run_queues+0x106/0x270
[ 764.888496] hrtimer_interrupt+0x116/0x240
[ 764.893237] smp_apic_timer_interrupt+0x6f/0x140
[ 764.898497] apic_timer_interrupt+0xf/0x20
[ 764.903228] </IRQ>
[ 764.905967] RIP: 0010:panic+0x1fe/0x244
[ 764.910438] Code: eb a6 83 3d 17 bc af 01 00 74 05 e8 b0 72 02 00 48 c7 c6 20 f1 f8 a1 48 c7 c7 10 54 6d a1 e8 c0 a3 06 00 fb 66 0f 1f 44 00 00 <31> db e8 3f f5 0df
[ 764.930499] RSP: 0018:ffffb97158cdfe60 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
[ 764.938726] RAX: 000000000000004a RBX: ffff9806b2501e00 RCX: 0000000000000006
[ 764.946509] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff9806c0ed6420
[ 764.954282] RBP: ffffb97158cdfed8 R08: 0000000000000046 R09: 0000000000aaaaaa
[ 764.962038] R10: 0000000000000040 R11: 0000000000000001 R12: 0000000000000000
[ 764.969776] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000046
[ 764.977502] do_exit+0x886/0xb20
[ 764.981305] ? cpu_startup_entry+0x73/0x80
[ 764.985967] rewind_stack_do_exit+0x17/0x20
[ 764.990699] ---[ end trace 9b35e3cb99f8eacc ]---
[ 764.995851] ------------[ cut here ]------------
[ 765.000984] sched: Unexpected reschedule of offline CPU#0!
[ 765.006976] WARNING: CPU: 70 PID: 0 at /build/linux-4.19-0rc3.ag.4/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x3f/0x50
[ 765.019617] Modules linked in: act_police cls_basic ebtable_filter ebtables ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw ip_tables r
[ 765.094470] coretemp lp parport btrfs zstd_compress raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linei
[ 765.127134] CPU: 70 PID: 0 Comm: swapper/70 Tainted: G D W OE 4.19-0rc3.ag-generic #4+1536951040do~8680a1b
[ 765.138261] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.2.11 10/19/2017
[ 765.146443] RIP: 0010:native_smp_send_reschedule+0x3f/0x50
[ 765.152543] Code: c0 84 c0 74 17 48 8b 05 ff d9 36 01 be fd 00 00 00 48 8b 40 30 e8 71 5e da 00 5d c3 89 fe 48 c7 c7 e8 b5 6c a1 e8 31 5b 03 00 <0f> 0b 5d c3 0f 1f0
[ 765.172572] RSP: 0018:ffff9806c0ec3d78 EFLAGS: 00010086
[ 765.178438] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
[ 765.186228] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff9806c0ed6420
[ 765.194020] RBP: ffff9806c0ec3d78 R08: 0000000000000046 R09: 0000000000aaaaaa
[ 765.201812] R10: ffff9806c0ec3c98 R11: 0000000000000001 R12: ffff9806c0622d80
[ 765.209601] R13: ffff9806c0622d80 R14: ffff9806c0ec3e48 R15: ffff9806c0622d80
[ 765.217394] FS: 0000000000000000(0000) GS:ffff9806c0ec0000(0000) knlGS:0000000000000000
[ 765.226154] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 765.232575] CR2: 0000000000000040 CR3: 00000002d720a004 CR4: 00000000007626e0
[ 765.240395] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 765.248211] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 765.256028] PKRU: 55555554
[ 765.259416] Call Trace:
[ 765.262547] <IRQ>
[ 765.265232] resched_curr+0x79/0xf0
[ 765.269391] check_preempt_curr+0x78/0xe0
[ 765.274073] ttwu_do_wakeup+0x1e/0x150
[ 765.278485] ttwu_do_activate+0x77/0x80
[ 765.282966] try_to_wake_up+0x1d6/0x4b0
[ 765.287445] ? apic_timer_expired+0x70/0x70 [kvm]
[ 765.292775] wake_up_process+0x15/0x20
[ 765.297151] swake_up_locked+0x24/0x40
[ 765.301518] swake_up_one+0x1f/0x30
[ 765.305637] apic_timer_expired+0x4b/0x70 [kvm]
[ 765.310800] apic_timer_fn+0x1b/0x50 [kvm]
[ 765.315515] __hrtimer_run_queues+0x106/0x270
[ 765.320490] hrtimer_interrupt+0x116/0x240
[ 765.325204] smp_apic_timer_interrupt+0x6f/0x140
[ 765.330439] apic_timer_interrupt+0xf/0x20
[ 765.335151] </IRQ>
[ 765.337865] RIP: 0010:panic+0x1fe/0x244
[ 765.342304] Code: eb a6 83 3d 17 bc af 01 00 74 05 e8 b0 72 02 00 48 c7 c6 20 f1 f8 a1 48 c7 c7 10 54 6d a1 e8 c0 a3 06 00 fb 66 0f 1f 44 00 00 <31> db e8 3f f5 0df
[ 765.362254] RSP: 0018:ffffb97158cdfe60 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
[ 765.370407] RAX: 000000000000004a RBX: ffff9806b2501e00 RCX: 0000000000000006
[ 765.378120] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff9806c0ed6420
[ 765.385815] RBP: ffffb97158cdfed8 R08: 0000000000000046 R09: 0000000000aaaaaa
[ 765.393504] R10: 0000000000000040 R11: 0000000000000001 R12: 0000000000000000
[ 765.401172] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000046
[ 765.408830] do_exit+0x886/0xb20
[ 765.412561] ? cpu_startup_entry+0x73/0x80
[ 765.417147] rewind_stack_do_exit+0x17/0x20
[ 765.421799] ---[ end trace 9b35e3cb99f8eacd ]---

Thanks,
Nish

2018-09-27 18:14:56

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux



On 09/24/2018 08:43 AM, Jan H. Schönherr wrote:
> On 09/19/2018 11:53 PM, Subhra Mazumdar wrote:
>> Can we have a more generic interface, like specifying a set of task ids
>> to be co-scheduled with a particular level rather than tying this with
>> cgroups? KVMs may not always run with cgroups and there might be other
>> use cases where we might want co-scheduling that doesn't relate to
>> cgroups.
> Currently: no.
>
> At this point the implementation is tightly coupled to the cpu cgroup
> controller. This *might* change, if the task group optimizations mentioned
> in other parts of this e-mail thread are done, as I think, that it would
> decouple the various mechanisms.
>
> That said, what if you were able to disable the "group-based fairness"
> aspect of the cpu cgroup controller? Then you would be able to control
> just the coscheduling aspects on their own. Would that satisfy the use
> case you have in mind?
>
> Regards
> Jan
Yes that will suffice the use case. We wish to experiment at some point
with co-scheduling of certain workers threads in DB parallel query and see
if there is any benefit

Thanks,
Subhra

2018-09-27 18:38:13

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux



On 09/26/2018 02:58 AM, Jan H. Schönherr wrote:
> On 09/17/2018 02:25 PM, Peter Zijlstra wrote:
>> On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Schönherr wrote:
>>
>>> Assuming, there is a cgroup-less solution that can prevent simultaneous
>>> execution of tasks on a core, when they're not supposed to. How would you
>>> tell the scheduler, which tasks these are?
>> Specifically for L1TF I hooked into/extended KVM's preempt_notifier
>> registration interface, which tells us which tasks are VCPUs and to
>> which VM they belong.
>>
>> But if we want to actually expose this to userspace, we can either do a
>> prctl() or extend struct sched_attr.
> Both, Peter and Subhra, seem to prefer an interface different than cgroups
> to specify what to coschedule.
>
> Can you provide some extra motivation for me, why you feel that way?
> (ignoring the current scalability issues with the cpu group controller)
>
>
> After all, cgroups where designed to create arbitrary groups of tasks and
> to attach functionality to those groups.
>
> If we were to introduce a different interface to control that, we'd need to
> introduce a whole new group concept, so that you make tasks part of some
> group while at the same time preventing unauthorized tasks from joining a
> group.
>
>
> I currently don't see any wins, just a loss in flexibility.
>
> Regards
> Jan
I think cgroups will the get the job done for any use case. But we have,
e.g. affinity control via both sched_setaffinity and cgroup cpusets. It
will be good to have an alternative way to specify co-scheduling too for
those who don't want to use cgroup for some reason. It can be added later
on though, only how one will override the other will need to be sorted out.

2018-10-01 09:14:13

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: [RFC 61/60] cosched: Accumulated fixes and improvements

On 09/26/2018 11:05 PM, Nishanth Aravamudan wrote:
> On 26.09.2018 [10:25:19 -0700], Nishanth Aravamudan wrote:
>>
>> I found another issue today, while attempting to test (with 61/60
>> applied) separate coscheduling cgroups for vcpus and emulator threads
>> [the default configuration with libvirt].
>
> <snip>
>
> [ 764.132461] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
> [ 764.141001] PGD 0 P4D 0
> [ 764.144020] Oops: 0000 [#1] SMP PTI
> [ 764.147988] CPU: 70 PID: 0 Comm: swapper/70 Tainted: G OE 4.19-0rc3.ag-generic #4+1536951040do~8680a1b
> [ 764.159086] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.2.11 10/19/2017
> [ 764.166968] RIP: 0010:set_next_entity+0x15/0x1d0

I got it reproduced.

I will send a fix, when I have one.

Regards
Jan

2018-10-04 13:30:07

by Jon Masters

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 9/7/18 5:39 PM, Jan H. Schönherr wrote:

> The collective context switch from one coscheduled set of tasks to another
> -- while fast -- is not atomic. If a use-case needs the absolute guarantee
> that all tasks of the previous set have stopped executing before any task
> of the next set starts executing, an additional hand-shake/barrier needs to
> be added.

In case nobody else brought it up yet, you're going to need a handshake
to strengthen protection against L1TF attacks. Otherwise, there's still
a small window where an attack can occur during the reschedule. Perhaps
one could then cause this to happen artificially by repeatedly have a VM
do some kind of pause/mwait type operation that might do a reschedule.

Jon.

--
Computer Architect | Sent with my Fedora powered laptop

2018-10-17 02:11:39

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Sch?nherr wrote:
> C) How does it work?
> --------------------
>
> This patch series introduces hierarchical runqueues, that represent larger
> and larger fractions of the system. By default, there is one runqueue per
> scheduling domain. These additional levels of runqueues are activated by
> the "cosched_max_level=" kernel command line argument. The bottom level is
> 0.
>
> One CPU per hierarchical runqueue is considered the leader, who is
> primarily responsible for the scheduling decision at this level. Once the
> leader has selected a task group to execute, it notifies all leaders of the
> runqueues below it to select tasks/task groups within the selected task
> group.
>
> For each task-group, the user can select at which level it should be
> scheduled. If you set "cpu.scheduled" to "1", coscheduling will typically
> happen at core-level on systems with SMT. That is, if one SMT sibling
> executes a task from this task group, the other sibling will do so, too. If
> no task is available, the SMT sibling will be idle. With "cpu.scheduled"
> set to "2" this is extended to the next level, which is typically a whole
> socket on many systems. And so on. If you feel, that this does not provide
> enough flexibility, you can specify "cosched_split_domains" on the kernel
> command line to create more fine-grained scheduling domains for your
> system.

Have you considered using cpuset to specify the set of CPUs inside which
you want to coschedule task groups in? Perhaps that would be more flexible
and intuitive to control than this cpu.scheduled value.

Unless you require this feature to act always symmetrical through the branches
of a given domain tree?

Thanks.

2018-10-19 00:28:16

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

Hi Jan,

On 9/7/18 2:39 PM, Jan H. Schönherr wrote:
> The collective context switch from one coscheduled set of tasks to another
> -- while fast -- is not atomic. If a use-case needs the absolute guarantee
> that all tasks of the previous set have stopped executing before any task
> of the next set starts executing, an additional hand-shake/barrier needs to
> be added.
>
Do you know how much is the delay? i.e what is overlap time when a thread
of new group starts executing on one HT while there is still thread of
another group running on the other HT?

Thanks,
Subhra

2018-10-19 11:40:57

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 17/10/2018 04.09, Frederic Weisbecker wrote:
> On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Schönherr wrote:
>> C) How does it work?
>> --------------------
[...]
>> For each task-group, the user can select at which level it should be
>> scheduled. If you set "cpu.scheduled" to "1", coscheduling will typically
>> happen at core-level on systems with SMT. That is, if one SMT sibling
>> executes a task from this task group, the other sibling will do so, too. If
>> no task is available, the SMT sibling will be idle. With "cpu.scheduled"
>> set to "2" this is extended to the next level, which is typically a whole
>> socket on many systems. And so on. If you feel, that this does not provide
>> enough flexibility, you can specify "cosched_split_domains" on the kernel
>> command line to create more fine-grained scheduling domains for your
>> system.
>
> Have you considered using cpuset to specify the set of CPUs inside which
> you want to coschedule task groups in? Perhaps that would be more flexible
> and intuitive to control than this cpu.scheduled value.

Yes, I did consider cpusets. Though, there are two dimensions to it:
a) at what fraction of the system tasks shall be coscheduled, and
b) where these tasks shall execute within the system.

cpusets would be the obvious answer to the "where". However, in the current
form they are too inflexible with too much overhead. Suppose, you want to
coschedule two tasks on SMT siblings of a core. You would be able to
restrict the tasks to a specific core with a cpuset. But then, it is bound
to that core, and the load balancer cannot move the group of two tasks to a
different core.

Now, it would be possible to "invent" relocatable cpusets to address that
issue ("I want affinity restricted to a core, I don't care which"), but
then, the current way how cpuset affinity is enforced doesn't scale for
making use of it from within the balancer. (The upcoming load balancing
portion of the coscheduler currently uses a file similar to cpu.scheduled
to restrict affinity to a load-balancer-controlled subset of the system.)


Using cpusets as the mean to describe which parts of the system are to be
coscheduled *may* be possible. But if so, it's a long way out. The current
implementation uses scheduling domains for this, because (a) most
coscheduling use cases require an alignment to the topology, and (b) it
integrates really nicely with the load balancer.

AFAIK, there is already some interaction between cpusets and scheduling
domains. But it is supposed to be rather static and as soon as you have
overlapping cpusets, you end up with the default scheduling domains.
If we were able to make the scheduling domains more dynamic than they are
today, we might be able to couple that to cpusets (or some similar
interface to *define* scheduling domains).

Regards
Jan


2018-10-19 14:53:16

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On Fri, Oct 19, 2018 at 01:40:03PM +0200, Jan H. Sch?nherr wrote:
> On 17/10/2018 04.09, Frederic Weisbecker wrote:
> > On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Sch?nherr wrote:
> >> C) How does it work?
> >> --------------------
> [...]
> >> For each task-group, the user can select at which level it should be
> >> scheduled. If you set "cpu.scheduled" to "1", coscheduling will typically
> >> happen at core-level on systems with SMT. That is, if one SMT sibling
> >> executes a task from this task group, the other sibling will do so, too. If
> >> no task is available, the SMT sibling will be idle. With "cpu.scheduled"
> >> set to "2" this is extended to the next level, which is typically a whole
> >> socket on many systems. And so on. If you feel, that this does not provide
> >> enough flexibility, you can specify "cosched_split_domains" on the kernel
> >> command line to create more fine-grained scheduling domains for your
> >> system.
> >
> > Have you considered using cpuset to specify the set of CPUs inside which
> > you want to coschedule task groups in? Perhaps that would be more flexible
> > and intuitive to control than this cpu.scheduled value.
>
> Yes, I did consider cpusets. Though, there are two dimensions to it:
> a) at what fraction of the system tasks shall be coscheduled, and
> b) where these tasks shall execute within the system.
>
> cpusets would be the obvious answer to the "where". However, in the current
> form they are too inflexible with too much overhead. Suppose, you want to
> coschedule two tasks on SMT siblings of a core. You would be able to
> restrict the tasks to a specific core with a cpuset. But then, it is bound
> to that core, and the load balancer cannot move the group of two tasks to a
> different core.
>
> Now, it would be possible to "invent" relocatable cpusets to address that
> issue ("I want affinity restricted to a core, I don't care which"), but
> then, the current way how cpuset affinity is enforced doesn't scale for
> making use of it from within the balancer. (The upcoming load balancing
> portion of the coscheduler currently uses a file similar to cpu.scheduled
> to restrict affinity to a load-balancer-controlled subset of the system.)

Oh ok, I understand now. Affinity and node-scope mutual exclusion are
entirely decoupled, I see.

>
>
> Using cpusets as the mean to describe which parts of the system are to be
> coscheduled *may* be possible. But if so, it's a long way out. The current
> implementation uses scheduling domains for this, because (a) most
> coscheduling use cases require an alignment to the topology, and (b) it
> integrates really nicely with the load balancer.

So what is the need for cosched_split_domains? What kind of corner case won't
fit into scheduler domains? Can you perhaps spare that part in this patchset
to simplify it somehow? If it happens to be necessary, it can still be added
iteratively.

Thanks.


2018-10-19 15:18:02

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On Fri, 2018-10-19 at 13:40 +0200, Jan H. Schönherr wrote:
>
> Now, it would be possible to "invent" relocatable cpusets to address
> that
> issue ("I want affinity restricted to a core, I don't care which"),
> but
> then, the current way how cpuset affinity is enforced doesn't scale
> for
> making use of it from within the balancer. (The upcoming load
> balancing
> portion of the coscheduler currently uses a file similar to
> cpu.scheduled
> to restrict affinity to a load-balancer-controlled subset of the
> system.)

Oh boy, so the coscheduler is going to get its
own load balancer?

At that point, why bother integrating the
coscheduler into CFS, instead of making it its
own scheduling class?

CFS is already complicated enough that it borders
on unmaintainable. I would really prefer to have
the coscheduler code separate from CFS, unless
there is a really compelling reason to do otherwise.

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2018-10-19 15:35:52

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On Fri, Oct 19, 2018 at 11:16:49AM -0400, Rik van Riel wrote:
> On Fri, 2018-10-19 at 13:40 +0200, Jan H. Sch?nherr wrote:
> >
> > Now, it would be possible to "invent" relocatable cpusets to address
> > that
> > issue ("I want affinity restricted to a core, I don't care which"),
> > but
> > then, the current way how cpuset affinity is enforced doesn't scale
> > for
> > making use of it from within the balancer. (The upcoming load
> > balancing
> > portion of the coscheduler currently uses a file similar to
> > cpu.scheduled
> > to restrict affinity to a load-balancer-controlled subset of the
> > system.)
>
> Oh boy, so the coscheduler is going to get its
> own load balancer?
>
> At that point, why bother integrating the
> coscheduler into CFS, instead of making it its
> own scheduling class?
>
> CFS is already complicated enough that it borders
> on unmaintainable. I would really prefer to have
> the coscheduler code separate from CFS, unless
> there is a really compelling reason to do otherwise.

I guess he wants to reuse as much as possible from the CFS features and
code present or to come (nice, fairness, load balancing, power aware,
NUMA aware, etc...).

OTOH you're right, the thing has specific enough requirements to consider a new
sched policy. And really I would love to see all that code separate from CFS,
for the reasons you just outlined. So I cross my fingers on what Jan is going to
answer on a new policy.


2018-10-19 15:46:45

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On Fri, 2018-10-19 at 17:33 +0200, Frederic Weisbecker wrote:
> On Fri, Oct 19, 2018 at 11:16:49AM -0400, Rik van Riel wrote:
> > On Fri, 2018-10-19 at 13:40 +0200, Jan H. Schönherr wrote:
> > >
> > > Now, it would be possible to "invent" relocatable cpusets to
> > > address
> > > that
> > > issue ("I want affinity restricted to a core, I don't care
> > > which"),
> > > but
> > > then, the current way how cpuset affinity is enforced doesn't
> > > scale
> > > for
> > > making use of it from within the balancer. (The upcoming load
> > > balancing
> > > portion of the coscheduler currently uses a file similar to
> > > cpu.scheduled
> > > to restrict affinity to a load-balancer-controlled subset of the
> > > system.)
> >
> > Oh boy, so the coscheduler is going to get its
> > own load balancer?
> >
> > At that point, why bother integrating the
> > coscheduler into CFS, instead of making it its
> > own scheduling class?
> >
> > CFS is already complicated enough that it borders
> > on unmaintainable. I would really prefer to have
> > the coscheduler code separate from CFS, unless
> > there is a really compelling reason to do otherwise.
>
> I guess he wants to reuse as much as possible from the CFS features
> and
> code present or to come (nice, fairness, load balancing, power aware,
> NUMA aware, etc...).

I wonder if things like nice levels, fairness,
and balancing could be broken out into code
that could be reused from both CFS and a new
co-scheduler scheduling class.

A bunch of the cgroup code is already broken
out, but maybe some more could be broken out
and shared, too?

> OTOH you're right, the thing has specific enough requirements to
> consider a new sched policy.

Some bits of functionality come to mind:
- track groups of tasks that should be co-scheduled
(eg all the VCPUs of a virtual machine)
- track the subsets of those groups that are runnable
(eg. the currently runnable VCPUs of a virtual machine)
- figure out time slots and CPU assignments to efficiently
use CPU time for the co-scheduled tasks
(while leaving some configurable(?) amount of CPU time
available for other tasks)
- configuring some lower-level code on each affected CPU
to "run task A in slot X", etc

This really does not seem like something that could be
shoehorned into CFS without making it unmaintainable.

Furthermore, it also seems like the thing that you could
never really get into a highly efficient state as long
as it is weighed down by the rest of CFS.

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2018-10-19 19:09:10

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 19/10/2018 17.45, Rik van Riel wrote:
> On Fri, 2018-10-19 at 17:33 +0200, Frederic Weisbecker wrote:
>> On Fri, Oct 19, 2018 at 11:16:49AM -0400, Rik van Riel wrote:
>>> On Fri, 2018-10-19 at 13:40 +0200, Jan H. Schönherr wrote:
>>>>
>>>> Now, it would be possible to "invent" relocatable cpusets to
>>>> address that issue ("I want affinity restricted to a core, I don't
>>>> care which"), but then, the current way how cpuset affinity is
>>>> enforced doesn't scale for making use of it from within the
>>>> balancer. (The upcoming load balancing portion of the coscheduler
>>>> currently uses a file similar to cpu.scheduled to restrict
>>>> affinity to a load-balancer-controlled subset of the system.)
>>>
>>> Oh boy, so the coscheduler is going to get its own load balancer?

Not "its own". The load balancer already aggregates statistics about
sched-groups. With the coscheduler as posted, there is now a runqueue per
scheduling group. The current "ad-hoc" gathering of data per scheduling
group is then basically replaced with looking up that data at the
corresponding runqueue, where it is kept up-to-date automatically.


>>> At that point, why bother integrating the coscheduler into CFS,
>>> instead of making it its own scheduling class?
>>>
>>> CFS is already complicated enough that it borders on unmaintainable.
>>> I would really prefer to have the coscheduler code separate from
>>> CFS, unless there is a really compelling reason to do otherwise.
>>
>> I guess he wants to reuse as much as possible from the CFS features
>> and code present or to come (nice, fairness, load balancing, power
>> aware, NUMA aware, etc...).

Exactly. I want a user to be able to "switch on" coscheduling for those
parts of the workload that profit from it, without affecting the behavior
we are all used to. For both: scheduling behavior for tasks that are not
coscheduled, as well as scheduling behavior for tasks *within* the group of
coscheduled tasks.


> I wonder if things like nice levels, fairness, and balancing could be
> broken out into code that could be reused from both CFS and a new
> co-scheduler scheduling class.
>
> A bunch of the cgroup code is already broken out, but maybe some more
> could be broken out and shared, too?

Maybe.


>> OTOH you're right, the thing has specific enough requirements to
>> consider a new sched policy.

The primary issue that I have with a new scheduling class, is that they are
strictly priority ordered. If there is a runnable task in a higher class,
it is executed, no matter the situation in lower classes. "Coscheduling"
would have to be higher in the class hierarchy than CFS. And then, all
kinds of issues appear from starvation of CFS threads and other unfairness,
to the necessity of (re-)defining a set of preemption rules, nice and other
things that are given with CFS.


> Some bits of functionality come to mind:
>
> - track groups of tasks that should be co-scheduled (eg all the VCPUs of
> a virtual machine)

cgroups

> - track the subsets of those groups that are runnable (eg. the currently
> runnable VCPUs of a virtual machine)

runqueues

> - figure out time slots and CPU assignments to efficiently use CPU time
> for the co-scheduled tasks (while leaving some configurable(?) amount of
> CPU time available for other tasks)

CFS runqueues and associated rules for preemption/time slices/etc.

> - configuring some lower-level code on each affected CPU to "run task A
> in slot X", etc

There is no "slot" concept, as it does not fit my idea of interactive
usage. (As in "slot X will execute from time T to T+1.) It is purely
event-driven right now (eg, "group X just became runnable, it is considered
more important than the currently running group Y; all CPUs (in the
affected part of the system) switch to group X", or "group X ran long
enough, next group").

While some planning ahead seems possible (as demonstrated by the Tableau
scheduler that Peter already pointed me at), I currently cannot imagine
such an approach working for general purpose workloads. The absence of true
preemption being my primary concern.


> This really does not seem like something that could be shoehorned into
> CFS without making it unmaintainable.
>
> Furthermore, it also seems like the thing that you could never really
> get into a highly efficient state as long as it is weighed down by the
> rest of CFS.

I still have this idealistic notion, that there is no "weighing down". I
see it more as profiting from all the hard work that went into CFS,
avoiding the same mistakes, being backwards compatible, etc.



If I were to do this "outside of CFS", I'd overhaul the scheduling class
concept as it exists today. Instead, I'd probably attempt to schedule
instantiations of scheduling classes. In its easiest setup, nothing would
change: one CFS instance, one RT instance, one DL instance, strictly
ordered by priority (on each CPU). The coscheduler as it is posted (and
task groups in general), are effectively some form of multiple CFS
instances being governed by a CFS instance.

This approach would allow, for example, multiple CFS instances that are
scheduled with explicit priorities; or some tasks that are scheduled with a
custom scheduling class, while the whole group of tasks competes for time
with other tasks via CFS rules.

I'd still keep the feature of "coscheduling" orthogonal to everything else,
though. Essentially, I'd just give the user/admin the possibility to choose
the set of rules that shall be applied to entities in a runqueue.



Your idea of further modularization seems to go in a similar direction, or
at least is not incompatible with that. If it helps keeping things
maintainable, I'm all for it. For example, some of the (upcoming) load
balancing changes are just generalizations, so that the functions don't
operate on *the* set of CFS runqueues, but just *a* set of CFS runqueues.
Similarly in the already posted code, where task picking now starts at
*some* top CFS runqueue, instead of *the* top CFS runqueue.

Regards
Jan

2018-10-26 23:07:04

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux



> D) What can I *not* do with this?
> ---------------------------------
>
> Besides the missing load-balancing within coscheduled task-groups, this
> implementation has the following properties, which might be considered
> short-comings.
>
> This particular implementation focuses on SCHED_OTHER tasks managed by CFS
> and allows coscheduling them. Interrupts as well as tasks in higher
> scheduling classes are currently out-of-scope: they are assumed to be
> negligible interruptions as far as coscheduling is concerned and they do
> *not* cause a preemption of a whole group. This implementation could be
> extended to cover higher scheduling classes. Interrupts, however, are an
> orthogonal issue.
>
> The collective context switch from one coscheduled set of tasks to another
> -- while fast -- is not atomic. If a use-case needs the absolute guarantee
> that all tasks of the previous set have stopped executing before any task
> of the next set starts executing, an additional hand-shake/barrier needs to
> be added.
>
The leader doesn't kick the other cpus _immediately_ to switch to a
different cosched group. So threads from previous cosched group will keep
running in other HTs till their sched_slice is over (in worst case). This
can still keep the window of L1TF vulnerability open?

2018-10-26 23:45:46

by Jan H. Schönherr

[permalink] [raw]
Subject: [RFC 00/60] Coscheduling for Linux

On 19/10/2018 02.26, Subhra Mazumdar wrote:
> Hi Jan,

Hi. Sorry for the delay.

> On 9/7/18 2:39 PM, Jan H. Schönherr wrote:
>> The collective context switch from one coscheduled set of tasks to another
>> -- while fast -- is not atomic. If a use-case needs the absolute guarantee
>> that all tasks of the previous set have stopped executing before any task
>> of the next set starts executing, an additional hand-shake/barrier needs to
>> be added.
>>
> Do you know how much is the delay? i.e what is overlap time when a thread
> of new group starts executing on one HT while there is still thread of
> another group running on the other HT?

The delay is roughly equivalent to the IPI latency, if we're just talking
about coscheduling at SMT level: one sibling decides to schedule another
group, sends an IPI to the other sibling(s), and may already start
executing a task of that other group, before the IPI is received on the
other end.

Now, there are some things that may delay processing an IPI, but in those
cases the target CPU isn't executing user code.

I've yet to produce some current numbers for SMT-only coscheduling. An
older ballpark number I have is about 2 microseconds for the collective
context switch of one hierarchy level, but take that with a grain of salt.

Regards
Jan


2018-10-27 00:08:22

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 27/10/2018 01.05, Subhra Mazumdar wrote:
>
>
>> D) What can I *not* do with this?
>> ---------------------------------
>>
>> Besides the missing load-balancing within coscheduled task-groups, this
>> implementation has the following properties, which might be considered
>> short-comings.
>>
>> This particular implementation focuses on SCHED_OTHER tasks managed by CFS
>> and allows coscheduling them. Interrupts as well as tasks in higher
>> scheduling classes are currently out-of-scope: they are assumed to be
>> negligible interruptions as far as coscheduling is concerned and they do
>> *not* cause a preemption of a whole group. This implementation could be
>> extended to cover higher scheduling classes. Interrupts, however, are an
>> orthogonal issue.
>>
>> The collective context switch from one coscheduled set of tasks to another
>> -- while fast -- is not atomic. If a use-case needs the absolute guarantee
>> that all tasks of the previous set have stopped executing before any task
>> of the next set starts executing, an additional hand-shake/barrier needs to
>> be added.
>>
> The leader doesn't kick the other cpus _immediately_ to switch to a
> different cosched group.

It does. (Or at least, it should, in case you found evidence that it does not.)

Specifically, the logic to not preempt the currently running task before
some minimum time has passed, is without effect for a collective context
switch.

> So threads from previous cosched group will keep
> running in other HTs till their sched_slice is over (in worst case). This
> can still keep the window of L1TF vulnerability open?

No. Per the above, the window due to the collective context switch should
not be as long as "the remaining time slice" but more towards the IPI
delay. During this window, tasks of different coscheduling groups may
execute simultaneously.

In addition (as mentioned in the quoted text above), there more cases where
a task of a coscheduled group on one SMT sibling may execute simultaneously
with some other code not from the same coscheduled group: tasks in
scheduling classes higher than CFS, and interrupts -- as both of them
operate outside the scope of the coscheduler.

Regards
Jan

2018-10-29 22:54:44

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux


On 10/26/18 4:44 PM, Jan H. Schönherr wrote:
> On 19/10/2018 02.26, Subhra Mazumdar wrote:
>> Hi Jan,
> Hi. Sorry for the delay.
>
>> On 9/7/18 2:39 PM, Jan H. Schönherr wrote:
>>> The collective context switch from one coscheduled set of tasks to another
>>> -- while fast -- is not atomic. If a use-case needs the absolute guarantee
>>> that all tasks of the previous set have stopped executing before any task
>>> of the next set starts executing, an additional hand-shake/barrier needs to
>>> be added.
>>>
>> Do you know how much is the delay? i.e what is overlap time when a thread
>> of new group starts executing on one HT while there is still thread of
>> another group running on the other HT?
> The delay is roughly equivalent to the IPI latency, if we're just talking
> about coscheduling at SMT level: one sibling decides to schedule another
> group, sends an IPI to the other sibling(s), and may already start
> executing a task of that other group, before the IPI is received on the
> other end.
Can you point to where the leader is sending the IPI to other siblings?

I did some experiment and delay seems to be sub microsec. I ran 2 threads
that are just looping in one cosched group and affinitized to the 2 HTs of
a core. And another thread in a different cosched group starts running
affinitized to the first HT of the same core. I time stamped just before
context_switch() in __schedule() for the threads switching from one to
another and one to idle. Following is what I get on cpu 1 and 45 that are
siblings, cpu 1 is where the other thread preempts:

[  403.216625] cpu:45 sub1->idle:403216624579
[  403.238623] cpu:1 sub1->sub2:403238621585
[  403.238624] cpu:45 sub1->idle:403238621787
[  403.260619] cpu:1 sub1->sub2:403260619182
[  403.260620] cpu:45 sub1->idle:403260619413
[  403.282617] cpu:1 sub1->sub2:403282617157
[  403.282618] cpu:45 sub1->idle:403282617317
..

Not sure why the first switch on cpu to idle happened. But then onwards
the difference in timestamps is less than a microsec. This is just a crude
way to get a sense of the delay, may not be exact.

Thanks,
Subhra
>
> Now, there are some things that may delay processing an IPI, but in those
> cases the target CPU isn't executing user code.
>
> I've yet to produce some current numbers for SMT-only coscheduling. An
> older ballpark number I have is about 2 microseconds for the collective
> context switch of one hierarchy level, but take that with a grain of salt.
>
> Regards
> Jan
>

2018-11-02 22:14:18

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On 17.09.2018 [13:33:15 +0200], Peter Zijlstra wrote:
> On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Sch?nherr wrote:
> > On 09/14/2018 01:12 PM, Peter Zijlstra wrote:
> > > On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Sch?nherr wrote:
>
> > >> B) Why would I want this?
> > >
> > >> In the L1TF context, it prevents other applications from
> > >> loading additional data into the L1 cache, while one
> > >> application tries to leak data.
> > >
> > > That is the whole and only reason you did this;
> > It really isn't. But as your mind seems made up, I'm not going to
> > bother to argue.
>
> > >> D) What can I *not* do with this?
> > >> ---------------------------------
> > >>
> > >> Besides the missing load-balancing within coscheduled
> > >> task-groups, this implementation has the following properties,
> > >> which might be considered short-comings.
> > >>
> > >> This particular implementation focuses on SCHED_OTHER tasks
> > >> managed by CFS and allows coscheduling them. Interrupts as well
> > >> as tasks in higher scheduling classes are currently out-of-scope:
> > >> they are assumed to be negligible interruptions as far as
> > >> coscheduling is concerned and they do *not* cause a preemption of
> > >> a whole group. This implementation could be extended to cover
> > >> higher scheduling classes. Interrupts, however, are an orthogonal
> > >> issue.
> > >>
> > >> The collective context switch from one coscheduled set of tasks
> > >> to another -- while fast -- is not atomic. If a use-case needs
> > >> the absolute guarantee that all tasks of the previous set have
> > >> stopped executing before any task of the next set starts
> > >> executing, an additional hand-shake/barrier needs to be added.
> > >
> > > IOW it's completely friggin useless for L1TF.
> >
> > Do you believe me now, that L1TF is not "the whole and only reason"
> > I did this? :D
>
> You did mention this work first to me in the context of L1TF, so I might
> have jumped to conclusions here.
>
> Also, I have, of course, been looking at (SMT) co-scheduling,
> specifically in the context of L1TF, myself. I came up with a vastly
> different approach. Tim - where are we on getting some of that posted?
>
> Note; that even though I wrote much of that code, I don't particularly
> like it either :-)

Did your approach get posted to LKML? I never saw it I don't think, and
I don't see it on lore. Could it be posted as an RFC, even if not
suitable for upstreaming yet, just for comparison?

Thanks!
-Nish

2018-11-24 08:43:33

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: Task group cleanups and optimizations (was: Re: [RFC 00/60] Coscheduling for Linux)

On Tue, Sep 18, 2018 at 03:22:13PM +0200, Jan H. Sch?nherr wrote:
> On 09/17/2018 11:48 AM, Peter Zijlstra wrote:
> > Right, so the whole bandwidth thing becomes a pain; the simplest
> > solution is to detect the throttle at task-pick time, dequeue and try
> > again. But that is indeed quite horrible.
> >
> > I'm not quite sure how this will play out.
> >
> > Anyway, if we pull off this flattening feat, then you can no longer use
> > the hierarchy for this co-scheduling stuff.
>
> Yeah. I might be a bit biased towards keeping or at least not fully throwing away
> the nesting of CFS runqueues. ;)

One detail here, is that hierarchical task group a strong requirement for cosched
or could you live with it flattened in the end?

2018-11-24 08:43:53

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC 00/60] Coscheduling for Linux

On Thu, Sep 27, 2018 at 11:36:34AM -0700, Subhra Mazumdar wrote:
>
>
> On 09/26/2018 02:58 AM, Jan H. Sch?nherr wrote:
> >On 09/17/2018 02:25 PM, Peter Zijlstra wrote:
> >>On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Sch?nherr wrote:
> >>
> >>>Assuming, there is a cgroup-less solution that can prevent simultaneous
> >>>execution of tasks on a core, when they're not supposed to. How would you
> >>>tell the scheduler, which tasks these are?
> >>Specifically for L1TF I hooked into/extended KVM's preempt_notifier
> >>registration interface, which tells us which tasks are VCPUs and to
> >>which VM they belong.
> >>
> >>But if we want to actually expose this to userspace, we can either do a
> >>prctl() or extend struct sched_attr.
> >Both, Peter and Subhra, seem to prefer an interface different than cgroups
> >to specify what to coschedule.
> >
> >Can you provide some extra motivation for me, why you feel that way?
> >(ignoring the current scalability issues with the cpu group controller)
> >
> >
> >After all, cgroups where designed to create arbitrary groups of tasks and
> >to attach functionality to those groups.
> >
> >If we were to introduce a different interface to control that, we'd need to
> >introduce a whole new group concept, so that you make tasks part of some
> >group while at the same time preventing unauthorized tasks from joining a
> >group.
> >
> >
> >I currently don't see any wins, just a loss in flexibility.
> >
> >Regards
> >Jan
> I think cgroups will the get the job done for any use case. But we have,
> e.g. affinity control via both sched_setaffinity and cgroup cpusets. It
> will be good to have an alternative way to specify co-scheduling too for
> those who don't want to use cgroup for some reason. It can be added later
> on though, only how one will override the other will need to be sorted out.

I kind of agree with Jan here that this is just going to add yet another task
group mechanism, very similar to the existing one, with runqueues inside and all.

Can you imagine kernel/sched/fair.c now dealing with both groups implementations?
What happens when cgroup task groups and cosched sched groups don't match wrt.
their tasks, their priorities, etc...

I understand cgroup task group has become infamous. But it may be a better idea
in the long run to fix it.

2018-12-04 13:24:54

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: Task group cleanups and optimizations (was: Re: [RFC 00/60] Coscheduling for Linux)

On 23/11/2018 17.51, Frederic Weisbecker wrote:
> On Tue, Sep 18, 2018 at 03:22:13PM +0200, Jan H. Schönherr wrote:
>> On 09/17/2018 11:48 AM, Peter Zijlstra wrote:
>>> Right, so the whole bandwidth thing becomes a pain; the simplest
>>> solution is to detect the throttle at task-pick time, dequeue and try
>>> again. But that is indeed quite horrible.
>>>
>>> I'm not quite sure how this will play out.
>>>
>>> Anyway, if we pull off this flattening feat, then you can no longer use
>>> the hierarchy for this co-scheduling stuff.
>>
>> Yeah. I might be a bit biased towards keeping or at least not fully throwing away
>> the nesting of CFS runqueues. ;)
>
> One detail here, is that hierarchical task group a strong requirement for cosched
> or could you live with it flattened in the end?

Currently, it is a strong requirement.

As mentioned at the bottom of https://lkml.org/lkml/2018/10/19/859 it should be
possible to pull the hierarchical aspect out of CFS and implement it one level
higher. But that would be a major re-design of everything.

I use the hierarchical aspect to a) keep coscheduled groups in separate sets of runqeues,
so that it is easy to select/balance tasks within a particular group; and b) to implement
per-core, per-node, per-system runqueues that represent larger fractions of the system,
which then fan out into per-CPU runqueues (eventually).

Regards
Jan