2019-10-30 19:12:29

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH v4 00/19] Core scheduling v4

Fourth iteration of the Core-Scheduling feature.

This version was aiming mostly at addressing the vruntime comparison
issues with v3. The main issue seen in v3 was the starvation of
interactive tasks when competing with cpu intensive tasks. This issue
is mitigated to a large extent.

We have tested and verified that incompatible processes are not
selected during schedule. In terms of performance, the impact
depends on the workload:
- on CPU intensive applications that use all the logical CPUs with
SMT enabled, enabling core scheduling performs better than nosmt.
- on mixed workloads with considerable io compared to cpu usage,
nosmt seems to perform better than core scheduling.

v4 is rebased on top of 5.3.5(dc073f193b70):
https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.3.5

Changes in v4
-------------
- Implement a core wide min_vruntime for vruntime comparison of tasks
across cpus in a core.
- Fixes a typo bug in setting the forced_idle cpu.

Changes in v3
-------------
- Fixes the issue of sibling picking up an incompatible task
- Aaron Lu
- Vineeth Pillai
- Julien Desfossez
- Fixes the issue of starving threads due to forced idle
- Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
- Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
- Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
- Aaron Lu

Changes in v2
-------------
- Fixes for couple of NULL pointer dereference crashes
- Subhra Mazumdar
- Tim Chen
- Improves priority comparison logic for process in different cpus
- Peter Zijlstra
- Aaron Lu
- Fixes a hard lockup in rq locking
- Vineeth Pillai
- Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
- Vineeth Pillai
- Julien Desfossez
- Fix for 32bit build
- Aubrey Li

TODO
----
- Decide on the API for exposing the feature to userland
- Investigate the source of the overhead even when no tasks are tagged:
https://lkml.org/lkml/2019/10/29/242
- Investigate the performance scaling issue when we have a high number of
tagged threads: https://lkml.org/lkml/2019/10/29/248
- Try to optimize the performance for IO-demanding applications:
https://lkml.org/lkml/2019/10/29/261

---

Aaron Lu (3):
sched/fair: wrapper for cfs_rq->min_vruntime
sched/fair: core wide vruntime comparison
sched/fair : Wake up forced idle siblings if needed

Peter Zijlstra (16):
stop_machine: Fix stop_cpus_in_progress ordering
sched: Fix kerneldoc comment for ia64_set_curr_task
sched: Wrap rq::lock access
sched/{rt,deadline}: Fix set_next_task vs pick_next_task
sched: Add task_struct pointer to sched_class::set_curr_task
sched/fair: Export newidle_balance()
sched: Allow put_prev_task() to drop rq->lock
sched: Rework pick_next_task() slow-path
sched: Introduce sched_class::pick_task()
sched: Core-wide rq->lock
sched: Basic tracking of matching tasks
sched: A quick and dirty cgroup tagging interface
sched: Add core wide task selection and scheduling.
sched/fair: Add a few assertions
sched: Trivial forced-newidle balancer
sched: Debug bits...

include/linux/sched.h | 9 +-
kernel/Kconfig.preempt | 6 +
kernel/sched/core.c | 847 +++++++++++++++++++++++++++++++++++++--
kernel/sched/cpuacct.c | 12 +-
kernel/sched/deadline.c | 99 +++--
kernel/sched/debug.c | 4 +-
kernel/sched/fair.c | 346 +++++++++++-----
kernel/sched/idle.c | 42 +-
kernel/sched/pelt.h | 2 +-
kernel/sched/rt.c | 96 ++---
kernel/sched/sched.h | 246 +++++++++---
kernel/sched/stop_task.c | 35 +-
kernel/sched/topology.c | 4 +-
kernel/stop_machine.c | 2 +
14 files changed, 1399 insertions(+), 351 deletions(-)

--
2.17.1


2019-10-30 19:12:49

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH v4 05/19] sched: Add task_struct pointer to sched_class::set_curr_task

From: Peter Zijlstra <[email protected]>

In preparation of further separating pick_next_task() and
set_curr_task() we have to pass the actual task into it, while there,
rename the thing to better pair with put_prev_task().

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/core.c | 12 ++++++------
kernel/sched/deadline.c | 7 +------
kernel/sched/fair.c | 17 ++++++++++++++---
kernel/sched/idle.c | 27 +++++++++++++++------------
kernel/sched/rt.c | 7 +------
kernel/sched/sched.h | 8 +++++---
kernel/sched/stop_task.c | 17 +++++++----------
7 files changed, 49 insertions(+), 46 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a9c93cf71f5f..95cc10ecc7c9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1494,7 +1494,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
if (queued)
enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
if (running)
- set_curr_task(rq, p);
+ set_next_task(rq, p);
}

/*
@@ -4276,7 +4276,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
if (queued)
enqueue_task(rq, p, queue_flag);
if (running)
- set_curr_task(rq, p);
+ set_next_task(rq, p);

check_class_changed(rq, p, prev_class, oldprio);
out_unlock:
@@ -4343,7 +4343,7 @@ void set_user_nice(struct task_struct *p, long nice)
resched_curr(rq);
}
if (running)
- set_curr_task(rq, p);
+ set_next_task(rq, p);
out_unlock:
task_rq_unlock(rq, p, &rf);
}
@@ -4786,7 +4786,7 @@ static int __sched_setscheduler(struct task_struct *p,
enqueue_task(rq, p, queue_flags);
}
if (running)
- set_curr_task(rq, p);
+ set_next_task(rq, p);

check_class_changed(rq, p, prev_class, oldprio);

@@ -5975,7 +5975,7 @@ void sched_setnuma(struct task_struct *p, int nid)
if (queued)
enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
if (running)
- set_curr_task(rq, p);
+ set_next_task(rq, p);
task_rq_unlock(rq, p, &rf);
}
#endif /* CONFIG_NUMA_BALANCING */
@@ -6856,7 +6856,7 @@ void sched_move_task(struct task_struct *tsk)
if (queued)
enqueue_task(rq, tsk, queue_flags);
if (running)
- set_curr_task(rq, tsk);
+ set_next_task(rq, tsk);

task_rq_unlock(rq, tsk, &rf);
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4b53e7c696c8..38b45f2f890b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1812,11 +1812,6 @@ static void task_fork_dl(struct task_struct *p)
*/
}

-static void set_curr_task_dl(struct rq *rq)
-{
- set_next_task_dl(rq, rq->curr);
-}
-
#ifdef CONFIG_SMP

/* Only try algorithms three times */
@@ -2396,6 +2391,7 @@ const struct sched_class dl_sched_class = {

.pick_next_task = pick_next_task_dl,
.put_prev_task = put_prev_task_dl,
+ .set_next_task = set_next_task_dl,

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_dl,
@@ -2406,7 +2402,6 @@ const struct sched_class dl_sched_class = {
.task_woken = task_woken_dl,
#endif

- .set_curr_task = set_curr_task_dl,
.task_tick = task_tick_dl,
.task_fork = task_fork_dl,

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1e8b811ce1f..a58e5de1732d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10180,9 +10180,19 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
* This routine is mostly called to set cfs_rq->curr field when a task
* migrates between groups/classes.
*/
-static void set_curr_task_fair(struct rq *rq)
+static void set_next_task_fair(struct rq *rq, struct task_struct *p)
{
- struct sched_entity *se = &rq->curr->se;
+ struct sched_entity *se = &p->se;
+
+#ifdef CONFIG_SMP
+ if (task_on_rq_queued(p)) {
+ /*
+ * Move the next running task to the front of the list, so our
+ * cfs_tasks list becomes MRU one.
+ */
+ list_move(&se->group_node, &rq->cfs_tasks);
+ }
+#endif

for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -10453,7 +10463,9 @@ const struct sched_class fair_sched_class = {
.check_preempt_curr = check_preempt_wakeup,

.pick_next_task = pick_next_task_fair,
+
.put_prev_task = put_prev_task_fair,
+ .set_next_task = set_next_task_fair,

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
@@ -10466,7 +10478,6 @@ const struct sched_class fair_sched_class = {
.set_cpus_allowed = set_cpus_allowed_common,
#endif

- .set_curr_task = set_curr_task_fair,
.task_tick = task_tick_fair,
.task_fork = task_fork_fair,

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 0d2f83899c83..3ff4889196e1 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -374,14 +374,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
resched_curr(rq);
}

+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
+{
+}
+
+static void set_next_task_idle(struct rq *rq, struct task_struct *next)
+{
+ update_idle_core(rq);
+ schedstat_inc(rq->sched_goidle);
+}
+
static struct task_struct *
pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
+ struct task_struct *next = rq->idle;
+
put_prev_task(rq, prev);
- update_idle_core(rq);
- schedstat_inc(rq->sched_goidle);
+ set_next_task_idle(rq, next);

- return rq->idle;
+ return next;
}

/*
@@ -397,10 +408,6 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
raw_spin_lock_irq(rq_lockp(rq));
}

-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
-{
-}
-
/*
* scheduler tick hitting a task of our scheduling class.
*
@@ -413,10 +420,6 @@ static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
{
}

-static void set_curr_task_idle(struct rq *rq)
-{
-}
-
static void switched_to_idle(struct rq *rq, struct task_struct *p)
{
BUG();
@@ -451,13 +454,13 @@ const struct sched_class idle_sched_class = {

.pick_next_task = pick_next_task_idle,
.put_prev_task = put_prev_task_idle,
+ .set_next_task = set_next_task_idle,

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_idle,
.set_cpus_allowed = set_cpus_allowed_common,
#endif

- .set_curr_task = set_curr_task_idle,
.task_tick = task_tick_idle,

.get_rr_interval = get_rr_interval_idle,
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f4590ac6fd92..a857945772d1 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2355,11 +2355,6 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
}
}

-static void set_curr_task_rt(struct rq *rq)
-{
- set_next_task_rt(rq, rq->curr);
-}
-
static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
{
/*
@@ -2381,6 +2376,7 @@ const struct sched_class rt_sched_class = {

.pick_next_task = pick_next_task_rt,
.put_prev_task = put_prev_task_rt,
+ .set_next_task = set_next_task_rt,

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_rt,
@@ -2392,7 +2388,6 @@ const struct sched_class rt_sched_class = {
.switched_from = switched_from_rt,
#endif

- .set_curr_task = set_curr_task_rt,
.task_tick = task_tick_rt,

.get_rr_interval = get_rr_interval_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index af2a9972149a..657831e26008 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1715,6 +1715,7 @@ struct sched_class {
struct task_struct *prev,
struct rq_flags *rf);
void (*put_prev_task)(struct rq *rq, struct task_struct *p);
+ void (*set_next_task)(struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
@@ -1729,7 +1730,6 @@ struct sched_class {
void (*rq_offline)(struct rq *rq);
#endif

- void (*set_curr_task)(struct rq *rq);
void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
void (*task_fork)(struct task_struct *p);
void (*task_dead)(struct task_struct *p);
@@ -1759,12 +1759,14 @@ struct sched_class {

static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{
+ WARN_ON_ONCE(rq->curr != prev);
prev->sched_class->put_prev_task(rq, prev);
}

-static inline void set_curr_task(struct rq *rq, struct task_struct *curr)
+static inline void set_next_task(struct rq *rq, struct task_struct *next)
{
- curr->sched_class->set_curr_task(rq);
+ WARN_ON_ONCE(rq->curr != next);
+ next->sched_class->set_next_task(rq, next);
}

#ifdef CONFIG_SMP
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index c183b790ca54..47a3d2a18a9a 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -23,6 +23,11 @@ check_preempt_curr_stop(struct rq *rq, struct task_struct *p, int flags)
/* we're never preempted */
}

+static void set_next_task_stop(struct rq *rq, struct task_struct *stop)
+{
+ stop->se.exec_start = rq_clock_task(rq);
+}
+
static struct task_struct *
pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
@@ -32,8 +37,7 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
return NULL;

put_prev_task(rq, prev);
-
- stop->se.exec_start = rq_clock_task(rq);
+ set_next_task_stop(rq, stop);

return stop;
}
@@ -86,13 +90,6 @@ static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
{
}

-static void set_curr_task_stop(struct rq *rq)
-{
- struct task_struct *stop = rq->stop;
-
- stop->se.exec_start = rq_clock_task(rq);
-}
-
static void switched_to_stop(struct rq *rq, struct task_struct *p)
{
BUG(); /* its impossible to change to this class */
@@ -128,13 +125,13 @@ const struct sched_class stop_sched_class = {

.pick_next_task = pick_next_task_stop,
.put_prev_task = put_prev_task_stop,
+ .set_next_task = set_next_task_stop,

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_stop,
.set_cpus_allowed = set_cpus_allowed_common,
#endif

- .set_curr_task = set_curr_task_stop,
.task_tick = task_tick_stop,

.get_rr_interval = get_rr_interval_stop,
--
2.17.1

2019-10-30 19:12:50

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH v4 18/19] sched/fair: core wide vruntime comparison

From: Aaron Lu <[email protected]>

This patch provides a vruntime based way to compare two cfs task's
priority, be it on the same cpu or different threads of the same core.

When the two tasks are on the same CPU, we just need to find a common
cfs_rq both sched_entities are on and then do the comparison.

When the two tasks are on differen threads of the same core, the root
level sched_entities to which the two tasks belong will be used to do
the comparison.

An ugly illustration for the cross CPU case:

cpu0 cpu1
/ | \ / | \
se1 se2 se3 se4 se5 se6
/ \ / \
se21 se22 se61 se62

Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while
task B's se is se61. To compare priority of task A and B, we compare
priority of se2 and se6. Whose vruntime is smaller, who wins.

To make this work, the root level se should have a common cfs_rq min
vuntime, which I call it the core cfs_rq min vruntime.

When we adjust the min_vruntime of rq->core, we need to propgate
that down the tree so as to not cause starvation of existing tasks
based on previous vruntime.

Signed-off-by: Aaron Lu <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
---
kernel/sched/core.c | 15 +------
kernel/sched/fair.c | 99 +++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 2 +
3 files changed, 102 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 18fbaa85ec30..09e5c77e54c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -117,19 +117,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
return !dl_time_before(a->dl.deadline, b->dl.deadline);

- if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */
- u64 vruntime = b->se.vruntime;
-
- /*
- * Normalize the vruntime if tasks are in different cpus.
- */
- if (task_cpu(a) != task_cpu(b)) {
- vruntime -= task_cfs_rq(b)->min_vruntime;
- vruntime += task_cfs_rq(a)->min_vruntime;
- }
-
- return !((s64)(a->se.vruntime - vruntime) <= 0);
- }
+ if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+ return cfs_prio_less(a, b);

return false;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab32b22b0574..e8dd78a8c54d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -450,9 +450,105 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)

#endif /* CONFIG_FAIR_GROUP_SCHED */

+static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+ return &rq_of(cfs_rq)->cfs;
+}
+
+static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq == root_cfs_rq(cfs_rq);
+}
+
+static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq)
+{
+ return &rq_of(cfs_rq)->core->cfs;
+}
+
static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
{
- return cfs_rq->min_vruntime;
+ if (!sched_core_enabled(rq_of(cfs_rq)))
+ return cfs_rq->min_vruntime;
+
+ if (is_root_cfs_rq(cfs_rq))
+ return core_cfs_rq(cfs_rq)->min_vruntime;
+ else
+ return cfs_rq->min_vruntime;
+}
+
+static void coresched_adjust_vruntime(struct cfs_rq *cfs_rq, u64 delta)
+{
+ struct sched_entity *se, *next;
+
+ if (!cfs_rq)
+ return;
+
+ cfs_rq->min_vruntime -= delta;
+ rbtree_postorder_for_each_entry_safe(se, next,
+ &cfs_rq->tasks_timeline.rb_root, run_node) {
+ if (se->vruntime > delta)
+ se->vruntime -= delta;
+ if (se->my_q)
+ coresched_adjust_vruntime(se->my_q, delta);
+ }
+}
+
+static void update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
+{
+ struct cfs_rq *cfs_rq_core;
+
+ if (!sched_core_enabled(rq_of(cfs_rq)))
+ return;
+
+ if (!is_root_cfs_rq(cfs_rq))
+ return;
+
+ cfs_rq_core = core_cfs_rq(cfs_rq);
+ if (cfs_rq_core != cfs_rq &&
+ cfs_rq->min_vruntime < cfs_rq_core->min_vruntime) {
+ u64 delta = cfs_rq_core->min_vruntime - cfs_rq->min_vruntime;
+ coresched_adjust_vruntime(cfs_rq_core, delta);
+ }
+}
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+ struct sched_entity *sea = &a->se;
+ struct sched_entity *seb = &b->se;
+ bool samecpu = task_cpu(a) == task_cpu(b);
+ struct task_struct *p;
+ s64 delta;
+
+ if (samecpu) {
+ /* vruntime is per cfs_rq */
+ while (!is_same_group(sea, seb)) {
+ int sea_depth = sea->depth;
+ int seb_depth = seb->depth;
+
+ if (sea_depth >= seb_depth)
+ sea = parent_entity(sea);
+ if (sea_depth <= seb_depth)
+ seb = parent_entity(seb);
+ }
+
+ delta = (s64)(sea->vruntime - seb->vruntime);
+ goto out;
+ }
+
+ /* crosscpu: compare root level se's vruntime to decide priority */
+ while (sea->parent)
+ sea = sea->parent;
+ while (seb->parent)
+ seb = seb->parent;
+ delta = (s64)(sea->vruntime - seb->vruntime);
+
+out:
+ p = delta > 0 ? b : a;
+ trace_printk("picked %s/%d %s: %Ld %Ld %Ld\n", p->comm, p->pid,
+ samecpu ? "samecpu" : "crosscpu",
+ sea->vruntime, seb->vruntime, delta);
+
+ return delta > 0;
}

static __always_inline
@@ -512,6 +608,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)

/* ensure we never gain time by being placed backwards. */
cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
+ update_core_cfs_rq_min_vruntime(cfs_rq);
#ifndef CONFIG_64BIT
smp_wmb();
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 311ab1e2a00e..4844e703298a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2537,3 +2537,5 @@ static inline bool sched_energy_enabled(void)
static inline bool sched_energy_enabled(void) { return false; }

#endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
--
2.17.1

2019-10-30 19:13:11

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH v4 08/19] sched: Rework pick_next_task() slow-path

From: Peter Zijlstra <[email protected]>

Avoid the RETRY_TASK case in the pick_next_task() slow path.

By doing the put_prev_task() early, we get the rt/deadline pull done,
and by testing rq->nr_running we know if we need newidle_balance().

This then gives a stable state to pick a task from.

Since the fast-path is fair only; it means the other classes will
always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/core.c | 19 ++++++++++++-------
kernel/sched/deadline.c | 30 ++----------------------------
kernel/sched/fair.c | 9 ++++++---
kernel/sched/idle.c | 4 +++-
kernel/sched/rt.c | 29 +----------------------------
kernel/sched/sched.h | 13 ++++++++-----
kernel/sched/stop_task.c | 3 ++-
7 files changed, 34 insertions(+), 73 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3fa7d37f2879..1f40211ff29a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3739,7 +3739,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

p = fair_sched_class.pick_next_task(rq, prev, rf);
if (unlikely(p == RETRY_TASK))
- goto again;
+ goto restart;

/* Assumes fair_sched_class->next == idle_sched_class */
if (unlikely(!p))
@@ -3748,14 +3748,19 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
return p;
}

-again:
+restart:
+ /*
+ * Ensure that we put DL/RT tasks before the pick loop, such that they
+ * can PULL higher prio tasks when we lower the RQ 'priority'.
+ */
+ prev->sched_class->put_prev_task(rq, prev, rf);
+ if (!rq->nr_running)
+ newidle_balance(rq, rf);
+
for_each_class(class) {
- p = class->pick_next_task(rq, prev, rf);
- if (p) {
- if (unlikely(p == RETRY_TASK))
- goto again;
+ p = class->pick_next_task(rq, NULL, NULL);
+ if (p)
return p;
- }
}

/* The idle class should always have a runnable task: */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 416e06c97e98..0c7a51745813 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1729,39 +1729,13 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
struct task_struct *p;
struct dl_rq *dl_rq;

- dl_rq = &rq->dl;
-
- if (need_pull_dl_task(rq, prev)) {
- /*
- * This is OK, because current is on_cpu, which avoids it being
- * picked for load-balance and preemption/IRQs are still
- * disabled avoiding further scheduler activity on it and we're
- * being very careful to re-start the picking loop.
- */
- rq_unpin_lock(rq, rf);
- pull_dl_task(rq);
- rq_repin_lock(rq, rf);
- /*
- * pull_dl_task() can drop (and re-acquire) rq->lock; this
- * means a stop task can slip in, in which case we need to
- * re-start task selection.
- */
- if (rq->stop && task_on_rq_queued(rq->stop))
- return RETRY_TASK;
- }
+ WARN_ON_ONCE(prev || rf);

- /*
- * When prev is DL, we may throttle it in put_prev_task().
- * So, we update time before we check for dl_nr_running.
- */
- if (prev->sched_class == &dl_sched_class)
- update_curr_dl(rq);
+ dl_rq = &rq->dl;

if (unlikely(!dl_rq->dl_nr_running))
return NULL;

- put_prev_task(rq, prev);
-
dl_se = pick_next_dl_entity(rq, dl_rq);
BUG_ON(!dl_se);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ee57472d88e..850e235c9209 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6799,7 +6799,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
goto idle;

#ifdef CONFIG_FAIR_GROUP_SCHED
- if (prev->sched_class != &fair_sched_class)
+ if (!prev || prev->sched_class != &fair_sched_class)
goto simple;

/*
@@ -6876,8 +6876,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
goto done;
simple:
#endif
-
- put_prev_task(rq, prev);
+ if (prev)
+ put_prev_task(rq, prev);

do {
se = pick_next_entity(cfs_rq, NULL);
@@ -6905,6 +6905,9 @@ done: __maybe_unused;
return p;

idle:
+ if (!rf)
+ return NULL;
+
new_tasks = newidle_balance(rq, rf);

/*
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 5848ae3d5210..5bb8d44fbff9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -389,7 +389,9 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
{
struct task_struct *next = rq->idle;

- put_prev_task(rq, prev);
+ if (prev)
+ put_prev_task(rq, prev);
+
set_next_task_idle(rq, next);

return next;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a7d9656f70b5..09c317a916aa 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1554,38 +1554,11 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
struct task_struct *p;
struct rt_rq *rt_rq = &rq->rt;

- if (need_pull_rt_task(rq, prev)) {
- /*
- * This is OK, because current is on_cpu, which avoids it being
- * picked for load-balance and preemption/IRQs are still
- * disabled avoiding further scheduler activity on it and we're
- * being very careful to re-start the picking loop.
- */
- rq_unpin_lock(rq, rf);
- pull_rt_task(rq);
- rq_repin_lock(rq, rf);
- /*
- * pull_rt_task() can drop (and re-acquire) rq->lock; this
- * means a dl or stop task can slip in, in which case we need
- * to re-start task selection.
- */
- if (unlikely((rq->stop && task_on_rq_queued(rq->stop)) ||
- rq->dl.dl_nr_running))
- return RETRY_TASK;
- }
-
- /*
- * We may dequeue prev's rt_rq in put_prev_task().
- * So, we update time before rt_queued check.
- */
- if (prev->sched_class == &rt_sched_class)
- update_curr_rt(rq);
+ WARN_ON_ONCE(prev || rf);

if (!rt_rq->rt_queued)
return NULL;

- put_prev_task(rq, prev);
-
p = _pick_next_task_rt(rq);

set_next_task_rt(rq, p);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8b86b24a3ccd..16c9654ac750 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1708,12 +1708,15 @@ struct sched_class {
void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);

/*
- * It is the responsibility of the pick_next_task() method that will
- * return the next task to call put_prev_task() on the @prev task or
- * something equivalent.
+ * Both @prev and @rf are optional and may be NULL, in which case the
+ * caller must already have invoked put_prev_task(rq, prev, rf).
*
- * May return RETRY_TASK when it finds a higher prio class has runnable
- * tasks.
+ * Otherwise it is the responsibility of the pick_next_task() to call
+ * put_prev_task() on the @prev task or something equivalent, IFF it
+ * returns a next task.
+ *
+ * In that case (@rf != NULL) it may return RETRY_TASK when it finds a
+ * higher prio class has runnable tasks.
*/
struct task_struct * (*pick_next_task)(struct rq *rq,
struct task_struct *prev,
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 8f414018d5e0..7e1cee4e65b2 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -33,10 +33,11 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
{
struct task_struct *stop = rq->stop;

+ WARN_ON_ONCE(prev || rf);
+
if (!stop || !task_on_rq_queued(stop))
return NULL;

- put_prev_task(rq, prev);
set_next_task_stop(rq, stop);

return stop;
--
2.17.1

2019-10-30 19:13:34

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH v4 19/19] sched/fair : Wake up forced idle siblings if needed

From: Aaron Lu <[email protected]>

If the sibling of a forced idle cpu has only one task and if it has
used up its timeslice, then we should try to wake up the forced idle
cpu to give the starving task on it a chance.

Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
---
kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 40 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8dd78a8c54d..9d4cc97d4dd8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4165,6 +4165,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
update_min_vruntime(cfs_rq);
}

+static inline bool
+__entity_slice_used(struct sched_entity *se)
+{
+ return (se->sum_exec_runtime - se->prev_sum_exec_runtime) >
+ sched_slice(cfs_rq_of(se), se);
+}
+
/*
* Preempt the current task with a newly woken task if needed:
*/
@@ -10052,6 +10059,34 @@ static void rq_offline_fair(struct rq *rq)

#endif /* CONFIG_SMP */

+#ifdef CONFIG_SCHED_CORE
+/*
+ * If runqueue has only one task which used up its slice and
+ * if the sibling is forced idle, then trigger schedule
+ * to give forced idle task a chance.
+ */
+static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se)
+{
+ int cpu = cpu_of(rq), sibling_cpu;
+ if (rq->cfs.nr_running > 1 || !__entity_slice_used(se))
+ return;
+
+ for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) {
+ struct rq *sibling_rq;
+ if (sibling_cpu == cpu)
+ continue;
+ if (cpu_is_offline(sibling_cpu))
+ continue;
+
+ sibling_rq = cpu_rq(sibling_cpu);
+ if (sibling_rq->core_forceidle) {
+ resched_curr(sibling_rq);
+ }
+ }
+}
+#endif
+
+
/*
* scheduler tick hitting a task of our scheduling class.
*
@@ -10075,6 +10110,11 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

update_misfit_status(curr, rq);
update_overutilized_status(task_rq(curr));
+
+#ifdef CONFIG_SCHED_CORE
+ if (sched_core_enabled(rq))
+ resched_forceidle_sibling(rq, &curr->se);
+#endif
}

/*
--
2.17.1

2019-10-30 19:14:01

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH v4 10/19] sched: Core-wide rq->lock

From: Peter Zijlstra <[email protected]>

Introduce the basic infrastructure to have a core wide rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
---
kernel/Kconfig.preempt | 6 +++
kernel/sched/core.c | 113 ++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 31 +++++++++++
3 files changed, 148 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index deff97217496..63493a1b9ff6 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -80,3 +80,9 @@ config PREEMPT_COUNT
config PREEMPTION
bool
select PREEMPT_COUNT
+
+config SCHED_CORE
+ bool
+ default y
+ depends on SCHED_SMT
+
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1f40211ff29a..0427cf610d6a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,70 @@ __read_mostly int scheduler_running;
*/
int sysctl_sched_rt_runtime = 950000;

+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * The static-key + stop-machine variable are needed such that:
+ *
+ * spin_lock(rq_lockp(rq));
+ * ...
+ * spin_unlock(rq_lockp(rq));
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+static int __sched_core_stopper(void *data)
+{
+ bool enabled = !!(unsigned long)data;
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ cpu_rq(cpu)->core_enabled = enabled;
+
+ return 0;
+}
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+
+static void __sched_core_enable(void)
+{
+ // XXX verify there are no cookie tasks (yet)
+
+ static_branch_enable(&__sched_core_enabled);
+ stop_machine(__sched_core_stopper, (void *)true, NULL);
+}
+
+static void __sched_core_disable(void)
+{
+ // XXX verify there are no cookie tasks (left)
+
+ stop_machine(__sched_core_stopper, (void *)false, NULL);
+ static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+ mutex_lock(&sched_core_mutex);
+ if (!sched_core_count++)
+ __sched_core_enable();
+ mutex_unlock(&sched_core_mutex);
+}
+
+void sched_core_put(void)
+{
+ mutex_lock(&sched_core_mutex);
+ if (!--sched_core_count)
+ __sched_core_disable();
+ mutex_unlock(&sched_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
/*
* __task_rq_lock - lock the rq @p resides on.
*/
@@ -6210,8 +6274,15 @@ int sched_cpu_activate(unsigned int cpu)
/*
* When going up, increment the number of cores with SMT present.
*/
- if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+ if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
static_branch_inc_cpuslocked(&sched_smt_present);
+#ifdef CONFIG_SCHED_CORE
+ if (static_branch_unlikely(&__sched_core_enabled)) {
+ rq->core_enabled = true;
+ }
+#endif
+ }
+
#endif
set_cpu_active(cpu, true);

@@ -6259,8 +6330,16 @@ int sched_cpu_deactivate(unsigned int cpu)
/*
* When going down, decrement the number of cores with SMT present.
*/
- if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+ if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
+#ifdef CONFIG_SCHED_CORE
+ struct rq *rq = cpu_rq(cpu);
+ if (static_branch_unlikely(&__sched_core_enabled)) {
+ rq->core_enabled = false;
+ }
+#endif
static_branch_dec_cpuslocked(&sched_smt_present);
+
+ }
#endif

if (!sched_smp_initialized)
@@ -6285,6 +6364,28 @@ static void sched_rq_cpu_starting(unsigned int cpu)

int sched_cpu_starting(unsigned int cpu)
{
+#ifdef CONFIG_SCHED_CORE
+ const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+ struct rq *rq, *core_rq = NULL;
+ int i;
+
+ for_each_cpu(i, smt_mask) {
+ rq = cpu_rq(i);
+ if (rq->core && rq->core == rq)
+ core_rq = rq;
+ }
+
+ if (!core_rq)
+ core_rq = cpu_rq(cpu);
+
+ for_each_cpu(i, smt_mask) {
+ rq = cpu_rq(i);
+
+ WARN_ON_ONCE(rq->core && rq->core != core_rq);
+ rq->core = core_rq;
+ }
+#endif /* CONFIG_SCHED_CORE */
+
sched_rq_cpu_starting(cpu);
sched_tick_start(cpu);
return 0;
@@ -6313,6 +6414,9 @@ int sched_cpu_dying(unsigned int cpu)
update_max_interval();
nohz_balance_exit_idle(rq);
hrtick_clear(rq);
+#ifdef CONFIG_SCHED_CORE
+ rq->core = NULL;
+#endif
return 0;
}
#endif
@@ -6507,6 +6611,11 @@ void __init sched_init(void)
#endif /* CONFIG_SMP */
hrtick_rq_init(rq);
atomic_set(&rq->nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+ rq->core = NULL;
+ rq->core_enabled = 0;
+#endif
}

set_load_weight(&init_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 44f45aad3f97..2ba68ab093ee 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -991,6 +991,12 @@ struct rq {
/* Must be inspected within a rcu lock section */
struct cpuidle_state *idle_state;
#endif
+
+#ifdef CONFIG_SCHED_CORE
+ /* per rq */
+ struct rq *core;
+ unsigned int core_enabled;
+#endif
};

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1018,11 +1024,36 @@ static inline int cpu_of(struct rq *rq)
#endif
}

+#ifdef CONFIG_SCHED_CORE
+DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+ return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
+}
+
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+ if (sched_core_enabled(rq))
+ return &rq->core->__lock;
+
+ return &rq->__lock;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+ return false;
+}
+
static inline raw_spinlock_t *rq_lockp(struct rq *rq)
{
return &rq->__lock;
}

+#endif /* CONFIG_SCHED_CORE */
+
#ifdef CONFIG_SCHED_SMT
extern void __update_idle_core(struct rq *rq);

--
2.17.1

2019-10-31 11:44:43

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 2019/10/31 2:33, Vineeth Remanan Pillai wrote:
> Fourth iteration of the Core-Scheduling feature.
>
> This version was aiming mostly at addressing the vruntime comparison
> issues with v3. The main issue seen in v3 was the starvation of
> interactive tasks when competing with cpu intensive tasks. This issue
> is mitigated to a large extent.
>
> We have tested and verified that incompatible processes are not
> selected during schedule. In terms of performance, the impact
> depends on the workload:
> - on CPU intensive applications that use all the logical CPUs with
> SMT enabled, enabling core scheduling performs better than nosmt.
> - on mixed workloads with considerable io compared to cpu usage,
> nosmt seems to perform better than core scheduling.
>
> v4 is rebased on top of 5.3.5(dc073f193b70):
> https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.3.5

Thanks to post V4 out. Refresh the data on my side. Since we have played
with Aaron's core vruntime patch for a while, no surprise in the result.

Thanks,
-Aubrey

Environment setup
--------------------------
Skylake 8170 server, 2 NUMA nodes, 52 cores, 104 CPUs (HT on)

Case 1:
-------
cgroup1 workload, sysbench CPU mode (non AVX workload)
cgroup2 workload, gemmbench (AVX512 workload)

sysbench throughput result:
.--------------------------------------------------------------------------------------------------------------------------------------.
|NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT [std% / sem%] +/- cpu% | no-SMT [std% / sem%] +/- cpu% |
|--------------------------------------------------------------------------------------------------------------------------------------|
| 1/1 1269.1 [ 0.1%/ 0.0%] 1.9% | 1272.4 [ 0.1%/ 0.0%] 0.3% 1.9% | 1272.0 [ 0.1%/ 0.0%] 0.2% 3.9% |
| 2/2 2466.9 [ 0.6%/ 0.1%] 3.9% | 2534.2 [ 0.6%/ 0.1%] 2.7% 3.8% | 2511.9 [ 0.2%/ 0.0%] 1.8% 7.7% |
| 4/4 4725.2 [ 0.3%/ 0.0%] 7.7% | 4806.3 [ 0.2%/ 0.0%] 1.7% 7.7% | 4786.7 [ 0.9%/ 0.1%] 1.3% 14.6% |
| 8/8 9353.4 [ 0.1%/ 0.0%] 14.6% | 9357.4 [ 0.1%/ 0.0%] 0.0% 14.6% | 9352.3 [ 0.1%/ 0.0%] -0.0% 30.0% |
| 16/16 17543.1 [ 1.0%/ 0.1%] 30.1% | 18120.7 [ 0.2%/ 0.0%] 3.3% 30.1% | 17864.8 [ 1.2%/ 0.1%] 1.8% 60.1% |
| 32/32 26968.8 [ 3.9%/ 0.4%] 60.1% | 29448.9 [ 3.5%/ 0.3%] 9.2% 59.9% | 25308.1 [10.7%/ 0.9%] -6.2% 97.7% |
| 48/48 30466.2 [10.4%/ 1.0%] 89.3% | 38624.4 [ 4.2%/ 0.4%] 26.8% 89.1% | 26891.2 [14.8%/ 1.0%] -11.7% 99.5% |
| 64/64 37909.3 [11.1%/ 1.1%] 97.7% | 41671.7 [ 8.7%/ 0.9%] 9.9% 97.6% | 25898.3 [16.2%/ 1.0%] -31.7% 100.0% |
|128/128 39479.4 [24.6%/ 2.5%] 100.0% | 42119.6 [ 6.3%/ 0.6%] 6.7% 99.5% | 26830.1 [16.5%/ 1.1%] -32.0% 100.0% |
|256/256 42602.1 [16.4%/ 1.6%] 100.0% | 40041.3 [ 7.0%/ 0.7%] -6.0% 99.7% | 27634.7 [15.4%/ 1.1%] -35.1% 100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'

Case 2
------
cgroup1 workload, sysbench MySQL (non AVX workload)
cgroup2 workload, gemmbench (AVX512 workload)

sysbench throughput result:
.--------------------------------------------------------------------------------------------------------------------------------------.
|NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT [std% / sem%] +/- cpu% | no-SMT [std% / sem%] +/- cpu% |
|--------------------------------------------------------------------------------------------------------------------------------------|
| 1/1 1018.2 [ 1.0%/ 0.1%] 1.9% | 915.8 [ 0.9%/ 0.1%] -10.1% 1.9% | 994.0 [ 1.4%/ 0.2%] -2.4% 3.9% |
| 2/2 1941.2 [ 0.7%/ 0.1%] 3.9% | 1746.0 [ 0.5%/ 0.1%] -10.1% 3.9% | 1946.2 [ 0.8%/ 0.1%] 0.3% 7.8% |
| 4/4 3763.9 [ 0.5%/ 0.0%] 7.8% | 3426.0 [ 1.5%/ 0.2%] -9.0% 7.8% | 3745.1 [ 1.1%/ 0.1%] -0.5% 15.6% |
| 8/8 7375.5 [ 1.3%/ 0.1%] 15.5% | 6647.1 [ 1.1%/ 0.1%] -9.9% 16.1% | 7368.4 [ 0.8%/ 0.1%] -0.1% 31.1% |
| 16/16 12990.3 [ 0.6%/ 0.1%] 31.1% | 10903.7 [ 1.9%/ 0.2%] -16.1% 32.0% | 12082.6 [ 4.7%/ 0.5%] -7.0% 62.9% |
| 32/32 18238.1 [ 6.1%/ 0.6%] 62.1% | 16580.8 [ 3.0%/ 0.3%] -9.1% 62.8% | 21193.6 [ 4.9%/ 0.6%] 16.2% 97.8% |
| 48/48 21708.6 [ 8.3%/ 0.8%] 90.3% | 17064.1 [ 9.5%/ 0.9%] -21.4% 90.4% | 18531.4 [16.6%/ 1.8%] -14.6% 99.5% |
| 64/64 18636.9 [13.1%/ 1.3%] 97.9% | 12376.1 [20.9%/ 2.1%] -33.6% 96.8% | 20025.8 [14.9%/ 2.4%] 7.5% 100.0% |
|128/128 16204.2 [16.8%/ 1.7%] 99.4% | 3776.1 [88.7%/ 8.9%] -76.7% 97.6% | 20263.5 [12.7%/ 6.8%] 25.1% 100.0% |
|256/256 16730.5 [17.9%/ 1.8%] 98.9% | 1499.7 [210.3%/21.0%] -91.0% 98.4% | 17633.1 [ 7.5%/ 8.9%] 5.4% 100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'

And for this case, we care about sysbench MySQL latency(ms):
.--------------------------------------------------------------------------------------------------------------------------------------.
|NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT [std% / sem%] +/- cpu% | no-SMT [std% / sem%] +/- cpu% |
|--------------------------------------------------------------------------------------------------------------------------------------|
| 1/1 1.1 [ 3.7%/ 0.4%] 1.9% | 1.1 [ 1.0%/ 0.1%] -8.9% 1.9% | 1.1 [ 4.1%/ 0.4%] -2.0% 3.9% |
| 2/2 1.1 [ 0.7%/ 0.1%] 3.9% | 1.2 [ 0.8%/ 0.1%] -10.8% 3.9% | 1.1 [ 0.8%/ 0.1%] 0.2% 7.8% |
| 4/4 1.1 [ 0.7%/ 0.1%] 7.8% | 1.3 [ 3.8%/ 0.4%] -11.8% 7.8% | 1.2 [ 2.2%/ 0.2%] -1.1% 15.6% |
| 8/8 1.2 [ 2.2%/ 0.2%] 15.5% | 1.3 [ 3.0%/ 0.3%] -11.7% 16.1% | 1.2 [ 1.8%/ 0.2%] 0.4% 31.1% |
| 16/16 1.4 [ 1.5%/ 0.1%] 31.1% | 2.0 [ 8.2%/ 0.8%] -45.8% 32.0% | 1.9 [18.2%/ 1.7%] -33.2% 62.9% |
| 32/32 2.4 [ 6.6%/ 0.7%] 62.1% | 2.6 [ 3.1%/ 0.3%] -6.2% 62.8% | 2.2 [23.5%/ 2.0%] 8.5% 97.8% |
| 48/48 2.7 [ 5.3%/ 0.5%] 90.3% | 3.4 [ 3.5%/ 0.4%] -26.1% 90.4% | 6.2 [19.3%/ 3.5%] -128.0% 99.5% |
| 64/64 5.9 [13.0%/ 1.3%] 97.9% | 8.3 [ 9.8%/ 1.0%] -40.1% 96.8% | 7.4 [16.6%/ 1.5%] -25.1% 100.0% |
|128/128 17.4 [46.8%/ 4.7%] 99.4% | 248.0 [146.9%/14.7%] -1327.8% 97.6% | 11.0 [10.5%/ 0.0%] 36.7% 100.0% |
|256/256 33.5 [67.1%/ 6.7%] 98.9% | 1279.5 [245.6%/24.6%] -3716.6% 98.4% | 21.5 [21.5%/ 0.0%] 36.0% 100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'

Note:
----
64/64: 64 sysbench threads(in one cgroup) and 64 gemmbench threads(in other cgroup) run simultaneously.
Vanilla-SMT: baseline with HT on
coresched-SMT: core scheduling enabled
no-SMT: HT off thru /sys/devices/system/cpu/smt/control
std%: standard deviation
sem%: standard error of the mean
±: improvement/regression against baseline
cpu%: derived by vmstat.idle and vmstat.iowait

2019-10-31 18:43:48

by Phil Auld

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

Hi Vineeth,

On Wed, Oct 30, 2019 at 06:33:13PM +0000 Vineeth Remanan Pillai wrote:
> Fourth iteration of the Core-Scheduling feature.
>
> This version was aiming mostly at addressing the vruntime comparison
> issues with v3. The main issue seen in v3 was the starvation of
> interactive tasks when competing with cpu intensive tasks. This issue
> is mitigated to a large extent.
>
> We have tested and verified that incompatible processes are not
> selected during schedule. In terms of performance, the impact
> depends on the workload:
> - on CPU intensive applications that use all the logical CPUs with
> SMT enabled, enabling core scheduling performs better than nosmt.
> - on mixed workloads with considerable io compared to cpu usage,
> nosmt seems to perform better than core scheduling.
>
> v4 is rebased on top of 5.3.5(dc073f193b70):
> https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.3.5
>
> Changes in v4
> -------------
> - Implement a core wide min_vruntime for vruntime comparison of tasks
> across cpus in a core.
> - Fixes a typo bug in setting the forced_idle cpu.
>
> Changes in v3
> -------------
> - Fixes the issue of sibling picking up an incompatible task
> - Aaron Lu
> - Vineeth Pillai
> - Julien Desfossez
> - Fixes the issue of starving threads due to forced idle
> - Peter Zijlstra
> - Fixes the refcounting issue when deleting a cgroup with tag
> - Julien Desfossez
> - Fixes a crash during cpu offline/online with coresched enabled
> - Vineeth Pillai
> - Fixes a comparison logic issue in sched_core_find
> - Aaron Lu
>
> Changes in v2
> -------------
> - Fixes for couple of NULL pointer dereference crashes
> - Subhra Mazumdar
> - Tim Chen
> - Improves priority comparison logic for process in different cpus
> - Peter Zijlstra
> - Aaron Lu
> - Fixes a hard lockup in rq locking
> - Vineeth Pillai
> - Julien Desfossez
> - Fixes a performance issue seen on IO heavy workloads
> - Vineeth Pillai
> - Julien Desfossez
> - Fix for 32bit build
> - Aubrey Li
>
> TODO
> ----
> - Decide on the API for exposing the feature to userland
> - Investigate the source of the overhead even when no tasks are tagged:
> https://lkml.org/lkml/2019/10/29/242
> - Investigate the performance scaling issue when we have a high number of
> tagged threads: https://lkml.org/lkml/2019/10/29/248
> - Try to optimize the performance for IO-demanding applications:
> https://lkml.org/lkml/2019/10/29/261
>
> ---
>
> Aaron Lu (3):
> sched/fair: wrapper for cfs_rq->min_vruntime
> sched/fair: core wide vruntime comparison
> sched/fair : Wake up forced idle siblings if needed
>
> Peter Zijlstra (16):
> stop_machine: Fix stop_cpus_in_progress ordering
> sched: Fix kerneldoc comment for ia64_set_curr_task
> sched: Wrap rq::lock access
> sched/{rt,deadline}: Fix set_next_task vs pick_next_task
> sched: Add task_struct pointer to sched_class::set_curr_task
> sched/fair: Export newidle_balance()
> sched: Allow put_prev_task() to drop rq->lock
> sched: Rework pick_next_task() slow-path
> sched: Introduce sched_class::pick_task()
> sched: Core-wide rq->lock
> sched: Basic tracking of matching tasks
> sched: A quick and dirty cgroup tagging interface
> sched: Add core wide task selection and scheduling.
> sched/fair: Add a few assertions
> sched: Trivial forced-newidle balancer
> sched: Debug bits...
>
> include/linux/sched.h | 9 +-
> kernel/Kconfig.preempt | 6 +
> kernel/sched/core.c | 847 +++++++++++++++++++++++++++++++++++++--
> kernel/sched/cpuacct.c | 12 +-
> kernel/sched/deadline.c | 99 +++--
> kernel/sched/debug.c | 4 +-
> kernel/sched/fair.c | 346 +++++++++++-----
> kernel/sched/idle.c | 42 +-
> kernel/sched/pelt.h | 2 +-
> kernel/sched/rt.c | 96 ++---
> kernel/sched/sched.h | 246 +++++++++---
> kernel/sched/stop_task.c | 35 +-
> kernel/sched/topology.c | 4 +-
> kernel/stop_machine.c | 2 +
> 14 files changed, 1399 insertions(+), 351 deletions(-)
>
> --
> 2.17.1
>

Unless I'm mistaken 7 of the first 8 of these went into sched/core
and are now in linux (from v5.4-rc1). It may make sense to rebase on
that and simplify the series.


Cheers,
Phil
--

2019-11-01 12:24:03

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 2019/10/31 19:42, Li, Aubrey wrote:
> On 2019/10/31 2:33, Vineeth Remanan Pillai wrote:
>> Fourth iteration of the Core-Scheduling feature.
>>
>> This version was aiming mostly at addressing the vruntime comparison
>> issues with v3. The main issue seen in v3 was the starvation of
>> interactive tasks when competing with cpu intensive tasks. This issue
>> is mitigated to a large extent.
>>
>> We have tested and verified that incompatible processes are not
>> selected during schedule. In terms of performance, the impact
>> depends on the workload:
>> - on CPU intensive applications that use all the logical CPUs with
>> SMT enabled, enabling core scheduling performs better than nosmt.
>> - on mixed workloads with considerable io compared to cpu usage,
>> nosmt seems to perform better than core scheduling.
>>
>> v4 is rebased on top of 5.3.5(dc073f193b70):
>> https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.3.5
>
> Thanks to post V4 out. Refresh the data on my side. Since we have played
> with Aaron's core vruntime patch for a while, no surprise in the result.
>
I have three small patches against V4. The basic idea is,
- for load balance, don't pull/push task if its cookie does not match with
destination CPU's core cookie
- for task wakeup, select idle CPU whose core cookie matches with task's
cookie.

Sysbench MySQL result shows significant improvement for the overload cases.

Here is original data of coresched_v4(for comparison):
.--------------------------------------------------------------------------------------------------------------------------------------.
|NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT [std% / sem%] +/- cpu% | no-SMT [std% / sem%] +/- cpu% |
|--------------------------------------------------------------------------------------------------------------------------------------|
| 1/1 1018.2 [ 1.0%/ 0.1%] 1.9% | 915.8 [ 0.9%/ 0.1%] -10.1% 1.9% | 994.0 [ 1.4%/ 0.2%] -2.4% 3.9% |
| 2/2 1941.2 [ 0.7%/ 0.1%] 3.9% | 1746.0 [ 0.5%/ 0.1%] -10.1% 3.9% | 1946.2 [ 0.8%/ 0.1%] 0.3% 7.8% |
| 4/4 3763.9 [ 0.5%/ 0.0%] 7.8% | 3426.0 [ 1.5%/ 0.2%] -9.0% 7.8% | 3745.1 [ 1.1%/ 0.1%] -0.5% 15.6% |
| 8/8 7375.5 [ 1.3%/ 0.1%] 15.5% | 6647.1 [ 1.1%/ 0.1%] -9.9% 16.1% | 7368.4 [ 0.8%/ 0.1%] -0.1% 31.1% |
| 16/16 12990.3 [ 0.6%/ 0.1%] 31.1% | 10903.7 [ 1.9%/ 0.2%] -16.1% 32.0% | 12082.6 [ 4.7%/ 0.5%] -7.0% 62.9% |
| 32/32 18238.1 [ 6.1%/ 0.6%] 62.1% | 16580.8 [ 3.0%/ 0.3%] -9.1% 62.8% | 21193.6 [ 4.9%/ 0.6%] 16.2% 97.8% |
| 48/48 21708.6 [ 8.3%/ 0.8%] 90.3% | 17064.1 [ 9.5%/ 0.9%] -21.4% 90.4% | 18531.4 [16.6%/ 1.8%] -14.6% 99.5% |
| 64/64 18636.9 [13.1%/ 1.3%] 97.9% | 12376.1 [20.9%/ 2.1%] -33.6% 96.8% | 20025.8 [14.9%/ 2.4%] 7.5% 100.0% |
|128/128 16204.2 [16.8%/ 1.7%] 99.4% | 3776.1 [88.7%/ 8.9%] -76.7% 97.6% | 20263.5 [12.7%/ 6.8%] 25.1% 100.0% |
|256/256 16730.5 [17.9%/ 1.8%] 98.9% | 1499.7 [210.3%/21.0%] -91.0% 98.4% | 17633.1 [ 7.5%/ 8.9%] 5.4% 100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'

And the following is the new data with cookie match checking,
coresched-SMT is coresched_v4 plus my patches.
.--------------------------------------------------------------------------------------------------------------------------------------.
|NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT [std% / sem%] +/- cpu% | no-SMT [std% / sem%] +/- cpu% |
|--------------------------------------------------------------------------------------------------------------------------------------|
| 1/1 1018.2 [ 1.0%/ 0.1%] 1.9% | 901.2 [ 0.7%/ 0.1%] -11.5% 1.9% | 994.0 [ 1.4%/ 0.2%] -2.4% 3.9% |
| 2/2 1941.2 [ 0.7%/ 0.1%] 3.9% | 1733.9 [ 0.8%/ 0.1%] -10.7% 3.9% | 1946.2 [ 0.8%/ 0.1%] 0.3% 7.8% |
| 4/4 3763.9 [ 0.5%/ 0.0%] 7.8% | 3429.4 [ 1.0%/ 0.1%] -8.9% 7.8% | 3745.1 [ 1.1%/ 0.1%] -0.5% 15.6% |
| 8/8 7375.5 [ 1.3%/ 0.1%] 15.5% | 6571.3 [ 1.2%/ 0.1%] -10.9% 16.1% | 7368.4 [ 0.8%/ 0.1%] -0.1% 31.1% |
| 16/16 12990.3 [ 0.6%/ 0.1%] 31.1% | 9365.1 [ 2.6%/ 0.3%] -27.9% 31.1% | 12082.6 [ 4.7%/ 0.6%] -7.0% 62.9% |
| 32/32 18238.1 [ 6.1%/ 0.6%] 62.1% | 16181.9 [ 4.0%/ 0.4%] -11.3% 63.6% | 21193.6 [ 4.9%/ 0.6%] 16.2% 97.8% |
| 48/48 21708.6 [ 8.3%/ 0.8%] 90.3% | 18390.4 [ 6.0%/ 0.6%] -15.3% 85.3% | 18531.4 [16.6%/ 1.7%] -14.6% 99.5% |
| 64/64 18636.9 [13.1%/ 1.3%] 97.9% | 12831.7 [22.5%/ 2.2%] -31.1% 96.2% | 20025.8 [14.9%/ 2.3%] 7.5% 100.0% |
|128/128 16204.2 [16.8%/ 1.7%] 99.4% | 10199.4 [16.3%/ 1.6%] -37.1% 97.9% | 20263.5 [12.7%/ 2.5%] 25.1% 100.0% |
|256/256 16730.5 [17.9%/ 1.8%] 98.9% | 7739.4 [15.6%/ 1.6%] -53.7% 96.4% | 17633.1 [ 7.5%/ 1.7%] 5.4% 100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'

This may be the workload specific. Looking forward to more testing and comments.

Thanks,
-Aubrey
----------------------------------------------------------------------
From 48f27dde3a54915c25f08b0dfff0a8d148cd6093 Mon Sep 17 00:00:00 2001
From: Aubrey Li <[email protected]>
Date: Fri, 1 Nov 2019 14:17:01 +0800
Subject: [PATCH 1/3] sched/fair: don't migrate task if cookie not match

Load balance tries to move task from busiest CPU to the destination
CPU. When core scheduling is enabled, if the task's cookie does not
match with the destination CPU's core cookie, this task will be
skipped by this CPU. This mitigates the forced idle time on the
destination CPU.

Signed-off-by: Aubrey Li <[email protected]>
---
kernel/sched/fair.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4728f5e..ba937b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7366,8 +7366,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* We do not migrate tasks that are:
* 1) throttled_lb_pair, or
* 2) cannot be migrated to this CPU due to cpus_ptr, or
- * 3) running (obviously), or
- * 4) are cache-hot on their current CPU.
+ * 3) task's cookie does not match with this CPU's core cookie
+ * 4) running (obviously), or
+ * 5) are cache-hot on their current CPU.
*/
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
@@ -7402,6 +7403,17 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
return 0;
}

+#ifdef CONFIG_SCHED_CORE
+ if (sched_core_enabled(cpu_rq(env->dst_cpu))) {
+ /*
+ * Don't migrate task if task's cookie does not match
+ * with core cookie
+ */
+ if (p->core_cookie != cpu_rq(env->dst_cpu)->core->core_cookie)
+ return 0;
+ }
+#endif
+
/* Record that we found atleast one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;

--
2.7.4

----------------------------------------------------------------------
From c6cd478408097c47f60a124bf67562d0e2994f7d Mon Sep 17 00:00:00 2001
From: Aubrey Li <[email protected]>
Date: Fri, 1 Nov 2019 14:28:09 +0800
Subject: [PATCH 2/3] sched/fair: select cookie matched idle CPU

In the fast path of task wakeup, select the first cookie matched idle
CPU instead of the first idle CPU.

Signed-off-by: Aubrey Li <[email protected]>
---
kernel/sched/fair.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba937b9..19ea8d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6074,7 +6074,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
continue;
if (available_idle_cpu(cpu))
+#ifdef CONFIG_SCHED_CORE
+ if (sched_core_enabled(cpu_rq(cpu)) &&
+ (p->core_cookie == cpu_rq(cpu)->core->core_cookie))
+ break;
+#else
break;
+#endif
}

time = cpu_clock(this) - time;
--
2.7.4

----------------------------------------------------------------------
From 1dc91d6b3e95c81d1387f4989f160a1a9924a8f4 Mon Sep 17 00:00:00 2001
From: Aubrey Li <[email protected]>
Date: Fri, 1 Nov 2019 14:49:00 +0800
Subject: [PATCH 3/3] sched/fair: find cookie matched idlest CPU

In the slow path of task wakeup, find the idlest CPU whose core
cookie matches with task's cookie

Signed-off-by: Aubrey Li <[email protected]>
---
kernel/sched/fair.c | 24 +++++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 19ea8d8..3aec68a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5692,6 +5692,22 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
p->cpus_ptr))
continue;

+#ifdef CONFIG_SCHED_CORE
+ if (sched_core_enabled(cpu_rq(this_cpu))) {
+ bool cookie_match = false;
+
+ for_each_cpu(i, sched_group_span(group)) {
+ struct rq *rq = cpu_rq(i);
+
+ if (p->core_cookie == rq->core->core_cookie)
+ cookie_match = true;
+ }
+ /* Skip over this group if no cookie matched */
+ if (!cookie_match)
+ continue;
+ }
+#endif
+
local_group = cpumask_test_cpu(this_cpu,
sched_group_span(group));

@@ -5817,8 +5833,14 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this

/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+ struct rq *rq = cpu_rq(i);
+
+#ifdef CONFIG_SCHED_CORE
+ if (sched_core_enabled(rq) && (p->core_cookie !=
+ rq->core->core_cookie))
+ continue;
+#endif
if (available_idle_cpu(i)) {
- struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
--
2.7.4

2019-11-01 14:15:16

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

Hi Phil,

> Unless I'm mistaken 7 of the first 8 of these went into sched/core
> and are now in linux (from v5.4-rc1). It may make sense to rebase on
> that and simplify the series.
>
Thanks a lot for pointing this out. We shall test on a rebased 5.4 RC
and post the changes soon, if the tests goes well. For v3, while rebasing
to an RC kernel, we saw perf regressions and hence did not check the
RC kernel this time. You are absolutely right that we can simplify the
patch series with 5.4 RC.


Thanks
Vineeth

2019-11-01 16:45:08

by Greg Kerr

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Fri, Nov 1, 2019 at 7:03 AM Vineeth Remanan Pillai
<[email protected]> wrote:
>
> Hi Phil,
>
> > Unless I'm mistaken 7 of the first 8 of these went into sched/core
> > and are now in linux (from v5.4-rc1). It may make sense to rebase on
> > that and simplify the series.
> >
> Thanks a lot for pointing this out. We shall test on a rebased 5.4 RC
> and post the changes soon, if the tests goes well. For v3, while rebasing
> to an RC kernel, we saw perf regressions and hence did not check the
> RC kernel this time. You are absolutely right that we can simplify the
> patch series with 5.4 RC.

Has anyone considering shipping a V1 implementation which just allows
threads from the same process to share a core together? And then
iterating on that? Would that be simpler to implement or do the same
fundamental problems exist as tagging arbitrary processes with
cookies?

Regards,

Greg Kerr

>
>
> Thanks
> Vineeth

2019-11-01 20:57:23

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Fri, 2019-11-01 at 09:35 -0700, Greg Kerr wrote:
> Has anyone considering shipping a V1 implementation which just allows
> threads from the same process to share a core together? And then
> iterating on that? Would that be simpler to implement or do the same
> fundamental problems exist as tagging arbitrary processes with
> cookies?
>
IMO, the latter, i.e., the same fundamental problem exist.

It's _tasks_ having to be scheduled in a certain way.

It does not matter much, as far as the necessary modifications to the
scheduler are concerned, whether those tasks are independent processes,
or threads or the same process, or vcpus of a VMs, etc...

Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2019-11-08 03:22:10

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 2019/11/1 19:33, Li, Aubrey wrote:
> On 2019/10/31 19:42, Li, Aubrey wrote:
>> On 2019/10/31 2:33, Vineeth Remanan Pillai wrote:
>>> Fourth iteration of the Core-Scheduling feature.
>>>
>>> This version was aiming mostly at addressing the vruntime comparison
>>> issues with v3. The main issue seen in v3 was the starvation of
>>> interactive tasks when competing with cpu intensive tasks. This issue
>>> is mitigated to a large extent.
>>>
>>> We have tested and verified that incompatible processes are not
>>> selected during schedule. In terms of performance, the impact
>>> depends on the workload:
>>> - on CPU intensive applications that use all the logical CPUs with
>>> SMT enabled, enabling core scheduling performs better than nosmt.
>>> - on mixed workloads with considerable io compared to cpu usage,
>>> nosmt seems to perform better than core scheduling.
>>>
>>> v4 is rebased on top of 5.3.5(dc073f193b70):
>>> https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.3.5
>>
>> Thanks to post V4 out. Refresh the data on my side. Since we have played
>> with Aaron's core vruntime patch for a while, no surprise in the result.
>>
> I have three small patches against V4. The basic idea is,
> - for load balance, don't pull/push task if its cookie does not match with
> destination CPU's core cookie
> - for task wakeup, select idle CPU whose core cookie matches with task's
> cookie.
>
> Sysbench MySQL result shows significant improvement for the overload cases.
>
> This may be the workload specific. Looking forward to more testing and comments.
>

Here is another one for task numa migration. We saw significant latency
improvement of workload sysbench MYSQL+gemmbench, for the overloaded case
on a 8-node system, the latency is reduced from 93.78ms to 28.36ms. So I
think it's worth to post this twist to draw more ideas and better solutions.

Thanks,
-Aubrey

----------------------------------------------------------------------
From 89fc5fd70d5dcc426dc4724afdf35d2c916cd303 Mon Sep 17 00:00:00 2001
From: Aubrey Li <[email protected]>
Date: Thu, 7 Nov 2019 14:51:28 +0800
Subject: [PATCH 1/2] sched/numa: don't migrate task if cookie not match

For the NUMA load balance, don't migrate task to the CPU whose core
cookie does not match with task's cookie

Signed-off-by: Aubrey Li <[email protected]>
---
kernel/sched/fair.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3aec68a..2909030 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1827,6 +1827,16 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;

+#ifdef CONFIG_SCHED_CORE
+ /*
+ * Skip this cpu if source task's cookie does not match
+ * with CPU's core cookie.
+ */
+ if (sched_core_enabled(cpu_rq(cpu)) && (env->p->core_cookie !=
+ cpu_rq(cpu)->core->core_cookie))
+ continue;
+#endif
+
env->dst_cpu = cpu;
task_numa_compare(env, taskimp, groupimp, maymove);
}
--
2.7.4


2019-11-11 19:11:43

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 10/30/19 11:33 AM, Vineeth Remanan Pillai wrote:
> Fourth iteration of the Core-Scheduling feature.
>
> This version was aiming mostly at addressing the vruntime comparison
> issues with v3. The main issue seen in v3 was the starvation of
> interactive tasks when competing with cpu intensive tasks. This issue
> is mitigated to a large extent.
>
> We have tested and verified that incompatible processes are not
> selected during schedule. In terms of performance, the impact
> depends on the workload:
> - on CPU intensive applications that use all the logical CPUs with
> SMT enabled, enabling core scheduling performs better than nosmt.
> - on mixed workloads with considerable io compared to cpu usage,
> nosmt seems to perform better than core scheduling.
>
> v4 is rebased on top of 5.3.5(dc073f193b70):
> https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.3.5
>

The v4 core scheduler will crash when you toggle the core scheduling
tag of the cgroup while there are active tasks in the cgroup running.

The reason is because we don't properly move tasks in and out of the
core scheduler queue according to the new core scheduling tag status.
A task's core scheduler status will then get misaligned with its core
cookie status.

The patch below fixes that.

Thanks.

Tim

------->8----------------

From 2f3f035a9b74013627069da62d6553548700eeac Mon Sep 17 00:00:00 2001
From: Tim Chen <[email protected]>
Date: Thu, 7 Nov 2019 15:45:24 -0500
Subject: [PATCH] sched: Update tasks in core queue when its cgroup tag is
changed

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling. Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Implement these missing core scheduler queue update mechanisms.
Otherwise, the core scheduler will OOPs the kernel when we toggle the
cgroup's core scheduling tag due to inconsistent core scheduler queue
status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup. A more complicated
scheme could probably avoid the stop machine. Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/core.c | 119 +++++++++++++++++++++++++++++++++++++++-----
1 file changed, 106 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4778b5940c30..08e7fdd5972d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -138,6 +138,37 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct *
return false;
}

+static bool sched_core_empty(struct rq *rq)
+{
+ return RB_EMPTY_ROOT(&rq->core_tree);
+}
+
+static bool sched_core_enqueued(struct task_struct *task)
+{
+ return !RB_EMPTY_NODE(&task->core_node);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+ struct task_struct *task;
+
+ task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node);
+ return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ struct task_struct *task;
+
+ while (!sched_core_empty(rq)) {
+ task = sched_core_first(rq);
+ rb_erase(&task->core_node, &rq->core_tree);
+ RB_CLEAR_NODE(&task->core_node);
+ }
+ rq->core->core_task_seq++;
+}
+
static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
{
struct rb_node *parent, **node;
@@ -169,10 +200,11 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
{
rq->core->core_task_seq++;

- if (!p->core_cookie)
+ if (!sched_core_enqueued(p))
return;

rb_erase(&p->core_node, &rq->core_tree);
+ RB_CLEAR_NODE(&p->core_node);
}

/*
@@ -236,6 +268,18 @@ static int __sched_core_stopper(void *data)
bool enabled = !!(unsigned long)data;
int cpu;

+ if (!enabled) {
+ for_each_online_cpu(cpu) {
+ /*
+ * All active and migrating tasks will have already been removed
+ * from core queue when we clear the cgroup tags.
+ * However, dying tasks could still be left in core queue.
+ * Flush them here.
+ */
+ sched_core_flush(cpu);
+ }
+ }
+
for_each_online_cpu(cpu)
cpu_rq(cpu)->core_enabled = enabled;

@@ -247,7 +291,11 @@ static int sched_core_count;

static void __sched_core_enable(void)
{
- // XXX verify there are no cookie tasks (yet)
+ int cpu;
+
+ /* verify there are no cookie tasks (yet) */
+ for_each_online_cpu(cpu)
+ BUG_ON(!sched_core_empty(cpu_rq(cpu)));

static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -257,8 +305,6 @@ static void __sched_core_enable(void)

static void __sched_core_disable(void)
{
- // XXX verify there are no cookie tasks (left)
-
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);

@@ -285,6 +331,7 @@ void sched_core_put(void)

static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static bool sched_core_enqueued(struct task_struct *task) { return false; }

#endif /* CONFIG_SCHED_CORE */

@@ -3016,6 +3063,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_SMP
plist_node_init(&p->pushable_tasks, MAX_PRIO);
RB_CLEAR_NODE(&p->pushable_dl_tasks);
+#endif
+#ifdef CONFIG_SCHED_CORE
+ RB_CLEAR_NODE(&p->core_node);
#endif
return 0;
}
@@ -6560,6 +6610,9 @@ void init_idle(struct task_struct *idle, int cpu)
#ifdef CONFIG_SMP
sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
#endif
+#ifdef CONFIG_SCHED_CORE
+ RB_CLEAR_NODE(&idle->core_node);
+#endif
}

#ifdef CONFIG_SMP
@@ -7671,7 +7724,12 @@ static void cpu_cgroup_fork(struct task_struct *task)
rq = task_rq_lock(task, &rf);

update_rq_clock(rq);
+ if (sched_core_enqueued(task))
+ sched_core_dequeue(rq, task);
sched_change_group(task, TASK_SET_GROUP);
+ if (sched_core_enabled(rq) && task_on_rq_queued(task) &&
+ task->core_cookie)
+ sched_core_enqueue(rq, task);

task_rq_unlock(rq, task, &rf);
}
@@ -8033,12 +8091,51 @@ static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype
return !!tg->tagged;
}

-static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+struct write_core_tag {
+ struct cgroup_subsys_state *css;
+ int val;
+};
+
+static int __sched_write_tag(void *data)
{
- struct task_group *tg = css_tg(css);
+ struct write_core_tag *tag = (struct write_core_tag *) data;
+ struct cgroup_subsys_state *css = tag->css;
+ int val = tag->val;
+ struct task_group *tg = css_tg(tag->css);
struct css_task_iter it;
struct task_struct *p;

+ tg->tagged = !!val;
+
+ css_task_iter_start(css, 0, &it);
+ /*
+ * Note: css_task_iter_next will skip dying tasks.
+ * There could still be dying tasks left in the core queue
+ * when we set cgroup tag to 0 when the loop is done below.
+ */
+ while ((p = css_task_iter_next(&it))) {
+ p->core_cookie = !!val ? (unsigned long)tg : 0UL;
+
+ if (sched_core_enqueued(p)) {
+ sched_core_dequeue(task_rq(p), p);
+ if (!p->core_cookie)
+ continue;
+ }
+
+ if (p->core_cookie && task_on_rq_queued(p))
+ sched_core_enqueue(task_rq(p), p);
+
+ }
+ css_task_iter_end(&it);
+
+ return 0;
+}
+
+static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+ struct task_group *tg = css_tg(css);
+ struct write_core_tag wtag;
+
if (val > 1)
return -ERANGE;

@@ -8048,16 +8145,12 @@ static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype
if (tg->tagged == !!val)
return 0;

- tg->tagged = !!val;
-
if (!!val)
sched_core_get();

- css_task_iter_start(css, 0, &it);
- while ((p = css_task_iter_next(&it)))
- p->core_cookie = !!val ? (unsigned long)tg : 0UL;
- css_task_iter_end(&it);
-
+ wtag.css = css;
+ wtag.val = val;
+ stop_machine(__sched_write_tag, (void *) &wtag, NULL);
if (!val)
sched_core_put();

--
2.20.1

2019-11-12 01:46:27

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Fri, 2019-11-01 at 10:03 -0400, Vineeth Remanan Pillai wrote:
> Hi Phil,
>
> > Unless I'm mistaken 7 of the first 8 of these went into sched/core
> > and are now in linux (from v5.4-rc1). It may make sense to rebase
> > on
> > that and simplify the series.
> >
> Thanks a lot for pointing this out. We shall test on a rebased 5.4 RC
> and post the changes soon, if the tests goes well. For v3, while
> rebasing
> to an RC kernel, we saw perf regressions and hence did not check the
> RC kernel this time. You are absolutely right that we can simplify
> the
> patch series with 5.4 RC.
>
And, in case it's useful to anybody, here's a rebase of this series on
top of 5.4-rc7:

https://github.com/dfaggioli/linux/tree/wip/sched/v5.4-rc7-coresched

This also includes Tim's patch (this one
<[email protected]>) and is, for
now, *only* compile tested.

Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2019-11-13 17:27:11

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 11/11/19 5:45 PM, Dario Faggioli wrote:
> On Fri, 2019-11-01 at 10:03 -0400, Vineeth Remanan Pillai wrote:
>> Hi Phil,
>>
>>> Unless I'm mistaken 7 of the first 8 of these went into sched/core
>>> and are now in linux (from v5.4-rc1). It may make sense to rebase
>>> on
>>> that and simplify the series.
>>>
>> Thanks a lot for pointing this out. We shall test on a rebased 5.4 RC
>> and post the changes soon, if the tests goes well. For v3, while
>> rebasing
>> to an RC kernel, we saw perf regressions and hence did not check the
>> RC kernel this time. You are absolutely right that we can simplify
>> the
>> patch series with 5.4 RC.
>>
> And, in case it's useful to anybody, here's a rebase of this series on
> top of 5.4-rc7:
>
> https://github.com/dfaggioli/linux/tree/wip/sched/v5.4-rc7-coresched

Thanks for the port.
>
> This also includes Tim's patch (this one
> <[email protected]>) and is, for
> now, *only* compile tested.
>

I tested my patch on top of your port and seems to work fine.

Tim

2020-01-02 02:29:10

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Tue, Nov 12, 2019 at 9:45 AM Dario Faggioli <[email protected]> wrote:
>
> On Fri, 2019-11-01 at 10:03 -0400, Vineeth Remanan Pillai wrote:
> > Hi Phil,
> >
> > > Unless I'm mistaken 7 of the first 8 of these went into sched/core
> > > and are now in linux (from v5.4-rc1). It may make sense to rebase
> > > on
> > > that and simplify the series.
> > >
> > Thanks a lot for pointing this out. We shall test on a rebased 5.4 RC
> > and post the changes soon, if the tests goes well. For v3, while
> > rebasing
> > to an RC kernel, we saw perf regressions and hence did not check the
> > RC kernel this time. You are absolutely right that we can simplify
> > the
> > patch series with 5.4 RC.
> >
> And, in case it's useful to anybody, here's a rebase of this series on
> top of 5.4-rc7:
>
> https://github.com/dfaggioli/linux/tree/wip/sched/v5.4-rc7-coresched
>

In case it's useful to anyone, I rebased the series on top of v5.5-rc4.
https://github.com/aubreyli/linux/tree/coresched_v4-v5.5-rc4

v5.5 includes a few scheduler rework and fix, so I modified patch1/2/6,
patch-0002 has relatively big changes, but still has no functionality
and logic change.

0001-sched-Wrap-rq-lock-access.patch
0002-sched-Introduce-sched_class-pick_task.patch
0003-sched-Core-wide-rq-lock.patch
0004-sched-Basic-tracking-of-matching-tasks.patch
0005-sched-A-quick-and-dirty-cgroup-tagging-interface.patch
0006-sched-Add-core-wide-task-selection-and-scheduling.patch
0007-sched-fair-Add-a-few-assertions.patch
0008-sched-Trivial-forced-newidle-balancer.patch
0009-sched-Debug-bits.patch
0010-sched-fair-wrapper-for-cfs_rq-min_vruntime.patch
0011-sched-fair-core-wide-vruntime-comparison.patch
0012-sched-fair-Wake-up-forced-idle-siblings-if-needed.patch

I verified by my test suites, it seems to work.

Thanks,
-Aubrey

2020-01-10 23:21:11

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 1/1/20 6:28 PM, Aubrey Li wrote:
> On Tue, Nov 12, 2019 at 9:45 AM Dario Faggioli <[email protected]> wrote:
>>
>> On Fri, 2019-11-01 at 10:03 -0400, Vineeth Remanan Pillai wrote:
>>> Hi Phil,
>>>
>>>> Unless I'm mistaken 7 of the first 8 of these went into sched/core
>>>> and are now in linux (from v5.4-rc1). It may make sense to rebase
>>>> on
>>>> that and simplify the series.
>>>>
>>> Thanks a lot for pointing this out. We shall test on a rebased 5.4 RC
>>> and post the changes soon, if the tests goes well. For v3, while
>>> rebasing
>>> to an RC kernel, we saw perf regressions and hence did not check the
>>> RC kernel this time. You are absolutely right that we can simplify
>>> the
>>> patch series with 5.4 RC.
>>>
>> And, in case it's useful to anybody, here's a rebase of this series on
>> top of 5.4-rc7:
>>
>> https://github.com/dfaggioli/linux/tree/wip/sched/v5.4-rc7-coresched
>>
>
> In case it's useful to anyone, I rebased the series on top of v5.5-rc4.
> https://github.com/aubreyli/linux/tree/coresched_v4-v5.5-rc4
>
> v5.5 includes a few scheduler rework and fix, so I modified patch1/2/6,
> patch-0002 has relatively big changes, but still has no functionality
> and logic change.
>
> 0001-sched-Wrap-rq-lock-access.patch
> 0002-sched-Introduce-sched_class-pick_task.patch
> 0003-sched-Core-wide-rq-lock.patch
> 0004-sched-Basic-tracking-of-matching-tasks.patch
> 0005-sched-A-quick-and-dirty-cgroup-tagging-interface.patch
> 0006-sched-Add-core-wide-task-selection-and-scheduling.patch
> 0007-sched-fair-Add-a-few-assertions.patch
> 0008-sched-Trivial-forced-newidle-balancer.patch
> 0009-sched-Debug-bits.patch
> 0010-sched-fair-wrapper-for-cfs_rq-min_vruntime.patch
> 0011-sched-fair-core-wide-vruntime-comparison.patch
> 0012-sched-fair-Wake-up-forced-idle-siblings-if-needed.patch
>
> I verified by my test suites, it seems to work.
>

Peter,

What do you see are the next steps to move the core scheduling
functionalities forward?

We'll like to find out what you see are the gaps that needed to be filled
to get this patchset be considered for mainline.

Thanks.

Tim

2020-01-14 01:14:09

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 11/11/19 11:10 AM, Tim Chen wrote:
> On 10/30/19 11:33 AM, Vineeth Remanan Pillai wrote:
>> Fourth iteration of the Core-Scheduling feature.
>>
>> This version was aiming mostly at addressing the vruntime comparison
>> issues with v3. The main issue seen in v3 was the starvation of
>> interactive tasks when competing with cpu intensive tasks. This issue
>> is mitigated to a large extent.
>>
>> We have tested and verified that incompatible processes are not
>> selected during schedule. In terms of performance, the impact
>> depends on the workload:
>> - on CPU intensive applications that use all the logical CPUs with
>> SMT enabled, enabling core scheduling performs better than nosmt.
>> - on mixed workloads with considerable io compared to cpu usage,
>> nosmt seems to perform better than core scheduling.
>>
>> v4 is rebased on top of 5.3.5(dc073f193b70):
>> https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.3.5
>>
>
> The v4 core scheduler will crash when you toggle the core scheduling
> tag of the cgroup while there are active tasks in the cgroup running.
>
> The reason is because we don't properly move tasks in and out of the
> core scheduler queue according to the new core scheduling tag status.
> A task's core scheduler status will then get misaligned with its core
> cookie status.
>
> The patch below fixes that.
>
> Thanks.
>
> Tim
>

I also encountered kernel panic with the v4 code when taking cpu offline or online
when core scheduler is running. I've refreshed the previous patch, along
with 3 other patches to fix problems related to CPU online/offline.

As a side effect of the fix, each core can now operate in core-scheduling
mode or non core-scheduling mode, depending on how many online SMT threads it has.

Vineet, are you guys planning to refresh v4 and update it to v5? Aubrey posted
a port to the latest kernel earlier.

Tim

--->8---
From 3c00f178538e2a33d075b5705adaf31784b19add Mon Sep 17 00:00:00 2001
Message-Id: <3c00f178538e2a33d075b5705adaf31784b19add.1578960120.git.tim.c.chen@linux.intel.com>
From: Tim Chen <[email protected]>
Date: Thu, 7 Nov 2019 15:45:24 -0500
Subject: [PATCH 1/4] sched: Update tasks in core queue when its cgroup tag is
changed

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling. Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Implement these missing core scheduler queue update mechanisms.
Otherwise, the core scheduler will OOPs the kernel when we toggle the
cgroup's core scheduling tag due to inconsistent core scheduler queue
status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup. A more complicated
scheme could probably avoid the stop machine. Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/core.c | 120 +++++++++++++++++++++++++++++++++++++++-----
1 file changed, 107 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4778b5940c30..44399f5434f8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -138,6 +138,37 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct *
return false;
}

+static bool sched_core_empty(struct rq *rq)
+{
+ return RB_EMPTY_ROOT(&rq->core_tree);
+}
+
+static bool sched_core_enqueued(struct task_struct *task)
+{
+ return !RB_EMPTY_NODE(&task->core_node);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+ struct task_struct *task;
+
+ task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node);
+ return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ struct task_struct *task;
+
+ while (!sched_core_empty(rq)) {
+ task = sched_core_first(rq);
+ rb_erase(&task->core_node, &rq->core_tree);
+ RB_CLEAR_NODE(&task->core_node);
+ }
+ rq->core->core_task_seq++;
+}
+
static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
{
struct rb_node *parent, **node;
@@ -169,10 +200,11 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
{
rq->core->core_task_seq++;

- if (!p->core_cookie)
+ if (!sched_core_enqueued(p))
return;

rb_erase(&p->core_node, &rq->core_tree);
+ RB_CLEAR_NODE(&p->core_node);
}

/*
@@ -236,6 +268,18 @@ static int __sched_core_stopper(void *data)
bool enabled = !!(unsigned long)data;
int cpu;

+ if (!enabled) {
+ for_each_online_cpu(cpu) {
+ /*
+ * All active and migrating tasks will have already been removed
+ * from core queue when we clear the cgroup tags.
+ * However, dying tasks could still be left in core queue.
+ * Flush them here.
+ */
+ sched_core_flush(cpu);
+ }
+ }
+
for_each_online_cpu(cpu)
cpu_rq(cpu)->core_enabled = enabled;

@@ -247,7 +291,11 @@ static int sched_core_count;

static void __sched_core_enable(void)
{
- // XXX verify there are no cookie tasks (yet)
+ int cpu;
+
+ /* verify there are no cookie tasks (yet) */
+ for_each_online_cpu(cpu)
+ BUG_ON(!sched_core_empty(cpu_rq(cpu)));

static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -257,8 +305,6 @@ static void __sched_core_enable(void)

static void __sched_core_disable(void)
{
- // XXX verify there are no cookie tasks (left)
-
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);

@@ -285,6 +331,7 @@ void sched_core_put(void)

static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static bool sched_core_enqueued(struct task_struct *task) { return false; }

#endif /* CONFIG_SCHED_CORE */

@@ -3016,6 +3063,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_SMP
plist_node_init(&p->pushable_tasks, MAX_PRIO);
RB_CLEAR_NODE(&p->pushable_dl_tasks);
+#endif
+#ifdef CONFIG_SCHED_CORE
+ RB_CLEAR_NODE(&p->core_node);
#endif
return 0;
}
@@ -6560,6 +6610,9 @@ void init_idle(struct task_struct *idle, int cpu)
#ifdef CONFIG_SMP
sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
#endif
+#ifdef CONFIG_SCHED_CORE
+ RB_CLEAR_NODE(&idle->core_node);
+#endif
}

#ifdef CONFIG_SMP
@@ -7671,7 +7724,12 @@ static void cpu_cgroup_fork(struct task_struct *task)
rq = task_rq_lock(task, &rf);

update_rq_clock(rq);
+ if (sched_core_enqueued(task))
+ sched_core_dequeue(rq, task);
sched_change_group(task, TASK_SET_GROUP);
+ if (sched_core_enabled(rq) && task_on_rq_queued(task) &&
+ task->core_cookie)
+ sched_core_enqueue(rq, task);

task_rq_unlock(rq, task, &rf);
}
@@ -8033,12 +8091,52 @@ static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype
return !!tg->tagged;
}

-static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+struct write_core_tag {
+ struct cgroup_subsys_state *css;
+ int val;
+};
+
+static int __sched_write_tag(void *data)
{
- struct task_group *tg = css_tg(css);
+ struct write_core_tag *tag = (struct write_core_tag *) data;
+ struct cgroup_subsys_state *css = tag->css;
+ int val = tag->val;
+ struct task_group *tg = css_tg(tag->css);
struct css_task_iter it;
struct task_struct *p;

+ tg->tagged = !!val;
+
+ css_task_iter_start(css, 0, &it);
+ /*
+ * Note: css_task_iter_next will skip dying tasks.
+ * There could still be dying tasks left in the core queue
+ * when we set cgroup tag to 0 when the loop is done below.
+ */
+ while ((p = css_task_iter_next(&it))) {
+ p->core_cookie = !!val ? (unsigned long)tg : 0UL;
+
+ if (sched_core_enqueued(p)) {
+ sched_core_dequeue(task_rq(p), p);
+ if (!p->core_cookie)
+ continue;
+ }
+
+ if (sched_core_enabled(task_rq(p)) &&
+ p->core_cookie && task_on_rq_queued(p))
+ sched_core_enqueue(task_rq(p), p);
+
+ }
+ css_task_iter_end(&it);
+
+ return 0;
+}
+
+static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+ struct task_group *tg = css_tg(css);
+ struct write_core_tag wtag;
+
if (val > 1)
return -ERANGE;

@@ -8048,16 +8146,12 @@ static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype
if (tg->tagged == !!val)
return 0;

- tg->tagged = !!val;
-
if (!!val)
sched_core_get();

- css_task_iter_start(css, 0, &it);
- while ((p = css_task_iter_next(&it)))
- p->core_cookie = !!val ? (unsigned long)tg : 0UL;
- css_task_iter_end(&it);
-
+ wtag.css = css;
+ wtag.val = val;
+ stop_machine(__sched_write_tag, (void *) &wtag, NULL);
if (!val)
sched_core_put();

--
2.20.1


From ee10a2f324702b5d940545322686fb6f1ec61431 Mon Sep 17 00:00:00 2001
Message-Id: <ee10a2f324702b5d940545322686fb6f1ec61431.1578960120.git.tim.c.chen@linux.intel.com>
In-Reply-To: <3c00f178538e2a33d075b5705adaf31784b19add.1578960120.git.tim.c.chen@linux.intel.com>
References: <3c00f178538e2a33d075b5705adaf31784b19add.1578960120.git.tim.c.chen@linux.intel.com>
From: Tim Chen <[email protected]>
Date: Mon, 22 Jul 2019 16:13:45 -0700
Subject: [PATCH 2/4] sched: Return idle task in pick_next_task for offlined
CPU

There could be races in core scheduler where a CPU is trying to pick
a task for its sibling in core scheduler, when that CPU has just been
offlined. We should not schedule any tasks on the CPU in this case.
Return an idle task in pick_next_task for this situation.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/core.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44399f5434f8..1a132beba3f8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4154,6 +4154,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
int i, j, cpu, occ = 0;
bool need_sync = false;

+ cpu = cpu_of(rq);
+ if (cpu_is_offline(cpu))
+ return idle_sched_class.pick_next_task(rq, prev, rf);
+
if (!sched_core_enabled(rq))
return __pick_next_task(rq, prev, rf);

@@ -4186,7 +4190,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
if (!rq->nr_running)
newidle_balance(rq, rf);

- cpu = cpu_of(rq);
smt_mask = cpu_smt_mask(cpu);

/*
@@ -4228,8 +4231,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
struct rq *rq_i = cpu_rq(i);
struct task_struct *p;

- if (cpu_is_offline(i))
+ if (cpu_is_offline(i)) {
+ rq_i->core_pick = rq_i->idle;
continue;
+ }

if (rq_i->core_pick)
continue;
--
2.20.1


From 226c811b3dc74452dc04244d57919d67a2555cc1 Mon Sep 17 00:00:00 2001
Message-Id: <226c811b3dc74452dc04244d57919d67a2555cc1.1578960120.git.tim.c.chen@linux.intel.com>
In-Reply-To: <3c00f178538e2a33d075b5705adaf31784b19add.1578960120.git.tim.c.chen@linux.intel.com>
References: <3c00f178538e2a33d075b5705adaf31784b19add.1578960120.git.tim.c.chen@linux.intel.com>
From: Tim Chen <[email protected]>
Date: Tue, 7 Jan 2020 13:22:28 -0800
Subject: [PATCH 3/4] sched/core: Enable core scheduler only for core with SMT
threads

Core scheduler has extra overhead. Enable it only for core with
more than one SMT hardware threads.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/core.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a132beba3f8..9d875d6ed3f3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -280,8 +280,10 @@ static int __sched_core_stopper(void *data)
}
}

- for_each_online_cpu(cpu)
- cpu_rq(cpu)->core_enabled = enabled;
+ for_each_online_cpu(cpu) {
+ if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2))
+ cpu_rq(cpu)->core_enabled = enabled;
+ }

return 0;
}
--
2.20.1


From 38c85b523a88b7dd0f659a874c7a1793751ca5ab Mon Sep 17 00:00:00 2001
Message-Id: <38c85b523a88b7dd0f659a874c7a1793751ca5ab.1578960120.git.tim.c.chen@linux.intel.com>
In-Reply-To: <3c00f178538e2a33d075b5705adaf31784b19add.1578960120.git.tim.c.chen@linux.intel.com>
References: <3c00f178538e2a33d075b5705adaf31784b19add.1578960120.git.tim.c.chen@linux.intel.com>
From: Tim Chen <[email protected]>
Date: Tue, 7 Jan 2020 13:26:54 -0800
Subject: [PATCH 4/4] sched/core: Update core scheduler queue when taking cpu
online/offline

When we bring a CPU online and enable core scheduler, tasks that need
core scheduling need to be placed in the core's core scheduling queue.
Likewise when we taks a CPU offline or disable core scheudling on a
core, tasks in the core's core scheduling queue need to be removed.
Without such mechanisms, the core scheduler causes OOPs due to
inconsistent core scheduling state of a task.

Implement such enqueue and dequeue mechanisms according to a CPU's change
in core scheduling status. The switch of core scheduling mode of a core,
and enqueue/dequeue of tasks on a core's queue due to the core scheduling
mode change has to be run in a separate context as it cannot be done in
the context taking cpu online/offline.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/core.c | 156 ++++++++++++++++++++++++++++++++++++----
kernel/sched/deadline.c | 35 +++++++++
kernel/sched/fair.c | 38 ++++++++++
kernel/sched/rt.c | 43 +++++++++++
kernel/sched/sched.h | 7 ++
5 files changed, 264 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d875d6ed3f3..8db8960c6e69 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,11 @@ int sysctl_sched_rt_runtime = 950000;

#ifdef CONFIG_SCHED_CORE

+struct core_sched_cpu_work {
+ struct work_struct work;
+ cpumask_t smt_mask;
+};
+
DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);

/* kernel prio, less is more */
@@ -207,6 +212,18 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
RB_CLEAR_NODE(&p->core_node);
}

+void sched_core_add(struct rq *rq, struct task_struct *p)
+{
+ if (p->core_cookie && task_on_rq_queued(p))
+ sched_core_enqueue(rq, p);
+}
+
+void sched_core_remove(struct rq *rq, struct task_struct *p)
+{
+ if (sched_core_enqueued(p))
+ sched_core_dequeue(rq, p);
+}
+
/*
* Find left-most (aka, highest priority) task matching @cookie.
*/
@@ -329,11 +346,133 @@ void sched_core_put(void)
mutex_unlock(&sched_core_mutex);
}

+enum cpu_action {
+ CPU_ACTIVATE = 1,
+ CPU_DEACTIVATE = 2
+};
+
+static int __activate_cpu_core_sched(void *data);
+static int __deactivate_cpu_core_sched(void *data);
+static void core_sched_cpu_update(unsigned int cpu, enum cpu_action action);
+
+static int activate_cpu_core_sched(struct core_sched_cpu_work *work)
+{
+ if (static_branch_unlikely(&__sched_core_enabled))
+ stop_machine(__activate_cpu_core_sched, (void *) work, NULL);
+
+ return 0;
+}
+
+static int deactivate_cpu_core_sched(struct core_sched_cpu_work *work)
+{
+ if (static_branch_unlikely(&__sched_core_enabled))
+ stop_machine(__deactivate_cpu_core_sched, (void *) work, NULL);
+
+ return 0;
+}
+
+static void core_sched_cpu_activate_fn(struct work_struct *work)
+{
+ struct core_sched_cpu_work *cpu_work;
+
+ cpu_work = container_of(work, struct core_sched_cpu_work, work);
+ activate_cpu_core_sched(cpu_work);
+ kfree(cpu_work);
+}
+
+static void core_sched_cpu_deactivate_fn(struct work_struct *work)
+{
+ struct core_sched_cpu_work *cpu_work;
+
+ cpu_work = container_of(work, struct core_sched_cpu_work, work);
+ deactivate_cpu_core_sched(cpu_work);
+ kfree(cpu_work);
+}
+
+static void core_sched_cpu_update(unsigned int cpu, enum cpu_action action)
+{
+ struct core_sched_cpu_work *work;
+
+ work = kmalloc(sizeof(struct core_sched_cpu_work), GFP_ATOMIC);
+ if (!work)
+ return;
+
+ if (action == CPU_ACTIVATE)
+ INIT_WORK(&work->work, core_sched_cpu_activate_fn);
+ else
+ INIT_WORK(&work->work, core_sched_cpu_deactivate_fn);
+
+ cpumask_copy(&work->smt_mask, cpu_smt_mask(cpu));
+
+ queue_work(system_highpri_wq, &work->work);
+}
+
+static int __activate_cpu_core_sched(void *data)
+{
+ struct core_sched_cpu_work *work = (struct core_sched_cpu_work *) data;
+ struct rq *rq;
+ int i;
+
+ if (cpumask_weight(&work->smt_mask) < 2)
+ return 0;
+
+ for_each_cpu(i, &work->smt_mask) {
+ const struct sched_class *class;
+
+ rq = cpu_rq(i);
+
+ if (rq->core_enabled)
+ continue;
+
+ for_each_class(class) {
+ if (!class->core_sched_activate)
+ continue;
+
+ if (cpu_online(i))
+ class->core_sched_activate(rq);
+ }
+
+ rq->core_enabled = true;
+ }
+ return 0;
+}
+
+static int __deactivate_cpu_core_sched(void *data)
+{
+ struct core_sched_cpu_work *work = (struct core_sched_cpu_work *) data;
+ struct rq *rq;
+ int i;
+
+ if (cpumask_weight(&work->smt_mask) > 2)
+ return 0;
+
+ for_each_cpu(i, &work->smt_mask) {
+ const struct sched_class *class;
+
+ rq = cpu_rq(i);
+
+ if (!rq->core_enabled)
+ continue;
+
+ for_each_class(class) {
+ if (!class->core_sched_deactivate)
+ continue;
+
+ if (cpu_online(i))
+ class->core_sched_deactivate(cpu_rq(i));
+ }
+
+ rq->core_enabled = false;
+ }
+ return 0;
+}
+
#else /* !CONFIG_SCHED_CORE */

static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
static bool sched_core_enqueued(struct task_struct *task) { return false; }
+static inline void core_sched_cpu_update(unsigned int cpu, int action) { }

#endif /* CONFIG_SCHED_CORE */

@@ -6941,13 +7080,8 @@ int sched_cpu_activate(unsigned int cpu)
*/
if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
static_branch_inc_cpuslocked(&sched_smt_present);
-#ifdef CONFIG_SCHED_CORE
- if (static_branch_unlikely(&__sched_core_enabled)) {
- rq->core_enabled = true;
- }
-#endif
}
-
+ core_sched_cpu_update(cpu, CPU_ACTIVATE);
#endif
set_cpu_active(cpu, true);

@@ -6996,15 +7130,10 @@ int sched_cpu_deactivate(unsigned int cpu)
* When going down, decrement the number of cores with SMT present.
*/
if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
-#ifdef CONFIG_SCHED_CORE
- struct rq *rq = cpu_rq(cpu);
- if (static_branch_unlikely(&__sched_core_enabled)) {
- rq->core_enabled = false;
- }
-#endif
static_branch_dec_cpuslocked(&sched_smt_present);

}
+ core_sched_cpu_update(cpu, CPU_DEACTIVATE);
#endif

if (!sched_smp_initialized)
@@ -7081,9 +7210,6 @@ int sched_cpu_dying(unsigned int cpu)
update_max_interval();
nohz_balance_exit_idle(rq);
hrtick_clear(rq);
-#ifdef CONFIG_SCHED_CORE
- rq->core = NULL;
-#endif
return 0;
}
#endif
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 514b6328262f..6bb69d42965b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1755,6 +1755,37 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
return rb_entry(left, struct sched_dl_entity, rb_node);
}

+static void for_each_dl_task(struct rq *rq,
+ void (*fn)(struct rq *rq, struct task_struct *p))
+{
+ struct dl_rq *dl_rq = &rq->dl;
+ struct sched_dl_entity *dl_ent;
+ struct task_struct *task;
+ struct rb_node *rb_node;
+
+ rb_node = rb_first_cached(&dl_rq->root);
+ while (rb_node) {
+ dl_ent = rb_entry(rb_node, struct sched_dl_entity, rb_node);
+ task = dl_task_of(dl_ent);
+ fn(rq, task);
+ rb_node = rb_next(rb_node);
+ }
+}
+
+#ifdef CONFIG_SCHED_CORE
+
+static void core_sched_activate_dl(struct rq *rq)
+{
+ for_each_dl_task(rq, sched_core_add);
+}
+
+static void core_sched_deactivate_dl(struct rq *rq)
+{
+ for_each_dl_task(rq, sched_core_remove);
+}
+
+#endif
+
static struct task_struct *pick_task_dl(struct rq *rq)
{
struct sched_dl_entity *dl_se;
@@ -2430,6 +2461,10 @@ const struct sched_class dl_sched_class = {
.rq_online = rq_online_dl,
.rq_offline = rq_offline_dl,
.task_woken = task_woken_dl,
+#ifdef CONFIG_SCHED_CORE
+ .core_sched_activate = core_sched_activate_dl,
+ .core_sched_deactivate = core_sched_deactivate_dl,
+#endif
#endif

.task_tick = task_tick_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4728f5ed45aa..6cfcced2b0bd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10058,6 +10058,40 @@ static void rq_offline_fair(struct rq *rq)
unthrottle_offline_cfs_rqs(rq);
}

+static void for_each_fair_task(struct rq *rq,
+ void (*fn)(struct rq *rq, struct task_struct *p))
+{
+ struct cfs_rq *cfs_rq, *pos;
+ struct sched_entity *se;
+ struct task_struct *task;
+
+ for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
+ for (se = __pick_first_entity(cfs_rq);
+ se != NULL;
+ se = __pick_next_entity(se)) {
+
+ if (!entity_is_task(se))
+ continue;
+
+ task = task_of(se);
+ fn(rq, task);
+ }
+ }
+}
+
+#ifdef CONFIG_SCHED_CORE
+
+static void core_sched_activate_fair(struct rq *rq)
+{
+ for_each_fair_task(rq, sched_core_add);
+}
+
+static void core_sched_deactivate_fair(struct rq *rq)
+{
+ for_each_fair_task(rq, sched_core_remove);
+}
+
+#endif
#endif /* CONFIG_SMP */

#ifdef CONFIG_SCHED_CORE
@@ -10612,6 +10646,10 @@ const struct sched_class fair_sched_class = {

.task_dead = task_dead_fair,
.set_cpus_allowed = set_cpus_allowed_common,
+#ifdef CONFIG_SCHED_CORE
+ .core_sched_activate = core_sched_activate_fair,
+ .core_sched_deactivate = core_sched_deactivate_fair,
+#endif
#endif

.task_tick = task_tick_fair,
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4714630a90b9..c6694e45b255 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1548,6 +1548,45 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
return rt_task_of(rt_se);
}

+static void for_each_rt_task(struct rq *rq,
+ void (*fn)(struct rq *rq, struct task_struct *p))
+{
+ rt_rq_iter_t iter;
+ struct rt_prio_array *array;
+ struct list_head *queue;
+ int i;
+ struct rt_rq *rt_rq = &rq->rt;
+ struct sched_rt_entity *rt_se = NULL;
+ struct task_struct *task;
+
+ for_each_rt_rq(rt_rq, iter, rq) {
+ array = &rt_rq->active;
+ for (i = 0; i < MAX_RT_PRIO; i++) {
+ queue = array->queue + i;
+ list_for_each_entry(rt_se, queue, run_list) {
+ if (rt_entity_is_task(rt_se)) {
+ task = rt_task_of(rt_se);
+ fn(rq, task);
+ }
+ }
+ }
+ }
+}
+
+#ifdef CONFIG_SCHED_CORE
+
+static void core_sched_activate_rt(struct rq *rq)
+{
+ for_each_rt_task(rq, sched_core_add);
+}
+
+static void core_sched_deactivate_rt(struct rq *rq)
+{
+ for_each_rt_task(rq, sched_core_remove);
+}
+
+#endif
+
static struct task_struct *pick_task_rt(struct rq *rq)
{
struct task_struct *p;
@@ -2382,6 +2421,10 @@ const struct sched_class rt_sched_class = {
.rq_offline = rq_offline_rt,
.task_woken = task_woken_rt,
.switched_from = switched_from_rt,
+#ifdef CONFIG_SCHED_CORE
+ .core_sched_activate = core_sched_activate_rt,
+ .core_sched_deactivate = core_sched_deactivate_rt,
+#endif
#endif

.task_tick = task_tick_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4844e703298a..c2068f2e2dd2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1055,6 +1055,9 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)

extern void queue_core_balance(struct rq *rq);

+void sched_core_add(struct rq *rq, struct task_struct *p);
+void sched_core_remove(struct rq *rq, struct task_struct *p);
+
#else /* !CONFIG_SCHED_CORE */

static inline bool sched_core_enabled(struct rq *rq)
@@ -1838,6 +1841,10 @@ struct sched_class {

void (*rq_online)(struct rq *rq);
void (*rq_offline)(struct rq *rq);
+#ifdef CONFIG_SCHED_CORE
+ void (*core_sched_activate)(struct rq *rq);
+ void (*core_sched_deactivate)(struct rq *rq);
+#endif
#endif

void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
--
2.20.1

2020-01-14 15:41:49

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Mon, Jan 13, 2020 at 8:12 PM Tim Chen <[email protected]> wrote:

> I also encountered kernel panic with the v4 code when taking cpu offline or online
> when core scheduler is running. I've refreshed the previous patch, along
> with 3 other patches to fix problems related to CPU online/offline.
>
> As a side effect of the fix, each core can now operate in core-scheduling
> mode or non core-scheduling mode, depending on how many online SMT threads it has.
>
> Vineet, are you guys planning to refresh v4 and update it to v5? Aubrey posted
> a port to the latest kernel earlier.
>
Thanks for the updated patch Tim.

We have been testing with v4 rebased on 5.4.8 as RC kernels had given us
trouble in the past. v5 is due soon and we are planning to release v5 when
5.5 comes out. As of now, v5 has your crash fixes and Aubrey's changes
related to load balancing. We are investigating a performance issue with
high overcommit io intensive workload and also we are trying to see if
we can add synchronization during VMEXITs so that a guest vm cannot run
run alongside with host kernel. We also need to think about the userland
interface for corescheduling in preparation for upstreaming work.

Does it make sense to wait until 5.5 release for v5 considering the
above details?

Thanks,
Vineeth

2020-01-15 03:44:31

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 2020/1/14 23:40, Vineeth Remanan Pillai wrote:
> On Mon, Jan 13, 2020 at 8:12 PM Tim Chen <[email protected]> wrote:
>
>> I also encountered kernel panic with the v4 code when taking cpu offline or online
>> when core scheduler is running. I've refreshed the previous patch, along
>> with 3 other patches to fix problems related to CPU online/offline.
>>
>> As a side effect of the fix, each core can now operate in core-scheduling
>> mode or non core-scheduling mode, depending on how many online SMT threads it has.
>>
>> Vineet, are you guys planning to refresh v4 and update it to v5? Aubrey posted
>> a port to the latest kernel earlier.
>>
> Thanks for the updated patch Tim.
>
> We have been testing with v4 rebased on 5.4.8 as RC kernels had given us
> trouble in the past. v5 is due soon and we are planning to release v5 when
> 5.5 comes out. As of now, v5 has your crash fixes and Aubrey's changes
> related to load balancing.

It turns out my load balancing related changes need to be refined.
For example, we don't migrate task if the task's core cookie does not match
with CPU's core cookie, but if the entire core is idle, we should allow task
migration, something like the following:

I plan to do this after my Chinese New Year holiday(Feb 3rd).

Thanks,
-Aubrey

----------------------------------------------------------------------------------
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4728f5e..75e0172 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7366,8 +7366,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* We do not migrate tasks that are:
* 1) throttled_lb_pair, or
* 2) cannot be migrated to this CPU due to cpus_ptr, or
- * 3) running (obviously), or
- * 4) are cache-hot on their current CPU.
+ * 3) task's cookie does not match with this CPU's core cookie
+ * 4) running (obviously), or
+ * 5) are cache-hot on their current CPU.
*/
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
@@ -7402,6 +7403,25 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
return 0;
}

+#ifdef CONFIG_SCHED_CORE
+ if (sched_core_enabled(cpu_rq(env->dst_cpu))) {
+ bool idle_core = true;
+ int cpu;
+
+ for_each_cpu(cpu, cpu_smt_mask(env->dst_cpu)) {
+ if (!available_idle_cpu(cpu))
+ idle_core = false;
+ }
+ /*
+ * Don't migrate task if task's cookie does not match
+ * with core cookie, unless the entire core is idle.
+ */
+ if (!idle_core && p->core_cookie !=
+ cpu_rq(env->dst_cpu)->core->core_cookie)
+ return 0;
+ }
+#endif
+
/* Record that we found atleast one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;

--
2.7.4

2020-01-15 19:34:51

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 1/14/20 7:43 PM, Li, Aubrey wrote:
> On 2020/1/14 23:40, Vineeth Remanan Pillai wrote:
>> On Mon, Jan 13, 2020 at 8:12 PM Tim Chen <[email protected]> wrote:
>>
>>> I also encountered kernel panic with the v4 code when taking cpu offline or online
>>> when core scheduler is running. I've refreshed the previous patch, along
>>> with 3 other patches to fix problems related to CPU online/offline.
>>>
>>> As a side effect of the fix, each core can now operate in core-scheduling
>>> mode or non core-scheduling mode, depending on how many online SMT threads it has.
>>>
>>> Vineet, are you guys planning to refresh v4 and update it to v5? Aubrey posted
>>> a port to the latest kernel earlier.
>>>
>> Thanks for the updated patch Tim.
>>
>> We have been testing with v4 rebased on 5.4.8 as RC kernels had given us
>> trouble in the past. v5 is due soon and we are planning to release v5 when
>> 5.5 comes out. As of now, v5 has your crash fixes and Aubrey's changes
>> related to load balancing.
>
> It turns out my load balancing related changes need to be refined.
> For example, we don't migrate task if the task's core cookie does not match
> with CPU's core cookie, but if the entire core is idle, we should allow task
> migration, something like the following:
>
> I plan to do this after my Chinese New Year holiday(Feb 3rd).
>
> Thanks,
> -Aubrey
>

Aubrey's attached patch should replace his previous patch
sched/fair: don't migrate task if cookie not match

I've also added a fix below for Aubrey's patch
sched/fair: find cookie matched idlest CPU.

Aubrey, can you merge this fix into that patch when you update
your patches?

Tim

---->8----
From 06c09a9c86db6a7c30e5ebca7a635005ac2df37b Mon Sep 17 00:00:00 2001
From: Tim Chen <[email protected]>
Date: Thu, 12 Dec 2019 09:08:18 -0800
Subject: [PATCH] sched: Check core scheduler in use when comparing cookie

When scanning scheduler groups to find the idlest cpu for a task waking
up, cookie matching should not be considered if a target cpu doesn't
use core scheduling. Code introduced by patch

sched/fair: find cookie matched idlest CPU

matches cookie regardless of cpu's core scheduler state. Fix this.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a90179937f63..55e7b22522db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5656,8 +5656,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
for_each_cpu(i, sched_group_span(group)) {
struct rq *rq = cpu_rq(i);

- if (p->core_cookie == rq->core->core_cookie)
+ if (!sched_core_enabled(rq)) {
cookie_match = true;
+ break;
+ }
+
+ if (p->core_cookie == rq->core->core_cookie) {
+ cookie_match = true;
+ break;
+ }
}
/* Skip over this group if no cookie matched */
if (!cookie_match)
--
2.20.1

2020-01-16 03:04:09

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Thu, Jan 16, 2020 at 3:33 AM Tim Chen <[email protected]> wrote:
>
> On 1/14/20 7:43 PM, Li, Aubrey wrote:
> > On 2020/1/14 23:40, Vineeth Remanan Pillai wrote:
> >> On Mon, Jan 13, 2020 at 8:12 PM Tim Chen <[email protected]> wrote:
> >>
> >>> I also encountered kernel panic with the v4 code when taking cpu offline or online
> >>> when core scheduler is running. I've refreshed the previous patch, along
> >>> with 3 other patches to fix problems related to CPU online/offline.
> >>>
> >>> As a side effect of the fix, each core can now operate in core-scheduling
> >>> mode or non core-scheduling mode, depending on how many online SMT threads it has.
> >>>
> >>> Vineet, are you guys planning to refresh v4 and update it to v5? Aubrey posted
> >>> a port to the latest kernel earlier.
> >>>
> >> Thanks for the updated patch Tim.
> >>
> >> We have been testing with v4 rebased on 5.4.8 as RC kernels had given us
> >> trouble in the past. v5 is due soon and we are planning to release v5 when
> >> 5.5 comes out. As of now, v5 has your crash fixes and Aubrey's changes
> >> related to load balancing.
> >
> > It turns out my load balancing related changes need to be refined.
> > For example, we don't migrate task if the task's core cookie does not match
> > with CPU's core cookie, but if the entire core is idle, we should allow task
> > migration, something like the following:
> >
> > I plan to do this after my Chinese New Year holiday(Feb 3rd).
> >
> > Thanks,
> > -Aubrey
> >
>
> Aubrey's attached patch should replace his previous patch
> sched/fair: don't migrate task if cookie not match
>
> I've also added a fix below for Aubrey's patch
> sched/fair: find cookie matched idlest CPU.
>
> Aubrey, can you merge this fix into that patch when you update
> your patches?

Thanks Tim, I'll include your fix when I update my patches.

-Aubrey

2020-01-17 16:02:21

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

> >
> > Aubrey's attached patch should replace his previous patch
> > sched/fair: don't migrate task if cookie not match
> >
> > I've also added a fix below for Aubrey's patch
> > sched/fair: find cookie matched idlest CPU.
> >
> > Aubrey, can you merge this fix into that patch when you update
> > your patches?
>
> Thanks Tim, I'll include your fix when I update my patches.
>
We have included both these changes in our branch rebased to 5.4.y:
https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.4.y

This is our testing branch.

Many Thanks,
Vineeth

2020-01-22 18:06:08

by Gruza, Agata

[permalink] [raw]
Subject: RE: [RFC PATCH v4 00/19] Core scheduling v4

-----Original Message-----
From: [email protected] <[email protected]> On Behalf Of Vineeth Remanan Pillai
Sent: Friday, January 17, 2020 8:01 AM
To: Aubrey Li <[email protected]>
Cc: Tim Chen <[email protected]>; Aubrey Li <[email protected]>; Nishanth Aravamudan <[email protected]>; Julien Desfossez <[email protected]>; Peter Zijlstra <[email protected]>; Ingo Molnar <[email protected]>; Thomas Gleixner <[email protected]>; Paul Turner <[email protected]>; Linus Torvalds <[email protected]>; Linux List Kernel Mailing <[email protected]>; Dario Faggioli <[email protected]>; Frédéric Weisbecker <[email protected]>; Kees Cook <[email protected]>; Greg Kerr <[email protected]>; Phil Auld <[email protected]>; Aaron Lu <[email protected]>; Valentin Schneider <[email protected]>; Mel Gorman <[email protected]>; Pawan Gupta <[email protected]>; Paolo Bonzini <[email protected]>
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

> >
> > Aubrey's attached patch should replace his previous patch
> > sched/fair: don't migrate task if cookie not match
> >
> > I've also added a fix below for Aubrey's patch
> > sched/fair: find cookie matched idlest CPU.
> >
> > Aubrey, can you merge this fix into that patch when you update your
> > patches?
>
> Thanks Tim, I'll include your fix when I update my patches.
>
We have included both these changes in our branch rebased to 5.4.y:
https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.4.y

This is our testing branch.



----------------------------------------------------------------------
ABOUT:
----------------------------------------------------------------------
Hello,

Core scheduling is required to protect against leakage of sensitive
data allocated on a sibling thread. Our goal is to measure performance
impact of core scheduling across different workloads and show how it
evolved over time. Below you will find data for core scheduling. Data
presented here are based on previous version of core-sched (v3) plus
additional kernel patches added by [email protected] and load
balancer made by [email protected] that are now in (v4) core
scheduler kernel
https://github.com/digitalocean/linux-coresched/commits/coresched/v4-v5.4.y
Detailed specifics about each system set up in attached PDF.

----------------------------------------------------------------------
BENCHMARKS:
----------------------------------------------------------------------
- SPECreate2017int : industry standard for measure compute intensive
CPU performance.
- CASSANDRA : NoSQL distributed DB and benchmark. Provides
high availability with no single point of failure.
- HammerDB : database benchmarking application
- SPECvirt : industry standard for measure server performance
in virtual environment.

----------------------------------------------------------------------
PERFORMANCE IMPACT:
----------------------------------------------------------------------

---------------------------------------------------------------------------------------------------------------------------------------------------
Kernel & Benchmark |# of cgroups|Overcommit | Performance Change %| Performance Change % |
| | (HT ON, core-sched) | (HT OFF, No core-sched) |
---------------------------------------------------------------------------------------------------------------------------------------------------
| 2cgroups | 0.5x | -0.48 | 0.26 |
SPECint_rate_base 2017 | 1 per VM | 1x | -0.54 | 0.66 |
| | 2x | -0.18 | -0.07 |
---------------------------------------------------------------------------------------------------------------------------------------------------
| 2cgroups | 0.5x | -2.56 | -44.61 |
CASSANDRA | 1 per VM | 1x | -4.71 | -47.47 |
| | 2x | -7.63 | -28.47 |
---------------------------------------------------------------------------------------------------------------------------------------------------
| 2cgroups | 0.5x | -19.99 | -23.52 |
HammerDB | 1 per VM | 1x | -14.90 | -24.63 |
| | 2x | -13.31 | -25.71 |
---------------------------------------------------------------------------------------------------------------------------------------------------
| attachment |attachment | -18 | -31 |
SPECvirt | | | | |
---------------------------------------------------------------------------------------------------------------------------------------------------
| 2cgroups | 0.5x | -5.18 | -44.61 |
CASSANDRA | 1 per VM | 1x | -7.90 | -47.47 |
| | 2x | -6.52 | -28.47 |
---------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------
TAKE AWAYS:
----------------------------------------------------------------------
1. Core scheduling performs better than turning off HT in all
overcommitting scenarios. However we observed in certain cases
up to 20% performance drop.
2. Impact of core scheduling depends on the workload and thread
scheduling intensity.
3. Core scheduling requires cgroups. Single cgroup per VM. Each
VM is running on it’s own independent cgroup.

Many thanks,
--Agata


Attachments:
LKML_core_scheduling.pdf (388.18 kB)
LKML_core_scheduling.pdf

2020-01-28 02:41:42

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Tue, 2020-01-14 at 10:40 -0500, Vineeth Remanan Pillai wrote:
> On Mon, Jan 13, 2020 at 8:12 PM Tim Chen <[email protected]>
>
> > As a side effect of the fix, each core can now operate in core-
> > scheduling
> > mode or non core-scheduling mode, depending on how many online SMT
> > threads it has.
> >
> > Vineet, are you guys planning to refresh v4 and update it to
> > v5? Aubrey posted
> > a port to the latest kernel earlier.
> >
> We are investigating a performance issue
> with
> high overcommit io intensive workload and also we are trying to see
> if
> we can add synchronization during VMEXITs so that a guest vm cannot
> run
> run alongside with host kernel.
>
So, about this VMEXIT sync thing. I do agree that we should at least
try and do it (and assess performance).

I was wondering, however, what we think about core-scheduling + address
space isolation (or whatever it is/will be called). More specifically,
whether such a solution wouldn't be considered an equally safe setup
(at least for the virt use-cases, of course).

Basically, core-scheduling would prevent VM-to-VM attacks while ASI
would mitigate VM-to-hypervisor attacks.

Of course, such a solution would need to be fully implemented and
evaluated too... I just wanted to toss it around, mostly to know what
you think about it and whether or not it is already on your radar.

Thanks and Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-02-01 15:32:25

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Fri, 2020-01-31 at 09:44 -0500, Vineeth Remanan Pillai wrote:
> > Basically, core-scheduling would prevent VM-to-VM attacks while ASI
> > would mitigate VM-to-hypervisor attacks.
> >
> > Of course, such a solution would need to be fully implemented and
> > evaluated too... I just wanted to toss it around, mostly to know
> > what
> > you think about it and whether or not it is already on your radar.
>
> We had this discussion during LPC.
>
I know. I wanted to be there, but couldn't. But I watched the
recordings of the miniconf. :-)

> Its something on the radar, but we
> haven't yet spend any dedicated time looking into it.
> Theoretically it is very promising. While looking into practical
> aspects,
> the main difficulty is to determine what is safe/unsafe to expose in
> the kernel when the sibling is running in userland/VM. Coming up with
> a
> minimal pagetable for the kernel when sibling is running untrusted
> code
> would be non-trivial.
>
It is. And this is exactly my point. :-)

I mean, what you're describing is pretty much what the memory isolation
efforts are mostly (all?) about, at least AFAIUI.

Therefore, I think we should see about "joining forces".

FWIW, there's a talk about ASI going on right now at FOSDEM2020:
https://fosdem.org/2020/schedule/event/kernel_address_space_isolation/
(this is also video recorded, so it will be possible for everyone to
watch it, in a few days time).

> Its definitely worth spending some time and effort on this idea.
>
Cool! Happy to hear this. :-)

Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-02-06 00:31:17

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 1/14/20 7:40 AM, Vineeth Remanan Pillai wrote:
> On Mon, Jan 13, 2020 at 8:12 PM Tim Chen <[email protected]> wrote:
>
>> I also encountered kernel panic with the v4 code when taking cpu offline or online
>> when core scheduler is running. I've refreshed the previous patch, along
>> with 3 other patches to fix problems related to CPU online/offline.
>>
>> As a side effect of the fix, each core can now operate in core-scheduling
>> mode or non core-scheduling mode, depending on how many online SMT threads it has.
>>
>> Vineet, are you guys planning to refresh v4 and update it to v5? Aubrey posted
>> a port to the latest kernel earlier.
>>
> Thanks for the updated patch Tim.
>
> We have been testing with v4 rebased on 5.4.8 as RC kernels had given us
> trouble in the past. v5 is due soon and we are planning to release v5 when
> 5.5 comes out. As of now, v5 has your crash fixes and Aubrey's changes
> related to load balancing. We are investigating a performance issue with
> high overcommit io intensive workload and also we are trying to see if
> we can add synchronization during VMEXITs so that a guest vm cannot run
> run alongside with host kernel. We also need to think about the userland
> interface for corescheduling in preparation for upstreaming work.
>

Vineet,

Have you guys been able to make progress on the issues with I/O intensive workload?

Can you explain a bit further what the problem is?
Is the VM doing lots of I/O, causing frequent VMEXITs?
And the the host thread doing the I/O cannot run along side with vcpu
thread, resulting in extra forced idle? So you are trying not to put vcpu
on the same core with such host thread?

Tim

2020-02-06 22:40:56

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 05-Feb-2020 04:28:18 PM, Tim Chen wrote:
> Have you guys been able to make progress on the issues with I/O intensive workload?

Hi Tim,

We are in the process of retesting everything with the following branch:
https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.5.y

The CPU-intensive case that becomes overcommitted when we disable SMT
behaves the same as before (core scheduling is a clear win there). So we
are now digging a bit more in our mixed tests with CPU and IO (MySQL
benchmark running in parallel with low-activity VMs). The tests are
still running, but I expect to be able to present some numbers tomorrow.

We will keep the list informed.

Julien

2020-02-12 23:09:07

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 05-Feb-2020 04:28:18 PM, Tim Chen wrote:
> On 1/14/20 7:40 AM, Vineeth Remanan Pillai wrote:
> > On Mon, Jan 13, 2020 at 8:12 PM Tim Chen <[email protected]> wrote:
> >
> >> I also encountered kernel panic with the v4 code when taking cpu offline or online
> >> when core scheduler is running. I've refreshed the previous patch, along
> >> with 3 other patches to fix problems related to CPU online/offline.
> >>
> >> As a side effect of the fix, each core can now operate in core-scheduling
> >> mode or non core-scheduling mode, depending on how many online SMT threads it has.
> >>
> >> Vineet, are you guys planning to refresh v4 and update it to v5? Aubrey posted
> >> a port to the latest kernel earlier.
> >>
> > Thanks for the updated patch Tim.
> >
> > We have been testing with v4 rebased on 5.4.8 as RC kernels had given us
> > trouble in the past. v5 is due soon and we are planning to release v5 when
> > 5.5 comes out. As of now, v5 has your crash fixes and Aubrey's changes
> > related to load balancing. We are investigating a performance issue with
> > high overcommit io intensive workload and also we are trying to see if
> > we can add synchronization during VMEXITs so that a guest vm cannot run
> > run alongside with host kernel. We also need to think about the userland
> > interface for corescheduling in preparation for upstreaming work.
> >
>
> Vineet,
>
> Have you guys been able to make progress on the issues with I/O intensive workload?

I finally have some results with the following branch:
https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.5.y

We tested the following classes of workloads in VMs (all vcpus in the
same cgroup/tag):
- linpack (pure CPU work)
- sysbench TPC-C (MySQL benchmark, good mix of CPU/net/disk)
with/without noise VMs around
- FIO randrw VM with/without noise VMs around

Our "noise VMs" are 1-vcpu VMs running a simple workload that wakes up
every 30 seconds, sends a couple of metrics over a VPN and go back to
sleep. They use between 0% and 30% of CPU on the host all the time,
nothing sustained just ups and downs.

# linpack
3x 12-vcpus pinned on a 36 hwthreads NUMA node (with smt on):
- core scheduling manages to perform slightly better than the baseline
by up to 20% in some cases !
- with nosmt (so 2:1 overcommit) the performance drop by 24%

# sysbench TPC-C
1x 12-vcpus MySQL server on each NUMA node, 48 client threads (running
on a different server):
- without noise: no performance difference between the 3 configurations
- with 96 noise VMs on each NUMA node:
- Performance drops by 54% with core scheduling
- Performance drops by 75% with nosmt
We write at about 130MB/s on disk with that test.

# FIO randrw 50%, 1 thread, libaio, bs=128k, iodepth=32
1x 12-vcpus FIO VM, usually only require up to 100% CPU overall (data
thread and all vcpus summed), we read and write at about 350MB/s
alone:
- coresched drops 5%
- nosmt drops 1%

1:1 vcpus vs hardware thread on the NUMA node (filled with noise VMs):
- coresched drops 7%
- nosmt drops 22%

3:1 ratio:
- coresched drops 16%
- nosmt drops 22%

5:1 ratio:
- coresched drops 51%
- nosmt drops 61%

So the main conclusion is that for all the test cases we have studied,
core scheduling performs better than nosmt ! This is different than what
we tested a while back, so it's looking really good !

Now I am looking for confirmation from others. Dario did you have time
to re-run your test suite against that same branch ?

After that, our next step is to trash all that with adding VMEXIT
synchronization points ;-)

Thanks,

Julien

2020-02-13 18:37:36

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 2/12/20 3:07 PM, Julien Desfossez wrote:

>>
>> Have you guys been able to make progress on the issues with I/O intensive workload?
>
> I finally have some results with the following branch:
> https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.5.y
>
>
> So the main conclusion is that for all the test cases we have studied,
> core scheduling performs better than nosmt ! This is different than what
> we tested a while back, so it's looking really good !

Thanks for the data. They look really encouraging.

Aubrey is working on updating his patches so it will load balance
to the idle cores a bit better. We are testing those and will post
the update soon.

Tim

2020-02-14 06:12:40

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Fri, Feb 14, 2020 at 2:37 AM Tim Chen <[email protected]> wrote:
>
> On 2/12/20 3:07 PM, Julien Desfossez wrote:
>
> >>
> >> Have you guys been able to make progress on the issues with I/O intensive workload?
> >
> > I finally have some results with the following branch:
> > https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.5.y
> >
> >
> > So the main conclusion is that for all the test cases we have studied,
> > core scheduling performs better than nosmt ! This is different than what
> > we tested a while back, so it's looking really good !
>
> Thanks for the data. They look really encouraging.
>
> Aubrey is working on updating his patches so it will load balance
> to the idle cores a bit better. We are testing those and will post
> the update soon.

I added a helper to check task and cpu cookie match, including the
entire core idle case. The refined patchset updated at here:
https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2

This branch also includes Tim's patchset. According to our testing
result, the performance data looks on par with the previous version.
A good news is, v5.4.y stability issue on our 8 numa node machine
is gone on this v5.5.2 branch.

Thanks,
-Aubrey

2020-02-15 06:04:25

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Fri, Feb 14, 2020 at 11:05 PM Vineeth Remanan Pillai
<[email protected]> wrote:
>
> Hi Aubrey,
>
> Thanks for the updated patches. Merged the changes to our testing branch
> in preparation for v5.
>
>>
>> I added a helper to check task and cpu cookie match, including the
>> entire core idle case. The refined patchset updated at here:
>> https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2
>
>
> I did not go through all the changes thoroughly, but on a quick glance,
> I feel a small change would optimize it a bit.
>
> + /*
> + * A CPU in an idle core is always the best choice for tasks with
> + * cookies, and ignore cookie match if core scheduler is not enabled
> + * on the CPU.
> + */
> + if (idle_core || !sched_core_enabled(rq))
> + return true;
> +
> + return rq->core->core_cookie == p->core_cookie;
> +}
> +
>
> I think check for sched_core_enabled would make sense to be done above the
> for loop. Something like this:
>
> +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
> +{
> + bool idle_core = true;
> + int cpu;
> +
> + if (!sched_core_enabled(rq))
> + return true;
> +
> + for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
> + if (!available_idle_cpu(cpu)) {
> + idle_core = false;
> + break;
> + }
> + }
> +
> + /*
> + * A CPU in an idle core is always the best choice for tasks with
> + * cookies.
> + */
> + return idle_core || rq->core->core_cookie == p->core_cookie;
> +}
> +
>
> We can avoid the unnecessary check for idle_core is sched_core is disabled.
> This would optimize in case of systems with lots of cpus.
>
> Please let me know!

Yes, this makes sense, patch updated at here, I put your name there if
you don't mind.
https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2-rc2

Thanks,
-Aubrey

2020-02-21 23:21:21

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 18-Feb-2020 04:58:02 PM, Vineeth Remanan Pillai wrote:
> > Yes, this makes sense, patch updated at here, I put your name there if
> > you don't mind.
> > https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2-rc2
> >
> > Thanks Aubrey!

Just a quick note, I ran a very cpu-intensive benchmark (9x12 vcpus VMs
running linpack), all affined to an 18 cores NUMA node (36 hardware
threads). Each VM is running in its own cgroup/tag with core scheduling
enabled. We know it already performed much better than nosmt, so for
this case, I measured various co-scheduling statistics:
- how much time the process spends co-scheduled with idle, a compatible
or an incompatible task
- how long does the process spends running in a inefficient
configuration (more than 1 thread running alone on a core)

And I am very happy to report than even though the 9 VMs were configured
to float on the whole NUMA node, the scheduler / load-balancer did a
very good job at keeping an efficient configuration:

Process 10667 (qemu-system-x86), 10 seconds trace:
- total runtime: 46451472309 ns,
- local neighbors (total: 45713285084 ns, 98.411 % of process runtime):
- idle neighbors (total: 484054061 ns, 1.042 % of process runtime):
- foreign neighbors (total: 4191002 ns, 0.009 % of process runtime):
- unknown neighbors (total: 92042503 ns, 0.198 % of process runtime)
- inefficient periods (total: 464832 ns, 0.001 % of process runtime):
- number of periods: 48
- min period duration: 1424 ns
- max period duration: 116988 ns
- average period duration: 9684.000 ns
- stdev: 19282.130

I thought you would enjoy seeing this :-)

Have a good weekend,

Julien

2020-02-25 03:45:24

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Fri, Feb 14, 2020 at 02:10:53PM +0800, Aubrey Li wrote:
> On Fri, Feb 14, 2020 at 2:37 AM Tim Chen <[email protected]> wrote:
> >
> > On 2/12/20 3:07 PM, Julien Desfossez wrote:
> >
> > >>
> > >> Have you guys been able to make progress on the issues with I/O intensive workload?
> > >
> > > I finally have some results with the following branch:
> > > https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.5.y
> > >
> > >
> > > So the main conclusion is that for all the test cases we have studied,
> > > core scheduling performs better than nosmt ! This is different than what
> > > we tested a while back, so it's looking really good !
> >
> > Thanks for the data. They look really encouraging.
> >
> > Aubrey is working on updating his patches so it will load balance
> > to the idle cores a bit better. We are testing those and will post
> > the update soon.
>
> I added a helper to check task and cpu cookie match, including the
> entire core idle case. The refined patchset updated at here:
> https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2
>
> This branch also includes Tim's patchset. According to our testing
> result, the performance data looks on par with the previous version.
> A good news is, v5.4.y stability issue on our 8 numa node machine
> is gone on this v5.5.2 branch.

One problem I have when testing this branch: the weight of the croup
seems to be ignored.

On a 2sockets/16cores/32threads VM, I grouped 8 sysbench(cpu mode)
threads into one cgroup(cgA) and another 16 sysbench(cpu mode) threads
into another cgroup(cgB). cgA and cgB's cpusets are set to the same
socket's 8 cores/16 CPUs and cgA's cpu.shares is set to 10240 while cgB's
cpu.shares is set to 2(so consider cgB as noise workload and cgA as
the real workload).

I had expected cgA to occupy 8 cpus(with each cpu on a different core)
most of the time since it has way more weight than cgB, while cgB should
occupy almost no CPUs since:
- when cgB's task is in the same CPU queue as cgA's task, then cgB's
task is given very little CPU due to its small weight;
- when cgB's task is in a CPU queue whose sibling's queue has cgA's
task, cgB's task should be forced idle(again, due to its small weight).

But testing shows cgA occupies only 2 cpus during the entire run while
cgB enjoys the remaining 14 cpus. As a comparison, when coresched is off,
cgA can occupy 8 cpus during its run.

I haven't taken a look at the patches, but would like to raise the
problem first. My gut feeling is that, we didn't make the CPU's load
balanced.

P.S. it's not that I care about VM's performance, it's just easier to
test kernel stuff using a VM than on a bare metal. Its CPU setup might
seem weird, I just set it up to be the same as my host setup.

2020-02-25 05:33:31

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Tue, Feb 25, 2020 at 11:44 AM Aaron Lu <[email protected]> wrote:
>
> On Fri, Feb 14, 2020 at 02:10:53PM +0800, Aubrey Li wrote:
> > On Fri, Feb 14, 2020 at 2:37 AM Tim Chen <[email protected]> wrote:
> > >
> > > On 2/12/20 3:07 PM, Julien Desfossez wrote:
> > >
> > > >>
> > > >> Have you guys been able to make progress on the issues with I/O intensive workload?
> > > >
> > > > I finally have some results with the following branch:
> > > > https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.5.y
> > > >
> > > >
> > > > So the main conclusion is that for all the test cases we have studied,
> > > > core scheduling performs better than nosmt ! This is different than what
> > > > we tested a while back, so it's looking really good !
> > >
> > > Thanks for the data. They look really encouraging.
> > >
> > > Aubrey is working on updating his patches so it will load balance
> > > to the idle cores a bit better. We are testing those and will post
> > > the update soon.
> >
> > I added a helper to check task and cpu cookie match, including the
> > entire core idle case. The refined patchset updated at here:
> > https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2
> >
> > This branch also includes Tim's patchset. According to our testing
> > result, the performance data looks on par with the previous version.
> > A good news is, v5.4.y stability issue on our 8 numa node machine
> > is gone on this v5.5.2 branch.
>
> One problem I have when testing this branch: the weight of the croup
> seems to be ignored.
>
> On a 2sockets/16cores/32threads VM, I grouped 8 sysbench(cpu mode)
> threads into one cgroup(cgA) and another 16 sysbench(cpu mode) threads
> into another cgroup(cgB). cgA and cgB's cpusets are set to the same
> socket's 8 cores/16 CPUs and cgA's cpu.shares is set to 10240 while cgB's
> cpu.shares is set to 2(so consider cgB as noise workload and cgA as
> the real workload).
>
> I had expected cgA to occupy 8 cpus(with each cpu on a different core)
> most of the time since it has way more weight than cgB, while cgB should
> occupy almost no CPUs since:
> - when cgB's task is in the same CPU queue as cgA's task, then cgB's
> task is given very little CPU due to its small weight;
> - when cgB's task is in a CPU queue whose sibling's queue has cgA's
> task, cgB's task should be forced idle(again, due to its small weight).
>
> But testing shows cgA occupies only 2 cpus during the entire run while
> cgB enjoys the remaining 14 cpus. As a comparison, when coresched is off,
> cgA can occupy 8 cpus during its run.
>
> I haven't taken a look at the patches, but would like to raise the
> problem first. My gut feeling is that, we didn't make the CPU's load
> balanced.
>
> P.S. it's not that I care about VM's performance, it's just easier to
> test kernel stuff using a VM than on a bare metal. Its CPU setup might
> seem weird, I just set it up to be the same as my host setup.

Aaron - did you test this before? In other words, if you reset repo to your
last commit:

- 5bd3c80 sched/fair : Wake up forced idle siblings if needed

Does the problem remain? Just want to check if this is a regression
introduced by the subsequent patchset.

Thanks,
-Aubrey

2020-02-25 07:36:57

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Tue, Feb 25, 2020 at 01:32:35PM +0800, Aubrey Li wrote:
> Aaron - did you test this before? In other words, if you reset repo to your
> last commit:

I did this test only recently when I started to think if I can use
coresched to boost main workload's performance in a colocated
environment.

>
> - 5bd3c80 sched/fair : Wake up forced idle siblings if needed
>
> Does the problem remain? Just want to check if this is a regression
> introduced by the subsequent patchset.

The problem isn't there with commit 5bd3c80 as the head, so yes, it
looks like indeed a regression introduced by subsequent patchset.

P.S. I will need to take a closer look if each cgA's task is running
on a different core later but the cpu usage of cgA is back to 800% with
commit 5bd3c80.

2020-02-25 10:42:20

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Tue, Feb 25, 2020 at 3:34 PM Aaron Lu <[email protected]> wrote:
>
> On Tue, Feb 25, 2020 at 01:32:35PM +0800, Aubrey Li wrote:
> > Aaron - did you test this before? In other words, if you reset repo to your
> > last commit:
>
> I did this test only recently when I started to think if I can use
> coresched to boost main workload's performance in a colocated
> environment.
>
> >
> > - 5bd3c80 sched/fair : Wake up forced idle siblings if needed
> >
> > Does the problem remain? Just want to check if this is a regression
> > introduced by the subsequent patchset.
>
> The problem isn't there with commit 5bd3c80 as the head, so yes, it
> looks like indeed a regression introduced by subsequent patchset.
>
> P.S. I will need to take a closer look if each cgA's task is running
> on a different core later but the cpu usage of cgA is back to 800% with
> commit 5bd3c80.

Hmm..., I went through the subsequent patches, and I think this one

- 4041eeb8f3 sched/fair: don't migrate task if cookie not match

is probably the major cause, can you please revert this one to see
if the problem is gone?

From what I can tell, if 16 threads in cgB occupied 8 cores, this
patch prevents any thread in cgA from migrating when load balance
is triggered, and yes, cpu.shares is ignored at this point.

Thanks,
-Aubrey

2020-02-25 12:10:01

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Tue, Feb 25, 2020 at 06:40:02PM +0800, Aubrey Li wrote:
> On Tue, Feb 25, 2020 at 3:34 PM Aaron Lu <[email protected]> wrote:
> >
> > On Tue, Feb 25, 2020 at 01:32:35PM +0800, Aubrey Li wrote:
> > > Aaron - did you test this before? In other words, if you reset repo to your
> > > last commit:
> >
> > I did this test only recently when I started to think if I can use
> > coresched to boost main workload's performance in a colocated
> > environment.
> >
> > >
> > > - 5bd3c80 sched/fair : Wake up forced idle siblings if needed
> > >
> > > Does the problem remain? Just want to check if this is a regression
> > > introduced by the subsequent patchset.
> >
> > The problem isn't there with commit 5bd3c80 as the head, so yes, it
> > looks like indeed a regression introduced by subsequent patchset.
> >
> > P.S. I will need to take a closer look if each cgA's task is running
> > on a different core later but the cpu usage of cgA is back to 800% with
> > commit 5bd3c80.
>
> Hmm..., I went through the subsequent patches, and I think this one
>
> - 4041eeb8f3 sched/fair: don't migrate task if cookie not match
>
> is probably the major cause, can you please revert this one to see
> if the problem is gone?

Yes, reverting this one fixed the problem.

2020-02-25 13:52:35

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Tue, Feb 25, 2020 at 7:22 PM Aaron Lu <[email protected]> wrote:
>
> On Tue, Feb 25, 2020 at 06:40:02PM +0800, Aubrey Li wrote:
> > On Tue, Feb 25, 2020 at 3:34 PM Aaron Lu <[email protected]> wrote:
> > >
> > > On Tue, Feb 25, 2020 at 01:32:35PM +0800, Aubrey Li wrote:
> > > > Aaron - did you test this before? In other words, if you reset repo to your
> > > > last commit:
> > >
> > > I did this test only recently when I started to think if I can use
> > > coresched to boost main workload's performance in a colocated
> > > environment.
> > >
> > > >
> > > > - 5bd3c80 sched/fair : Wake up forced idle siblings if needed
> > > >
> > > > Does the problem remain? Just want to check if this is a regression
> > > > introduced by the subsequent patchset.
> > >
> > > The problem isn't there with commit 5bd3c80 as the head, so yes, it
> > > looks like indeed a regression introduced by subsequent patchset.
> > >
> > > P.S. I will need to take a closer look if each cgA's task is running
> > > on a different core later but the cpu usage of cgA is back to 800% with
> > > commit 5bd3c80.
> >
> > Hmm..., I went through the subsequent patches, and I think this one
> >
> > - 4041eeb8f3 sched/fair: don't migrate task if cookie not match
> >
> > is probably the major cause, can you please revert this one to see
> > if the problem is gone?
>
> Yes, reverting this one fixed the problem.

okay, but this patch also contributed the improvement of a few benchmarks
on my side. So we need a way to fix your case, my quick thought is allowing
task migration in this case(sounds like a workaround). Need take a deep look
at CPU resource controlled code path when core scheduling enabled.

Any ideas?

Thanks,
-Aubrey

2020-02-26 03:15:18

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Tue, Feb 25, 2020 at 03:51:37PM -0500, Vineeth Remanan Pillai wrote:
> Hi Aaron,
> We tried reproducing this with a sample script here:
> https://gist.github.com/vineethrp/4356e66694269d1525ff254d7f213aef

Nice script.

> But the set1 cgroup processes always get their share of cpu time in
> our test. Could you please verify if its the same test that you were
> also doing? The only difference is that we run on a 2 16core/32thread
> socket bare metal using only socket 0. We also tried threads instead of
> processes, but the results are the same.

Sorry for missing one detail: I always start the noise workload first,
and then start the real workload. This is critical for this test here
since only so, the noise workload can occupy all CPUs and presents a
challange for the load balancer to balance real workload's tasks. If
both workloads are started at the same time, the initial task placement
might mitigate the problem.

BTW, your script has given 12cores/24 CPUs to the workloads and cgA
spawned 16 tasks, cgB spawned 32. This is an even more complex scenario
to test since the real workload's task number is already more than the
available core number. Perhaps just starting 12 tasks for cgA and 24
tasks for cgB is enough for now. As for the start sequence, simply sleep
5 seconds after cgB workload is started, then start cgA. I have left a
comment on the script's gist page.

>
>
> On a 2sockets/16cores/32threads VM, I grouped 8 sysbench(cpu mode)
> > threads into one cgroup(cgA) and another 16 sysbench(cpu mode) threads
> > into another cgroup(cgB). cgA and cgB's cpusets are set to the same
> > socket's 8 cores/16 CPUs and cgA's cpu.shares is set to 10240 while cgB's
> > cpu.shares is set to 2(so consider cgB as noise workload and cgA as
> > the real workload).
> >
> > I had expected cgA to occupy 8 cpus(with each cpu on a different core)
>
> The expected behaviour could also be that 8 processes share 4 cores and
> 8 hw threads right? This is what we are seeing mostly
>
> most of the time since it has way more weight than cgB, while cgB should
> > occupy almost no CPUs since:
> > - when cgB's task is in the same CPU queue as cgA's task, then cgB's
> > task is given very little CPU due to its small weight;
> > - when cgB's task is in a CPU queue whose sibling's queue has cgA's
> > task, cgB's task should be forced idle(again, due to its small weight).
> >
> We are seeing the cgA is taking half the cores and cgB taking rest half
> of the cores. Looks like the scheduler ultimately groups the tasks into
> its own cores.
>
>
> >
> > But testing shows cgA occupies only 2 cpus during the entire run while
> > cgB enjoys the remaining 14 cpus. As a comparison, when coresched is off,
> > cgA can occupy 8 cpus during its run.
> >
> > Not sure why we are not able to reproduce this. I have a quick patch
> which might fix this. The idea is that, we allow migration if p's
> hierarchical load or estimated utilization is more than dest_rq->curr.
> While thinking about this fix, I noticed that we are not holding the
> dest_rq lock for any of the migration patches. Migration patches would
> probably need a rework. Attaching my patch down, but it also does not
> take the dest_rq lock. I have also added a case of dest_core being
> forced_idle. I think that would be an opportunity to migrate. Ideally
> we should check if the forced idle task has the same cookie as p.
>
> https://gist.github.com/vineethrp/887743608f42a6ce96bf7847b5b119ae

Is this on top of Aubrey's coresched_v4-v5.5.2 branch?

2020-02-26 07:22:29

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Wed, Feb 26, 2020 at 4:51 AM Vineeth Remanan Pillai
<[email protected]> wrote:
>
> I have a quick patch
> which might fix this. The idea is that, we allow migration if p's
> hierarchical load or estimated utilization is more than dest_rq->curr.
> While thinking about this fix, I noticed that we are not holding the
> dest_rq lock for any of the migration patches. Migration patches would
> probably need a rework. Attaching my patch down, but it also does not
> take the dest_rq lock. I have also added a case of dest_core being
> forced_idle. I think that would be an opportunity to migrate. Ideally
> we should check if the forced idle task has the same cookie as p.
>
> https://gist.github.com/vineethrp/887743608f42a6ce96bf7847b5b119ae
>

Hi Vineeth,

Thanks for the quick fix. I guess this patch should work for Aaron's case
because it almost removed the cookie checking here. Some comments
below:

+ if (rq->core->core_forceidle)
+ return true;

We check cookie match during load balance to avoid two different
cookie tasks on the same core. If one cpu is forced idle, its sibling should
be running a non-idle task, not sure why we can return true directly here.
And if we need to check cookie, with this forced idle case, why it's special
to be picked up?

+ if (task_h_load(p) > task_h_load(rq->curr))
+ return true;
+ if (task_util_est(p) > task_util_est(rq->curr))
+ return true;

These two need to exclude rq->curr == rq->idle case?
otherwise idle balance always return true?

Thanks,
-Aubrey

2020-02-27 02:05:21

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Tue, Feb 25, 2020 at 03:51:37PM -0500, Vineeth Remanan Pillai wrote:
> On a 2sockets/16cores/32threads VM, I grouped 8 sysbench(cpu mode)
> > threads into one cgroup(cgA) and another 16 sysbench(cpu mode) threads
> > into another cgroup(cgB). cgA and cgB's cpusets are set to the same
> > socket's 8 cores/16 CPUs and cgA's cpu.shares is set to 10240 while cgB's
> > cpu.shares is set to 2(so consider cgB as noise workload and cgA as
> > the real workload).
> >
> > I had expected cgA to occupy 8 cpus(with each cpu on a different core)
>
> The expected behaviour could also be that 8 processes share 4 cores and
> 8 hw threads right? This is what we are seeing mostly

I expect the 8 cgA tasks to spread on each core, instead of occupying
4 cores/8 hw threads. If they stay on 4 cores/8 hw threads, than on the
core level, these cores' load would be much higher than other cores
which are running cgB's tasks, this doesn't look right to me.

I think the end result should be: each core has two tasks queued, one
cgA task and one cgB task(to maintain load balance on the core level).
The two tasks are queued on different hw thread, with cgA's task runs
most of the time on one thread and cgB's task being forced idle most
of the time on the other thread.

2020-02-27 04:46:03

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Thu, Feb 27, 2020 at 5:54 AM Vineeth Remanan Pillai
<[email protected]> wrote:
>
> Hi Aubrey,
>>
>>
>> Thanks for the quick fix. I guess this patch should work for Aaron's case
>> because it almost removed the cookie checking here. Some comments
>> below:
>>
>> + if (rq->core->core_forceidle)
>> + return true;
>>
>> We check cookie match during load balance to avoid two different
>> cookie tasks on the same core. If one cpu is forced idle, its sibling should
>> be running a non-idle task, not sure why we can return true directly here.
>> And if we need to check cookie, with this forced idle case, why it's special
>> to be picked up?
>
> The idea behind this was: If a core is forced idle, then we have a
> scenario where one task is running without a companion and one task
> is forced to be idle. So if this source task is of the same cookie
> as the forced idle task, then migrating this source task will offer a
> chance for the force idle task to run together with task in subsequent
> schedules. We don't have a framework to check the cookie of forced
> idle tasks and hence this if block is only partial fix.

To mitigate force idle, this helper is trying to prevent unmatched case, not
catch the matched case. If the task's cookie is matched with forced idle
cpu's sibling, it's able to pass the check later. Comparing to my original
intention, this fix increases unmatched case as it returns immediately and
we don't check cookie later.

>>
>>
>> + if (task_h_load(p) > task_h_load(rq->curr))
>> + return true;
>> + if (task_util_est(p) > task_util_est(rq->curr))
>> + return true;
>>
>> These two need to exclude rq->curr == rq->idle case?
>> otherwise idle balance always return true?
>
> rq->curr being NULL can mean that the sibling is idle or forced idle.
> In both the cases, I think it makes sense to migrate a task so that it can
> compete with the other sibling for a chance to run. This function
> can_migrate_task actually only says if this task is eligible and
> later part of the code decides whether it is okay to migrate it
> based on factors like load and util and capacity. So I think its
> fine to declare the task as eligible if the dest core is running
> idle. Does this thinking make sense?
>
The intention here is stopping migration not allowing migration.
task_h_load(idle) is zero, but idle_task's sibling CPU is running
a task with cookieA, the fix returns immediately when
task_h_load(task cookieB) > 0, this puts another unmatched case in.

> On our testing, it did not show much degradation in performance with
> this change. I am reworking the fix by removing the check for
> task_est_util. It doesn't seem to be valid to check for util to migrate
> the task.

I'm not sure the check for task_h_load is exactly wanted here either.
From my understanding, the problem is because we stopped high
weight task migration because its cookie does not match with the target's,
can we check high load here instead of high weight?

Thanks,
-Aubrey

2020-02-27 14:12:16

by Phil Auld

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

Hi Aaron,

On Thu, Feb 27, 2020 at 10:04:32AM +0800 Aaron Lu wrote:
> On Tue, Feb 25, 2020 at 03:51:37PM -0500, Vineeth Remanan Pillai wrote:
> > On a 2sockets/16cores/32threads VM, I grouped 8 sysbench(cpu mode)
> > > threads into one cgroup(cgA) and another 16 sysbench(cpu mode) threads
> > > into another cgroup(cgB). cgA and cgB's cpusets are set to the same
> > > socket's 8 cores/16 CPUs and cgA's cpu.shares is set to 10240 while cgB's
> > > cpu.shares is set to 2(so consider cgB as noise workload and cgA as
> > > the real workload).
> > >
> > > I had expected cgA to occupy 8 cpus(with each cpu on a different core)
> >
> > The expected behaviour could also be that 8 processes share 4 cores and
> > 8 hw threads right? This is what we are seeing mostly
>
> I expect the 8 cgA tasks to spread on each core, instead of occupying
> 4 cores/8 hw threads. If they stay on 4 cores/8 hw threads, than on the
> core level, these cores' load would be much higher than other cores
> which are running cgB's tasks, this doesn't look right to me.
>

I don't think that's a valid assumption, at least since the load balancer rework.

The scheduler will be looking much more at the number of running task versus
the group weight. So in this case 2 running tasks, 2 siblings at the core level
will look fine. There will be no reason to migrate.

> I think the end result should be: each core has two tasks queued, one
> cgA task and one cgB task(to maintain load balance on the core level).
> The two tasks are queued on different hw thread, with cgA's task runs
> most of the time on one thread and cgB's task being forced idle most
> of the time on the other thread.
>

With the core scheduler that does not seem to be a desired outcome. I think
grouping the 8 cgA tasks on the 8 cpus of 4 cores seems right.



Cheers,
Phil

--

2020-02-27 14:39:39

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Thu, Feb 27, 2020 at 10:10 PM Phil Auld <[email protected]> wrote:
>
> Hi Aaron,
>
> On Thu, Feb 27, 2020 at 10:04:32AM +0800 Aaron Lu wrote:
> > On Tue, Feb 25, 2020 at 03:51:37PM -0500, Vineeth Remanan Pillai wrote:
> > > On a 2sockets/16cores/32threads VM, I grouped 8 sysbench(cpu mode)
> > > > threads into one cgroup(cgA) and another 16 sysbench(cpu mode) threads
> > > > into another cgroup(cgB). cgA and cgB's cpusets are set to the same
> > > > socket's 8 cores/16 CPUs and cgA's cpu.shares is set to 10240 while cgB's
> > > > cpu.shares is set to 2(so consider cgB as noise workload and cgA as
> > > > the real workload).
> > > >
> > > > I had expected cgA to occupy 8 cpus(with each cpu on a different core)
> > >
> > > The expected behaviour could also be that 8 processes share 4 cores and
> > > 8 hw threads right? This is what we are seeing mostly
> >
> > I expect the 8 cgA tasks to spread on each core, instead of occupying
> > 4 cores/8 hw threads. If they stay on 4 cores/8 hw threads, than on the
> > core level, these cores' load would be much higher than other cores
> > which are running cgB's tasks, this doesn't look right to me.
> >
>
> I don't think that's a valid assumption, at least since the load balancer rework.
>
> The scheduler will be looking much more at the number of running task versus
> the group weight. So in this case 2 running tasks, 2 siblings at the core level
> will look fine. There will be no reason to migrate.

Can this be replicated?

>
> > I think the end result should be: each core has two tasks queued, one
> > cgA task and one cgB task(to maintain load balance on the core level).
> > The two tasks are queued on different hw thread, with cgA's task runs
> > most of the time on one thread and cgB's task being forced idle most
> > of the time on the other thread.
> >
>
> With the core scheduler that does not seem to be a desired outcome. I think
> grouping the 8 cgA tasks on the 8 cpus of 4 cores seems right.
>
Especially, if the load of cgA task + cgB task > cpu capacity, grouping cgA
tasks can avoid forced idle completely. Maintaining core level balance seems
not the best result. I guess that's why with core scheduler enabled we saw
10-20% improvement in some cases against the default core scheduler disabled.

Thanks,
-Aubrey

2020-02-28 02:55:10

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Thu, Feb 27, 2020 at 09:10:33AM -0500, Phil Auld wrote:
> Hi Aaron,
>
> On Thu, Feb 27, 2020 at 10:04:32AM +0800 Aaron Lu wrote:
> > On Tue, Feb 25, 2020 at 03:51:37PM -0500, Vineeth Remanan Pillai wrote:
> > > On a 2sockets/16cores/32threads VM, I grouped 8 sysbench(cpu mode)
> > > > threads into one cgroup(cgA) and another 16 sysbench(cpu mode) threads
> > > > into another cgroup(cgB). cgA and cgB's cpusets are set to the same
> > > > socket's 8 cores/16 CPUs and cgA's cpu.shares is set to 10240 while cgB's
> > > > cpu.shares is set to 2(so consider cgB as noise workload and cgA as
> > > > the real workload).
> > > >
> > > > I had expected cgA to occupy 8 cpus(with each cpu on a different core)
> > >
> > > The expected behaviour could also be that 8 processes share 4 cores and
> > > 8 hw threads right? This is what we are seeing mostly
> >
> > I expect the 8 cgA tasks to spread on each core, instead of occupying
> > 4 cores/8 hw threads. If they stay on 4 cores/8 hw threads, than on the
> > core level, these cores' load would be much higher than other cores
> > which are running cgB's tasks, this doesn't look right to me.
> >
>
> I don't think that's a valid assumption, at least since the load balancer rework.
>
> The scheduler will be looking much more at the number of running task versus
> the group weight. So in this case 2 running tasks, 2 siblings at the core level
> will look fine. There will be no reason to migrate.

In the absence of core scheduling, I agree there is no reason to migrate
since no matter how to migrate, the end result is one high-weight task
sharing a core with another (high or low weight) task. But with core
scheduling, things can be different: if the high weight tasks are
spread among cores, then these high weight tasks can enjoy the core
alone(by force idling its sibling) and get better performance.

I'm thinking to use core scheduling to protect main workload's
performance in a colocated environment, similar to the realtime use
case described here:
https://lwn.net/Articles/799454/

I'll quote the relevant part here:
"
But core scheduling can force sibling CPUs to go idle when a realtime
process is running on the core, thus preventing this kind of
interference. That opens the door to enabling SMT whenever a core has no
realtime work, but effectively disabling it when realtime constraints
apply, getting the best of both worlds.
"

Using cpuset for the main workload to only allow its task run on one HT
of each core might also solve this, but I had hoped not to need use
cpuset as that can add complexity in deployment.

> > I think the end result should be: each core has two tasks queued, one
> > cgA task and one cgB task(to maintain load balance on the core level).
> > The two tasks are queued on different hw thread, with cgA's task runs
> > most of the time on one thread and cgB's task being forced idle most
> > of the time on the other thread.
> >
>
> With the core scheduler that does not seem to be a desired outcome. I think
> grouping the 8 cgA tasks on the 8 cpus of 4 cores seems right.
>

When the core wide weight is somewhat balanced, yes I definitely agree.
But when core wide weight mismatch a lot, I'm not so sure since if these
high weight task is spread among cores, with the feature of core
scheduling, these high weight tasks can get better performance. So this
appeared to me like a question of: is it desirable to protect/enhance
high weight task performance in the presence of core scheduling?

2020-02-28 23:57:00

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 2/26/20 1:54 PM, Vineeth Remanan Pillai wrote:

> rq->curr being NULL can mean that the sibling is idle or forced idle.
> In both the cases, I think it makes sense to migrate a task so that it can
> compete with the other sibling for a chance to run. This function
> can_migrate_task actually only says if this task is eligible and
> later part of the code decides whether it is okay to migrate it
> based on factors like load and util and capacity. So I think its
> fine to declare the task as eligible if the dest core is running
> idle. Does this thinking make sense?
>
> On our testing, it did not show much degradation in performance with
> this change. I am reworking the fix by removing the check for
> task_est_util. It doesn't seem to be valid to check for util to migrate
> the task.
>

In Aaron's test case, there is a great imbalance in the load on one core
where all the grp A tasks are vs the other cores where the grp B tasks are
spread around. Normally, load balancer will move the tasks for grp A.

Aubrey's can_migrate_task patch prevented the load balancer to migrate tasks if the core
cookie on the target queue don't match. The thought was it will induce
force idle and reduces cpu utilization if we migrate task to it.
That kept all the grp A tasks from getting migrated and kept the imbalance
indefinitely in Aaron's test case.

Perhaps we should also look at the load imbalance between the src rq and
target rq. If the imbalance is big (say two full cpu bound tasks worth
of load), we should migrate anyway despite the cookie mismatch. We are willing
to pay a bit for the force idle by balancing the load out more.
I think Aubrey's patch on can_migrate_task should be more friendly to
Aaron's test scenario if such logic is incorporated.

In Vinnet's fix, we only look at the currently running task's weight in
src and dst rq. Perhaps the load on the src and dst rq needs to be considered
to prevent too great an imbalance between the run queues?

Tim

2020-03-03 15:43:51

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 2020/2/29 7:55, Tim Chen wrote:
> On 2/26/20 1:54 PM, Vineeth Remanan Pillai wrote:
>
>> rq->curr being NULL can mean that the sibling is idle or forced idle.
>> In both the cases, I think it makes sense to migrate a task so that it can
>> compete with the other sibling for a chance to run. This function
>> can_migrate_task actually only says if this task is eligible and
>> later part of the code decides whether it is okay to migrate it
>> based on factors like load and util and capacity. So I think its
>> fine to declare the task as eligible if the dest core is running
>> idle. Does this thinking make sense?
>>
>> On our testing, it did not show much degradation in performance with
>> this change. I am reworking the fix by removing the check for
>> task_est_util. It doesn't seem to be valid to check for util to migrate
>> the task.
>>
>
> In Aaron's test case, there is a great imbalance in the load on one core
> where all the grp A tasks are vs the other cores where the grp B tasks are
> spread around. Normally, load balancer will move the tasks for grp A.
>
> Aubrey's can_migrate_task patch prevented the load balancer to migrate tasks if the core
> cookie on the target queue don't match. The thought was it will induce
> force idle and reduces cpu utilization if we migrate task to it.
> That kept all the grp A tasks from getting migrated and kept the imbalance
> indefinitely in Aaron's test case.
>
> Perhaps we should also look at the load imbalance between the src rq and
> target rq. If the imbalance is big (say two full cpu bound tasks worth
> of load), we should migrate anyway despite the cookie mismatch. We are willing
> to pay a bit for the force idle by balancing the load out more.
> I think Aubrey's patch on can_migrate_task should be more friendly to
> Aaron's test scenario if such logic is incorporated.
>
> In Vinnet's fix, we only look at the currently running task's weight in
> src and dst rq. Perhaps the load on the src and dst rq needs to be considered
> to prevent too great an imbalance between the run queues?

We are trying to migrate a task, can we just use cfs.h_nr_running? This signal
is used to find the busiest run queue as well.

Thanks,
-Aubrey

2020-03-03 23:56:34

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 2020/3/3 22:59, Li, Aubrey wrote:
> On 2020/2/29 7:55, Tim Chen wrote:
>> On 2/26/20 1:54 PM, Vineeth Remanan Pillai wrote:
>>
>>> rq->curr being NULL can mean that the sibling is idle or forced idle.
>>> In both the cases, I think it makes sense to migrate a task so that it can
>>> compete with the other sibling for a chance to run. This function
>>> can_migrate_task actually only says if this task is eligible and
>>> later part of the code decides whether it is okay to migrate it
>>> based on factors like load and util and capacity. So I think its
>>> fine to declare the task as eligible if the dest core is running
>>> idle. Does this thinking make sense?
>>>
>>> On our testing, it did not show much degradation in performance with
>>> this change. I am reworking the fix by removing the check for
>>> task_est_util. It doesn't seem to be valid to check for util to migrate
>>> the task.
>>>
>>
>> In Aaron's test case, there is a great imbalance in the load on one core
>> where all the grp A tasks are vs the other cores where the grp B tasks are
>> spread around. Normally, load balancer will move the tasks for grp A.
>>
>> Aubrey's can_migrate_task patch prevented the load balancer to migrate tasks if the core
>> cookie on the target queue don't match. The thought was it will induce
>> force idle and reduces cpu utilization if we migrate task to it.
>> That kept all the grp A tasks from getting migrated and kept the imbalance
>> indefinitely in Aaron's test case.
>>
>> Perhaps we should also look at the load imbalance between the src rq and
>> target rq. If the imbalance is big (say two full cpu bound tasks worth
>> of load), we should migrate anyway despite the cookie mismatch. We are willing
>> to pay a bit for the force idle by balancing the load out more.
>> I think Aubrey's patch on can_migrate_task should be more friendly to
>> Aaron's test scenario if such logic is incorporated.
>>
>> In Vinnet's fix, we only look at the currently running task's weight in
>> src and dst rq. Perhaps the load on the src and dst rq needs to be considered
>> to prevent too great an imbalance between the run queues?
>
> We are trying to migrate a task, can we just use cfs.h_nr_running? This signal
> is used to find the busiest run queue as well.

How about this one? the cgroup weight issue seems fixed on my side.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f42ceec..90024cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1767,6 +1767,8 @@ static void task_numa_compare(struct task_numa_env *env,
rcu_read_unlock();
}

+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p);
+
static void task_numa_find_cpu(struct task_numa_env *env,
long taskimp, long groupimp)
{
@@ -5650,6 +5652,44 @@ static struct sched_group *
find_idlest_group(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int sd_flag);

+#ifdef CONFIG_SCHED_CORE
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+ struct rq *src_rq = task_rq(p);
+ bool idle_core = true;
+ int cpu;
+
+ /* Ignore cookie match if core scheduler is not enabled on the CPU. */
+ if (!sched_core_enabled(rq))
+ return true;
+
+ if (rq->core->core_cookie == p->core_cookie)
+ return true;
+
+ for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+ if (!available_idle_cpu(cpu)) {
+ idle_core = false;
+ break;
+ }
+ }
+ /*
+ * A CPU in an idle core is always the best choice for tasks with
+ * cookies.
+ */
+ if (idle_core)
+ return true;
+
+ /*
+ * Ignore cookie match if there is a big imbalance between the src rq
+ * and dst rq.
+ */
+ if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1)
+ return true;
+
+ return false;
+}
+#endif
+
/*
* find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
*/
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7ae6858..8c607e9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1061,28 +1061,6 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
return &rq->__lock;
}

-static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
-{
- bool idle_core = true;
- int cpu;
-
- /* Ignore cookie match if core scheduler is not enabled on the CPU. */
- if (!sched_core_enabled(rq))
- return true;
-
- for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
- if (!available_idle_cpu(cpu)) {
- idle_core = false;
- break;
- }
- }
- /*
- * A CPU in an idle core is always the best choice for tasks with
- * cookies.
- */
- return idle_core || rq->core->core_cookie == p->core_cookie;
-}
-
extern void queue_core_balance(struct rq *rq);

void sched_core_add(struct rq *rq, struct task_struct *p);

2020-03-05 04:34:42

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Wed, Mar 04, 2020 at 07:54:39AM +0800, Li, Aubrey wrote:
> On 2020/3/3 22:59, Li, Aubrey wrote:
> > On 2020/2/29 7:55, Tim Chen wrote:
...
> >> In Vinnet's fix, we only look at the currently running task's weight in
> >> src and dst rq. Perhaps the load on the src and dst rq needs to be considered
> >> to prevent too great an imbalance between the run queues?
> >
> > We are trying to migrate a task, can we just use cfs.h_nr_running? This signal
> > is used to find the busiest run queue as well.
>
> How about this one? the cgroup weight issue seems fixed on my side.

It doesn't apply on top of your coresched_v4-v5.5.2 branch, so I
manually allied it. Not sure if I missed something.

It's now getting 4 cpus in 2 cores. Better, but not back to normal yet..

Thanks,
Aaron

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f42ceec..90024cf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1767,6 +1767,8 @@ static void task_numa_compare(struct task_numa_env *env,
> rcu_read_unlock();
> }
>
> +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p);
> +
> static void task_numa_find_cpu(struct task_numa_env *env,
> long taskimp, long groupimp)
> {
> @@ -5650,6 +5652,44 @@ static struct sched_group *
> find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> int this_cpu, int sd_flag);
>
> +#ifdef CONFIG_SCHED_CORE
> +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
> +{
> + struct rq *src_rq = task_rq(p);
> + bool idle_core = true;
> + int cpu;
> +
> + /* Ignore cookie match if core scheduler is not enabled on the CPU. */
> + if (!sched_core_enabled(rq))
> + return true;
> +
> + if (rq->core->core_cookie == p->core_cookie)
> + return true;
> +
> + for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
> + if (!available_idle_cpu(cpu)) {
> + idle_core = false;
> + break;
> + }
> + }
> + /*
> + * A CPU in an idle core is always the best choice for tasks with
> + * cookies.
> + */
> + if (idle_core)
> + return true;
> +
> + /*
> + * Ignore cookie match if there is a big imbalance between the src rq
> + * and dst rq.
> + */
> + if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1)
> + return true;
> +
> + return false;
> +}
> +#endif
> +
> /*
> * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
> */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 7ae6858..8c607e9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1061,28 +1061,6 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
> return &rq->__lock;
> }
>
> -static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
> -{
> - bool idle_core = true;
> - int cpu;
> -
> - /* Ignore cookie match if core scheduler is not enabled on the CPU. */
> - if (!sched_core_enabled(rq))
> - return true;
> -
> - for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
> - if (!available_idle_cpu(cpu)) {
> - idle_core = false;
> - break;
> - }
> - }
> - /*
> - * A CPU in an idle core is always the best choice for tasks with
> - * cookies.
> - */
> - return idle_core || rq->core->core_cookie == p->core_cookie;
> -}
> -
> extern void queue_core_balance(struct rq *rq);
>
> void sched_core_add(struct rq *rq, struct task_struct *p);

2020-03-05 06:11:21

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 2020/3/5 12:33, Aaron Lu wrote:
> On Wed, Mar 04, 2020 at 07:54:39AM +0800, Li, Aubrey wrote:
>> On 2020/3/3 22:59, Li, Aubrey wrote:
>>> On 2020/2/29 7:55, Tim Chen wrote:
> ...
>>>> In Vinnet's fix, we only look at the currently running task's weight in
>>>> src and dst rq. Perhaps the load on the src and dst rq needs to be considered
>>>> to prevent too great an imbalance between the run queues?
>>>
>>> We are trying to migrate a task, can we just use cfs.h_nr_running? This signal
>>> is used to find the busiest run queue as well.
>>
>> How about this one? the cgroup weight issue seems fixed on my side.
>
> It doesn't apply on top of your coresched_v4-v5.5.2 branch, so I
> manually allied it. Not sure if I missed something.

Here is a rebase version on coresched_v5 Vineeth released this morning:
https://github.com/aubreyli/linux/tree/coresched_V5-v5.5.y-rc1

>
> It's now getting 4 cpus in 2 cores. Better, but not back to normal yet..

I always saw higher weight tasks getting 8 cpus in 4 cores on my side.
Are you still running 8+16 sysbench cpu threads?

I replicated your setup, the cpuset with 8cores 16threads, cpu mode 8 sysbench
threads with cpu.shares=10240, 16 sysbench threads with cpu.shares=2, and here
is the data on my side.

weight(10240) weight(2)
coresched disabled 324.23(eps) 356.43(eps)
coresched enabled 340.74(eps) 311.62(eps)

It seems higher weight tasks win this time and lower weight tasks have ~15%
regression(not big deal?), did you see anything worse?

Thanks,
-Aubrey

2020-03-05 08:54:34

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Thu, Mar 05, 2020 at 02:10:36PM +0800, Li, Aubrey wrote:
> On 2020/3/5 12:33, Aaron Lu wrote:
> > On Wed, Mar 04, 2020 at 07:54:39AM +0800, Li, Aubrey wrote:
> >> On 2020/3/3 22:59, Li, Aubrey wrote:
> >>> On 2020/2/29 7:55, Tim Chen wrote:
> > ...
> >>>> In Vinnet's fix, we only look at the currently running task's weight in
> >>>> src and dst rq. Perhaps the load on the src and dst rq needs to be considered
> >>>> to prevent too great an imbalance between the run queues?
> >>>
> >>> We are trying to migrate a task, can we just use cfs.h_nr_running? This signal
> >>> is used to find the busiest run queue as well.
> >>
> >> How about this one? the cgroup weight issue seems fixed on my side.
> >
> > It doesn't apply on top of your coresched_v4-v5.5.2 branch, so I
> > manually allied it. Not sure if I missed something.
>
> Here is a rebase version on coresched_v5 Vineeth released this morning:
> https://github.com/aubreyli/linux/tree/coresched_V5-v5.5.y-rc1
>
> >
> > It's now getting 4 cpus in 2 cores. Better, but not back to normal yet..
>
> I always saw higher weight tasks getting 8 cpus in 4 cores on my side.
> Are you still running 8+16 sysbench cpu threads?

I used the wrong workload for high weight cgroup. After using sysbench
for high weight cgroup, I also see the problem fixed. Sorry for the noise.

P.S. I used redis-server as the high weight workload this morning, it has
some softirq processing and I guess that makes things not as expected.

Thanks,
Aaron

>
> I replicated your setup, the cpuset with 8cores 16threads, cpu mode 8 sysbench
> threads with cpu.shares=10240, 16 sysbench threads with cpu.shares=2, and here
> is the data on my side.
>
> weight(10240) weight(2)
> coresched disabled 324.23(eps) 356.43(eps)
> coresched enabled 340.74(eps) 311.62(eps)
>
> It seems higher weight tasks win this time and lower weight tasks have ~15%
> regression(not big deal?), did you see anything worse?
>
> Thanks,
> -Aubrey

2020-03-05 13:46:10

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Fri, Feb 28, 2020 at 10:54 AM Aaron Lu <[email protected]> wrote:
>
> When the core wide weight is somewhat balanced, yes I definitely agree.
> But when core wide weight mismatch a lot, I'm not so sure since if these
> high weight task is spread among cores, with the feature of core
> scheduling, these high weight tasks can get better performance.

It depends.

Say TaskA(cookie 1) and TaskB(cookie1) has high weight,
TaskC(cookie 2) and Task D(cookie 2) has low weight.
And we have two cores 4 CPU.

If we dispatch
- TaskA and TaskB on Core0,
- TaskC and TaskD on Core1,

with coresched enabled, all 4 tasks can run all the time.

But if we dispatch
- TaskA on Core0.CPU0, TaskB on Core1.CPU2,
- TaskC on Core0.CPU1, TaskB on Core1.CPU3,

with coresched enabled, when TaskC is running, TaskA will be forced
off CPU and replaced with a forced idle thread.

Things get worse if TaskA and TaskB share some data and can get
benefit from the core level cache.

> So this appeared to me like a question of: is it desirable to protect/enhance
> high weight task performance in the presence of core scheduling?

This sounds to me a policy VS mechanism question. Do you have any idea
how to spread high weight task among the cores with coresched enabled?

Thanks,
-Aubrey

2020-03-06 02:42:31

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Thu, Mar 05, 2020 at 09:45:15PM +0800, Aubrey Li wrote:
> On Fri, Feb 28, 2020 at 10:54 AM Aaron Lu <[email protected]> wrote:
> >
> > When the core wide weight is somewhat balanced, yes I definitely agree.
> > But when core wide weight mismatch a lot, I'm not so sure since if these
> > high weight task is spread among cores, with the feature of core
> > scheduling, these high weight tasks can get better performance.
>
> It depends.
>
> Say TaskA(cookie 1) and TaskB(cookie1) has high weight,
> TaskC(cookie 2) and Task D(cookie 2) has low weight.
> And we have two cores 4 CPU.
>
> If we dispatch
> - TaskA and TaskB on Core0,
> - TaskC and TaskD on Core1,
>
> with coresched enabled, all 4 tasks can run all the time.

Although all tasks get CPU, TaskA and TaskB are competing hardware
resources and will run slower.

> But if we dispatch
> - TaskA on Core0.CPU0, TaskB on Core1.CPU2,
> - TaskC on Core0.CPU1, TaskB on Core1.CPU3,
>
> with coresched enabled, when TaskC is running, TaskA will be forced
> off CPU and replaced with a forced idle thread.

Not likely to happen since TaskA and TaskB's share will normally be a
lot higher to make sure they get the CPU most of the time.

>
> Things get worse if TaskA and TaskB share some data and can get
> benefit from the core level cache.

That's a good point and hard to argue.

I'm mostly considering colocating redis-server(the main workload) with
other compute intensive workload. redis-server can be idle most of the
time but needs every hardware resource when it runs to meet its latency
and throughput requirement. Test at my side shows redis-server's
throughput can be about 30% lower when two redis-servers run at the
same core(throughput is about 80000 when runs exclusively on a core VS
about 56000 when runs with sibling thread busy) IIRC.

So my use case here is that I don't really care about low weight task's
performance when high weight task demands CPU. I understand that there
will be other use cases that also care about low weight task's
performance. So what I have done is to make the two task's weight
difference as large as possible to signal that the low weight task is
not important, maybe I can also try to tag low weight task as SCHED_IDLE
ones and then we can happily sacrifice SCHED_IDLE task's performance?

> > So this appeared to me like a question of: is it desirable to protect/enhance
> > high weight task performance in the presence of core scheduling?
>
> This sounds to me a policy VS mechanism question. Do you have any idea
> how to spread high weight task among the cores with coresched enabled?

Yes I would like to get us on the same page of the expected behaviour
before jumping to the implementation details. As for how to achieve
that: I'm thinking about to make core wide load balanced and then high
weight task shall spread on different cores. This isn't just about load
balance, the initial task placement will also need to be considered of
course if the high weight task only runs a small period.

2020-03-06 18:06:59

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 3/5/20 6:41 PM, Aaron Lu wrote:

>>> So this appeared to me like a question of: is it desirable to protect/enhance
>>> high weight task performance in the presence of core scheduling?
>>
>> This sounds to me a policy VS mechanism question. Do you have any idea
>> how to spread high weight task among the cores with coresched enabled?
>
> Yes I would like to get us on the same page of the expected behaviour
> before jumping to the implementation details. As for how to achieve
> that: I'm thinking about to make core wide load balanced and then high
> weight task shall spread on different cores. This isn't just about load
> balance, the initial task placement will also need to be considered of
> course if the high weight task only runs a small period.
>

I am wondering why this is not happening:

When the low weight task group has exceeded its cfs allocation during a cfs period, the task group
should be throttled. In that case, the CPU cores that the low
weight task group occupies will become idle, and allow load balance from the
overloaded CPUs for the high weight task group to migrate over.

Tim

2020-03-06 18:34:41

by Phil Auld

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Fri, Mar 06, 2020 at 10:06:16AM -0800 Tim Chen wrote:
> On 3/5/20 6:41 PM, Aaron Lu wrote:
>
> >>> So this appeared to me like a question of: is it desirable to protect/enhance
> >>> high weight task performance in the presence of core scheduling?
> >>
> >> This sounds to me a policy VS mechanism question. Do you have any idea
> >> how to spread high weight task among the cores with coresched enabled?
> >
> > Yes I would like to get us on the same page of the expected behaviour
> > before jumping to the implementation details. As for how to achieve
> > that: I'm thinking about to make core wide load balanced and then high
> > weight task shall spread on different cores. This isn't just about load
> > balance, the initial task placement will also need to be considered of
> > course if the high weight task only runs a small period.
> >
>
> I am wondering why this is not happening:
>
> When the low weight task group has exceeded its cfs allocation during a cfs period, the task group
> should be throttled. In that case, the CPU cores that the low
> weight task group occupies will become idle, and allow load balance from the
> overloaded CPUs for the high weight task group to migrate over.
>

cpu.shares is not quota. I think it will only get throttled if it has and
exceeds quota. Shares are supposed to be used to help weight contention
without providing a hard limit.


Cheers,
Phil


> Tim
>

--

2020-03-06 21:45:10

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On 3/6/20 10:33 AM, Phil Auld wrote:
> On Fri, Mar 06, 2020 at 10:06:16AM -0800 Tim Chen wrote:
>> On 3/5/20 6:41 PM, Aaron Lu wrote:
>>
>>>>> So this appeared to me like a question of: is it desirable to protect/enhance
>>>>> high weight task performance in the presence of core scheduling?
>>>>
>>>> This sounds to me a policy VS mechanism question. Do you have any idea
>>>> how to spread high weight task among the cores with coresched enabled?
>>>
>>> Yes I would like to get us on the same page of the expected behaviour
>>> before jumping to the implementation details. As for how to achieve
>>> that: I'm thinking about to make core wide load balanced and then high
>>> weight task shall spread on different cores. This isn't just about load
>>> balance, the initial task placement will also need to be considered of
>>> course if the high weight task only runs a small period.
>>>
>>
>> I am wondering why this is not happening:
>>
>> When the low weight task group has exceeded its cfs allocation during a cfs period, the task group
>> should be throttled. In that case, the CPU cores that the low
>> weight task group occupies will become idle, and allow load balance from the
>> overloaded CPUs for the high weight task group to migrate over.
>>
>
> cpu.shares is not quota. I think it will only get throttled if it has and
> exceeds quota. Shares are supposed to be used to help weight contention
> without providing a hard limit.
>

Ah yes. cpu.quota is not set in Aaron's test case.

That said, I wonder if the time consumed is getting out of whack with the
cpu shares assigned, we can leverage the quota mechanism to throttle
those cgroup that have overused their shares of cpu. Most of the stats and machinery
needed are already in the throttling mechanism.

I am hoping that allowing task migration with task group mismatch
under large load imbalance between CPUs will be good enough.

Tim

Tim

2020-03-07 03:15:57

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Fri, Mar 06, 2020 at 01:44:08PM -0800, Tim Chen wrote:
> On 3/6/20 10:33 AM, Phil Auld wrote:
> > On Fri, Mar 06, 2020 at 10:06:16AM -0800 Tim Chen wrote:
> >> On 3/5/20 6:41 PM, Aaron Lu wrote:
> >>
> >>>>> So this appeared to me like a question of: is it desirable to protect/enhance
> >>>>> high weight task performance in the presence of core scheduling?
> >>>>
> >>>> This sounds to me a policy VS mechanism question. Do you have any idea
> >>>> how to spread high weight task among the cores with coresched enabled?
> >>>
> >>> Yes I would like to get us on the same page of the expected behaviour
> >>> before jumping to the implementation details. As for how to achieve
> >>> that: I'm thinking about to make core wide load balanced and then high
> >>> weight task shall spread on different cores. This isn't just about load
> >>> balance, the initial task placement will also need to be considered of
> >>> course if the high weight task only runs a small period.
> >>>
> >>
> >> I am wondering why this is not happening:
> >>
> >> When the low weight task group has exceeded its cfs allocation during a cfs period, the task group
> >> should be throttled. In that case, the CPU cores that the low
> >> weight task group occupies will become idle, and allow load balance from the
> >> overloaded CPUs for the high weight task group to migrate over.
> >>
> >
> > cpu.shares is not quota. I think it will only get throttled if it has and
> > exceeds quota. Shares are supposed to be used to help weight contention
> > without providing a hard limit.
> >
>
> Ah yes. cpu.quota is not set in Aaron's test case.
>
> That said, I wonder if the time consumed is getting out of whack with the
> cpu shares assigned, we can leverage the quota mechanism to throttle
> those cgroup that have overused their shares of cpu. Most of the stats and machinery
> needed are already in the throttling mechanism.

cpu.quota is not work conserving IIUC, it will reduce noise workload's
performance when real workload has no demand for CPU.

Also, while not exceeding quota, the noise workload can still hurt real
workload's performance. To protect real workload from noise, cpu.shares
and SCHED_IDLE seems appropriate but the implementation may not be
enough as of now.

>
> I am hoping that allowing task migration with task group mismatch
> under large load imbalance between CPUs will be good enough.

I also hope so :-)

2020-03-17 01:09:12

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

Hi Julien, Peter, all,

On Fri, Feb 21, 2020 at 06:20:57PM -0500, Julien Desfossez wrote:
> On 18-Feb-2020 04:58:02 PM, Vineeth Remanan Pillai wrote:
> > > Yes, this makes sense, patch updated at here, I put your name there if
> > > you don't mind.
> > > https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2-rc2
> > >
> > > Thanks Aubrey!
>
> Just a quick note, I ran a very cpu-intensive benchmark (9x12 vcpus VMs
> running linpack), all affined to an 18 cores NUMA node (36 hardware
> threads). Each VM is running in its own cgroup/tag with core scheduling
> enabled. We know it already performed much better than nosmt, so for
> this case, I measured various co-scheduling statistics:
> - how much time the process spends co-scheduled with idle, a compatible
> or an incompatible task
> - how long does the process spends running in a inefficient
> configuration (more than 1 thread running alone on a core)
>
> And I am very happy to report than even though the 9 VMs were configured
> to float on the whole NUMA node, the scheduler / load-balancer did a
> very good job at keeping an efficient configuration:
>
> Process 10667 (qemu-system-x86), 10 seconds trace:
> - total runtime: 46451472309 ns,
> - local neighbors (total: 45713285084 ns, 98.411 % of process runtime):
> - idle neighbors (total: 484054061 ns, 1.042 % of process runtime):
> - foreign neighbors (total: 4191002 ns, 0.009 % of process runtime):
> - unknown neighbors (total: 92042503 ns, 0.198 % of process runtime)
> - inefficient periods (total: 464832 ns, 0.001 % of process runtime):
> - number of periods: 48
> - min period duration: 1424 ns
> - max period duration: 116988 ns
> - average period duration: 9684.000 ns
> - stdev: 19282.130
>
> I thought you would enjoy seeing this :-)

Looks quite interesting. We are trying apply this work to ChromeOS. What we
want to do is selectively marking tasks, instead of grouping sets of trusted
tasks. I have a patch that adds a prctl which a task can call, and it works
well (task calls prctl and gets a cookie which gives it a dedicated core).

However, I have the following questions, in particular there are 4 scenarios
where I feel the current patches do not resolve MDS/L1TF, would you guys
please share your thoughts?

1. HT1 is running either hostile guest or host code.
HT2 is running an interrupt handler (victim).

In this case I see there is a possible MDS issue between HT1 and HT2.

2. HT1 is executing hostile host code, and gets interrupted by a victim
interrupt. HT2 is idle.

In this case, I see there is a possible MDS issue between interrupt and
the host code on the same HT1.

3. HT1 is executing hostile guest code, HT2 is executing a victim interrupt
handler on the host.

In this case, I see there is a possible L1TF issue between HT1 and HT2.
This issue does not happen if HT1 is running host code, since the host
kernel takes care of inverting PTE bits.

4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running
hostile code on guest or host. HT2 is being forced idle. However, there is
an overlap between HT1 starting to execute hostile code and HT2's victim
process getting scheduled out.
Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq
counter of the sibling being idled to detect that it is now idle.
However we discussed today that looking at this data, it is not really an
issue since it is such a small window.

My concern is now cases 1, 2 to which there does not seem a good solution,
short of disabling interrupts. For 3, we could still possibly do something on
the guest side, such as using shadow page tables. Any thoughts on all this?

thanks,

- Joel


2020-03-17 19:09:18

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

Joel,

>
> Looks quite interesting. We are trying apply this work to ChromeOS. What we
> want to do is selectively marking tasks, instead of grouping sets of trusted
> tasks. I have a patch that adds a prctl which a task can call, and it works
> well (task calls prctl and gets a cookie which gives it a dedicated core).
>
> However, I have the following questions, in particular there are 4 scenarios
> where I feel the current patches do not resolve MDS/L1TF, would you guys
> please share your thoughts?
>
> 1. HT1 is running either hostile guest or host code.
> HT2 is running an interrupt handler (victim).
>
> In this case I see there is a possible MDS issue between HT1 and HT2.

Core scheduling mitigates the userspace to userspace attacks via MDS between the HT.
It does not prevent the userspace to kernel space attack. That will
have to be mitigated via other means, e.g. redirecting interrupts to a core
that don't run potentially unsafe code.

>
> 2. HT1 is executing hostile host code, and gets interrupted by a victim
> interrupt. HT2 is idle.

Similar to above.

>
> In this case, I see there is a possible MDS issue between interrupt and
> the host code on the same HT1.

The cpu buffers are cleared before return to the hostile host code. So
MDS shouldn't be an issue if interrupt handler and hostile code
runs on the same HT thread.

>
> 3. HT1 is executing hostile guest code, HT2 is executing a victim interrupt
> handler on the host.
>
> In this case, I see there is a possible L1TF issue between HT1 and HT2.
> This issue does not happen if HT1 is running host code, since the host
> kernel takes care of inverting PTE bits.

The interrupt handler will be run with PTE inverted. So I don't think
there's a leak via L1TF in this scenario.

>
> 4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running
> hostile code on guest or host. HT2 is being forced idle. However, there is
> an overlap between HT1 starting to execute hostile code and HT2's victim
> process getting scheduled out.
> Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq
> counter of the sibling being idled to detect that it is now idle.
> However we discussed today that looking at this data, it is not really an
> issue since it is such a small window.
>
> My concern is now cases 1, 2 to which there does not seem a good solution,
> short of disabling interrupts. For 3, we could still possibly do something on
> the guest side, such as using shadow page tables. Any thoughts on all this?
>

+ Tony who may have more insights on L1TF and MDS.

Thanks.

Tim

2020-03-17 20:21:24

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4



BTW Joel,

Did you guys have a chance to try out v5 core scheduler?

> - Joel(ChromeOS) found a deadlock and crash on PREEMPT kernel in the
> coreshed idle balance logic

We did some patches to fix a few stability issues in v4. I wonder
if v5 still has the deadlock that you saw before?

Tim

2020-03-17 21:20:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

Tim,

Tim Chen <[email protected]> writes:
>> However, I have the following questions, in particular there are 4 scenarios
>> where I feel the current patches do not resolve MDS/L1TF, would you guys
>> please share your thoughts?
>>
>> 1. HT1 is running either hostile guest or host code.
>> HT2 is running an interrupt handler (victim).
>>
>> In this case I see there is a possible MDS issue between HT1 and HT2.
>
> Core scheduling mitigates the userspace to userspace attacks via MDS between the HT.
> It does not prevent the userspace to kernel space attack. That will
> have to be mitigated via other means, e.g. redirecting interrupts to a core
> that don't run potentially unsafe code.

Which is in some cases simply impossible. Think multiqueue devices with
managed interrupts. You can't change the affinity of those. Neither can
you do that for the per cpu timer interrupt.

>> 2. HT1 is executing hostile host code, and gets interrupted by a victim
>> interrupt. HT2 is idle.
>
> Similar to above.

No. It's the same HT so not similar at all.

>> In this case, I see there is a possible MDS issue between interrupt and
>> the host code on the same HT1.
>
> The cpu buffers are cleared before return to the hostile host code. So
> MDS shouldn't be an issue if interrupt handler and hostile code
> runs on the same HT thread.

OTOH, thats mostly correct. Aside of the shouldn't wording:

MDS _is_ no issue in this case when the full mitigation is enabled.

Assumed that I have not less information about MDS than you have :)

>> 3. HT1 is executing hostile guest code, HT2 is executing a victim interrupt
>> handler on the host.
>>
>> In this case, I see there is a possible L1TF issue between HT1 and HT2.
>> This issue does not happen if HT1 is running host code, since the host
>> kernel takes care of inverting PTE bits.
>
> The interrupt handler will be run with PTE inverted. So I don't think
> there's a leak via L1TF in this scenario.

How so?

Host memory is attackable, when one of the sibling SMT threads runs in
host OS (hypervisor) context and the other in guest context.

HT1 is in guest mode and attacking (has control over PTEs). HT2 is
running in host mode and executes an interrupt handler. The host PTE
inversion does not matter in this scenario at all.

So HT1 can very well see data which is brought into the shared L1 by
HT2.

The only way to mitigate that aside of disabling HT is disabling EPT.

>> 4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running
>> hostile code on guest or host. HT2 is being forced idle. However, there is
>> an overlap between HT1 starting to execute hostile code and HT2's victim
>> process getting scheduled out.
>> Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq
>> counter of the sibling being idled to detect that it is now idle.
>> However we discussed today that looking at this data, it is not really an
>> issue since it is such a small window.

If the victim HT is kicked out of execution with an IPI then the overlap
depends on the contexts:

HT1 (attack) HT2 (victim)

A idle -> user space user space -> idle

B idle -> user space guest -> idle

C idle -> guest user space -> idle

D idle -> guest guest -> idle

The IPI from HT1 brings HT2 immediately into the kernel when HT2 is in
host user mode or brings it immediately into VMEXIT when HT2 is in guest
mode.

#A On return from handling the IPI HT2 immediately reschedules to idle.
To have an overlap the return to user space on HT1 must be faster.

#B Coming back from VEMXIT into schedule/idle might take slightly longer
than #A.

#C Similar to #A, but reentering guest mode in HT1 after sending the IPI
will probably take longer.

#D Similar to #C if you make the assumption that VMEXIT on HT2 and
rescheduling into idle is not significantly slower than reaching
VMENTER after sending the IPI.

In all cases the data exposed by a potential overlap shouldn't be that
interesting (e.g. scheduler state), but that obviously depends on what
the attacker is looking for.

But all of them are still problematic vs. interrupts / softinterrupts
which can happen on HT2 on the way to idle or while idling. i.e. #3 of
the original case list. #A and #B are only affected my MDS, #C and #D by
both MDS and L1TF (if EPT is in use).

>> My concern is now cases 1, 2 to which there does not seem a good solution,
>> short of disabling interrupts. For 3, we could still possibly do something on
>> the guest side, such as using shadow page tables. Any thoughts on all this?

#1 can be partially mitigated by changing interrupt affinities, which is
not always possible and in the case of the local timer interrupt
completely impossible. It's not only the timer interrupt itself, the
timer callbacks which can run in the softirq on return from interrupt
might be valuable attack surface depending on the nature of the
callbacks, the random entropy timer just being a random example.

#2 is a non issue if MDS mitigation is on, i.e. buffers are flushed
before returning to user space. It's pretty much a non SMT case,
i.e. same CPU user to kernel attack.

#3 Can only be fully mitigated by disabling EPT

#4 Assumed that my assumptions about transition times are correct, which
I think they are, #4 is pretty much redirected to #1

Hope that helps.

Thanks,

tglx

2020-03-17 21:59:19

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4



On 3/17/20 2:17 PM, Thomas Gleixner wrote:

>> The interrupt handler will be run with PTE inverted. So I don't think
>> there's a leak via L1TF in this scenario.
>
> How so?
>
> Host memory is attackable, when one of the sibling SMT threads runs in
> host OS (hypervisor) context and the other in guest context.
>
> HT1 is in guest mode and attacking (has control over PTEs). HT2 is
> running in host mode and executes an interrupt handler. The host PTE
> inversion does not matter in this scenario at all.
>
> So HT1 can very well see data which is brought into the shared L1 by
> HT2.
>
> The only way to mitigate that aside of disabling HT is disabling EPT.
>

I had a brain lapse. Yes, PTE inversion is for mitigating against malicious
user space code, not for malicious guest.

Thanks for the correction.

Tim

2020-03-18 00:55:15

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

On Tue, Mar 17, 2020 at 3:07 PM Tim Chen <[email protected]> wrote:
>
> Joel,
>
> >
> > Looks quite interesting. We are trying apply this work to ChromeOS. What we
> > want to do is selectively marking tasks, instead of grouping sets of trusted
> > tasks. I have a patch that adds a prctl which a task can call, and it works
> > well (task calls prctl and gets a cookie which gives it a dedicated core).
> >
> > However, I have the following questions, in particular there are 4 scenarios
> > where I feel the current patches do not resolve MDS/L1TF, would you guys
> > please share your thoughts?
> >
> > 1. HT1 is running either hostile guest or host code.
> > HT2 is running an interrupt handler (victim).
> >
> > In this case I see there is a possible MDS issue between HT1 and HT2.
>
> Core scheduling mitigates the userspace to userspace attacks via MDS between the HT.
> It does not prevent the userspace to kernel space attack. That will
> have to be mitigated via other means, e.g. redirecting interrupts to a core
> that don't run potentially unsafe code.

We have only 2 cores (4 HT) on many devices. It is not an option to
dedicate a core to only running trusted code, that would kill
performance. Another option is to designate a single HT of a
particular core to run both untrusted code and an interrupt handler --
but as Thomas pointed out, this does not work for per-CPU interrupts
or managed interrupts, and the softirqs that they trigger. But if we
just consider interrupts for which we can control the affinities (and
assuming that most interrupts can be controlled like that), then maybe
it will work? In the ChromeOS model, each untrusted task is in its own
domain (cookie). So untrusted tasks cannot benefit from parallelism
(in our case) anyway -- so it seems reasonable to run an affinable
interrupt and an untrusted task, on a particular designated core.

(Just thinking out loud...). Another option could be a patch that
Vineeth shared with me (that Peter experimentally wrote) where he
sends IPI from an interrupt handler to a sibling running untrusted
guest code which would result in it getting paused. I am hoping
something like this could work on the host side as well (not just for
guests). We could also set per-core state from the interrupted HT,
possibly IPI'ing the untrusted sibling if we have to. If sibling runs
untrusted code *after* the other's siblings interrupt already started,
then the schedule() loop on the untrusted sibling would spin knowing
the other sibling has an interrupt in progress. The softirq is a real
problem though. Perhaps it can also set similar per-core state.
Thoughts?

> > 2. HT1 is executing hostile host code, and gets interrupted by a victim
> > interrupt. HT2 is idle.
> > In this case, I see there is a possible MDS issue between interrupt and
> > the host code on the same HT1.
>
> The cpu buffers are cleared before return to the hostile host code. So
> MDS shouldn't be an issue if interrupt handler and hostile code
> runs on the same HT thread.

Got it, agreed this is not an issue.

> > 3. HT1 is executing hostile guest code, HT2 is executing a victim interrupt
> > handler on the host.
> >
> > In this case, I see there is a possible L1TF issue between HT1 and HT2.
> > This issue does not happen if HT1 is running host code, since the host
> > kernel takes care of inverting PTE bits.
>
> The interrupt handler will be run with PTE inverted. So I don't think
> there's a leak via L1TF in this scenario.

As Thomas and you later pointed out, this is still an issue and will
require a similar solution as described above.

thanks,

- Joel

2020-03-18 01:04:45

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

Hi Thomas,

Thanks for the detailed email. I am on the same page with all your points, I
had a question on one of the points below (which I agree with as well) but
just to confirm,

On Tue, Mar 17, 2020 at 10:17:47PM +0100, Thomas Gleixner wrote:
[..]
> >> 4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running
> >> hostile code on guest or host. HT2 is being forced idle. However, there is
> >> an overlap between HT1 starting to execute hostile code and HT2's victim
> >> process getting scheduled out.
> >> Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq
> >> counter of the sibling being idled to detect that it is now idle.
> >> However we discussed today that looking at this data, it is not really an
> >> issue since it is such a small window.
>
> If the victim HT is kicked out of execution with an IPI then the overlap
> depends on the contexts:
>
> HT1 (attack) HT2 (victim)
>
> A idle -> user space user space -> idle
>
> B idle -> user space guest -> idle
>
> C idle -> guest user space -> idle
>
> D idle -> guest guest -> idle
>
> The IPI from HT1 brings HT2 immediately into the kernel when HT2 is in
> host user mode or brings it immediately into VMEXIT when HT2 is in guest
> mode.
>
> #A On return from handling the IPI HT2 immediately reschedules to idle.
> To have an overlap the return to user space on HT1 must be faster.
>
> #B Coming back from VEMXIT into schedule/idle might take slightly longer
> than #A.
>
> #C Similar to #A, but reentering guest mode in HT1 after sending the IPI
> will probably take longer.
>
> #D Similar to #C if you make the assumption that VMEXIT on HT2 and
> rescheduling into idle is not significantly slower than reaching
> VMENTER after sending the IPI.
>
> In all cases the data exposed by a potential overlap shouldn't be that
> interesting (e.g. scheduler state), but that obviously depends on what
> the attacker is looking for.

About the "shouldn't be that interesting" part, you are saying, the overlap
should not be that interesting because the act of one sibling IPI'ing the
other implies the sibling HT immediately entering kernel mode, right?

Thanks, your email really helped!!!

- Joel


2020-03-18 01:13:10

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

Hi Tim,

On Tue, Mar 17, 2020 at 01:18:45PM -0700, Tim Chen wrote:
[..]
> Did you guys have a chance to try out v5 core scheduler?
>
> > - Joel(ChromeOS) found a deadlock and crash on PREEMPT kernel in the
> > coreshed idle balance logic
>
> We did some patches to fix a few stability issues in v4. I wonder
> if v5 still has the deadlock that you saw before?

I have not yet tested v5. I fixed the RCU-related deadlock in the scheduler
with this patch sent few days ago:
https://lore.kernel.org/patchwork/patch/1209561/

I'll rebase my tree on v5 soon.

Thanks!

- Joel

2020-03-18 02:31:48

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

> On Tue, Mar 17, 2020 at 10:17:47PM +0100, Thomas Gleixner wrote:
> [..]
> > >> 4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running
> > >> hostile code on guest or host. HT2 is being forced idle. However, there is
> > >> an overlap between HT1 starting to execute hostile code and HT2's victim
> > >> process getting scheduled out.
> > >> Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq
> > >> counter of the sibling being idled to detect that it is now idle.
> > >> However we discussed today that looking at this data, it is not really an
> > >> issue since it is such a small window.
> >
> > If the victim HT is kicked out of execution with an IPI then the overlap
> > depends on the contexts:
> >
> > HT1 (attack) HT2 (victim)
> >
> > A idle -> user space user space -> idle
> >
> > B idle -> user space guest -> idle
> >
> > C idle -> guest user space -> idle
> >
> > D idle -> guest guest -> idle
> >
> > The IPI from HT1 brings HT2 immediately into the kernel when HT2 is in
> > host user mode or brings it immediately into VMEXIT when HT2 is in guest
> > mode.
> >
> > #A On return from handling the IPI HT2 immediately reschedules to idle.
> > To have an overlap the return to user space on HT1 must be faster.
> >
> > #B Coming back from VEMXIT into schedule/idle might take slightly longer
> > than #A.
> >
> > #C Similar to #A, but reentering guest mode in HT1 after sending the IPI
> > will probably take longer.
> >
> > #D Similar to #C if you make the assumption that VMEXIT on HT2 and
> > rescheduling into idle is not significantly slower than reaching
> > VMENTER after sending the IPI.
> >
> > In all cases the data exposed by a potential overlap shouldn't be that
> > interesting (e.g. scheduler state), but that obviously depends on what
> > the attacker is looking for.
>
> About the "shouldn't be that interesting" part, you are saying, the overlap
> should not be that interesting because the act of one sibling IPI'ing the
> other implies the sibling HT immediately entering kernel mode, right?

Hi Thomas,

I see what you mean now. Basically after the IPI is sent, the HT
sending the IPI would have buffers cleared before the attacker gets a
chance to run on it. Because the victim HT enters into the IPI
handler, all the attacker will get to see is possibly uninteresting
data (e.g. scheduler state) as you mentioned.

Thanks!

- Joel

2020-03-18 11:54:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

Joel,

Joel Fernandes <[email protected]> writes:
> We have only 2 cores (4 HT) on many devices. It is not an option to
> dedicate a core to only running trusted code, that would kill
> performance. Another option is to designate a single HT of a
> particular core to run both untrusted code and an interrupt handler --
> but as Thomas pointed out, this does not work for per-CPU interrupts
> or managed interrupts, and the softirqs that they trigger. But if we
> just consider interrupts for which we can control the affinities (and
> assuming that most interrupts can be controlled like that), then maybe
> it will work? In the ChromeOS model, each untrusted task is in its own
> domain (cookie). So untrusted tasks cannot benefit from parallelism
> (in our case) anyway -- so it seems reasonable to run an affinable
> interrupt and an untrusted task, on a particular designated core.
>
> (Just thinking out loud...). Another option could be a patch that
> Vineeth shared with me (that Peter experimentally wrote) where he
> sends IPI from an interrupt handler to a sibling running untrusted
> guest code which would result in it getting paused. I am hoping
> something like this could work on the host side as well (not just for
> guests). We could also set per-core state from the interrupted HT,
> possibly IPI'ing the untrusted sibling if we have to. If sibling runs
> untrusted code *after* the other's siblings interrupt already started,
> then the schedule() loop on the untrusted sibling would spin knowing
> the other sibling has an interrupt in progress. The softirq is a real
> problem though. Perhaps it can also set similar per-core state.

There is not much difference between bringing the sibling out of guest
mode and bringing it out of host user mode.

Adding state to force spinning until the other side has completed is not
rocket science either.

But the whole concept is prone to starvation issues and full of nasty
corner cases. From experiments I did back in the L1TF days I'm pretty
much convinced that this can't result in a generaly usable solution.

Let me share a few thoughts what might be doable with less horrors, but
be aware that this is mostly a brain dump of half thought out ideas.

1) Managed interrupts on multi queue devices

It should be reasonably simple to force a reduced number of queues
which would in turn allow to shield one ore two CPUs from such
interrupts and queue handling for the price of indirection.

2) Timers and softirqs

If device interrupts are targeted to "safe" CPUs then the amount of
timers and soft interrupt processing will be reduced as well.

That still leaves e.g. network TX side soft interrupts when the task
running on a shielded core does networking. Maybe that's a non issue,
but I'm not familiar enough with the network maze to give an answer.

A possible workaround would be to force softirq processing into
thread context so everything is under scheduler control. How well
that scales is a different story.

That would bring out the timer_list timers and reduce the potential
surface to hrtimer expiry callbacks. Most of them should be fine
(doing wakeups or scheduler housekeeping of some sort). For the
others we might just utilize the mechanism which PREEMPT_RT uses and
force them off into softirq expiry mode.

Thanks,

tglx

2020-03-19 01:56:37

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

Hi Thomas,

On Wed, Mar 18, 2020 at 12:53:01PM +0100, Thomas Gleixner wrote:
> Joel,
>
> Joel Fernandes <[email protected]> writes:
> > We have only 2 cores (4 HT) on many devices. It is not an option to
> > dedicate a core to only running trusted code, that would kill
> > performance. Another option is to designate a single HT of a
> > particular core to run both untrusted code and an interrupt handler --
> > but as Thomas pointed out, this does not work for per-CPU interrupts
> > or managed interrupts, and the softirqs that they trigger. But if we
> > just consider interrupts for which we can control the affinities (and
> > assuming that most interrupts can be controlled like that), then maybe
> > it will work? In the ChromeOS model, each untrusted task is in its own
> > domain (cookie). So untrusted tasks cannot benefit from parallelism
> > (in our case) anyway -- so it seems reasonable to run an affinable
> > interrupt and an untrusted task, on a particular designated core.
> >
> > (Just thinking out loud...). Another option could be a patch that
> > Vineeth shared with me (that Peter experimentally wrote) where he
> > sends IPI from an interrupt handler to a sibling running untrusted
> > guest code which would result in it getting paused. I am hoping
> > something like this could work on the host side as well (not just for
> > guests). We could also set per-core state from the interrupted HT,
> > possibly IPI'ing the untrusted sibling if we have to. If sibling runs
> > untrusted code *after* the other's siblings interrupt already started,
> > then the schedule() loop on the untrusted sibling would spin knowing
> > the other sibling has an interrupt in progress. The softirq is a real
> > problem though. Perhaps it can also set similar per-core state.
>
> There is not much difference between bringing the sibling out of guest
> mode and bringing it out of host user mode.
>
> Adding state to force spinning until the other side has completed is not
> rocket science either.
>
> But the whole concept is prone to starvation issues and full of nasty
> corner cases. From experiments I did back in the L1TF days I'm pretty
> much convinced that this can't result in a generaly usable solution.
> Let me share a few thoughts what might be doable with less horrors, but
> be aware that this is mostly a brain dump of half thought out ideas.
>
> 1) Managed interrupts on multi queue devices
>
> It should be reasonably simple to force a reduced number of queues
> which would in turn allow to shield one ore two CPUs from such
> interrupts and queue handling for the price of indirection.
>
> 2) Timers and softirqs
>
> If device interrupts are targeted to "safe" CPUs then the amount of
> timers and soft interrupt processing will be reduced as well.
>
> That still leaves e.g. network TX side soft interrupts when the task
> running on a shielded core does networking. Maybe that's a non issue,
> but I'm not familiar enough with the network maze to give an answer.
>
> A possible workaround would be to force softirq processing into
> thread context so everything is under scheduler control. How well
> that scales is a different story.
>
> That would bring out the timer_list timers and reduce the potential
> surface to hrtimer expiry callbacks. Most of them should be fine
> (doing wakeups or scheduler housekeeping of some sort). For the
> others we might just utilize the mechanism which PREEMPT_RT uses and
> force them off into softirq expiry mode.

Thanks a lot for your thoughts, it is really helpful.

One issue with interrupt affinities is, we have only 2 cores on many devices
(4 HTs). If we assign 1 HT out of the 4 for receiving interrupts (at least
the ones we can control affinity off), then that means we have to devote that
core to running only 1 trusted task at a time, since one of the 2 HT is
designated for interrupts. That leaves us with only the other 2 HTs on the
other core for simultaneously running the 2 trusted tasks, thus reducing the
parallellism.

Thanks for raising the point about running interrupts and softirqs in process
context to solve these, turns out we do use threaded IRQs however not on most
interrupts. We could consider threading as many IRQs (and softirqs) as
as possible and for the ones we cannot thread, may be we can code audit code and
assess the risk, possibly deferring processing of sensitivity data to
workqueue context. Perhaps we can also conditionally force a softirq into
process context say if we detect that the core has an untrusted task running
on the softirq HT's sibling, and if not let the softirq execute in softirq
context itself.

Thanks, Thomas!!

- Joel