2019-05-29 20:40:10

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH v3 00/16] Core scheduling v3

Third iteration of the Core-Scheduling feature.

This version fixes mostly correctness related issues in v2 and
addresses performance issues. Also, addressed some crashes related
to cgroups and cpu hotplugging.

We have tested and verified that incompatible processes are not
selected during schedule. In terms of performance, the impact
depends on the workload:
- on CPU intensive applications that use all the logical CPUs with
SMT enabled, enabling core scheduling performs better than nosmt.
- on mixed workloads with considerable io compared to cpu usage,
nosmt seems to perform better than core scheduling.

Changes in v3
-------------
- Fixes the issue of sibling picking up an incompatible task
- Aaron Lu
- Vineeth Pillai
- Julien Desfossez
- Fixes the issue of starving threads due to forced idle
- Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
- Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
- Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
- Aaron Lu

Changes in v2
-------------
- Fixes for couple of NULL pointer dereference crashes
- Subhra Mazumdar
- Tim Chen
- Improves priority comparison logic for process in different cpus
- Peter Zijlstra
- Aaron Lu
- Fixes a hard lockup in rq locking
- Vineeth Pillai
- Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
- Vineeth Pillai
- Julien Desfossez
- Fix for 32bit build
- Aubrey Li

Issues
------
- Comparing process priority across cpus is not accurate

TODO
----
- Decide on the API for exposing the feature to userland

---

Peter Zijlstra (16):
stop_machine: Fix stop_cpus_in_progress ordering
sched: Fix kerneldoc comment for ia64_set_curr_task
sched: Wrap rq::lock access
sched/{rt,deadline}: Fix set_next_task vs pick_next_task
sched: Add task_struct pointer to sched_class::set_curr_task
sched/fair: Export newidle_balance()
sched: Allow put_prev_task() to drop rq->lock
sched: Rework pick_next_task() slow-path
sched: Introduce sched_class::pick_task()
sched: Core-wide rq->lock
sched: Basic tracking of matching tasks
sched: A quick and dirty cgroup tagging interface
sched: Add core wide task selection and scheduling.
sched/fair: Add a few assertions
sched: Trivial forced-newidle balancer
sched: Debug bits...

include/linux/sched.h | 9 +-
kernel/Kconfig.preempt | 7 +-
kernel/sched/core.c | 858 +++++++++++++++++++++++++++++++++++++--
kernel/sched/cpuacct.c | 12 +-
kernel/sched/deadline.c | 99 +++--
kernel/sched/debug.c | 4 +-
kernel/sched/fair.c | 180 ++++----
kernel/sched/idle.c | 42 +-
kernel/sched/pelt.h | 2 +-
kernel/sched/rt.c | 96 ++---
kernel/sched/sched.h | 237 ++++++++---
kernel/sched/stop_task.c | 35 +-
kernel/sched/topology.c | 4 +-
kernel/stop_machine.c | 2 +
14 files changed, 1250 insertions(+), 337 deletions(-)

--
2.17.1


2019-05-29 20:40:26

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH v3 10/16] sched: Core-wide rq->lock

From: Peter Zijlstra <[email protected]>

Introduce the basic infrastructure to have a core wide rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
---

Changes in v3
-------------
- Fixes a crash during cpu offline/offline with coresched enabled
- Vineeth Pillai

kernel/Kconfig.preempt | 7 ++-
kernel/sched/core.c | 113 ++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 31 +++++++++++
3 files changed, 148 insertions(+), 3 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 0fee5fe6c899..02fe0bf26676 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -57,4 +57,9 @@ config PREEMPT
endchoice

config PREEMPT_COUNT
- bool
+ bool
+
+config SCHED_CORE
+ bool
+ default y
+ depends on SCHED_SMT
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b883c70674ba..b1ce33f9b106 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -60,6 +60,70 @@ __read_mostly int scheduler_running;
*/
int sysctl_sched_rt_runtime = 950000;

+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * The static-key + stop-machine variable are needed such that:
+ *
+ * spin_lock(rq_lockp(rq));
+ * ...
+ * spin_unlock(rq_lockp(rq));
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+static int __sched_core_stopper(void *data)
+{
+ bool enabled = !!(unsigned long)data;
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ cpu_rq(cpu)->core_enabled = enabled;
+
+ return 0;
+}
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+
+static void __sched_core_enable(void)
+{
+ // XXX verify there are no cookie tasks (yet)
+
+ static_branch_enable(&__sched_core_enabled);
+ stop_machine(__sched_core_stopper, (void *)true, NULL);
+}
+
+static void __sched_core_disable(void)
+{
+ // XXX verify there are no cookie tasks (left)
+
+ stop_machine(__sched_core_stopper, (void *)false, NULL);
+ static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+ mutex_lock(&sched_core_mutex);
+ if (!sched_core_count++)
+ __sched_core_enable();
+ mutex_unlock(&sched_core_mutex);
+}
+
+void sched_core_put(void)
+{
+ mutex_lock(&sched_core_mutex);
+ if (!--sched_core_count)
+ __sched_core_disable();
+ mutex_unlock(&sched_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
/*
* __task_rq_lock - lock the rq @p resides on.
*/
@@ -5790,8 +5854,15 @@ int sched_cpu_activate(unsigned int cpu)
/*
* When going up, increment the number of cores with SMT present.
*/
- if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+ if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
static_branch_inc_cpuslocked(&sched_smt_present);
+#ifdef CONFIG_SCHED_CORE
+ if (static_branch_unlikely(&__sched_core_enabled)) {
+ rq->core_enabled = true;
+ }
+#endif
+ }
+
#endif
set_cpu_active(cpu, true);

@@ -5839,8 +5910,16 @@ int sched_cpu_deactivate(unsigned int cpu)
/*
* When going down, decrement the number of cores with SMT present.
*/
- if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+ if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
+#ifdef CONFIG_SCHED_CORE
+ struct rq *rq = cpu_rq(cpu);
+ if (static_branch_unlikely(&__sched_core_enabled)) {
+ rq->core_enabled = false;
+ }
+#endif
static_branch_dec_cpuslocked(&sched_smt_present);
+
+ }
#endif

if (!sched_smp_initialized)
@@ -5865,6 +5944,28 @@ static void sched_rq_cpu_starting(unsigned int cpu)

int sched_cpu_starting(unsigned int cpu)
{
+#ifdef CONFIG_SCHED_CORE
+ const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+ struct rq *rq, *core_rq = NULL;
+ int i;
+
+ for_each_cpu(i, smt_mask) {
+ rq = cpu_rq(i);
+ if (rq->core && rq->core == rq)
+ core_rq = rq;
+ }
+
+ if (!core_rq)
+ core_rq = cpu_rq(cpu);
+
+ for_each_cpu(i, smt_mask) {
+ rq = cpu_rq(i);
+
+ WARN_ON_ONCE(rq->core && rq->core != core_rq);
+ rq->core = core_rq;
+ }
+#endif /* CONFIG_SCHED_CORE */
+
sched_rq_cpu_starting(cpu);
sched_tick_start(cpu);
return 0;
@@ -5893,6 +5994,9 @@ int sched_cpu_dying(unsigned int cpu)
update_max_interval();
nohz_balance_exit_idle(rq);
hrtick_clear(rq);
+#ifdef CONFIG_SCHED_CORE
+ rq->core = NULL;
+#endif
return 0;
}
#endif
@@ -6091,6 +6195,11 @@ void __init sched_init(void)
#endif /* CONFIG_SMP */
hrtick_rq_init(rq);
atomic_set(&rq->nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+ rq->core = NULL;
+ rq->core_enabled = 0;
+#endif
}

set_load_weight(&init_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a024dd80eeb3..eb38063221d0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -952,6 +952,12 @@ struct rq {
/* Must be inspected within a rcu lock section */
struct cpuidle_state *idle_state;
#endif
+
+#ifdef CONFIG_SCHED_CORE
+ /* per rq */
+ struct rq *core;
+ unsigned int core_enabled;
+#endif
};

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -979,11 +985,36 @@ static inline int cpu_of(struct rq *rq)
#endif
}

+#ifdef CONFIG_SCHED_CORE
+DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+ return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
+}
+
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+ if (sched_core_enabled(rq))
+ return &rq->core->__lock;
+
+ return &rq->__lock;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+ return false;
+}
+
static inline raw_spinlock_t *rq_lockp(struct rq *rq)
{
return &rq->__lock;
}

+#endif /* CONFIG_SCHED_CORE */
+
#ifdef CONFIG_SCHED_SMT
extern void __update_idle_core(struct rq *rq);

--
2.17.1

2019-05-29 20:40:39

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: [RFC PATCH v3 09/16] sched: Introduce sched_class::pick_task()

From: Peter Zijlstra <[email protected]>

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
---

Chnages in v3
-------------
- Minor refactor to remove redundant NULL checks

Changes in v2
-------------
- Fixes a NULL pointer dereference crash
- Subhra Mazumdar
- Tim Chen

---
kernel/sched/deadline.c | 21 ++++++++++++++++-----
kernel/sched/fair.c | 36 +++++++++++++++++++++++++++++++++---
kernel/sched/idle.c | 10 +++++++++-
kernel/sched/rt.c | 21 ++++++++++++++++-----
kernel/sched/sched.h | 2 ++
kernel/sched/stop_task.c | 21 ++++++++++++++++-----
6 files changed, 92 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d3904168857a..64fc444f44f9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1722,15 +1722,12 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
return rb_entry(left, struct sched_dl_entity, rb_node);
}

-static struct task_struct *
-pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_task_dl(struct rq *rq)
{
struct sched_dl_entity *dl_se;
struct task_struct *p;
struct dl_rq *dl_rq;

- WARN_ON_ONCE(prev || rf);
-
dl_rq = &rq->dl;

if (unlikely(!dl_rq->dl_nr_running))
@@ -1741,7 +1738,19 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

p = dl_task_of(dl_se);

- set_next_task_dl(rq, p);
+ return p;
+}
+
+static struct task_struct *
+pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+ struct task_struct *p;
+
+ WARN_ON_ONCE(prev || rf);
+
+ p = pick_task_dl(rq);
+ if (p)
+ set_next_task_dl(rq, p);

return p;
}
@@ -2388,6 +2397,8 @@ const struct sched_class dl_sched_class = {
.set_next_task = set_next_task_dl,

#ifdef CONFIG_SMP
+ .pick_task = pick_task_dl,
+
.select_task_rq = select_task_rq_dl,
.migrate_task_rq = migrate_task_rq_dl,
.set_cpus_allowed = set_cpus_allowed_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e65f2dfda77a..02e5dfb85e7d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4136,7 +4136,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
* Avoid running the skip buddy, if running something else can
* be done without getting too unfair.
*/
- if (cfs_rq->skip == se) {
+ if (cfs_rq->skip && cfs_rq->skip == se) {
struct sched_entity *second;

if (se == curr) {
@@ -4154,13 +4154,13 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
/*
* Prefer last buddy, try to return the CPU to a preempted task.
*/
- if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+ if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
se = cfs_rq->last;

/*
* Someone really wants this to run. If it's not unfair, run it.
*/
- if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+ if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
se = cfs_rq->next;

clear_buddies(cfs_rq, se);
@@ -6966,6 +6966,34 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
set_last_buddy(se);
}

+static struct task_struct *
+pick_task_fair(struct rq *rq)
+{
+ struct cfs_rq *cfs_rq = &rq->cfs;
+ struct sched_entity *se;
+
+ if (!cfs_rq->nr_running)
+ return NULL;
+
+ do {
+ struct sched_entity *curr = cfs_rq->curr;
+
+ se = pick_next_entity(cfs_rq, NULL);
+
+ if (curr) {
+ if (se && curr->on_rq)
+ update_curr(cfs_rq);
+
+ if (!se || entity_before(curr, se))
+ se = curr;
+ }
+
+ cfs_rq = group_cfs_rq(se);
+ } while (cfs_rq);
+
+ return task_of(se);
+}
+
static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
@@ -10677,6 +10705,8 @@ const struct sched_class fair_sched_class = {
.set_next_task = set_next_task_fair,

#ifdef CONFIG_SMP
+ .pick_task = pick_task_fair,
+
.select_task_rq = select_task_rq_fair,
.migrate_task_rq = migrate_task_rq_fair,

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 7ece8e820b5d..e7f38da60373 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,6 +373,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
resched_curr(rq);
}

+static struct task_struct *
+pick_task_idle(struct rq *rq)
+{
+ return rq->idle;
+}
+
static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
}
@@ -386,11 +392,12 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next)
static struct task_struct *
pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
- struct task_struct *next = rq->idle;
+ struct task_struct *next;

if (prev)
put_prev_task(rq, prev);

+ next = pick_task_idle(rq);
set_next_task_idle(rq, next);

return next;
@@ -458,6 +465,7 @@ const struct sched_class idle_sched_class = {
.set_next_task = set_next_task_idle,

#ifdef CONFIG_SMP
+ .pick_task = pick_task_idle,
.select_task_rq = select_task_rq_idle,
.set_cpus_allowed = set_cpus_allowed_common,
#endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 79f2e60516ef..81557224548c 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1548,20 +1548,29 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
return rt_task_of(rt_se);
}

-static struct task_struct *
-pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_task_rt(struct rq *rq)
{
struct task_struct *p;
struct rt_rq *rt_rq = &rq->rt;

- WARN_ON_ONCE(prev || rf);
-
if (!rt_rq->rt_queued)
return NULL;

p = _pick_next_task_rt(rq);

- set_next_task_rt(rq, p);
+ return p;
+}
+
+static struct task_struct *
+pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+ struct task_struct *p;
+
+ WARN_ON_ONCE(prev || rf);
+
+ p = pick_task_rt(rq);
+ if (p)
+ set_next_task_rt(rq, p);

return p;
}
@@ -2364,6 +2373,8 @@ const struct sched_class rt_sched_class = {
.set_next_task = set_next_task_rt,

#ifdef CONFIG_SMP
+ .pick_task = pick_task_rt,
+
.select_task_rq = select_task_rq_rt,

.set_cpus_allowed = set_cpus_allowed_common,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 460dd04e76af..a024dd80eeb3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1682,6 +1682,8 @@ struct sched_class {
void (*set_next_task)(struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
+ struct task_struct * (*pick_task)(struct rq *rq);
+
int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
void (*migrate_task_rq)(struct task_struct *p, int new_cpu);

diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 7e1cee4e65b2..fb6c436cba6c 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -29,20 +29,30 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop)
}

static struct task_struct *
-pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+pick_task_stop(struct rq *rq)
{
struct task_struct *stop = rq->stop;

- WARN_ON_ONCE(prev || rf);
-
if (!stop || !task_on_rq_queued(stop))
return NULL;

- set_next_task_stop(rq, stop);
-
return stop;
}

+static struct task_struct *
+pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+ struct task_struct *p;
+
+ WARN_ON_ONCE(prev || rf);
+
+ p = pick_task_stop(rq);
+ if (p)
+ set_next_task_stop(rq, p);
+
+ return p;
+}
+
static void
enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
{
@@ -129,6 +139,7 @@ const struct sched_class stop_sched_class = {
.set_next_task = set_next_task_stop,

#ifdef CONFIG_SMP
+ .pick_task = pick_task_stop,
.select_task_rq = select_task_rq_stop,
.set_cpus_allowed = set_cpus_allowed_common,
#endif
--
2.17.1

2019-05-30 14:06:54

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, May 30, 2019 at 4:36 AM Vineeth Remanan Pillai
<[email protected]> wrote:
>
> Third iteration of the Core-Scheduling feature.
>
> This version fixes mostly correctness related issues in v2 and
> addresses performance issues. Also, addressed some crashes related
> to cgroups and cpu hotplugging.
>
> We have tested and verified that incompatible processes are not
> selected during schedule. In terms of performance, the impact
> depends on the workload:
> - on CPU intensive applications that use all the logical CPUs with
> SMT enabled, enabling core scheduling performs better than nosmt.
> - on mixed workloads with considerable io compared to cpu usage,
> nosmt seems to perform better than core scheduling.

My testing scripts can not be completed on this version. I figured out the
number of cpu utilization report entry didn't reach my minimal requirement.
Then I wrote a simple script to verify.
====================
$ cat test.sh
#!/bin/sh

for i in `seq 1 10`
do
echo `date`, $i
sleep 1
done
====================

Normally it works as below:

Thu May 30 14:13:40 CST 2019, 1
Thu May 30 14:13:41 CST 2019, 2
Thu May 30 14:13:42 CST 2019, 3
Thu May 30 14:13:43 CST 2019, 4
Thu May 30 14:13:44 CST 2019, 5
Thu May 30 14:13:45 CST 2019, 6
Thu May 30 14:13:46 CST 2019, 7
Thu May 30 14:13:47 CST 2019, 8
Thu May 30 14:13:48 CST 2019, 9
Thu May 30 14:13:49 CST 2019, 10

When the system was running 32 sysbench threads and
32 gemmbench threads, it worked as below(the system
has ~38% idle time)
Thu May 30 14:14:20 CST 2019, 1
Thu May 30 14:14:21 CST 2019, 2
Thu May 30 14:14:22 CST 2019, 3
Thu May 30 14:14:24 CST 2019, 4 <=======x=
Thu May 30 14:14:25 CST 2019, 5
Thu May 30 14:14:26 CST 2019, 6
Thu May 30 14:14:28 CST 2019, 7 <=======x=
Thu May 30 14:14:29 CST 2019, 8
Thu May 30 14:14:31 CST 2019, 9 <=======x=
Thu May 30 14:14:34 CST 2019, 10 <=======x=

And it got worse when the system was running 64/64 case,
the system still had ~3% idle time
Thu May 30 14:26:40 CST 2019, 1
Thu May 30 14:26:46 CST 2019, 2
Thu May 30 14:26:53 CST 2019, 3
Thu May 30 14:27:01 CST 2019, 4
Thu May 30 14:27:03 CST 2019, 5
Thu May 30 14:27:11 CST 2019, 6
Thu May 30 14:27:31 CST 2019, 7
Thu May 30 14:27:32 CST 2019, 8
Thu May 30 14:27:41 CST 2019, 9
Thu May 30 14:27:56 CST 2019, 10

Any thoughts?

Thanks,
-Aubrey

2019-05-30 14:21:11

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 30-May-2019 10:04:39 PM, Aubrey Li wrote:
> On Thu, May 30, 2019 at 4:36 AM Vineeth Remanan Pillai
> <[email protected]> wrote:
> >
> > Third iteration of the Core-Scheduling feature.
> >
> > This version fixes mostly correctness related issues in v2 and
> > addresses performance issues. Also, addressed some crashes related
> > to cgroups and cpu hotplugging.
> >
> > We have tested and verified that incompatible processes are not
> > selected during schedule. In terms of performance, the impact
> > depends on the workload:
> > - on CPU intensive applications that use all the logical CPUs with
> > SMT enabled, enabling core scheduling performs better than nosmt.
> > - on mixed workloads with considerable io compared to cpu usage,
> > nosmt seems to perform better than core scheduling.
>
> My testing scripts can not be completed on this version. I figured out the
> number of cpu utilization report entry didn't reach my minimal requirement.
> Then I wrote a simple script to verify.
> ====================
> $ cat test.sh
> #!/bin/sh
>
> for i in `seq 1 10`
> do
> echo `date`, $i
> sleep 1
> done
> ====================
>
> Normally it works as below:
>
> Thu May 30 14:13:40 CST 2019, 1
> Thu May 30 14:13:41 CST 2019, 2
> Thu May 30 14:13:42 CST 2019, 3
> Thu May 30 14:13:43 CST 2019, 4
> Thu May 30 14:13:44 CST 2019, 5
> Thu May 30 14:13:45 CST 2019, 6
> Thu May 30 14:13:46 CST 2019, 7
> Thu May 30 14:13:47 CST 2019, 8
> Thu May 30 14:13:48 CST 2019, 9
> Thu May 30 14:13:49 CST 2019, 10
>
> When the system was running 32 sysbench threads and
> 32 gemmbench threads, it worked as below(the system
> has ~38% idle time)
> Thu May 30 14:14:20 CST 2019, 1
> Thu May 30 14:14:21 CST 2019, 2
> Thu May 30 14:14:22 CST 2019, 3
> Thu May 30 14:14:24 CST 2019, 4 <=======x=
> Thu May 30 14:14:25 CST 2019, 5
> Thu May 30 14:14:26 CST 2019, 6
> Thu May 30 14:14:28 CST 2019, 7 <=======x=
> Thu May 30 14:14:29 CST 2019, 8
> Thu May 30 14:14:31 CST 2019, 9 <=======x=
> Thu May 30 14:14:34 CST 2019, 10 <=======x=
>
> And it got worse when the system was running 64/64 case,
> the system still had ~3% idle time
> Thu May 30 14:26:40 CST 2019, 1
> Thu May 30 14:26:46 CST 2019, 2
> Thu May 30 14:26:53 CST 2019, 3
> Thu May 30 14:27:01 CST 2019, 4
> Thu May 30 14:27:03 CST 2019, 5
> Thu May 30 14:27:11 CST 2019, 6
> Thu May 30 14:27:31 CST 2019, 7
> Thu May 30 14:27:32 CST 2019, 8
> Thu May 30 14:27:41 CST 2019, 9
> Thu May 30 14:27:56 CST 2019, 10
>
> Any thoughts?

Interesting, could you detail a bit more your test setup (commands used,
type of machine, any cgroup/pinning configuration, etc) ? I would like
to reproduce it and investigate.

Thanks,

Julien

2019-05-31 03:04:56

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 2019/5/30 22:04, Aubrey Li wrote:
> On Thu, May 30, 2019 at 4:36 AM Vineeth Remanan Pillai
> <[email protected]> wrote:
>>
>> Third iteration of the Core-Scheduling feature.
>>
>> This version fixes mostly correctness related issues in v2 and
>> addresses performance issues. Also, addressed some crashes related
>> to cgroups and cpu hotplugging.
>>
>> We have tested and verified that incompatible processes are not
>> selected during schedule. In terms of performance, the impact
>> depends on the workload:
>> - on CPU intensive applications that use all the logical CPUs with
>> SMT enabled, enabling core scheduling performs better than nosmt.
>> - on mixed workloads with considerable io compared to cpu usage,
>> nosmt seems to perform better than core scheduling.
>
> My testing scripts can not be completed on this version. I figured out the
> number of cpu utilization report entry didn't reach my minimal requirement.
> Then I wrote a simple script to verify.
> ====================
> $ cat test.sh
> #!/bin/sh
>
> for i in `seq 1 10`
> do
> echo `date`, $i
> sleep 1
> done
> ====================

Is the shell put to some cgroup and assigned some tag or simply untagged?

>
> Normally it works as below:
>
> Thu May 30 14:13:40 CST 2019, 1
> Thu May 30 14:13:41 CST 2019, 2
> Thu May 30 14:13:42 CST 2019, 3
> Thu May 30 14:13:43 CST 2019, 4
> Thu May 30 14:13:44 CST 2019, 5
> Thu May 30 14:13:45 CST 2019, 6
> Thu May 30 14:13:46 CST 2019, 7
> Thu May 30 14:13:47 CST 2019, 8
> Thu May 30 14:13:48 CST 2019, 9
> Thu May 30 14:13:49 CST 2019, 10
>
> When the system was running 32 sysbench threads and
> 32 gemmbench threads, it worked as below(the system
> has ~38% idle time)

Are the two workloads assigned different tags?
And how many cores/threads do you have?

> Thu May 30 14:14:20 CST 2019, 1
> Thu May 30 14:14:21 CST 2019, 2
> Thu May 30 14:14:22 CST 2019, 3
> Thu May 30 14:14:24 CST 2019, 4 <=======x=
> Thu May 30 14:14:25 CST 2019, 5
> Thu May 30 14:14:26 CST 2019, 6
> Thu May 30 14:14:28 CST 2019, 7 <=======x=
> Thu May 30 14:14:29 CST 2019, 8
> Thu May 30 14:14:31 CST 2019, 9 <=======x=
> Thu May 30 14:14:34 CST 2019, 10 <=======x=

This feels like "date" failed to schedule on some CPU
on time.

> And it got worse when the system was running 64/64 case,
> the system still had ~3% idle time
> Thu May 30 14:26:40 CST 2019, 1
> Thu May 30 14:26:46 CST 2019, 2
> Thu May 30 14:26:53 CST 2019, 3
> Thu May 30 14:27:01 CST 2019, 4
> Thu May 30 14:27:03 CST 2019, 5
> Thu May 30 14:27:11 CST 2019, 6
> Thu May 30 14:27:31 CST 2019, 7
> Thu May 30 14:27:32 CST 2019, 8
> Thu May 30 14:27:41 CST 2019, 9
> Thu May 30 14:27:56 CST 2019, 10
>
> Any thoughts?

My first reaction is: when shell wakes up from sleep, it will
fork date. If the script is untagged and those workloads are
tagged and all available cores are already running workload
threads, the forked date can lose to the running workload
threads due to __prio_less() can't properly do vruntime comparison
for tasks on different CPUs. So those idle siblings can't run
date and are idled instead. See my previous post on this:

https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
(Now that I re-read my post, I see that I didn't make it clear
that se_bash and se_hog are assigned different tags(e.g. hog is
tagged and bash is untagged).

Siblings being forced idle is expected due to the nature of core
scheduling, but when two tasks belonging to two siblings are
fighting for schedule, we should let the higher priority one win.

It used to work on v2 is probably due to we mistakenly
allow different tagged tasks to schedule on the same core at
the same time, but that is fixed in v3.

2019-05-31 04:57:04

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, May 30, 2019 at 10:17 PM Julien Desfossez
<[email protected]> wrote:
>
> Interesting, could you detail a bit more your test setup (commands used,
> type of machine, any cgroup/pinning configuration, etc) ? I would like
> to reproduce it and investigate.

Let me see if I can simply my test to reproduce it.

Thanks,
-Aubrey

2019-05-31 05:14:16

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Fri, May 31, 2019 at 11:01 AM Aaron Lu <[email protected]> wrote:
>
> This feels like "date" failed to schedule on some CPU
> on time.
>
> My first reaction is: when shell wakes up from sleep, it will
> fork date. If the script is untagged and those workloads are
> tagged and all available cores are already running workload
> threads, the forked date can lose to the running workload
> threads due to __prio_less() can't properly do vruntime comparison
> for tasks on different CPUs. So those idle siblings can't run
> date and are idled instead. See my previous post on this:
> https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> (Now that I re-read my post, I see that I didn't make it clear
> that se_bash and se_hog are assigned different tags(e.g. hog is
> tagged and bash is untagged).

Yes, script is untagged. This looks like exactly the problem in you
previous post. I didn't follow that, does that discussion lead to a solution?

>
> Siblings being forced idle is expected due to the nature of core
> scheduling, but when two tasks belonging to two siblings are
> fighting for schedule, we should let the higher priority one win.
>
> It used to work on v2 is probably due to we mistakenly
> allow different tagged tasks to schedule on the same core at
> the same time, but that is fixed in v3.

I have 64 threads running on a 104-CPU server, that is, when the
system has ~40% idle time, and "date" is still failed to be picked
up onto CPU on time. This may be the nature of core scheduling,
but it seems to be far from fairness.

Shouldn't we share the core between (sysbench+gemmbench)
and (date)? I mean core level sharing instead of "date" starvation?

Thanks,
-Aubrey

2019-05-31 06:12:07

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 2019/5/31 13:12, Aubrey Li wrote:
> On Fri, May 31, 2019 at 11:01 AM Aaron Lu <[email protected]> wrote:
>>
>> This feels like "date" failed to schedule on some CPU
>> on time.
>>
>> My first reaction is: when shell wakes up from sleep, it will
>> fork date. If the script is untagged and those workloads are
>> tagged and all available cores are already running workload
>> threads, the forked date can lose to the running workload
>> threads due to __prio_less() can't properly do vruntime comparison
>> for tasks on different CPUs. So those idle siblings can't run
>> date and are idled instead. See my previous post on this:
>> https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
>> (Now that I re-read my post, I see that I didn't make it clear
>> that se_bash and se_hog are assigned different tags(e.g. hog is
>> tagged and bash is untagged).
>
> Yes, script is untagged. This looks like exactly the problem in you
> previous post. I didn't follow that, does that discussion lead to a solution?

No immediate solution yet.

>>
>> Siblings being forced idle is expected due to the nature of core
>> scheduling, but when two tasks belonging to two siblings are
>> fighting for schedule, we should let the higher priority one win.
>>
>> It used to work on v2 is probably due to we mistakenly
>> allow different tagged tasks to schedule on the same core at
>> the same time, but that is fixed in v3.
>
> I have 64 threads running on a 104-CPU server, that is, when the

104-CPU means 52 cores I guess.
64 threads may(should?) spread on all the 52 cores and that is enough
to make 'date' suffer.

> system has ~40% idle time, and "date" is still failed to be picked
> up onto CPU on time. This may be the nature of core scheduling,
> but it seems to be far from fairness.

Exactly.

> Shouldn't we share the core between (sysbench+gemmbench)
> and (date)? I mean core level sharing instead of "date" starvation?

We need to make core scheduling fair, but due to no
immediate solution to vruntime comparison cross CPUs, it's not
done yet.

2019-05-31 06:54:55

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Fri, May 31, 2019 at 2:09 PM Aaron Lu <[email protected]> wrote:
>
> On 2019/5/31 13:12, Aubrey Li wrote:
> > On Fri, May 31, 2019 at 11:01 AM Aaron Lu <[email protected]> wrote:
> >>
> >> This feels like "date" failed to schedule on some CPU
> >> on time.
> >>
> >> My first reaction is: when shell wakes up from sleep, it will
> >> fork date. If the script is untagged and those workloads are
> >> tagged and all available cores are already running workload
> >> threads, the forked date can lose to the running workload
> >> threads due to __prio_less() can't properly do vruntime comparison
> >> for tasks on different CPUs. So those idle siblings can't run
> >> date and are idled instead. See my previous post on this:
> >> https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> >> (Now that I re-read my post, I see that I didn't make it clear
> >> that se_bash and se_hog are assigned different tags(e.g. hog is
> >> tagged and bash is untagged).
> >
> > Yes, script is untagged. This looks like exactly the problem in you
> > previous post. I didn't follow that, does that discussion lead to a solution?
>
> No immediate solution yet.
>
> >>
> >> Siblings being forced idle is expected due to the nature of core
> >> scheduling, but when two tasks belonging to two siblings are
> >> fighting for schedule, we should let the higher priority one win.
> >>
> >> It used to work on v2 is probably due to we mistakenly
> >> allow different tagged tasks to schedule on the same core at
> >> the same time, but that is fixed in v3.
> >
> > I have 64 threads running on a 104-CPU server, that is, when the
>
> 104-CPU means 52 cores I guess.
> 64 threads may(should?) spread on all the 52 cores and that is enough
> to make 'date' suffer.

64 threads should spread onto all the 52 cores, but why they can get
scheduled while untagged "date" can not? Is it because in the current
implementation the task with cookie always has higher priority than the
task without a cookie?

Thanks,
-Aubrey

2019-05-31 07:46:36

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Fri, May 31, 2019 at 02:53:21PM +0800, Aubrey Li wrote:
> On Fri, May 31, 2019 at 2:09 PM Aaron Lu <[email protected]> wrote:
> >
> > On 2019/5/31 13:12, Aubrey Li wrote:
> > > On Fri, May 31, 2019 at 11:01 AM Aaron Lu <[email protected]> wrote:
> > >>
> > >> This feels like "date" failed to schedule on some CPU
> > >> on time.
> > >>
> > >> My first reaction is: when shell wakes up from sleep, it will
> > >> fork date. If the script is untagged and those workloads are
> > >> tagged and all available cores are already running workload
> > >> threads, the forked date can lose to the running workload
> > >> threads due to __prio_less() can't properly do vruntime comparison
> > >> for tasks on different CPUs. So those idle siblings can't run
> > >> date and are idled instead. See my previous post on this:
> > >> https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> > >> (Now that I re-read my post, I see that I didn't make it clear
> > >> that se_bash and se_hog are assigned different tags(e.g. hog is
> > >> tagged and bash is untagged).
> > >
> > > Yes, script is untagged. This looks like exactly the problem in you
> > > previous post. I didn't follow that, does that discussion lead to a solution?
> >
> > No immediate solution yet.
> >
> > >>
> > >> Siblings being forced idle is expected due to the nature of core
> > >> scheduling, but when two tasks belonging to two siblings are
> > >> fighting for schedule, we should let the higher priority one win.
> > >>
> > >> It used to work on v2 is probably due to we mistakenly
> > >> allow different tagged tasks to schedule on the same core at
> > >> the same time, but that is fixed in v3.
> > >
> > > I have 64 threads running on a 104-CPU server, that is, when the
> >
> > 104-CPU means 52 cores I guess.
> > 64 threads may(should?) spread on all the 52 cores and that is enough
> > to make 'date' suffer.
>
> 64 threads should spread onto all the 52 cores, but why they can get
> scheduled while untagged "date" can not? Is it because in the current

If 'date' didn't get scheduled, there will be no output at all unless
all those workload threads finished :-)

I guess the workload you used is not entirely CPU intensive, or 'date'
can be totally blocked due to START_DEBIT. But note that START_DEBIT
isn't the problem here, cross CPU vruntime comparison is.

> implementation the task with cookie always has higher priority than the
> task without a cookie?

No.

2019-05-31 08:29:44

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Fri, May 31, 2019 at 3:45 PM Aaron Lu <[email protected]> wrote:
>
> On Fri, May 31, 2019 at 02:53:21PM +0800, Aubrey Li wrote:
> > On Fri, May 31, 2019 at 2:09 PM Aaron Lu <[email protected]> wrote:
> > >
> > > On 2019/5/31 13:12, Aubrey Li wrote:
> > > > On Fri, May 31, 2019 at 11:01 AM Aaron Lu <[email protected]> wrote:
> > > >>
> > > >> This feels like "date" failed to schedule on some CPU
> > > >> on time.
> > > >>
> > > >> My first reaction is: when shell wakes up from sleep, it will
> > > >> fork date. If the script is untagged and those workloads are
> > > >> tagged and all available cores are already running workload
> > > >> threads, the forked date can lose to the running workload
> > > >> threads due to __prio_less() can't properly do vruntime comparison
> > > >> for tasks on different CPUs. So those idle siblings can't run
> > > >> date and are idled instead. See my previous post on this:
> > > >> https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> > > >> (Now that I re-read my post, I see that I didn't make it clear
> > > >> that se_bash and se_hog are assigned different tags(e.g. hog is
> > > >> tagged and bash is untagged).
> > > >
> > > > Yes, script is untagged. This looks like exactly the problem in you
> > > > previous post. I didn't follow that, does that discussion lead to a solution?
> > >
> > > No immediate solution yet.
> > >
> > > >>
> > > >> Siblings being forced idle is expected due to the nature of core
> > > >> scheduling, but when two tasks belonging to two siblings are
> > > >> fighting for schedule, we should let the higher priority one win.
> > > >>
> > > >> It used to work on v2 is probably due to we mistakenly
> > > >> allow different tagged tasks to schedule on the same core at
> > > >> the same time, but that is fixed in v3.
> > > >
> > > > I have 64 threads running on a 104-CPU server, that is, when the
> > >
> > > 104-CPU means 52 cores I guess.
> > > 64 threads may(should?) spread on all the 52 cores and that is enough
> > > to make 'date' suffer.
> >
> > 64 threads should spread onto all the 52 cores, but why they can get
> > scheduled while untagged "date" can not? Is it because in the current
>
> If 'date' didn't get scheduled, there will be no output at all unless
> all those workload threads finished :-)

Certainly I meant untagged "date" can not be scheduled on time, :)

>
> I guess the workload you used is not entirely CPU intensive, or 'date'
> can be totally blocked due to START_DEBIT. But note that START_DEBIT
> isn't the problem here, cross CPU vruntime comparison is.
>
> > implementation the task with cookie always has higher priority than the
> > task without a cookie?
>
> No.

I checked the benchmark log manually, it looks like the data of two benchmarks
with cookies are acceptable, but ones without cookies are really bad.

2019-05-31 11:11:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 10/16] sched: Core-wide rq->lock

On Wed, May 29, 2019 at 08:36:46PM +0000, Vineeth Remanan Pillai wrote:

> + * The static-key + stop-machine variable are needed such that:
> + *
> + * spin_lock(rq_lockp(rq));
> + * ...
> + * spin_unlock(rq_lockp(rq));
> + *
> + * ends up locking and unlocking the _same_ lock, and all CPUs
> + * always agree on what rq has what lock.

> @@ -5790,8 +5854,15 @@ int sched_cpu_activate(unsigned int cpu)
> /*
> * When going up, increment the number of cores with SMT present.
> */
> - if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> + if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
> static_branch_inc_cpuslocked(&sched_smt_present);
> +#ifdef CONFIG_SCHED_CORE
> + if (static_branch_unlikely(&__sched_core_enabled)) {
> + rq->core_enabled = true;
> + }
> +#endif
> + }
> +
> #endif
> set_cpu_active(cpu, true);
>
> @@ -5839,8 +5910,16 @@ int sched_cpu_deactivate(unsigned int cpu)
> /*
> * When going down, decrement the number of cores with SMT present.
> */
> - if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> + if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
> +#ifdef CONFIG_SCHED_CORE
> + struct rq *rq = cpu_rq(cpu);
> + if (static_branch_unlikely(&__sched_core_enabled)) {
> + rq->core_enabled = false;
> + }
> +#endif
> static_branch_dec_cpuslocked(&sched_smt_present);
> +
> + }
> #endif

I'm confused, how doesn't this break the invariant above?

That is, all CPUs must at all times agree on the value of rq_lockp(),
and I'm not seeing how that is true with the above changes.

2019-05-31 15:25:43

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v3 10/16] sched: Core-wide rq->lock

>
> I'm confused, how doesn't this break the invariant above?
>
> That is, all CPUs must at all times agree on the value of rq_lockp(),
> and I'm not seeing how that is true with the above changes.
>
While fixing the crash in cpu online/offline, I was focusing on
maintaining the invariance
of all online cpus to agree on the value of rq_lockp(). Would it be
safe to assume that
rq and rq_lock would be used only after a cpu is onlined(sched:active)?.

To maintain the strict invariance, the sibling should also disable
core scheduling, but
we need to empty the rbtree before disabling it. I am trying to see
how to empty the
rbtree safely in the offline context.

Thanks,
Vineeth

2019-05-31 21:10:01

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

> My first reaction is: when shell wakes up from sleep, it will
> fork date. If the script is untagged and those workloads are
> tagged and all available cores are already running workload
> threads, the forked date can lose to the running workload
> threads due to __prio_less() can't properly do vruntime comparison
> for tasks on different CPUs. So those idle siblings can't run
> date and are idled instead. See my previous post on this:
>
> https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> (Now that I re-read my post, I see that I didn't make it clear
> that se_bash and se_hog are assigned different tags(e.g. hog is
> tagged and bash is untagged).
>
> Siblings being forced idle is expected due to the nature of core
> scheduling, but when two tasks belonging to two siblings are
> fighting for schedule, we should let the higher priority one win.
>
> It used to work on v2 is probably due to we mistakenly
> allow different tagged tasks to schedule on the same core at
> the same time, but that is fixed in v3.

I confirm this is indeed what is happening, we reproduced it with a
simple script that only uses one core (cpu 2 and 38 are sibling on this
machine):

setup:
cgcreate -g cpu,cpuset:test
cgcreate -g cpu,cpuset:test/set1
cgcreate -g cpu,cpuset:test/set2
echo 2,38 > /sys/fs/cgroup/cpuset/test/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/test/cpuset.mems
echo 2,38 > /sys/fs/cgroup/cpuset/test/set1/cpuset.cpus
echo 2,38 > /sys/fs/cgroup/cpuset/test/set2/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/test/set1/cpuset.mems
echo 0 > /sys/fs/cgroup/cpuset/test/set2/cpuset.mems
echo 1 > /sys/fs/cgroup/cpu,cpuacct/test/set1/cpu.tag

In one terminal:
sudo cgexec -g cpu,cpuset:test/set1 sysbench --threads=1 --time=30
--test=cpu run

In another one:
sudo cgexec -g cpu,cpuset:test/set2 date

It's very clear that 'date' hangs until sysbench is done.

We started experimenting with marking a task on the forced idle sibling
if normalized vruntimes are equal. That way, at the next compare, if the
normalized vruntimes are still equal, it prefers the task on the forced
idle sibling. It still needs more work, but in our early tests it helps.

Thanks,

Julien

2019-06-06 18:15:40

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 31-May-2019 05:08:16 PM, Julien Desfossez wrote:
> > My first reaction is: when shell wakes up from sleep, it will
> > fork date. If the script is untagged and those workloads are
> > tagged and all available cores are already running workload
> > threads, the forked date can lose to the running workload
> > threads due to __prio_less() can't properly do vruntime comparison
> > for tasks on different CPUs. So those idle siblings can't run
> > date and are idled instead. See my previous post on this:
> >
> > https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> > (Now that I re-read my post, I see that I didn't make it clear
> > that se_bash and se_hog are assigned different tags(e.g. hog is
> > tagged and bash is untagged).
> >
> > Siblings being forced idle is expected due to the nature of core
> > scheduling, but when two tasks belonging to two siblings are
> > fighting for schedule, we should let the higher priority one win.
> >
> > It used to work on v2 is probably due to we mistakenly
> > allow different tagged tasks to schedule on the same core at
> > the same time, but that is fixed in v3.
>
> I confirm this is indeed what is happening, we reproduced it with a
> simple script that only uses one core (cpu 2 and 38 are sibling on this
> machine):
>
> setup:
> cgcreate -g cpu,cpuset:test
> cgcreate -g cpu,cpuset:test/set1
> cgcreate -g cpu,cpuset:test/set2
> echo 2,38 > /sys/fs/cgroup/cpuset/test/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/test/cpuset.mems
> echo 2,38 > /sys/fs/cgroup/cpuset/test/set1/cpuset.cpus
> echo 2,38 > /sys/fs/cgroup/cpuset/test/set2/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/test/set1/cpuset.mems
> echo 0 > /sys/fs/cgroup/cpuset/test/set2/cpuset.mems
> echo 1 > /sys/fs/cgroup/cpu,cpuacct/test/set1/cpu.tag
>
> In one terminal:
> sudo cgexec -g cpu,cpuset:test/set1 sysbench --threads=1 --time=30
> --test=cpu run
>
> In another one:
> sudo cgexec -g cpu,cpuset:test/set2 date
>
> It's very clear that 'date' hangs until sysbench is done.
>
> We started experimenting with marking a task on the forced idle sibling
> if normalized vruntimes are equal. That way, at the next compare, if the
> normalized vruntimes are still equal, it prefers the task on the forced
> idle sibling. It still needs more work, but in our early tests it helps.

As mentioned above, we have come up with a fix for the long starvation
of untagged interactive threads competing for the same core with tagged
threads at the same priority. The idea is to detect the stall and boost
the stalling threads priority so that it gets a chance next time.
Boosting is done by a new counter(min_vruntime_boost) for every task
which we subtract from vruntime before comparison. The new logic looks
like this:

If we see that normalized runtimes are equal, we check the min_vruntimes
of their runqueues and give a chance for the task in the runqueue with
less min_vruntime. That will help it to progress its vruntime. While
doing this, we boost the priority of the task in the sibling so that, we
don’t starve the task in the sibling until the min_vruntime of this
runqueue catches up.

If min_vruntimes are also equal, we do as before and consider the task
‘a’ of higher priority. Here we boost the task ‘b’ so that it gets to
run next time.

The min_vruntime_boost is reset to zero once the task in on cpu. So only
waiting tasks will have a non-zero value if it is starved while matching
a task on the other sibling.

The attached patch has a sched_feature to enable the above feature so
that you can compare the results with and without this feature.

What we observe with this patch is that it helps for untagged
interactive tasks and fairness in general, but this increases the
overhead of core scheduling when there is contention for the CPU with
tasks of varying cpu usage. The general trend we see is that if there is
a cpu intensive thread and multiple relatively idle threads in different
tags, the cpu intensive tasks continuously yields to be fair to the
relatively idle threads when it becomes runnable. And if the relatively
idle threads make up for most of the tasks in a system and are tagged,
the cpu intensive tasks sees a considerable drop in performance.

If you have any feedback or creative ideas to help improve, let us
know !

Thanks


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1a309e8..56cad0e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -642,6 +642,7 @@ struct task_struct {
struct rb_node core_node;
unsigned long core_cookie;
unsigned int core_occupation;
+ unsigned int core_vruntime_boost;
#endif

#ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 73329da..c302853 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -92,6 +92,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)

int pa = __task_prio(a), pb = __task_prio(b);

+ trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+ a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+ b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;

@@ -102,21 +106,36 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
return !dl_time_before(a->dl.deadline, b->dl.deadline);

if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */
- u64 vruntime = b->se.vruntime;
-
- trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
- a->comm, a->pid, pa, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
- b->comm, b->pid, pb, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
+ u64 a_vruntime = a->se.vruntime - a->core_vruntime_boost;
+ u64 b_vruntime = b->se.vruntime - b->core_vruntime_boost;

/*
* Normalize the vruntime if tasks are in different cpus.
*/
if (task_cpu(a) != task_cpu(b)) {
- vruntime -= task_cfs_rq(b)->min_vruntime;
- vruntime += task_cfs_rq(a)->min_vruntime;
+ s64 min_vruntime_diff = task_cfs_rq(a)->min_vruntime -
+ task_cfs_rq(b)->min_vruntime;
+ b_vruntime += min_vruntime_diff;
+
+ trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
+ a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
+ b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
+
+ if (sched_feat(CORESCHED_STALL_FIX) &&
+ a_vruntime == b_vruntime) {
+ bool less_prio = min_vruntime_diff > 0;
+
+ if (less_prio)
+ a->core_vruntime_boost++;
+ else
+ b->core_vruntime_boost++;
+
+ return less_prio;
+
+ }
}

- return !((s64)(a->se.vruntime - vruntime) <= 0);
+ return !((s64)(a_vruntime - b_vruntime) <= 0);
}

return false;
@@ -2456,6 +2475,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_COMPACTION
p->capture_control = NULL;
#endif
+#ifdef CONFIG_SCHED_CORE
+ p->core_vruntime_boost = 0UL;
+#endif
init_numa_balancing(clone_flags, p);
}

@@ -3723,6 +3745,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
next->comm, next->pid,
next->core_cookie);

+ next->core_vruntime_boost = 0UL;
return next;
}

@@ -3835,6 +3858,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);

if (old_max) {
+ if (old_max->core_vruntime_boost)
+ old_max->core_vruntime_boost--;
+
for_each_cpu(j, smt_mask) {
if (j == i)
continue;
@@ -3905,6 +3931,7 @@ next_class:;

done:
set_next_task(rq, next);
+ next->core_vruntime_boost = 0UL;
return next;
}

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..332a092 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -90,3 +90,9 @@ SCHED_FEAT(WA_BIAS, true)
* UtilEstimation. Use estimated CPU utilization.
*/
SCHED_FEAT(UTIL_EST, true)
+
+/*
+ * Prevent task stall due to vruntime comparison limitation across
+ * cpus.
+ */
+SCHED_FEAT(CORESCHED_STALL_FIX, false)

2019-06-12 07:26:08

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 2019/6/6 23:26, Julien Desfossez wrote:
> As mentioned above, we have come up with a fix for the long starvation
> of untagged interactive threads competing for the same core with tagged
> threads at the same priority. The idea is to detect the stall and boost
> the stalling threads priority so that it gets a chance next time.
> Boosting is done by a new counter(min_vruntime_boost) for every task
> which we subtract from vruntime before comparison. The new logic looks
> like this:
>
> If we see that normalized runtimes are equal, we check the min_vruntimes
> of their runqueues and give a chance for the task in the runqueue with
> less min_vruntime. That will help it to progress its vruntime. While
> doing this, we boost the priority of the task in the sibling so that, we
> don’t starve the task in the sibling until the min_vruntime of this
> runqueue catches up.
>
> If min_vruntimes are also equal, we do as before and consider the task
> ‘a’ of higher priority. Here we boost the task ‘b’ so that it gets to
> run next time.
>
> The min_vruntime_boost is reset to zero once the task in on cpu. So only
> waiting tasks will have a non-zero value if it is starved while matching
> a task on the other sibling.
>
> The attached patch has a sched_feature to enable the above feature so
> that you can compare the results with and without this feature.
>
> What we observe with this patch is that it helps for untagged
> interactive tasks and fairness in general, but this increases the
> overhead of core scheduling when there is contention for the CPU with
> tasks of varying cpu usage. The general trend we see is that if there is
> a cpu intensive thread and multiple relatively idle threads in different
> tags, the cpu intensive tasks continuously yields to be fair to the
> relatively idle threads when it becomes runnable. And if the relatively
> idle threads make up for most of the tasks in a system and are tagged,
> the cpu intensive tasks sees a considerable drop in performance.
>
> If you have any feedback or creative ideas to help improve, let us
> know !

The data on my side looks good with CORESCHED_STALL_FIX = true.

Environment setup
--------------------------
Skylake 8170 server, 2 numa nodes, 52 cores, 104 CPUs (HT on)
cgroup1 workload, sysbench (CPU mode, non AVX workload)
cgroup2 workload, gemmbench (AVX512 workload)

sysbench throughput result:
.--------------------------------------------------------------------------------------------------------------------------------------.
|NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT [std% / sem%] +/- cpu% | no-SMT [std% / sem%] +/- cpu% |
|--------------------------------------------------------------------------------------------------------------------------------------|
| 1/1 490.8 [ 0.1%/ 0.0%] 1.9% | 492.6 [ 0.1%/ 0.0%] 0.4% 1.9% | 489.5 [ 0.1%/ 0.0%] -0.3% 3.9% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 2/2 975.0 [ 0.6%/ 0.1%] 3.9% | 970.4 [ 0.4%/ 0.0%] -0.5% 3.9% | 975.6 [ 0.2%/ 0.0%] 0.1% 7.7% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 4/4 1856.9 [ 0.2%/ 0.0%] 7.8% | 1854.5 [ 0.3%/ 0.0%] -0.1% 7.8% | 1849.4 [ 0.8%/ 0.1%] -0.4% 14.8% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 8/8 3622.8 [ 0.2%/ 0.0%] 14.6% | 3618.3 [ 0.1%/ 0.0%] -0.1% 14.7% | 3626.6 [ 0.4%/ 0.0%] 0.1% 30.1% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 16/16 6976.7 [ 0.2%/ 0.0%] 30.1% | 6959.3 [ 0.3%/ 0.0%] -0.2% 30.1% | 6964.4 [ 0.9%/ 0.1%] -0.2% 60.1% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 32/32 10347.7 [ 3.8%/ 0.4%] 60.1% | 11525.4 [ 2.8%/ 0.3%] 11.4% 59.5% | 9810.5 [ 9.4%/ 0.8%] -5.2% 97.7% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 64/64 15284.9 [ 9.0%/ 0.9%] 98.1% | 17022.1 [ 4.5%/ 0.5%] 11.4% 98.2% | 9989.7 [19.3%/ 1.1%] -34.6% 100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'
|128/128 16211.3 [18.9%/ 1.9%] 100.0% | 16507.9 [ 6.1%/ 0.6%] 1.8% 99.8% | 10379.0 [12.6%/ 0.8%] -36.0% 100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'
|256/256 16667.1 [ 3.1%/ 0.3%] 100.0% | 16499.1 [ 3.2%/ 0.3%] -1.0% 100.0% | 10540.9 [16.2%/ 1.0%] -36.8% 100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'

sysbench latency result:
(The story we care about latency is that some customers reported their
latency critical job is affected when co-locating a deep learning job
(AVX512 task) onto the same core, because when a core executes AVX512
instructions, the core automatically reduces its frequency. This can
lead to a significant overall performance loss for a non-AVX512 job
on the same core.

And now we have core cookie match mechanism, so if we put AVX512 task
and non AVX512 task into the different cgroups, they are supposed not
to be co-located. That's why we saw the improvements of 32/32 and 64/64
cases.)

.--------------------------------------------------------------------------------------------------------------------------------------.
|NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT [std% / sem%] +/- cpu% | no-SMT [std% / sem%] +/- cpu% |
|--------------------------------------------------------------------------------------------------------------------------------------|
| 1/1 2.1 [ 0.6%/ 0.1%] 1.9% | 2.0 [ 0.2%/ 0.0%] 3.8% 1.9% | 2.1 [ 0.7%/ 0.1%] -0.8% 3.9% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 2/2 2.1 [ 0.7%/ 0.1%] 3.9% | 2.1 [ 0.3%/ 0.0%] 0.2% 3.9% | 2.1 [ 0.6%/ 0.1%] 0.5% 7.7% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 4/4 2.2 [ 0.6%/ 0.1%] 7.8% | 2.2 [ 0.4%/ 0.0%] -0.2% 7.8% | 2.2 [ 0.2%/ 0.0%] -0.3% 14.8% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 8/8 2.2 [ 0.4%/ 0.0%] 14.6% | 2.2 [ 0.0%/ 0.0%] 0.1% 14.7% | 2.2 [ 0.0%/ 0.0%] 0.1% 30.1% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 16/16 2.4 [ 1.6%/ 0.2%] 30.1% | 2.4 [ 1.6%/ 0.2%] -0.9% 30.1% | 2.4 [ 1.9%/ 0.2%] -0.3% 60.1% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 32/32 4.9 [ 6.2%/ 0.6%] 60.1% | 3.1 [ 5.0%/ 0.5%] 36.6% 59.5% | 6.7 [17.3%/ 3.7%] -34.5% 97.7% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 64/64 9.4 [28.3%/ 2.8%] 98.1% | 3.5 [25.6%/ 2.6%] 62.4% 98.2% | 18.5 [ 9.5%/ 5.0%] -97.9% 100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'
|128/128 21.3 [10.1%/ 1.0%] 100.0% | 24.8 [ 8.1%/ 0.8%] -16.1% 99.8% | 34.5 [ 4.9%/ 0.7%] -62.0% 100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'
|256/256 35.5 [ 7.8%/ 0.8%] 100.0% | 37.3 [ 5.4%/ 0.5%] -5.1% 100.0% | 40.8 [ 5.9%/ 0.6%] -15.0% 100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'

Note:
----
64/64: 64 sysbench threads(in one cgroup) and 64 gemmbench threads(in other cgroup) run simultaneously.
Vanilla-SMT: baseline with HT on
coresched-SMT: core scheduling enabled
no-SMT: HT off thru /sys/devices/system/cpu/smt/control
std%: standard deviation
sem%: standard error of the mean
±: improvement/regression against baseline
cpu%: derived by vmstat.idle and vmstat.iowait

2019-06-12 18:07:10

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

> The data on my side looks good with CORESCHED_STALL_FIX = true.

Thank you for testing this fix, I'm glad it works for this use-case as
well.

We will be posting another (simpler) version today, stay tuned :-)

Julien

2019-06-12 18:08:32

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

After reading more traces and trying to understand why only untagged
tasks are starving when there are cpu-intensive tasks running on the
same set of CPUs, we noticed a difference in behavior in ‘pick_task’. In
the case where ‘core_cookie’ is 0, we are supposed to only prefer the
tagged task if it’s priority is higher, but when the priorities are
equal we prefer it as well which causes the starving. ‘pick_task’ is
biased toward selecting its first parameter in case of equality which in
this case was the ‘class_pick’ instead of ‘max’. Reversing the order of
the parameter solves this issue and matches the expected behavior.

So we can get rid of this vruntime_boost concept.

We have tested the fix below and it seems to work well with
tagged/untagged tasks.

Here are our initial test results. When core scheduling is enabled,
each VM (and associated vhost threads) are in their own cgroup/tag.

1 12-vcpu VM MySQL TPC-C benchmark (IO + CPU) with 96 mostly-idle 1-vcpu
VMs on each NUMA node (72 logical CPUs total with SMT on):
+-------------+----------+--------------+------------+--------+
| | baseline | coresched | coresched | nosmt |
| | no tag | VMs tagged | VMs tagged | no tag |
| | v5.1.5 | no stall fix | stall fix | |
+-------------+----------+--------------+------------+--------+
|average TPS | 1474 | 1289 | 1264 | 1339 |
|stdev | 48 | 12 | 17 | 24 |
|overhead | N/A | -12% | -14% | -9% |
+-------------+----------+--------------+------------+--------+

3 12-vcpu VMs running linpack (cpu-intensive), all pinned on the same
NUMA node (36 logical CPUs with SMT enabled on that NUMA node):
+---------------+----------+--------------+-----------+--------+
| | baseline | coresched | coresched | nosmt |
| | no tag | VMs tagged | VMs tagged| no tag |
| | v5.1.5 | no stall fix | stall fix | |
+---------------+----------+--------------+-----------+--------+
|average gflops | 177.9 | 171.3 | 172.7 | 81.9 |
|stdev | 2.6 | 10.6 | 6.4 | 8.1 |
|overhead | N/A | -3.7% | -2.9% | -53.9% |
+---------------+----------+--------------+-----------+--------+

This fix can be toggled dynamically with the ‘CORESCHED_STALL_FIX’
sched_feature so it’s easy to test before/after (it is disabled by
default).

The up-to-date git tree can also be found here in case it’s easier to
follow:
https://github.com/digitalocean/linux-coresched/commits/vpillai/coresched-v3-v5.1.5-test

Feedback welcome !

Thanks,

Julien

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6e79421..26fea68 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3668,8 +3668,10 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
* If class_pick is tagged, return it only if it has
* higher priority than max.
*/
- if (max && class_pick->core_cookie &&
- prio_less(class_pick, max))
+ bool max_is_higher = sched_feat(CORESCHED_STALL_FIX) ?
+ max && !prio_less(max, class_pick) :
+ max && prio_less(class_pick, max);
+ if (class_pick->core_cookie && max_is_higher)
return idle_sched_class.pick_task(rq);

return class_pick;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..332a092 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -90,3 +90,9 @@ SCHED_FEAT(WA_BIAS, true)
* UtilEstimation. Use estimated CPU utilization.
*/
SCHED_FEAT(UTIL_EST, true)
+
+/*
+ * Prevent task stall due to vruntime comparison limitation across
+ * cpus.
+ */
+SCHED_FEAT(CORESCHED_STALL_FIX, false)

2019-06-13 16:53:59

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 12-Jun-2019 05:03:08 PM, Subhra Mazumdar wrote:
>
> On 6/12/19 9:33 AM, Julien Desfossez wrote:
> >After reading more traces and trying to understand why only untagged
> >tasks are starving when there are cpu-intensive tasks running on the
> >same set of CPUs, we noticed a difference in behavior in ‘pick_task’. In
> >the case where ‘core_cookie’ is 0, we are supposed to only prefer the
> >tagged task if it’s priority is higher, but when the priorities are
> >equal we prefer it as well which causes the starving. ‘pick_task’ is
> >biased toward selecting its first parameter in case of equality which in
> >this case was the ‘class_pick’ instead of ‘max’. Reversing the order of
> >the parameter solves this issue and matches the expected behavior.
> >
> >So we can get rid of this vruntime_boost concept.
> >
> >We have tested the fix below and it seems to work well with
> >tagged/untagged tasks.
> >
> My 2 DB instance runs with this patch are better with CORESCHED_STALL_FIX
> than NO_CORESCHED_STALL_FIX in terms of performance, std deviation and
> idleness. May be enable it by default?

Yes if the fix is approved, we will just remove the option and it will
always be enabled.

Thanks,

Julien

2019-06-13 17:00:19

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3


On 6/12/19 9:33 AM, Julien Desfossez wrote:
> After reading more traces and trying to understand why only untagged
> tasks are starving when there are cpu-intensive tasks running on the
> same set of CPUs, we noticed a difference in behavior in ‘pick_task’. In
> the case where ‘core_cookie’ is 0, we are supposed to only prefer the
> tagged task if it’s priority is higher, but when the priorities are
> equal we prefer it as well which causes the starving. ‘pick_task’ is
> biased toward selecting its first parameter in case of equality which in
> this case was the ‘class_pick’ instead of ‘max’. Reversing the order of
> the parameter solves this issue and matches the expected behavior.
>
> So we can get rid of this vruntime_boost concept.
>
> We have tested the fix below and it seems to work well with
> tagged/untagged tasks.
>
My 2 DB instance runs with this patch are better with CORESCHED_STALL_FIX
than NO_CORESCHED_STALL_FIX in terms of performance, std deviation and
idleness. May be enable it by default?

NO_CORESCHED_STALL_FIX:

users     %stdev   %gain %idle
16        25       -42.4 73
24        32       -26.3 67
32        0.2      -48.9 62


CORESCHED_STALL_FIX:

users     %stdev   %gain %idle
16        6.5      -23 70
24        0.6      -17 60
32        1.5      -30.2   52

2019-06-17 02:53:12

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, Jun 13, 2019 at 11:22 AM Julien Desfossez
<[email protected]> wrote:
>
> On 12-Jun-2019 05:03:08 PM, Subhra Mazumdar wrote:
> >
> > On 6/12/19 9:33 AM, Julien Desfossez wrote:
> > >After reading more traces and trying to understand why only untagged
> > >tasks are starving when there are cpu-intensive tasks running on the
> > >same set of CPUs, we noticed a difference in behavior in ‘pick_task’. In
> > >the case where ‘core_cookie’ is 0, we are supposed to only prefer the
> > >tagged task if it’s priority is higher, but when the priorities are
> > >equal we prefer it as well which causes the starving. ‘pick_task’ is
> > >biased toward selecting its first parameter in case of equality which in
> > >this case was the ‘class_pick’ instead of ‘max’. Reversing the order of
> > >the parameter solves this issue and matches the expected behavior.
> > >
> > >So we can get rid of this vruntime_boost concept.
> > >
> > >We have tested the fix below and it seems to work well with
> > >tagged/untagged tasks.
> > >
> > My 2 DB instance runs with this patch are better with CORESCHED_STALL_FIX
> > than NO_CORESCHED_STALL_FIX in terms of performance, std deviation and
> > idleness. May be enable it by default?
>
> Yes if the fix is approved, we will just remove the option and it will
> always be enabled.
>

sysbench --report-interval option unveiled something.

benchmark setup
-------------------------
two cgroups, cpuset.cpus = 1, 53(one core, two siblings)
sysbench cpu mode, one thread in cgroup1
sysbench memory mode, one thread in cgroup2

no core scheduling
--------------------------
cpu throughput eps: 405.8, std: 0.14%
mem bandwidth MB/s: 5785.7, std: 0.11%

cgroup1 enable core scheduling(cpu mode)
cgroup2 disable core scheduling(memory mode)
-----------------------------------------------------------------
cpu throughput eps: 8.7, std: 519.2%
mem bandwidth MB/s: 6263.2, std: 9.3%

cgroup1 disable core scheduling(cpu mode)
cgroup2 enable core scheduling(memory mode)
-----------------------------------------------------------------
cpu throughput eps: 468.0 , std: 8.7%
mem bandwidth MB/S: 311.6 , std: 169.1%

cgroup1 enable core scheduling(cpu mode)
cgroup2 enable core scheduling(memory mode)
----------------------------------------------------------------
cpu throughput eps: 76.4 , std: 168.0%
mem bandwidth MB/S: 5388.3 , std: 30.9%

The result looks still unfair, and particularly, the variance is too high,
----sysbench cpu log ----
----snip----
[ 10s ] thds: 1 eps: 296.00 lat (ms,95%): 2.03
[ 11s ] thds: 1 eps: 0.00 lat (ms,95%): 1170.65
[ 12s ] thds: 1 eps: 1.00 lat (ms,95%): 0.00
[ 13s ] thds: 1 eps: 0.00 lat (ms,95%): 0.00
[ 14s ] thds: 1 eps: 295.91 lat (ms,95%): 2.03
[ 15s ] thds: 1 eps: 1.00 lat (ms,95%): 170.48
[ 16s ] thds: 1 eps: 0.00 lat (ms,95%): 2009.23
[ 17s ] thds: 1 eps: 1.00 lat (ms,95%): 995.51
[ 18s ] thds: 1 eps: 296.00 lat (ms,95%): 2.03
[ 19s ] thds: 1 eps: 1.00 lat (ms,95%): 170.48
[ 20s ] thds: 1 eps: 0.00 lat (ms,95%): 2009.23
----snip----

Thanks,
-Aubrey

2019-06-19 18:33:40

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 17-Jun-2019 10:51:27 AM, Aubrey Li wrote:
> The result looks still unfair, and particularly, the variance is too high,

I just want to confirm that I am also seeing the same issue with a
similar setup. I also tried with the priority boost fix we previously
posted, the results are slightly better, but we are still seeing a very
high variance.

On average, the results I get for 10 30-seconds runs are still much
better than nosmt (both sysbench pinned on the same sibling) for the
memory benchmark, and pretty similar for the CPU benchmark, but the high
variance between runs is indeed concerning.

Still digging :-)

Thanks,

Julien

2019-07-18 10:08:43

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote:
> On 17-Jun-2019 10:51:27 AM, Aubrey Li wrote:
> > The result looks still unfair, and particularly, the variance is too high,
>
> I just want to confirm that I am also seeing the same issue with a
> similar setup. I also tried with the priority boost fix we previously
> posted, the results are slightly better, but we are still seeing a very
> high variance.
>
> On average, the results I get for 10 30-seconds runs are still much
> better than nosmt (both sysbench pinned on the same sibling) for the
> memory benchmark, and pretty similar for the CPU benchmark, but the high
> variance between runs is indeed concerning.

I was thinking to use util_avg signal to decide which task win in
__prio_less() in the cross cpu case. The reason util_avg is chosen
is because it represents how cpu intensive the task is, so the end
result is, less cpu intensive task will preempt more cpu intensive
task.

Here is the test I have done to see how util_avg works
(on a single node, 16 cores, 32 cpus vm):
1 Start tmux and then start 3 windows with each running bash;
2 Place two shells into two different cgroups and both have cpu.tag set;
3 Switch to the 1st tmux window, start
will-it-scale/page_fault1_processes -t 16 -s 30
in the first tagged shell;
4 Switch to the 2nd tmux window;
5 Start
will-it-scale/page_fault1_processes -t 16 -s 30
in the 2nd tagged shell;
6 Switch to the 3rd tmux window;
7 Do some simple things in the 3rd untagged shell like ls to see if
untagged task is able to proceed;
8 Wait for the two page_fault workloads to finish.

With v3 here, I can not do step 4 and later steps, i.e. the 16
page_fault1 processes started in step 3 will occupy all 16 cores and
other tasks do not have a chance to run, including tmux, which made
switching tmux window impossible.

With the below patch on top of v3 that makes use of util_avg to decide
which task win, I can do all 8 steps and the final scores of the 2
workloads are: 1796191 and 2199586. The score number are not close,
suggesting some unfairness, but I can finish the test now...

Here is the diff(consider it as a POC):

---
kernel/sched/core.c | 35 ++---------------------------------
kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 2 ++
3 files changed, 40 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26fea68f7f54..7557a7bbb481 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -105,25 +105,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
return !dl_time_before(a->dl.deadline, b->dl.deadline);

- if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */
- u64 a_vruntime = a->se.vruntime;
- u64 b_vruntime = b->se.vruntime;
-
- /*
- * Normalize the vruntime if tasks are in different cpus.
- */
- if (task_cpu(a) != task_cpu(b)) {
- b_vruntime -= task_cfs_rq(b)->min_vruntime;
- b_vruntime += task_cfs_rq(a)->min_vruntime;
-
- trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
- a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
- b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
-
- }
-
- return !((s64)(a_vruntime - b_vruntime) <= 0);
- }
+ if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+ return cfs_prio_less(a, b);

return false;
}
@@ -3663,20 +3646,6 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
if (!class_pick)
return NULL;

- if (!cookie) {
- /*
- * If class_pick is tagged, return it only if it has
- * higher priority than max.
- */
- bool max_is_higher = sched_feat(CORESCHED_STALL_FIX) ?
- max && !prio_less(max, class_pick) :
- max && prio_less(class_pick, max);
- if (class_pick->core_cookie && max_is_higher)
- return idle_sched_class.pick_task(rq);
-
- return class_pick;
- }
-
/*
* If class_pick is idle or matches cookie, return early.
*/
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 26d29126d6a5..06fb00689db1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10740,3 +10740,39 @@ __init void init_sched_fair_class(void)
#endif /* SMP */

}
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+ struct sched_entity *sea = &a->se;
+ struct sched_entity *seb = &b->se;
+ bool samecore = task_cpu(a) == task_cpu(b);
+ struct task_struct *p;
+ s64 delta;
+
+ if (samecore) {
+ /* vruntime is per cfs_rq */
+ while (!is_same_group(sea, seb)) {
+ int sea_depth = sea->depth;
+ int seb_depth = seb->depth;
+
+ if (sea_depth >= seb_depth)
+ sea = parent_entity(sea);
+ if (sea_depth <= seb_depth)
+ seb = parent_entity(seb);
+ }
+
+ delta = (s64)(sea->vruntime - seb->vruntime);
+ }
+
+ /* across cpu: use util_avg to decide which task to run */
+ delta = (s64)(sea->avg.util_avg - seb->avg.util_avg);
+
+ p = delta > 0 ? b : a;
+ trace_printk("picked %s/%d %s: %Lu %Lu %Ld\n", p->comm, p->pid,
+ samecore ? "vruntime" : "util_avg",
+ samecore ? sea->vruntime : sea->avg.util_avg,
+ samecore ? seb->vruntime : seb->avg.util_avg,
+ delta);
+
+ return delta > 0;
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e91c188a452c..02a6d71704f0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2454,3 +2454,5 @@ static inline bool sched_energy_enabled(void)
static inline bool sched_energy_enabled(void) { return false; }

#endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
--
2.19.1.3.ge56e4f7

2019-07-18 23:29:49

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3



On 7/18/19 3:07 AM, Aaron Lu wrote:
> On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote:

>
> With the below patch on top of v3 that makes use of util_avg to decide
> which task win, I can do all 8 steps and the final scores of the 2
> workloads are: 1796191 and 2199586. The score number are not close,
> suggesting some unfairness, but I can finish the test now...

Aaron,

Do you still see high variance in terms of workload throughput that
was a problem with the previous version?

>
>
> }
> +
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> + struct sched_entity *sea = &a->se;
> + struct sched_entity *seb = &b->se;
> + bool samecore = task_cpu(a) == task_cpu(b);


Probably "samecpu" instead of "samecore" will be more accurate.
I think task_cpu(a) and task_cpu(b)
can be different, but still belong to the same cpu core.

> + struct task_struct *p;
> + s64 delta;
> +
> + if (samecore) {
> + /* vruntime is per cfs_rq */
> + while (!is_same_group(sea, seb)) {
> + int sea_depth = sea->depth;
> + int seb_depth = seb->depth;
> +
> + if (sea_depth >= seb_depth)

Should this be strictly ">" instead of ">=" ?

> + sea = parent_entity(sea);
> + if (sea_depth <= seb_depth)

Should use "<" ?

> + seb = parent_entity(seb);
> + }
> +
> + delta = (s64)(sea->vruntime - seb->vruntime);
> + }
> +

Thanks.

Tim

2019-07-19 05:55:31

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, Jul 18, 2019 at 04:27:19PM -0700, Tim Chen wrote:
>
>
> On 7/18/19 3:07 AM, Aaron Lu wrote:
> > On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote:
>
> >
> > With the below patch on top of v3 that makes use of util_avg to decide
> > which task win, I can do all 8 steps and the final scores of the 2
> > workloads are: 1796191 and 2199586. The score number are not close,
> > suggesting some unfairness, but I can finish the test now...
>
> Aaron,
>
> Do you still see high variance in terms of workload throughput that
> was a problem with the previous version?

Any suggestion how to measure this?
It's not clear how Aubrey did his test, will need to take a look at
sysbench.

> >
> >
> > }
> > +
> > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> > +{
> > + struct sched_entity *sea = &a->se;
> > + struct sched_entity *seb = &b->se;
> > + bool samecore = task_cpu(a) == task_cpu(b);
>
>
> Probably "samecpu" instead of "samecore" will be more accurate.
> I think task_cpu(a) and task_cpu(b)
> can be different, but still belong to the same cpu core.

Right, definitely, guess I'm brain damaged.

>
> > + struct task_struct *p;
> > + s64 delta;
> > +
> > + if (samecore) {
> > + /* vruntime is per cfs_rq */
> > + while (!is_same_group(sea, seb)) {
> > + int sea_depth = sea->depth;
> > + int seb_depth = seb->depth;
> > +
> > + if (sea_depth >= seb_depth)
>
> Should this be strictly ">" instead of ">=" ?

Same depth doesn't necessarily mean same group while the purpose here is
to make sure they are in the same cfs_rq. When they are of the same
depth but in different cfs_rqs, we will continue to go up till we reach
rq->cfs.

>
> > + sea = parent_entity(sea);
> > + if (sea_depth <= seb_depth)
>
> Should use "<" ?

Ditto here.
When they are of the same depth but no in the same cfs_rq, both se will
move up.

> > + seb = parent_entity(seb);
> > + }
> > +
> > + delta = (s64)(sea->vruntime - seb->vruntime);
> > + }
> > +

Thanks.

2019-07-19 11:51:28

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Fri, Jul 19, 2019 at 1:53 PM Aaron Lu <[email protected]> wrote:
>
> On Thu, Jul 18, 2019 at 04:27:19PM -0700, Tim Chen wrote:
> >
> >
> > On 7/18/19 3:07 AM, Aaron Lu wrote:
> > > On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote:
> >
> > >
> > > With the below patch on top of v3 that makes use of util_avg to decide
> > > which task win, I can do all 8 steps and the final scores of the 2
> > > workloads are: 1796191 and 2199586. The score number are not close,
> > > suggesting some unfairness, but I can finish the test now...
> >
> > Aaron,
> >
> > Do you still see high variance in terms of workload throughput that
> > was a problem with the previous version?
>
> Any suggestion how to measure this?
> It's not clear how Aubrey did his test, will need to take a look at
> sysbench.
>

Well, thanks to post this at the end of my vacation, ;)
I'll go back to the office next week and give a shot.
I actually have a new setup of co-locating AVX512 tasks with
sysbench MYSQL. Both throughput and latency was unacceptable
on the top of V3, Looking forward to seeing the difference of
patch.

Thanks,
-Aubrey

2019-07-19 21:50:37

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 7/18/19 10:52 PM, Aaron Lu wrote:
> On Thu, Jul 18, 2019 at 04:27:19PM -0700, Tim Chen wrote:
>>
>>
>> On 7/18/19 3:07 AM, Aaron Lu wrote:
>>> On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote:
>>
>>>
>>> With the below patch on top of v3 that makes use of util_avg to decide
>>> which task win, I can do all 8 steps and the final scores of the 2
>>> workloads are: 1796191 and 2199586. The score number are not close,
>>> suggesting some unfairness, but I can finish the test now...
>>
>> Aaron,
>>
>> Do you still see high variance in terms of workload throughput that
>> was a problem with the previous version?
>
> Any suggestion how to measure this?
> It's not clear how Aubrey did his test, will need to take a look at
> sysbench.
>
>>>
>>>
>>> }
>>> +
>>> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
>>> +{
>>> + struct sched_entity *sea = &a->se;
>>> + struct sched_entity *seb = &b->se;
>>> + bool samecore = task_cpu(a) == task_cpu(b);
>>
>>
>> Probably "samecpu" instead of "samecore" will be more accurate.
>> I think task_cpu(a) and task_cpu(b)
>> can be different, but still belong to the same cpu core.
>
> Right, definitely, guess I'm brain damaged.
>
>>
>>> + struct task_struct *p;
>>> + s64 delta;
>>> +
>>> + if (samecore) {
>>> + /* vruntime is per cfs_rq */
>>> + while (!is_same_group(sea, seb)) {
>>> + int sea_depth = sea->depth;
>>> + int seb_depth = seb->depth;
>>> +
>>> + if (sea_depth >= seb_depth)
>>
>> Should this be strictly ">" instead of ">=" ?
>
> Same depth doesn't necessarily mean same group while the purpose here is
> to make sure they are in the same cfs_rq. When they are of the same
> depth but in different cfs_rqs, we will continue to go up till we reach
> rq->cfs.

Ah, I see what you are doing now. Thanks for the clarification.

Tim

2019-07-22 12:28:32

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, Jul 18, 2019 at 6:07 PM Aaron Lu <[email protected]> wrote:
>
> On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote:
> > On 17-Jun-2019 10:51:27 AM, Aubrey Li wrote:
> > > The result looks still unfair, and particularly, the variance is too high,
> >
> > I just want to confirm that I am also seeing the same issue with a
> > similar setup. I also tried with the priority boost fix we previously
> > posted, the results are slightly better, but we are still seeing a very
> > high variance.
> >
> > On average, the results I get for 10 30-seconds runs are still much
> > better than nosmt (both sysbench pinned on the same sibling) for the
> > memory benchmark, and pretty similar for the CPU benchmark, but the high
> > variance between runs is indeed concerning.
>
> I was thinking to use util_avg signal to decide which task win in
> __prio_less() in the cross cpu case. The reason util_avg is chosen
> is because it represents how cpu intensive the task is, so the end
> result is, less cpu intensive task will preempt more cpu intensive
> task.
>
> Here is the test I have done to see how util_avg works
> (on a single node, 16 cores, 32 cpus vm):
> 1 Start tmux and then start 3 windows with each running bash;
> 2 Place two shells into two different cgroups and both have cpu.tag set;
> 3 Switch to the 1st tmux window, start
> will-it-scale/page_fault1_processes -t 16 -s 30
> in the first tagged shell;
> 4 Switch to the 2nd tmux window;
> 5 Start
> will-it-scale/page_fault1_processes -t 16 -s 30
> in the 2nd tagged shell;
> 6 Switch to the 3rd tmux window;
> 7 Do some simple things in the 3rd untagged shell like ls to see if
> untagged task is able to proceed;
> 8 Wait for the two page_fault workloads to finish.
>
> With v3 here, I can not do step 4 and later steps, i.e. the 16
> page_fault1 processes started in step 3 will occupy all 16 cores and
> other tasks do not have a chance to run, including tmux, which made
> switching tmux window impossible.
>
> With the below patch on top of v3 that makes use of util_avg to decide
> which task win, I can do all 8 steps and the final scores of the 2
> workloads are: 1796191 and 2199586. The score number are not close,
> suggesting some unfairness, but I can finish the test now...
>
> Here is the diff(consider it as a POC):
>
> ---
> kernel/sched/core.c | 35 ++---------------------------------
> kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 2 ++
> 3 files changed, 40 insertions(+), 33 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 26fea68f7f54..7557a7bbb481 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -105,25 +105,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
> if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
> return !dl_time_before(a->dl.deadline, b->dl.deadline);
>
> - if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */
> - u64 a_vruntime = a->se.vruntime;
> - u64 b_vruntime = b->se.vruntime;
> -
> - /*
> - * Normalize the vruntime if tasks are in different cpus.
> - */
> - if (task_cpu(a) != task_cpu(b)) {
> - b_vruntime -= task_cfs_rq(b)->min_vruntime;
> - b_vruntime += task_cfs_rq(a)->min_vruntime;
> -
> - trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
> - a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
> - b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
> -
> - }
> -
> - return !((s64)(a_vruntime - b_vruntime) <= 0);
> - }
> + if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
> + return cfs_prio_less(a, b);
>
> return false;
> }
> @@ -3663,20 +3646,6 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
> if (!class_pick)
> return NULL;
>
> - if (!cookie) {
> - /*
> - * If class_pick is tagged, return it only if it has
> - * higher priority than max.
> - */
> - bool max_is_higher = sched_feat(CORESCHED_STALL_FIX) ?
> - max && !prio_less(max, class_pick) :
> - max && prio_less(class_pick, max);
> - if (class_pick->core_cookie && max_is_higher)
> - return idle_sched_class.pick_task(rq);
> -
> - return class_pick;
> - }
> -
> /*
> * If class_pick is idle or matches cookie, return early.
> */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 26d29126d6a5..06fb00689db1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10740,3 +10740,39 @@ __init void init_sched_fair_class(void)
> #endif /* SMP */
>
> }
> +
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> + struct sched_entity *sea = &a->se;
> + struct sched_entity *seb = &b->se;
> + bool samecore = task_cpu(a) == task_cpu(b);
> + struct task_struct *p;
> + s64 delta;
> +
> + if (samecore) {
> + /* vruntime is per cfs_rq */
> + while (!is_same_group(sea, seb)) {
> + int sea_depth = sea->depth;
> + int seb_depth = seb->depth;
> +
> + if (sea_depth >= seb_depth)
> + sea = parent_entity(sea);
> + if (sea_depth <= seb_depth)
> + seb = parent_entity(seb);
> + }
> +
> + delta = (s64)(sea->vruntime - seb->vruntime);
> + }
> +
> + /* across cpu: use util_avg to decide which task to run */
> + delta = (s64)(sea->avg.util_avg - seb->avg.util_avg);

The granularity period of util_avg seems too large to decide task priority
during pick_task(), at least it is in my case, cfs_prio_less() always picked
core max task, so pick_task() eventually picked idle, which causes this change
not very helpful for my case.

<idle>-0 [057] dN.. 83.716973: __schedule: max: sysbench/2578
ffff889050f68600
<idle>-0 [057] dN.. 83.716974: __schedule:
(swapper/5/0;140,0,0) ?< (mysqld/2511;119,1042118143,0)
<idle>-0 [057] dN.. 83.716975: __schedule:
(sysbench/2578;119,96449836,0) ?< (mysqld/2511;119,1042118143,0)
<idle>-0 [057] dN.. 83.716975: cfs_prio_less: picked
sysbench/2578 util_avg: 20 527 -507 <======= here===
<idle>-0 [057] dN.. 83.716976: __schedule: pick_task cookie
pick swapper/5/0 ffff889050f68600

Thanks,
-Aubrey

2019-07-22 14:40:33

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 2019/7/22 18:26, Aubrey Li wrote:
> The granularity period of util_avg seems too large to decide task priority
> during pick_task(), at least it is in my case, cfs_prio_less() always picked
> core max task, so pick_task() eventually picked idle, which causes this change
> not very helpful for my case.
>
> <idle>-0 [057] dN.. 83.716973: __schedule: max: sysbench/2578
> ffff889050f68600
> <idle>-0 [057] dN.. 83.716974: __schedule:
> (swapper/5/0;140,0,0) ?< (mysqld/2511;119,1042118143,0)
> <idle>-0 [057] dN.. 83.716975: __schedule:
> (sysbench/2578;119,96449836,0) ?< (mysqld/2511;119,1042118143,0)
> <idle>-0 [057] dN.. 83.716975: cfs_prio_less: picked
> sysbench/2578 util_avg: 20 527 -507 <======= here===
> <idle>-0 [057] dN.. 83.716976: __schedule: pick_task cookie
> pick swapper/5/0 ffff889050f68600

Can you share your setup of the test? I would like to try it locally.
Thanks.

2019-07-23 12:17:06

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Mon, Jul 22, 2019 at 6:43 PM Aaron Lu <[email protected]> wrote:
>
> On 2019/7/22 18:26, Aubrey Li wrote:
> > The granularity period of util_avg seems too large to decide task priority
> > during pick_task(), at least it is in my case, cfs_prio_less() always picked
> > core max task, so pick_task() eventually picked idle, which causes this change
> > not very helpful for my case.
> >
> > <idle>-0 [057] dN.. 83.716973: __schedule: max: sysbench/2578
> > ffff889050f68600
> > <idle>-0 [057] dN.. 83.716974: __schedule:
> > (swapper/5/0;140,0,0) ?< (mysqld/2511;119,1042118143,0)
> > <idle>-0 [057] dN.. 83.716975: __schedule:
> > (sysbench/2578;119,96449836,0) ?< (mysqld/2511;119,1042118143,0)
> > <idle>-0 [057] dN.. 83.716975: cfs_prio_less: picked
> > sysbench/2578 util_avg: 20 527 -507 <======= here===
> > <idle>-0 [057] dN.. 83.716976: __schedule: pick_task cookie
> > pick swapper/5/0 ffff889050f68600
>
> Can you share your setup of the test? I would like to try it locally.

My setup is a co-location of AVX512 tasks(gemmbench) and non-AVX512 tasks
(sysbench MYSQL). Let me simply it and send offline.

Thanks,
-Aubrey

2019-07-25 18:00:40

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Mon, Jul 22, 2019 at 06:26:46PM +0800, Aubrey Li wrote:
> The granularity period of util_avg seems too large to decide task priority
> during pick_task(), at least it is in my case, cfs_prio_less() always picked
> core max task, so pick_task() eventually picked idle, which causes this change
> not very helpful for my case.
>
> <idle>-0 [057] dN.. 83.716973: __schedule: max: sysbench/2578
> ffff889050f68600
> <idle>-0 [057] dN.. 83.716974: __schedule:
> (swapper/5/0;140,0,0) ?< (mysqld/2511;119,1042118143,0)
> <idle>-0 [057] dN.. 83.716975: __schedule:
> (sysbench/2578;119,96449836,0) ?< (mysqld/2511;119,1042118143,0)
> <idle>-0 [057] dN.. 83.716975: cfs_prio_less: picked
> sysbench/2578 util_avg: 20 527 -507 <======= here===
> <idle>-0 [057] dN.. 83.716976: __schedule: pick_task cookie
> pick swapper/5/0 ffff889050f68600

I tried a different approach based on vruntime with 3 patches following.

When the two tasks are on the same CPU, no change is made, I still route
the two sched entities up till they are in the same group(cfs_rq) and
then do the vruntime comparison.

When the two tasks are on differen threads of the same core, the root
level sched_entities to which the two tasks belong will be used to do
the comparison.

An ugly illustration for the cross CPU case:

cpu0 cpu1
/ | \ / | \
se1 se2 se3 se4 se5 se6
/ \ / \
se21 se22 se61 se62

Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while
task B's se is se61. To compare priority of task A and B, we compare
priority of se2 and se6. The smaller vruntime wins.

To make this work, the root level ses on both CPU should have a common
cfs_rq min vuntime, which I call it the core cfs_rq min vruntime.

This is mostly done in patch2/3.

Test:
1 wrote an cpu intensive program that does nothing but while(1) in
main(), let's call it cpuhog;
2 start 2 cgroups, with one cgroup's cpuset binding to CPU2 and the
other binding to cpu3. cpu2 and cpu3 are smt siblings on the test VM;
3 enable cpu.tag for the two cgroups;
4 start one cpuhog task in each cgroup;
5 kill both cpuhog tasks after 10 seconds;
6 check each cgroup's cpu usage.

If the task is scheduled fairly, then each cgroup's cpu usage should be
around 5s.

With v3, the cpu usage of both cgroups are sometimes 3s, 7s; sometimes
1s, 9s.

With the 3 patches applied, the numbers are mostly around 5s, 5s.

Another test is starting two cgroups simultaneously with cpu.tag set,
with one cgroup running: will-it-scale/page_fault1_processes -t 16 -s 30,
the other running: will-it-scale/page_fault2_processes -t 16 -s 30.
With v3, like I said last time, the later started page_fault processes
can't start running. With the 3 patches applied, both running at the
same time with each CPU having a relatively fair score:

output line of 16 page_fault1 processes in 1 second interval:
min:105225 max:131716 total:1872322

output line of 16 page_fault2 processes in 1 second interval:
min:86797 max:110554 total:1581177

Note the value in min and max, the smaller the gap is, the better the
faireness is.

Aubrey,

I haven't been able to run your workload yet...

2019-07-25 18:00:42

by Aaron Lu

[permalink] [raw]
Subject: [PATCH 3/3] temp hack to make tick based schedule happen

When a hyperthread is forced idle and the other hyperthread has a single
CPU intensive task running, the running task can occupy the hyperthread
for a long time with no scheduling point and starve the other
hyperthread.

Fix this temporarily by always checking if the task has exceed its
timeslice and if so, do a schedule.

Signed-off-by: Aaron Lu <[email protected]>
---
kernel/sched/fair.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43babc2a12a5..730c9359e9c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4093,6 +4093,9 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
return;
}

+ if (cfs_rq->nr_running <= 1)
+ return;
+
/*
* Ensure that a task that missed wakeup preemption by a
* narrow margin doesn't have to wait for a full slice.
@@ -4261,8 +4264,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
return;
#endif

- if (cfs_rq->nr_running > 1)
- check_preempt_tick(cfs_rq, curr);
+ check_preempt_tick(cfs_rq, curr);
}


--
2.19.1.3.ge56e4f7


2019-07-25 21:44:02

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 2019/7/25 22:30, Aaron Lu wrote:
> On Mon, Jul 22, 2019 at 06:26:46PM +0800, Aubrey Li wrote:
>> The granularity period of util_avg seems too large to decide task priority
>> during pick_task(), at least it is in my case, cfs_prio_less() always picked
>> core max task, so pick_task() eventually picked idle, which causes this change
>> not very helpful for my case.
>>
>> <idle>-0 [057] dN.. 83.716973: __schedule: max: sysbench/2578
>> ffff889050f68600
>> <idle>-0 [057] dN.. 83.716974: __schedule:
>> (swapper/5/0;140,0,0) ?< (mysqld/2511;119,1042118143,0)
>> <idle>-0 [057] dN.. 83.716975: __schedule:
>> (sysbench/2578;119,96449836,0) ?< (mysqld/2511;119,1042118143,0)
>> <idle>-0 [057] dN.. 83.716975: cfs_prio_less: picked
>> sysbench/2578 util_avg: 20 527 -507 <======= here===
>> <idle>-0 [057] dN.. 83.716976: __schedule: pick_task cookie
>> pick swapper/5/0 ffff889050f68600
>
> I tried a different approach based on vruntime with 3 patches following.
>
> When the two tasks are on the same CPU, no change is made, I still route
> the two sched entities up till they are in the same group(cfs_rq) and
> then do the vruntime comparison.
>
> When the two tasks are on differen threads of the same core, the root
> level sched_entities to which the two tasks belong will be used to do
> the comparison.
>
> An ugly illustration for the cross CPU case:
>
> cpu0 cpu1
> / | \ / | \
> se1 se2 se3 se4 se5 se6
> / \ / \
> se21 se22 se61 se62
>
> Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while
> task B's se is se61. To compare priority of task A and B, we compare
> priority of se2 and se6. The smaller vruntime wins.
>
> To make this work, the root level ses on both CPU should have a common
> cfs_rq min vuntime, which I call it the core cfs_rq min vruntime.
>
> This is mostly done in patch2/3.
>
> Test:
> 1 wrote an cpu intensive program that does nothing but while(1) in
> main(), let's call it cpuhog;
> 2 start 2 cgroups, with one cgroup's cpuset binding to CPU2 and the
> other binding to cpu3. cpu2 and cpu3 are smt siblings on the test VM;
> 3 enable cpu.tag for the two cgroups;
> 4 start one cpuhog task in each cgroup;
> 5 kill both cpuhog tasks after 10 seconds;
> 6 check each cgroup's cpu usage.
>
> If the task is scheduled fairly, then each cgroup's cpu usage should be
> around 5s.
>
> With v3, the cpu usage of both cgroups are sometimes 3s, 7s; sometimes
> 1s, 9s.
>
> With the 3 patches applied, the numbers are mostly around 5s, 5s.
>
> Another test is starting two cgroups simultaneously with cpu.tag set,
> with one cgroup running: will-it-scale/page_fault1_processes -t 16 -s 30,
> the other running: will-it-scale/page_fault2_processes -t 16 -s 30.
> With v3, like I said last time, the later started page_fault processes
> can't start running. With the 3 patches applied, both running at the
> same time with each CPU having a relatively fair score:
>
> output line of 16 page_fault1 processes in 1 second interval:
> min:105225 max:131716 total:1872322
>
> output line of 16 page_fault2 processes in 1 second interval:
> min:86797 max:110554 total:1581177
>
> Note the value in min and max, the smaller the gap is, the better the
> faireness is.
>
> Aubrey,
>
> I haven't been able to run your workload yet...
>

No worry, let me try to see how it works.

Thanks,
-Aubrey

2019-07-26 15:22:23

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 25-Jul-2019 10:30:03 PM, Aaron Lu wrote:
>
> I tried a different approach based on vruntime with 3 patches following.
[...]

We have experimented with this new patchset and indeed the fairness is
now much better. Interactive tasks with v3 were complete starving when
there were cpu-intensive tasks running, now they can run consistently.
With my initial test of TPC-C running in large VMs with a lot of
background noise VMs, the results are pretty similar to v3, I will run
more thorough tests and report the results back here.

Instead of the 3/3 hack patch, we were already working on a different
approach to solve the same problem. What we have done so far is create a
very low priority per-cpu coresched_idle kernel thread that we use
instead of idle when we can't co-schedule tasks. This gives us more
control and accounting. It still needs some work, but the initial
results are encouraging, I will post more when we have something that
works well.

Thanks,

Julien

2019-07-26 21:31:24

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 7/26/19 8:21 AM, Julien Desfossez wrote:
> On 25-Jul-2019 10:30:03 PM, Aaron Lu wrote:
>>
>> I tried a different approach based on vruntime with 3 patches following.
> [...]
>
> We have experimented with this new patchset and indeed the fairness is
> now much better. Interactive tasks with v3 were complete starving when
> there were cpu-intensive tasks running, now they can run consistently.
> With my initial test of TPC-C running in large VMs with a lot of
> background noise VMs, the results are pretty similar to v3, I will run
> more thorough tests and report the results back here.

Aaron's patch inspired me to experiment with another approach to tackle
fairness. The root problem with v3 was we didn't account for the forced
idle time when we compare the priority of tasks between two threads.

So what I did here is to account the forced idle time in the top cfs_rq's
min_vruntime when we update the runqueue clock. When we are comparing
between two cfs runqueues, the task in cpu getting forced idle will now
be credited with the forced idle time. The effect should be similar to
Aaron's patches. Logic is a bit simpler and we don't need to use
one of the sibling's cfs_rq min_vruntime as a time base.

In really limited testing, it seems to have balanced fairness between two
tagged cgroups.

Tim

-------patch 1----------
From: Tim Chen <[email protected]>
Date: Wed, 24 Jul 2019 13:58:18 -0700
Subject: [PATCH 1/2] sched: move sched fair prio comparison to fair.c

Move the priority comparison of two tasks in fair class to fair.c.
There is no functional change.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/core.c | 21 ++-------------------
kernel/sched/fair.c | 21 +++++++++++++++++++++
kernel/sched/sched.h | 1 +
3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8ea87be56a1e..f78b8fdfd47c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -105,25 +105,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
return !dl_time_before(a->dl.deadline, b->dl.deadline);

- if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */
- u64 a_vruntime = a->se.vruntime;
- u64 b_vruntime = b->se.vruntime;
-
- /*
- * Normalize the vruntime if tasks are in different cpus.
- */
- if (task_cpu(a) != task_cpu(b)) {
- b_vruntime -= task_cfs_rq(b)->min_vruntime;
- b_vruntime += task_cfs_rq(a)->min_vruntime;
-
- trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
- a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
- b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
-
- }
-
- return !((s64)(a_vruntime - b_vruntime) <= 0);
- }
+ if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+ return prio_less_fair(a, b);

return false;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02bff10237d4..e289b6e1545b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -602,6 +602,27 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
return delta;
}

+bool prio_less_fair(struct task_struct *a, struct task_struct *b)
+{
+ u64 a_vruntime = a->se.vruntime;
+ u64 b_vruntime = b->se.vruntime;
+
+ /*
+ * Normalize the vruntime if tasks are in different cpus.
+ */
+ if (task_cpu(a) != task_cpu(b)) {
+ b_vruntime -= task_cfs_rq(b)->min_vruntime;
+ b_vruntime += task_cfs_rq(a)->min_vruntime;
+
+ trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
+ a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
+ b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
+
+ }
+
+ return !((s64)(a_vruntime - b_vruntime) <= 0);
+}
+
/*
* The idea is to set a period in which each task runs once.
*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e91c188a452c..bdabe7ce1152 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1015,6 +1015,7 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
}

extern void queue_core_balance(struct rq *rq);
+extern bool prio_less_fair(struct task_struct *a, struct task_struct *b);

#else /* !CONFIG_SCHED_CORE */

--
2.20.1


--------------patch 2------------------
From: Tim Chen <[email protected]>
Date: Thu, 25 Jul 2019 13:09:21 -0700
Subject: [PATCH 2/2] sched: Account the forced idle time

We did not account for the forced idle time when comparing two tasks
from different SMT thread in the same core.

Account it in root cfs_rq min_vruntime when we update the rq clock. This will
allow for fair comparison of which task has higher priority from
two different SMT.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/core.c | 6 ++++++
kernel/sched/fair.c | 22 ++++++++++++++++++----
2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f78b8fdfd47c..d8fa74810126 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -393,6 +393,12 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)

rq->clock_task += delta;

+#ifdef CONFIG_SCHED_CORE
+ /* Account the forced idle time by sibling */
+ if (rq->core_forceidle)
+ rq->cfs.min_vruntime += delta;
+#endif
+
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
update_irq_load_avg(rq, irq_delta + steal);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e289b6e1545b..1b2fd1271c51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -604,20 +604,34 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)

bool prio_less_fair(struct task_struct *a, struct task_struct *b)
{
- u64 a_vruntime = a->se.vruntime;
- u64 b_vruntime = b->se.vruntime;
+ u64 a_vruntime;
+ u64 b_vruntime;

/*
* Normalize the vruntime if tasks are in different cpus.
*/
if (task_cpu(a) != task_cpu(b)) {
- b_vruntime -= task_cfs_rq(b)->min_vruntime;
- b_vruntime += task_cfs_rq(a)->min_vruntime;
+ struct sched_entity *sea = &a->se;
+ struct sched_entity *seb = &b->se;
+
+ while (sea->parent)
+ sea = sea->parent;
+ while (seb->parent)
+ seb = seb->parent;
+
+ a_vruntime = sea->vruntime;
+ b_vruntime = seb->vruntime;
+
+ b_vruntime -= task_rq(b)->cfs.min_vruntime;
+ b_vruntime += task_rq(a)->cfs.min_vruntime;

trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);

+ } else {
+ a_vruntime = a->se.vruntime;
+ b_vruntime = b->se.vruntime;
}

return !((s64)(a_vruntime - b_vruntime) <= 0);
--
2.20.1


2019-07-31 05:40:57

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 2019/7/26 23:21, Julien Desfossez wrote:
> On 25-Jul-2019 10:30:03 PM, Aaron Lu wrote:
>>
>> I tried a different approach based on vruntime with 3 patches following.
> [...]
>
> We have experimented with this new patchset and indeed the fairness is
> now much better. Interactive tasks with v3 were complete starving when
> there were cpu-intensive tasks running, now they can run consistently.

Yeah, the fairness is much better now.

For two cgroups created, I limited the cpuset to be one core(two siblings)
of these two cgroups, I still run gemmbench and sysbench-mysql, and here
is the mysql result:

Latency:
.----------------------------------------------------------------------------------------------.
|NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT [std% / sem%] +/- cpu% |
|----------------------------------------------------------------------------------------------|
| 1/1 6.7 [13.8%/ 1.4%] 2.1% | 6.4 [14.6%/ 1.5%] 4.0% 2.0% |
| 2/2 9.1 [ 5.0%/ 0.5%] 4.0% | 11.4 [ 6.8%/ 0.7%] -24.9% 3.9% |
'----------------------------------------------------------------------------------------------'

Throughput:
.----------------------------------------------------------------------------------------------.
|NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT [std% / sem%] +/- cpu% |
|----------------------------------------------------------------------------------------------|
| 1/1 310.2 [ 4.1%/ 0.4%] 2.1% | 296.2 [ 5.0%/ 0.5%] -4.5% 2.0% |
| 2/2 547.7 [ 3.6%/ 0.4%] 4.0% | 368.3 [ 4.8%/ 0.5%] -32.8% 3.9% |
'----------------------------------------------------------------------------------------------'

Note: 2/2 case means 4 threads run on one core, which is overloaded.(cpu% is overall system report)

Though the latency/throughput has regressions, but standard deviation is much better now.

> With my initial test of TPC-C running in large VMs with a lot of
> background noise VMs, the results are pretty similar to v3, I will run
> more thorough tests and report the results back here.

I see something similar. I guess task placement could be another problem.
We don't check cookie matching in load balance and task wakeup, so
- if tasks with different cookie happen to be dispatched onto different cores,
The result should be good
- if tasks with different cookie are unfortunately dispatched onto the same
core, the result should be bad.

This problem is bypassed in my testing setup above, but may be one cause
of my other scenarios, need a while to sort out.

Thanks,
-Aubrey

2019-08-03 14:43:09

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

We tested both Aaron's and Tim's patches and here are our results.

Test setup:
- 2 1-thread sysbench, one running the cpu benchmark, the other one the
mem benchmark
- both started at the same time
- both are pinned on the same core (2 hardware threads)
- 10 30-seconds runs
- test script: https://paste.debian.net/plainh/834cf45c
- only showing the CPU events/sec (higher is better)
- tested 4 tag configurations:
- no tag
- sysbench mem untagged, sysbench cpu tagged
- sysbench mem tagged, sysbench cpu untagged
- both tagged with a different tag
- "Alone" is the sysbench CPU running alone on the core, no tag
- "nosmt" is both sysbench pinned on the same hardware thread, no tag
- "Tim's full patchset + sched" is an experiment with Tim's patchset
combined with Aaron's "hack patch" to get rid of the remaining deep
idle cases
- In all test cases, both tasks can run simultaneously (which was not
the case without those patches), but the standard deviation is a
pretty good indicator of the fairness/consistency.

No tag
------
Test Average Stdev
Alone 1306.90 0.94
nosmt 649.95 1.44
Aaron's full patchset: 828.15 32.45
Aaron's first 2 patches: 832.12 36.53
Aaron's 3rd patch alone: 864.21 3.68
Tim's full patchset: 852.50 4.11
Tim's full patchset + sched: 852.59 8.25

Sysbench mem untagged, sysbench cpu tagged
------------------------------------------
Test Average Stdev
Alone 1306.90 0.94
nosmt 649.95 1.44
Aaron's full patchset: 586.06 1.77
Aaron's first 2 patches: 630.08 47.30
Aaron's 3rd patch alone: 1086.65 246.54
Tim's full patchset: 852.50 4.11
Tim's full patchset + sched: 390.49 15.76

Sysbench mem tagged, sysbench cpu untagged
------------------------------------------
Test Average Stdev
Alone 1306.90 0.94
nosmt 649.95 1.44
Aaron's full patchset: 583.77 3.52
Aaron's first 2 patches: 513.63 63.09
Aaron's 3rd patch alone: 1171.23 3.35
Tim's full patchset: 564.04 58.05
Tim's full patchset + sched: 1026.16 49.43

Both sysbench tagged
--------------------
Test Average Stdev
Alone 1306.90 0.94
nosmt 649.95 1.44
Aaron's full patchset: 582.15 3.75
Aaron's first 2 patches: 561.07 91.61
Aaron's 3rd patch alone: 638.49 231.06
Tim's full patchset: 679.43 70.07
Tim's full patchset + sched: 664.34 210.14

So in terms of fairness, Aaron's full patchset is the most consistent, but only
Tim's patchset performs better than nosmt in some conditions.

Of course, this is one of the worst case scenario, as soon as we have
multithreaded applications on overcommitted systems, core scheduling performs
better than nosmt.

Thanks,

Julien

2019-08-05 15:57:44

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 8/2/19 8:37 AM, Julien Desfossez wrote:
> We tested both Aaron's and Tim's patches and here are our results.
>
> Test setup:
> - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> mem benchmark
> - both started at the same time
> - both are pinned on the same core (2 hardware threads)
> - 10 30-seconds runs
> - test script: https://paste.debian.net/plainh/834cf45c
> - only showing the CPU events/sec (higher is better)
> - tested 4 tag configurations:
> - no tag
> - sysbench mem untagged, sysbench cpu tagged
> - sysbench mem tagged, sysbench cpu untagged
> - both tagged with a different tag
> - "Alone" is the sysbench CPU running alone on the core, no tag
> - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> - "Tim's full patchset + sched" is an experiment with Tim's patchset
> combined with Aaron's "hack patch" to get rid of the remaining deep
> idle cases
> - In all test cases, both tasks can run simultaneously (which was not
> the case without those patches), but the standard deviation is a
> pretty good indicator of the fairness/consistency.

Thanks for testing the patches and giving such detailed data.

I came to realize that for my scheme, the accumulated deficit of forced idle could be wiped
out in one execution of a task on the forced idle cpu, with the update of the min_vruntime,
even if the execution time could be far less than the accumulated deficit.
That's probably one reason my scheme didn't achieve fairness.

Tim

2019-08-05 20:10:19

by Phil Auld

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

Hi,

On Fri, Aug 02, 2019 at 11:37:15AM -0400 Julien Desfossez wrote:
> We tested both Aaron's and Tim's patches and here are our results.
>
> Test setup:
> - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> mem benchmark
> - both started at the same time
> - both are pinned on the same core (2 hardware threads)
> - 10 30-seconds runs
> - test script: https://paste.debian.net/plainh/834cf45c
> - only showing the CPU events/sec (higher is better)
> - tested 4 tag configurations:
> - no tag
> - sysbench mem untagged, sysbench cpu tagged
> - sysbench mem tagged, sysbench cpu untagged
> - both tagged with a different tag
> - "Alone" is the sysbench CPU running alone on the core, no tag
> - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> - "Tim's full patchset + sched" is an experiment with Tim's patchset
> combined with Aaron's "hack patch" to get rid of the remaining deep
> idle cases
> - In all test cases, both tasks can run simultaneously (which was not
> the case without those patches), but the standard deviation is a
> pretty good indicator of the fairness/consistency.
>
> No tag
> ------
> Test Average Stdev
> Alone 1306.90 0.94
> nosmt 649.95 1.44
> Aaron's full patchset: 828.15 32.45
> Aaron's first 2 patches: 832.12 36.53
> Aaron's 3rd patch alone: 864.21 3.68
> Tim's full patchset: 852.50 4.11
> Tim's full patchset + sched: 852.59 8.25
>
> Sysbench mem untagged, sysbench cpu tagged
> ------------------------------------------
> Test Average Stdev
> Alone 1306.90 0.94
> nosmt 649.95 1.44
> Aaron's full patchset: 586.06 1.77
> Aaron's first 2 patches: 630.08 47.30
> Aaron's 3rd patch alone: 1086.65 246.54
> Tim's full patchset: 852.50 4.11
> Tim's full patchset + sched: 390.49 15.76
>
> Sysbench mem tagged, sysbench cpu untagged
> ------------------------------------------
> Test Average Stdev
> Alone 1306.90 0.94
> nosmt 649.95 1.44
> Aaron's full patchset: 583.77 3.52
> Aaron's first 2 patches: 513.63 63.09
> Aaron's 3rd patch alone: 1171.23 3.35
> Tim's full patchset: 564.04 58.05
> Tim's full patchset + sched: 1026.16 49.43
>
> Both sysbench tagged
> --------------------
> Test Average Stdev
> Alone 1306.90 0.94
> nosmt 649.95 1.44
> Aaron's full patchset: 582.15 3.75
> Aaron's first 2 patches: 561.07 91.61
> Aaron's 3rd patch alone: 638.49 231.06
> Tim's full patchset: 679.43 70.07
> Tim's full patchset + sched: 664.34 210.14
>

Sorry if I'm missing something obvious here but with only 2 processes
of interest shouldn't one tagged and one untagged be about the same
as both tagged?

In both cases the 2 sysbenches should not be running on the core at
the same time.

There will be times when oher non-related threads could share the core
with the untagged one. Is that enough to account for this difference?


Thanks,
Phil


> So in terms of fairness, Aaron's full patchset is the most consistent, but only
> Tim's patchset performs better than nosmt in some conditions.
>
> Of course, this is one of the worst case scenario, as soon as we have
> multithreaded applications on overcommitted systems, core scheduling performs
> better than nosmt.
>
> Thanks,
>
> Julien

--

2019-08-06 03:26:52

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Mon, Aug 05, 2019 at 08:55:28AM -0700, Tim Chen wrote:
> On 8/2/19 8:37 AM, Julien Desfossez wrote:
> > We tested both Aaron's and Tim's patches and here are our results.
> >
> > Test setup:
> > - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> > mem benchmark
> > - both started at the same time
> > - both are pinned on the same core (2 hardware threads)
> > - 10 30-seconds runs
> > - test script: https://paste.debian.net/plainh/834cf45c
> > - only showing the CPU events/sec (higher is better)
> > - tested 4 tag configurations:
> > - no tag
> > - sysbench mem untagged, sysbench cpu tagged
> > - sysbench mem tagged, sysbench cpu untagged
> > - both tagged with a different tag
> > - "Alone" is the sysbench CPU running alone on the core, no tag
> > - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> > - "Tim's full patchset + sched" is an experiment with Tim's patchset
> > combined with Aaron's "hack patch" to get rid of the remaining deep
> > idle cases
> > - In all test cases, both tasks can run simultaneously (which was not
> > the case without those patches), but the standard deviation is a
> > pretty good indicator of the fairness/consistency.
>
> Thanks for testing the patches and giving such detailed data.

Thanks Julien.

> I came to realize that for my scheme, the accumulated deficit of forced idle could be wiped
> out in one execution of a task on the forced idle cpu, with the update of the min_vruntime,
> even if the execution time could be far less than the accumulated deficit.
> That's probably one reason my scheme didn't achieve fairness.

I've been thinking if we should consider core wide tenent fairness?

Let's say there are 3 tasks on 2 threads' rq of the same core, 2 tasks
(e.g. A1, A2) belong to tenent A and the 3rd B1 belong to another tenent
B. Assume A1 and B1 are queued on the same thread and A2 on the other
thread, when we decide priority for A1 and B1, shall we also consider
A2's vruntime? i.e. shall we consider A1 and A2 as a whole since they
belong to the same tenent? I tend to think we should make fairness per
core per tenent, instead of per thread(cpu) per task(sched entity). What
do you guys think?

Implemention of the idea is a mess to me, as I feel I'm duplicating the
existing per cpu per sched_entity enqueue/update vruntime/dequeue logic
for the per core per tenent stuff.

2019-08-06 06:58:07

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, Aug 6, 2019 at 11:24 AM Aaron Lu <[email protected]> wrote:
>
> On Mon, Aug 05, 2019 at 08:55:28AM -0700, Tim Chen wrote:
> > On 8/2/19 8:37 AM, Julien Desfossez wrote:
> > > We tested both Aaron's and Tim's patches and here are our results.
> > >
> > > Test setup:
> > > - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> > > mem benchmark
> > > - both started at the same time
> > > - both are pinned on the same core (2 hardware threads)
> > > - 10 30-seconds runs
> > > - test script: https://paste.debian.net/plainh/834cf45c
> > > - only showing the CPU events/sec (higher is better)
> > > - tested 4 tag configurations:
> > > - no tag
> > > - sysbench mem untagged, sysbench cpu tagged
> > > - sysbench mem tagged, sysbench cpu untagged
> > > - both tagged with a different tag
> > > - "Alone" is the sysbench CPU running alone on the core, no tag
> > > - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> > > - "Tim's full patchset + sched" is an experiment with Tim's patchset
> > > combined with Aaron's "hack patch" to get rid of the remaining deep
> > > idle cases
> > > - In all test cases, both tasks can run simultaneously (which was not
> > > the case without those patches), but the standard deviation is a
> > > pretty good indicator of the fairness/consistency.
> >
> > Thanks for testing the patches and giving such detailed data.
>
> Thanks Julien.
>
> > I came to realize that for my scheme, the accumulated deficit of forced idle could be wiped
> > out in one execution of a task on the forced idle cpu, with the update of the min_vruntime,
> > even if the execution time could be far less than the accumulated deficit.
> > That's probably one reason my scheme didn't achieve fairness.
>
> I've been thinking if we should consider core wide tenent fairness?
>
> Let's say there are 3 tasks on 2 threads' rq of the same core, 2 tasks
> (e.g. A1, A2) belong to tenent A and the 3rd B1 belong to another tenent
> B. Assume A1 and B1 are queued on the same thread and A2 on the other
> thread, when we decide priority for A1 and B1, shall we also consider
> A2's vruntime? i.e. shall we consider A1 and A2 as a whole since they
> belong to the same tenent? I tend to think we should make fairness per
> core per tenent, instead of per thread(cpu) per task(sched entity). What
> do you guys think?
>

I also think a way to make fairness per cookie per core, is this what you
want to propose?

Thanks,
-Aubrey

> Implemention of the idea is a mess to me, as I feel I'm duplicating the
> existing per cpu per sched_entity enqueue/update vruntime/dequeue logic
> for the per core per tenent stuff.

2019-08-06 07:05:40

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 2019/8/6 14:56, Aubrey Li wrote:
> On Tue, Aug 6, 2019 at 11:24 AM Aaron Lu <[email protected]> wrote:
>> I've been thinking if we should consider core wide tenent fairness?
>>
>> Let's say there are 3 tasks on 2 threads' rq of the same core, 2 tasks
>> (e.g. A1, A2) belong to tenent A and the 3rd B1 belong to another tenent
>> B. Assume A1 and B1 are queued on the same thread and A2 on the other
>> thread, when we decide priority for A1 and B1, shall we also consider
>> A2's vruntime? i.e. shall we consider A1 and A2 as a whole since they
>> belong to the same tenent? I tend to think we should make fairness per
>> core per tenent, instead of per thread(cpu) per task(sched entity). What
>> do you guys think?
>>
>
> I also think a way to make fairness per cookie per core, is this what you
> want to propose?

Yes, that's what I meant.

2019-08-06 12:25:41

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

> >
> > I also think a way to make fairness per cookie per core, is this what you
> > want to propose?
>
> Yes, that's what I meant.

I think that would hurt some kind of workloads badly, especially if
one tenant is
having way more tasks than the other. Tenant with more task on the same core
might have immediate requirements from some threads than the other and we
would fail to take that into account. With some hierarchical management, we can
alleviate this, but as Aaron said, it would be a bit messy.

Peter's rebalance logic actually takes care of most of the runq
imbalance caused
due to cookie tagging. What we have found from our testing is, fairness issue is
caused mostly due to a Hyperthread going idle and not waking up. Aaron's 3rd
patch works around that. As Julien mentioned, we are working on a per thread
coresched idle thread concept. The problem that we found was, idle thread causes
accounting issues and wakeup issues as it was not designed to be used in this
context. So if we can have a low priority thread which looks like any other task
to the scheduler, things becomes easy for the scheduler and we achieve security
as well. Please share your thoughts on this idea.

The results are encouraging, but we do not yet have the coresched idle
to not spin
100%. We will soon post the patch once it is a bit more stable for
running the tests
that we all have done so far.

Thanks,
Vineeth

2019-08-06 13:53:01

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, Aug 06, 2019 at 08:24:17AM -0400, Vineeth Remanan Pillai wrote:
> > >
> > > I also think a way to make fairness per cookie per core, is this what you
> > > want to propose?
> >
> > Yes, that's what I meant.
>
> I think that would hurt some kind of workloads badly, especially if
> one tenant is
> having way more tasks than the other. Tenant with more task on the same core
> might have immediate requirements from some threads than the other and we
> would fail to take that into account. With some hierarchical management, we can
> alleviate this, but as Aaron said, it would be a bit messy.

I think tenant will have per core weight, similar to sched entity's per
cpu weight. The tenant's per core weight could derive from its
corresponding taskgroup's per cpu sched entities' weight(sum them up
perhaps). Tenant with higher weight will have its core wide vruntime
advance slower than tenant with lower weight. Does this address the
issue here?

> Peter's rebalance logic actually takes care of most of the runq
> imbalance caused
> due to cookie tagging. What we have found from our testing is, fairness issue is
> caused mostly due to a Hyperthread going idle and not waking up. Aaron's 3rd
> patch works around that. As Julien mentioned, we are working on a per thread
> coresched idle thread concept. The problem that we found was, idle thread causes
> accounting issues and wakeup issues as it was not designed to be used in this
> context. So if we can have a low priority thread which looks like any other task
> to the scheduler, things becomes easy for the scheduler and we achieve security
> as well. Please share your thoughts on this idea.

Care to elaborate the idea of coresched idle thread concept?
How it solved the hyperthread going idle problem and what the accounting
issues and wakeup issues are, etc.

Thanks,
Aaron

> The results are encouraging, but we do not yet have the coresched idle
> to not spin
> 100%. We will soon post the patch once it is a bit more stable for
> running the tests
> that we all have done so far.
>
> Thanks,
> Vineeth

2019-08-06 13:55:21

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Mon, Aug 05, 2019 at 04:09:15PM -0400, Phil Auld wrote:
> Hi,
>
> On Fri, Aug 02, 2019 at 11:37:15AM -0400 Julien Desfossez wrote:
> > We tested both Aaron's and Tim's patches and here are our results.
> >
> > Test setup:
> > - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> > mem benchmark
> > - both started at the same time
> > - both are pinned on the same core (2 hardware threads)
> > - 10 30-seconds runs
> > - test script: https://paste.debian.net/plainh/834cf45c
> > - only showing the CPU events/sec (higher is better)
> > - tested 4 tag configurations:
> > - no tag
> > - sysbench mem untagged, sysbench cpu tagged
> > - sysbench mem tagged, sysbench cpu untagged
> > - both tagged with a different tag
> > - "Alone" is the sysbench CPU running alone on the core, no tag
> > - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> > - "Tim's full patchset + sched" is an experiment with Tim's patchset
> > combined with Aaron's "hack patch" to get rid of the remaining deep
> > idle cases
> > - In all test cases, both tasks can run simultaneously (which was not
> > the case without those patches), but the standard deviation is a
> > pretty good indicator of the fairness/consistency.
> >
> > No tag
> > ------
> > Test Average Stdev
> > Alone 1306.90 0.94
> > nosmt 649.95 1.44
> > Aaron's full patchset: 828.15 32.45
> > Aaron's first 2 patches: 832.12 36.53
> > Aaron's 3rd patch alone: 864.21 3.68
> > Tim's full patchset: 852.50 4.11
> > Tim's full patchset + sched: 852.59 8.25
> >
> > Sysbench mem untagged, sysbench cpu tagged
> > ------------------------------------------
> > Test Average Stdev
> > Alone 1306.90 0.94
> > nosmt 649.95 1.44
> > Aaron's full patchset: 586.06 1.77
> > Aaron's first 2 patches: 630.08 47.30
> > Aaron's 3rd patch alone: 1086.65 246.54
> > Tim's full patchset: 852.50 4.11
> > Tim's full patchset + sched: 390.49 15.76
> >
> > Sysbench mem tagged, sysbench cpu untagged
> > ------------------------------------------
> > Test Average Stdev
> > Alone 1306.90 0.94
> > nosmt 649.95 1.44
> > Aaron's full patchset: 583.77 3.52
> > Aaron's first 2 patches: 513.63 63.09
> > Aaron's 3rd patch alone: 1171.23 3.35
> > Tim's full patchset: 564.04 58.05
> > Tim's full patchset + sched: 1026.16 49.43
> >
> > Both sysbench tagged
> > --------------------
> > Test Average Stdev
> > Alone 1306.90 0.94
> > nosmt 649.95 1.44
> > Aaron's full patchset: 582.15 3.75
> > Aaron's first 2 patches: 561.07 91.61
> > Aaron's 3rd patch alone: 638.49 231.06
> > Tim's full patchset: 679.43 70.07
> > Tim's full patchset + sched: 664.34 210.14
> >
>
> Sorry if I'm missing something obvious here but with only 2 processes
> of interest shouldn't one tagged and one untagged be about the same
> as both tagged?

It should.

> In both cases the 2 sysbenches should not be running on the core at
> the same time.

Agree.

> There will be times when oher non-related threads could share the core
> with the untagged one. Is that enough to account for this difference?

What difference do you mean?

Thanks,
Aaron

> > So in terms of fairness, Aaron's full patchset is the most consistent, but only
> > Tim's patchset performs better than nosmt in some conditions.
> >
> > Of course, this is one of the worst case scenario, as soon as we have
> > multithreaded applications on overcommitted systems, core scheduling performs
> > better than nosmt.
> >
> > Thanks,
> >
> > Julien
>
> --

2019-08-06 14:18:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, Aug 06, 2019 at 08:24:17AM -0400, Vineeth Remanan Pillai wrote:
> Peter's rebalance logic actually takes care of most of the runq
> imbalance caused
> due to cookie tagging. What we have found from our testing is, fairness issue is
> caused mostly due to a Hyperthread going idle and not waking up. Aaron's 3rd
> patch works around that. As Julien mentioned, we are working on a per thread
> coresched idle thread concept. The problem that we found was, idle thread causes
> accounting issues and wakeup issues as it was not designed to be used in this
> context. So if we can have a low priority thread which looks like any other task
> to the scheduler, things becomes easy for the scheduler and we achieve security
> as well. Please share your thoughts on this idea.

What accounting in particular is upset? Is it things like
select_idle_sibling() that thinks the thread is idle and tries to place
tasks there?

It should be possible to change idle_cpu() to not report a forced-idle
CPU as idle.

(also; it should be possible to optimize select_idle_sibling() for the
core-sched case specifically)

> The results are encouraging, but we do not yet have the coresched idle
> to not spin 100%. We will soon post the patch once it is a bit more
> stable for running the tests that we all have done so far.

There's play_idle(), which is the entry point for idle injection.

In general, I don't particularly like 'fake' idle threads, please be
very specific in describing what issues it works around such that we can
look at alternatives.

2019-08-06 14:19:32

by Phil Auld

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, Aug 06, 2019 at 09:54:01PM +0800 Aaron Lu wrote:
> On Mon, Aug 05, 2019 at 04:09:15PM -0400, Phil Auld wrote:
> > Hi,
> >
> > On Fri, Aug 02, 2019 at 11:37:15AM -0400 Julien Desfossez wrote:
> > > We tested both Aaron's and Tim's patches and here are our results.
> > >
> > > Test setup:
> > > - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> > > mem benchmark
> > > - both started at the same time
> > > - both are pinned on the same core (2 hardware threads)
> > > - 10 30-seconds runs
> > > - test script: https://paste.debian.net/plainh/834cf45c
> > > - only showing the CPU events/sec (higher is better)
> > > - tested 4 tag configurations:
> > > - no tag
> > > - sysbench mem untagged, sysbench cpu tagged
> > > - sysbench mem tagged, sysbench cpu untagged
> > > - both tagged with a different tag
> > > - "Alone" is the sysbench CPU running alone on the core, no tag
> > > - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> > > - "Tim's full patchset + sched" is an experiment with Tim's patchset
> > > combined with Aaron's "hack patch" to get rid of the remaining deep
> > > idle cases
> > > - In all test cases, both tasks can run simultaneously (which was not
> > > the case without those patches), but the standard deviation is a
> > > pretty good indicator of the fairness/consistency.
> > >
> > > No tag
> > > ------
> > > Test Average Stdev
> > > Alone 1306.90 0.94
> > > nosmt 649.95 1.44
> > > Aaron's full patchset: 828.15 32.45
> > > Aaron's first 2 patches: 832.12 36.53
> > > Aaron's 3rd patch alone: 864.21 3.68
> > > Tim's full patchset: 852.50 4.11
> > > Tim's full patchset + sched: 852.59 8.25
> > >
> > > Sysbench mem untagged, sysbench cpu tagged
> > > ------------------------------------------
> > > Test Average Stdev
> > > Alone 1306.90 0.94
> > > nosmt 649.95 1.44
> > > Aaron's full patchset: 586.06 1.77
> > > Aaron's first 2 patches: 630.08 47.30
> > > Aaron's 3rd patch alone: 1086.65 246.54
> > > Tim's full patchset: 852.50 4.11
> > > Tim's full patchset + sched: 390.49 15.76
> > >
> > > Sysbench mem tagged, sysbench cpu untagged
> > > ------------------------------------------
> > > Test Average Stdev
> > > Alone 1306.90 0.94
> > > nosmt 649.95 1.44
> > > Aaron's full patchset: 583.77 3.52
> > > Aaron's first 2 patches: 513.63 63.09
> > > Aaron's 3rd patch alone: 1171.23 3.35
> > > Tim's full patchset: 564.04 58.05
> > > Tim's full patchset + sched: 1026.16 49.43
> > >
> > > Both sysbench tagged
> > > --------------------
> > > Test Average Stdev
> > > Alone 1306.90 0.94
> > > nosmt 649.95 1.44
> > > Aaron's full patchset: 582.15 3.75
> > > Aaron's first 2 patches: 561.07 91.61
> > > Aaron's 3rd patch alone: 638.49 231.06
> > > Tim's full patchset: 679.43 70.07
> > > Tim's full patchset + sched: 664.34 210.14
> > >
> >
> > Sorry if I'm missing something obvious here but with only 2 processes
> > of interest shouldn't one tagged and one untagged be about the same
> > as both tagged?
>
> It should.
>
> > In both cases the 2 sysbenches should not be running on the core at
> > the same time.
>
> Agree.
>
> > There will be times when oher non-related threads could share the core
> > with the untagged one. Is that enough to account for this difference?
>
> What difference do you mean?


I was looking at the above posted numbers. For example:

> > > Sysbench mem untagged, sysbench cpu tagged
> > > Aaron's 3rd patch alone: 1086.65 246.54

> > > Sysbench mem tagged, sysbench cpu untagged
> > > Aaron's 3rd patch alone: 1171.23 3.35

> > > Both sysbench tagged
> > > Aaron's 3rd patch alone: 638.49 231.06


Admittedly, there's some high variance on some of those numbers.


Cheers,
Phil

>
> Thanks,
> Aaron
>
> > > So in terms of fairness, Aaron's full patchset is the most consistent, but only
> > > Tim's patchset performs better than nosmt in some conditions.
> > >
> > > Of course, this is one of the worst case scenario, as soon as we have
> > > multithreaded applications on overcommitted systems, core scheduling performs
> > > better than nosmt.
> > >
> > > Thanks,
> > >
> > > Julien
> >
> > --

--

2019-08-06 14:56:47

by Phil Auld

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, Aug 06, 2019 at 10:41:25PM +0800 Aaron Lu wrote:
> On 2019/8/6 22:17, Phil Auld wrote:
> > On Tue, Aug 06, 2019 at 09:54:01PM +0800 Aaron Lu wrote:
> >> On Mon, Aug 05, 2019 at 04:09:15PM -0400, Phil Auld wrote:
> >>> Hi,
> >>>
> >>> On Fri, Aug 02, 2019 at 11:37:15AM -0400 Julien Desfossez wrote:
> >>>> We tested both Aaron's and Tim's patches and here are our results.
> >>>>
> >>>> Test setup:
> >>>> - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> >>>> mem benchmark
> >>>> - both started at the same time
> >>>> - both are pinned on the same core (2 hardware threads)
> >>>> - 10 30-seconds runs
> >>>> - test script: https://paste.debian.net/plainh/834cf45c
> >>>> - only showing the CPU events/sec (higher is better)
> >>>> - tested 4 tag configurations:
> >>>> - no tag
> >>>> - sysbench mem untagged, sysbench cpu tagged
> >>>> - sysbench mem tagged, sysbench cpu untagged
> >>>> - both tagged with a different tag
> >>>> - "Alone" is the sysbench CPU running alone on the core, no tag
> >>>> - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> >>>> - "Tim's full patchset + sched" is an experiment with Tim's patchset
> >>>> combined with Aaron's "hack patch" to get rid of the remaining deep
> >>>> idle cases
> >>>> - In all test cases, both tasks can run simultaneously (which was not
> >>>> the case without those patches), but the standard deviation is a
> >>>> pretty good indicator of the fairness/consistency.
> >>>>
> >>>> No tag
> >>>> ------
> >>>> Test Average Stdev
> >>>> Alone 1306.90 0.94
> >>>> nosmt 649.95 1.44
> >>>> Aaron's full patchset: 828.15 32.45
> >>>> Aaron's first 2 patches: 832.12 36.53
> >>>> Aaron's 3rd patch alone: 864.21 3.68
> >>>> Tim's full patchset: 852.50 4.11
> >>>> Tim's full patchset + sched: 852.59 8.25
> >>>>
> >>>> Sysbench mem untagged, sysbench cpu tagged
> >>>> ------------------------------------------
> >>>> Test Average Stdev
> >>>> Alone 1306.90 0.94
> >>>> nosmt 649.95 1.44
> >>>> Aaron's full patchset: 586.06 1.77
> >>>> Aaron's first 2 patches: 630.08 47.30
> >>>> Aaron's 3rd patch alone: 1086.65 246.54
> >>>> Tim's full patchset: 852.50 4.11
> >>>> Tim's full patchset + sched: 390.49 15.76
> >>>>
> >>>> Sysbench mem tagged, sysbench cpu untagged
> >>>> ------------------------------------------
> >>>> Test Average Stdev
> >>>> Alone 1306.90 0.94
> >>>> nosmt 649.95 1.44
> >>>> Aaron's full patchset: 583.77 3.52
> >>>> Aaron's first 2 patches: 513.63 63.09
> >>>> Aaron's 3rd patch alone: 1171.23 3.35
> >>>> Tim's full patchset: 564.04 58.05
> >>>> Tim's full patchset + sched: 1026.16 49.43
> >>>>
> >>>> Both sysbench tagged
> >>>> --------------------
> >>>> Test Average Stdev
> >>>> Alone 1306.90 0.94
> >>>> nosmt 649.95 1.44
> >>>> Aaron's full patchset: 582.15 3.75
> >>>> Aaron's first 2 patches: 561.07 91.61
> >>>> Aaron's 3rd patch alone: 638.49 231.06
> >>>> Tim's full patchset: 679.43 70.07
> >>>> Tim's full patchset + sched: 664.34 210.14
> >>>>
> >>>
> >>> Sorry if I'm missing something obvious here but with only 2 processes
> >>> of interest shouldn't one tagged and one untagged be about the same
> >>> as both tagged?
> >>
> >> It should.
> >>
> >>> In both cases the 2 sysbenches should not be running on the core at
> >>> the same time.
> >>
> >> Agree.
> >>
> >>> There will be times when oher non-related threads could share the core
> >>> with the untagged one. Is that enough to account for this difference?
> >>
> >> What difference do you mean?
> >
> >
> > I was looking at the above posted numbers. For example:
> >
> >>>> Sysbench mem untagged, sysbench cpu tagged
> >>>> Aaron's 3rd patch alone: 1086.65 246.54
> >
> >>>> Sysbench mem tagged, sysbench cpu untagged
> >>>> Aaron's 3rd patch alone: 1171.23 3.35
> >
> >>>> Both sysbench tagged
> >>>> Aaron's 3rd patch alone: 638.49 231.06
> >
> >
> > Admittedly, there's some high variance on some of those numbers.
>
> The high variance suggests the code having some fairness issues :-)
>
> For the test here, I didn't expect the 3rd patch being used alone
> since the fairness is solved by patch2 and patch3 together.

Makes sense, thanks.


--

2019-08-06 15:28:05

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 2019/8/6 22:17, Phil Auld wrote:
> On Tue, Aug 06, 2019 at 09:54:01PM +0800 Aaron Lu wrote:
>> On Mon, Aug 05, 2019 at 04:09:15PM -0400, Phil Auld wrote:
>>> Hi,
>>>
>>> On Fri, Aug 02, 2019 at 11:37:15AM -0400 Julien Desfossez wrote:
>>>> We tested both Aaron's and Tim's patches and here are our results.
>>>>
>>>> Test setup:
>>>> - 2 1-thread sysbench, one running the cpu benchmark, the other one the
>>>> mem benchmark
>>>> - both started at the same time
>>>> - both are pinned on the same core (2 hardware threads)
>>>> - 10 30-seconds runs
>>>> - test script: https://paste.debian.net/plainh/834cf45c
>>>> - only showing the CPU events/sec (higher is better)
>>>> - tested 4 tag configurations:
>>>> - no tag
>>>> - sysbench mem untagged, sysbench cpu tagged
>>>> - sysbench mem tagged, sysbench cpu untagged
>>>> - both tagged with a different tag
>>>> - "Alone" is the sysbench CPU running alone on the core, no tag
>>>> - "nosmt" is both sysbench pinned on the same hardware thread, no tag
>>>> - "Tim's full patchset + sched" is an experiment with Tim's patchset
>>>> combined with Aaron's "hack patch" to get rid of the remaining deep
>>>> idle cases
>>>> - In all test cases, both tasks can run simultaneously (which was not
>>>> the case without those patches), but the standard deviation is a
>>>> pretty good indicator of the fairness/consistency.
>>>>
>>>> No tag
>>>> ------
>>>> Test Average Stdev
>>>> Alone 1306.90 0.94
>>>> nosmt 649.95 1.44
>>>> Aaron's full patchset: 828.15 32.45
>>>> Aaron's first 2 patches: 832.12 36.53
>>>> Aaron's 3rd patch alone: 864.21 3.68
>>>> Tim's full patchset: 852.50 4.11
>>>> Tim's full patchset + sched: 852.59 8.25
>>>>
>>>> Sysbench mem untagged, sysbench cpu tagged
>>>> ------------------------------------------
>>>> Test Average Stdev
>>>> Alone 1306.90 0.94
>>>> nosmt 649.95 1.44
>>>> Aaron's full patchset: 586.06 1.77
>>>> Aaron's first 2 patches: 630.08 47.30
>>>> Aaron's 3rd patch alone: 1086.65 246.54
>>>> Tim's full patchset: 852.50 4.11
>>>> Tim's full patchset + sched: 390.49 15.76
>>>>
>>>> Sysbench mem tagged, sysbench cpu untagged
>>>> ------------------------------------------
>>>> Test Average Stdev
>>>> Alone 1306.90 0.94
>>>> nosmt 649.95 1.44
>>>> Aaron's full patchset: 583.77 3.52
>>>> Aaron's first 2 patches: 513.63 63.09
>>>> Aaron's 3rd patch alone: 1171.23 3.35
>>>> Tim's full patchset: 564.04 58.05
>>>> Tim's full patchset + sched: 1026.16 49.43
>>>>
>>>> Both sysbench tagged
>>>> --------------------
>>>> Test Average Stdev
>>>> Alone 1306.90 0.94
>>>> nosmt 649.95 1.44
>>>> Aaron's full patchset: 582.15 3.75
>>>> Aaron's first 2 patches: 561.07 91.61
>>>> Aaron's 3rd patch alone: 638.49 231.06
>>>> Tim's full patchset: 679.43 70.07
>>>> Tim's full patchset + sched: 664.34 210.14
>>>>
>>>
>>> Sorry if I'm missing something obvious here but with only 2 processes
>>> of interest shouldn't one tagged and one untagged be about the same
>>> as both tagged?
>>
>> It should.
>>
>>> In both cases the 2 sysbenches should not be running on the core at
>>> the same time.
>>
>> Agree.
>>
>>> There will be times when oher non-related threads could share the core
>>> with the untagged one. Is that enough to account for this difference?
>>
>> What difference do you mean?
>
>
> I was looking at the above posted numbers. For example:
>
>>>> Sysbench mem untagged, sysbench cpu tagged
>>>> Aaron's 3rd patch alone: 1086.65 246.54
>
>>>> Sysbench mem tagged, sysbench cpu untagged
>>>> Aaron's 3rd patch alone: 1171.23 3.35
>
>>>> Both sysbench tagged
>>>> Aaron's 3rd patch alone: 638.49 231.06
>
>
> Admittedly, there's some high variance on some of those numbers.

The high variance suggests the code having some fairness issues :-)

For the test here, I didn't expect the 3rd patch being used alone
since the fairness is solved by patch2 and patch3 together.

2019-08-06 16:15:50

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

> I think tenant will have per core weight, similar to sched entity's per
> cpu weight. The tenant's per core weight could derive from its
> corresponding taskgroup's per cpu sched entities' weight(sum them up
> perhaps). Tenant with higher weight will have its core wide vruntime
> advance slower than tenant with lower weight. Does this address the
> issue here?
>
I think that makes sense. Should work. We should also consider how to
classify untagged processes so that they are not starved .

>
> Care to elaborate the idea of coresched idle thread concept?
> How it solved the hyperthread going idle problem and what the accounting
> issues and wakeup issues are, etc.
>
So we have one coresched_idle thread per cpu and when a sibling
cannot find a match, instead of forcing idle, we schedule this new
thread. Ideally this thread would be similar to idle, but scheduler doesn't
now confuse idle cpu with a forced idle state. This also invokes schedule()
as vruntime progresses(alternative to your 3rd patch) and vruntime
accounting gets more consistent. There are special cases that need to be
handled so that coresched_idle never gets scheduled in the normal
scheduling path(without coresched) etc. Hope this clarifies.

But as Peter suggested, if we can differentiate idle from forced idle in
the idle thread and account for the vruntime progress, that would be a
better approach.

Thanks,
Vineeth

2019-08-06 17:13:51

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

>
> What accounting in particular is upset? Is it things like
> select_idle_sibling() that thinks the thread is idle and tries to place
> tasks there?
>
The major issue that we saw was, certain work load causes the idle cpu to never
wakeup and schedule again even when there are runnable threads in there. If
I remember correctly, this happened when the sibling had only one cpu intensive
task and did not enter the pick_next_task for a long time. There were other
situations as well which caused this prolonged idle state on the cpu.
One was when
pick_next_task was called on the sibling but it always won there
because vruntime
was not progressing on the idle cpu.

Having a coresched idle makes sure that the idle thread is not overloaded. Also
vruntime moves forward and tsk vruntime comparison across cpus works when
we normalize.

> It should be possible to change idle_cpu() to not report a forced-idle
> CPU as idle.
I agree. If we can identify all the places the idle thread is
considered special and
also account for the vruntime progress for force idle, this should be a better
approach compared to coresched idle thread per cpu.

>
> (also; it should be possible to optimize select_idle_sibling() for the
> core-sched case specifically)
>
We haven't seen this because, most of our micro test cases did not have more
threads than the cpus. Thanks for pointing this out, we shall cook some tests
to observe this behavior.

Thanks,
Vineeth

2019-08-06 17:46:17

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 8/5/19 8:24 PM, Aaron Lu wrote:

> I've been thinking if we should consider core wide tenent fairness?
>
> Let's say there are 3 tasks on 2 threads' rq of the same core, 2 tasks
> (e.g. A1, A2) belong to tenent A and the 3rd B1 belong to another tenent
> B. Assume A1 and B1 are queued on the same thread and A2 on the other
> thread, when we decide priority for A1 and B1, shall we also consider
> A2's vruntime? i.e. shall we consider A1 and A2 as a whole since they
> belong to the same tenent? I tend to think we should make fairness per
> core per tenent, instead of per thread(cpu) per task(sched entity). What
> do you guys think?
>
> Implemention of the idea is a mess to me, as I feel I'm duplicating the
> existing per cpu per sched_entity enqueue/update vruntime/dequeue logic
> for the per core per tenent stuff.

I'm wondering if something simpler will work. It is easier to maintain fairness
between the CPU threads. A simple scheme may be if the force idle deficit
on a CPU thread exceeds a threshold compared to its sibling, we will
bias in choosing the task on the suppressed CPU thread.
The fairness among the tenents per run queue is balanced out by cfq fairness,
so things should be fair if we maintain fairness in CPU utilization between
the two CPU threads.

Tim

2019-08-06 18:20:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, Aug 06, 2019 at 10:03:29AM -0700, Tim Chen wrote:
> On 8/5/19 8:24 PM, Aaron Lu wrote:
>
> > I've been thinking if we should consider core wide tenent fairness?
> >
> > Let's say there are 3 tasks on 2 threads' rq of the same core, 2 tasks
> > (e.g. A1, A2) belong to tenent A and the 3rd B1 belong to another tenent
> > B. Assume A1 and B1 are queued on the same thread and A2 on the other
> > thread, when we decide priority for A1 and B1, shall we also consider
> > A2's vruntime? i.e. shall we consider A1 and A2 as a whole since they
> > belong to the same tenent? I tend to think we should make fairness per
> > core per tenent, instead of per thread(cpu) per task(sched entity). What
> > do you guys think?
> >
> > Implemention of the idea is a mess to me, as I feel I'm duplicating the
> > existing per cpu per sched_entity enqueue/update vruntime/dequeue logic
> > for the per core per tenent stuff.
>
> I'm wondering if something simpler will work. It is easier to maintain fairness
> between the CPU threads. A simple scheme may be if the force idle deficit
> on a CPU thread exceeds a threshold compared to its sibling, we will
> bias in choosing the task on the suppressed CPU thread.
> The fairness among the tenents per run queue is balanced out by cfq fairness,
> so things should be fair if we maintain fairness in CPU utilization between
> the two CPU threads.

IIRC pjt once did a simle 5ms flip flop between siblings.

2019-08-06 21:20:50

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 8/6/19 10:12 AM, Peter Zijlstra wrote:

>> I'm wondering if something simpler will work. It is easier to maintain fairness
>> between the CPU threads. A simple scheme may be if the force idle deficit
>> on a CPU thread exceeds a threshold compared to its sibling, we will
>> bias in choosing the task on the suppressed CPU thread.
>> The fairness among the tenents per run queue is balanced out by cfq fairness,
>> so things should be fair if we maintain fairness in CPU utilization between
>> the two CPU threads.
>
> IIRC pjt once did a simle 5ms flip flop between siblings.
>

Trying out Peter's suggestions in the following two patches on v3 to
provide fairness between the CPU threads.
The changes is in patch 2 and patch 1 is simply a code reorg.

It is only minimally tested and seems to provide fairness between
two will-it-scale cgroups. Haven't tried it yet on something that
is less CPU intensive with lots of sleep in between.

Also need to put the account idle time for rt and dl sched classes.
Will do that later if this works for the fair class.

Tim

----------------patch 1-----------------------

From ede10309986a6b1bcc82d317f86a5b06459d76bd Mon Sep 17 00:00:00 2001
From: Tim Chen <[email protected]>
Date: Wed, 24 Jul 2019 13:58:18 -0700
Subject: [PATCH 1/2] sched: Move sched fair prio comparison to fair.c

Consolidate the task priority comparison of the fair class
to fair.c. A simple code reorganization and there are no functional changes.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/core.c | 21 ++-------------------
kernel/sched/fair.c | 21 +++++++++++++++++++++
kernel/sched/sched.h | 1 +
3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e3cd9cb17809..567eba50dc38 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -105,25 +105,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
return !dl_time_before(a->dl.deadline, b->dl.deadline);

- if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */
- u64 a_vruntime = a->se.vruntime;
- u64 b_vruntime = b->se.vruntime;
-
- /*
- * Normalize the vruntime if tasks are in different cpus.
- */
- if (task_cpu(a) != task_cpu(b)) {
- b_vruntime -= task_cfs_rq(b)->min_vruntime;
- b_vruntime += task_cfs_rq(a)->min_vruntime;
-
- trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
- a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
- b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
-
- }
-
- return !((s64)(a_vruntime - b_vruntime) <= 0);
- }
+ if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+ return prio_less_fair(a, b);

return false;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02bff10237d4..e289b6e1545b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -602,6 +602,27 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
return delta;
}

+bool prio_less_fair(struct task_struct *a, struct task_struct *b)
+{
+ u64 a_vruntime = a->se.vruntime;
+ u64 b_vruntime = b->se.vruntime;
+
+ /*
+ * Normalize the vruntime if tasks are in different cpus.
+ */
+ if (task_cpu(a) != task_cpu(b)) {
+ b_vruntime -= task_cfs_rq(b)->min_vruntime;
+ b_vruntime += task_cfs_rq(a)->min_vruntime;
+
+ trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
+ a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
+ b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
+
+ }
+
+ return !((s64)(a_vruntime - b_vruntime) <= 0);
+}
+
/*
* The idea is to set a period in which each task runs once.
*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e91c188a452c..bdabe7ce1152 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1015,6 +1015,7 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
}

extern void queue_core_balance(struct rq *rq);
+extern bool prio_less_fair(struct task_struct *a, struct task_struct *b);

#else /* !CONFIG_SCHED_CORE */

--
2.20.1


--------------------patch 2------------------------

From e27305b0042631382bd4f72260d579bf4c971d2f Mon Sep 17 00:00:00 2001
From: Tim Chen <[email protected]>
Date: Tue, 6 Aug 2019 12:50:45 -0700
Subject: [PATCH 2/2] sched: Enforce fairness between cpu threads

CPU thread could be suppressed by its sibling for extended time.
Implement a budget for force idling, making all CPU threads have
equal chance to run.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/core.c | 41 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 12 ++++++++++++
kernel/sched/sched.h | 4 ++++
3 files changed, 57 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 567eba50dc38..e22042883723 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -207,6 +207,44 @@ static struct task_struct *sched_core_next(struct task_struct *p, unsigned long
return p;
}

+void account_core_idletime(struct task_struct *p, u64 exec)
+{
+ const struct cpumask *smt_mask;
+ struct rq *rq;
+ bool force_idle, refill;
+ int i, cpu;
+
+ rq = task_rq(p);
+ if (!sched_core_enabled(rq) || !p->core_cookie)
+ return;
+
+ cpu = task_cpu(p);
+ force_idle = false;
+ refill = true;
+ smt_mask = cpu_smt_mask(cpu);
+
+ for_each_cpu(i, smt_mask) {
+ if (cpu == i)
+ continue;
+
+ if (cpu_rq(i)->core_forceidle)
+ force_idle = true;
+
+ /* Only refill if everyone has run out of allowance */
+ if (cpu_rq(i)->core_idle_allowance > 0)
+ refill = false;
+ }
+
+ if (force_idle)
+ rq->core_idle_allowance -= (s64) exec;
+
+ if (rq->core_idle_allowance < 0 && refill) {
+ for_each_cpu(i, smt_mask) {
+ cpu_rq(i)->core_idle_allowance += (s64) SCHED_IDLE_ALLOWANCE;
+ }
+ }
+}
+
/*
* The static-key + stop-machine variable are needed such that:
*
@@ -273,6 +311,8 @@ void sched_core_put(void)

static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static inline void account_core_idletime(struct task_struct *p, u64 exec) { }
+{

#endif /* CONFIG_SCHED_CORE */

@@ -6773,6 +6813,7 @@ void __init sched_init(void)
rq->core_enabled = 0;
rq->core_tree = RB_ROOT;
rq->core_forceidle = false;
+ rq->core_idle_allowance = (s64) SCHED_IDLE_ALLOWANCE;

rq->core_cookie = 0UL;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e289b6e1545b..d4f9ea03296e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -611,6 +611,17 @@ bool prio_less_fair(struct task_struct *a, struct task_struct *b)
* Normalize the vruntime if tasks are in different cpus.
*/
if (task_cpu(a) != task_cpu(b)) {
+
+ if (a->core_cookie && b->core_cookie &&
+ a->core_cookie != b->core_cookie) {
+ /*
+ * Will be force idling one thread,
+ * pick the thread that has more allowance.
+ */
+ return (task_rq(a)->core_idle_allowance <=
+ task_rq(b)->core_idle_allowance) ? true : false;
+ }
+
b_vruntime -= task_cfs_rq(b)->min_vruntime;
b_vruntime += task_cfs_rq(a)->min_vruntime;

@@ -817,6 +828,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
cgroup_account_cputime(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
+ account_core_idletime(curtask, delta_exec);
}

account_cfs_rq_runtime(cfs_rq, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bdabe7ce1152..927334b2078c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -963,6 +963,7 @@ struct rq {
struct task_struct *core_pick;
unsigned int core_enabled;
unsigned int core_sched_seq;
+ s64 core_idle_allowance;
struct rb_root core_tree;
bool core_forceidle;

@@ -999,6 +1000,8 @@ static inline int cpu_of(struct rq *rq)
}

#ifdef CONFIG_SCHED_CORE
+#define SCHED_IDLE_ALLOWANCE 5000000 /* 5 msec */
+
DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);

static inline bool sched_core_enabled(struct rq *rq)
@@ -1016,6 +1019,7 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)

extern void queue_core_balance(struct rq *rq);
extern bool prio_less_fair(struct task_struct *a, struct task_struct *b);
+extern void account_core_idletime(struct task_struct *p, u64 exec);

#else /* !CONFIG_SCHED_CORE */

--
2.20.1

2019-08-07 09:03:14

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

Hello everyone,

This is Dario, from SUSE. I'm also interesting in core-scheduling, and
using it in virtualization use cases.

Just for context, I'm working in virt since a few years, mostly on Xen,
but I've done Linux stuff before, and I am getting back at it.

For now, I've been looking at the core-scheduling code, and run some
benchmarks myself.

On Fri, 2019-08-02 at 11:37 -0400, Julien Desfossez wrote:
> We tested both Aaron's and Tim's patches and here are our results.
>
> Test setup:
> - 2 1-thread sysbench, one running the cpu benchmark, the other one
> the
> mem benchmark
> - both started at the same time
> - both are pinned on the same core (2 hardware threads)
> - 10 30-seconds runs
> - test script: https://paste.debian.net/plainh/834cf45c
> - only showing the CPU events/sec (higher is better)
> - tested 4 tag configurations:
> - no tag
> - sysbench mem untagged, sysbench cpu tagged
> - sysbench mem tagged, sysbench cpu untagged
> - both tagged with a different tag
> - "Alone" is the sysbench CPU running alone on the core, no tag
> - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> - "Tim's full patchset + sched" is an experiment with Tim's patchset
> combined with Aaron's "hack patch" to get rid of the remaining deep
> idle cases
> - In all test cases, both tasks can run simultaneously (which was not
> the case without those patches), but the standard deviation is a
> pretty good indicator of the fairness/consistency.
>
This, and of course the numbers below too, is very interesting.

So, here comes my question: I've done a benchmarking campaign (yes,
I'll post numbers soon) using this branch:

https://github.com/digitalocean/linux-coresched.git vpillai/coresched-v3-v5.1.5-test
https://github.com/digitalocean/linux-coresched/tree/vpillai/coresched-v3-v5.1.5-test

Last commit:
7feb1007f274 "Fix stalling of untagged processes competing with tagged
processes"

Since I see that, in this thread, there are various patches being
proposed and discussed... should I rerun my benchmarks with them
applied? If yes, which ones? And is there, by any chance, one (or maybe
more than one) updated git branch(es)?

Thanks in advance and Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2019-08-07 17:11:40

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 8/7/19 1:58 AM, Dario Faggioli wrote:

> So, here comes my question: I've done a benchmarking campaign (yes,
> I'll post numbers soon) using this branch:
>
> https://github.com/digitalocean/linux-coresched.git vpillai/coresched-v3-v5.1.5-test
> https://github.com/digitalocean/linux-coresched/tree/vpillai/coresched-v3-v5.1.5-test
>
> Last commit:
> 7feb1007f274 "Fix stalling of untagged processes competing with tagged
> processes"
>
> Since I see that, in this thread, there are various patches being
> proposed and discussed... should I rerun my benchmarks with them
> applied? If yes, which ones? And is there, by any chance, one (or maybe
> more than one) updated git branch(es)?
>
> Thanks in advance and Regards
>

Hi Dario,

Having an extra set of eyes are certainly welcomed.
I'll give my 2 cents on the issues with v3.
Others feel free to chime in if my understanding is
incorrect or I'm missing something.

1) Unfairness between the sibling threads
-----------------------------------------
One sibling thread could be suppressing and force idling
the sibling thread over proportionally. Resulting in
the force idled CPU not getting run and stall tasks on
suppressed CPU.

Status:
i) Aaron has proposed a patchset here based on using one
rq as a base reference for vruntime for task priority
comparison between siblings.

https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/
It works well on fairness but has some initialization issues

ii) Tim has proposed a patchset here to account for forced
idle time in rq's min_vruntime
https://lore.kernel.org/lkml/[email protected]/
It improves over v3 with simpler logic compared to
Aaron's patch, but does not work as well on fairness

iii) Tim has proposed yet another patch to maintain fairness
of forced idle time between CPU threads per Peter's suggestion.
https://lore.kernel.org/lkml/[email protected]/
Its performance has yet to be tested.

2) Not rescheduling forced idled CPU
------------------------------------
The forced idled CPU does not get a chance to re-schedule
itself, and will stall for a long time even though it
has eligible tasks to run.

Status:
i) Aaron proposed a patch to fix this to check if there
are runnable tasks when scheduling tick comes in.
https://lore.kernel.org/lkml/20190725143344.GD992@aaronlu/

ii) Vineeth has patches to this issue and also issue 1, based
on scheduling in a new "forced idle task" when getting forced
idle, but has yet to post the patches.

3) Load balancing between CPU cores
-----------------------------------
Say if one CPU core's sibling threads get forced idled
a lot as it has mostly incompatible tasks between the siblings,
moving the incompatible load to other cores and pulling
compatible load to the core could help CPU utilization.

So just considering the load of a task is not enough during
load balancing, task compatibility also needs to be considered.
Peter has put in mechanisms to balance compatible tasks between
CPU thread siblings, but not across cores.

Status:
I have not seen patches on this issue. This issue could lead to
large variance in workload performance based on your luck
in placing the workload among the cores.

Thanks.

Tim

2019-08-08 06:49:40

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, Aug 06, 2019 at 02:19:57PM -0700, Tim Chen wrote:
> +void account_core_idletime(struct task_struct *p, u64 exec)
> +{
> + const struct cpumask *smt_mask;
> + struct rq *rq;
> + bool force_idle, refill;
> + int i, cpu;
> +
> + rq = task_rq(p);
> + if (!sched_core_enabled(rq) || !p->core_cookie)
> + return;

I don't see why return here for untagged task. Untagged task can also
preempt tagged task and force a CPU thread enter idle state.
Untagged is just another tag to me, unless we want to allow untagged
task to coschedule with a tagged task.

> + cpu = task_cpu(p);
> + force_idle = false;
> + refill = true;
> + smt_mask = cpu_smt_mask(cpu);
> +
> + for_each_cpu(i, smt_mask) {
> + if (cpu == i)
> + continue;
> +
> + if (cpu_rq(i)->core_forceidle)
> + force_idle = true;
> +
> + /* Only refill if everyone has run out of allowance */
> + if (cpu_rq(i)->core_idle_allowance > 0)
> + refill = false;
> + }
> +
> + if (force_idle)
> + rq->core_idle_allowance -= (s64) exec;
> +
> + if (rq->core_idle_allowance < 0 && refill) {
> + for_each_cpu(i, smt_mask) {
> + cpu_rq(i)->core_idle_allowance += (s64) SCHED_IDLE_ALLOWANCE;
> + }
> + }
> +}

2019-08-08 12:57:12

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Mon, Aug 05, 2019 at 08:55:28AM -0700, Tim Chen wrote:
> On 8/2/19 8:37 AM, Julien Desfossez wrote:
> > We tested both Aaron's and Tim's patches and here are our results.
> >
> > Test setup:
> > - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> > mem benchmark
> > - both started at the same time
> > - both are pinned on the same core (2 hardware threads)
> > - 10 30-seconds runs
> > - test script: https://paste.debian.net/plainh/834cf45c
> > - only showing the CPU events/sec (higher is better)
> > - tested 4 tag configurations:
> > - no tag
> > - sysbench mem untagged, sysbench cpu tagged
> > - sysbench mem tagged, sysbench cpu untagged
> > - both tagged with a different tag
> > - "Alone" is the sysbench CPU running alone on the core, no tag
> > - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> > - "Tim's full patchset + sched" is an experiment with Tim's patchset
> > combined with Aaron's "hack patch" to get rid of the remaining deep
> > idle cases
> > - In all test cases, both tasks can run simultaneously (which was not
> > the case without those patches), but the standard deviation is a
> > pretty good indicator of the fairness/consistency.
>
> Thanks for testing the patches and giving such detailed data.
>
> I came to realize that for my scheme, the accumulated deficit of forced idle could be wiped
> out in one execution of a task on the forced idle cpu, with the update of the min_vruntime,
> even if the execution time could be far less than the accumulated deficit.
> That's probably one reason my scheme didn't achieve fairness.

Turns out there is a typo error in v3 when setting rq's core_forceidle:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26fea68f7f54..542974a8da18 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3888,7 +3888,7 @@ next_class:;
WARN_ON_ONCE(!rq_i->core_pick);

if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
- rq->core_forceidle = true;
+ rq_i->core_forceidle = true;

rq_i->core_pick->core_occupation = occ;

With this fixed and together with the patch to let schedule always
happen, your latest 2 patches work well for the 10s cpuhog test I
described previously:
https://lore.kernel.org/lkml/20190725143003.GA992@aaronlu/

overloaded workload without any cpu binding doesn't work well though, I
haven't taken a closer look yet.

2019-08-08 17:42:21

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 8/8/19 5:55 AM, Aaron Lu wrote:
> On Mon, Aug 05, 2019 at 08:55:28AM -0700, Tim Chen wrote:
>> On 8/2/19 8:37 AM, Julien Desfossez wrote:
>>> We tested both Aaron's and Tim's patches and here are our results.

>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 26fea68f7f54..542974a8da18 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3888,7 +3888,7 @@ next_class:;
> WARN_ON_ONCE(!rq_i->core_pick);
>
> if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> - rq->core_forceidle = true;
> + rq_i->core_forceidle = true;

Good catch!

>
> rq_i->core_pick->core_occupation = occ;
>
> With this fixed and together with the patch to let schedule always
> happen, your latest 2 patches work well for the 10s cpuhog test I
> described previously:
> https://lore.kernel.org/lkml/20190725143003.GA992@aaronlu/

That's encouraging. You are talking about my patches
that try to keep the force idle time between sibling threads
balanced, right?

>
> overloaded workload without any cpu binding doesn't work well though, I
> haven't taken a closer look yet.
>

I think we need a load balancing scheme among the cores that will try
to minimize force idle.

One possible metric to measure load compatibility imbalance that leads to
force idle is

Say i, j are sibling threads of a cpu core
imbalanace = \sum_tagged_cgroup abs(Load_cgroup_cpui - Load_cgroup_cpuj)

This gives us a metric to decide if migrating a task will improve
load compatability imbalance. As we already track cgroup load on a CPU,
it should be doable without adding too much overhead.

Tim

2019-08-08 18:01:10

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 8/7/19 11:47 PM, Aaron Lu wrote:
> On Tue, Aug 06, 2019 at 02:19:57PM -0700, Tim Chen wrote:
>> +void account_core_idletime(struct task_struct *p, u64 exec)
>> +{
>> + const struct cpumask *smt_mask;
>> + struct rq *rq;
>> + bool force_idle, refill;
>> + int i, cpu;
>> +
>> + rq = task_rq(p);
>> + if (!sched_core_enabled(rq) || !p->core_cookie)
>> + return;
>
> I don't see why return here for untagged task. Untagged task can also
> preempt tagged task and force a CPU thread enter idle state.
> Untagged is just another tag to me, unless we want to allow untagged
> task to coschedule with a tagged task.

You are right. This needs to be fixed.

And the cookie check will also need to be changed in prio_less_fair.

@@ -611,6 +611,17 @@ bool prio_less_fair(struct task_struct *a, struct task_struct *b)
* Normalize the vruntime if tasks are in different cpus.
*/
if (task_cpu(a) != task_cpu(b)) {
+
+ if (a->core_cookie && b->core_cookie &&
+ a->core_cookie != b->core_cookie) {

if (!cookie_match(a, b))

+ /*
+ * Will be force idling one thread,
+ * pick the thread that has more allowance.
+ */
+ return (task_rq(a)->core_idle_allowance <=
+ task_rq(b)->core_idle_allowance) ? true : false;
+ }
+

I'll respin my patches.

Tim

2019-08-08 21:44:43

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 8/8/19 10:27 AM, Tim Chen wrote:
> On 8/7/19 11:47 PM, Aaron Lu wrote:
>> On Tue, Aug 06, 2019 at 02:19:57PM -0700, Tim Chen wrote:
>>> +void account_core_idletime(struct task_struct *p, u64 exec)
>>> +{
>>> + const struct cpumask *smt_mask;
>>> + struct rq *rq;
>>> + bool force_idle, refill;
>>> + int i, cpu;
>>> +
>>> + rq = task_rq(p);
>>> + if (!sched_core_enabled(rq) || !p->core_cookie)
>>> + return;
>>
>> I don't see why return here for untagged task. Untagged task can also
>> preempt tagged task and force a CPU thread enter idle state.
>> Untagged is just another tag to me, unless we want to allow untagged
>> task to coschedule with a tagged task.
>
> You are right. This needs to be fixed.
>

Here's the updated patchset, including Aaron's fix and also
added accounting of force idle time by deadline and rt tasks.

Tim

-----------------patch 1----------------------
From 730dbb125f5f67c75f97f6be154d382767810f8b Mon Sep 17 00:00:00 2001
From: Aaron Lu <[email protected]>
Date: Thu, 8 Aug 2019 08:57:46 -0700
Subject: [PATCH 1/3 v2] sched: Fix incorrect rq tagged as forced idle

Incorrect run queue was tagged as forced idle.
Tag the correct one.

Signed-off-by: Aaron Lu <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e3cd9cb17809..50453e1329f3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3903,7 +3903,7 @@ next_class:;
WARN_ON_ONCE(!rq_i->core_pick);

if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
- rq->core_forceidle = true;
+ rq_i->core_forceidle = true;

rq_i->core_pick->core_occupation = occ;

--
2.20.1

--------------patch 2------------------------
From 263ceeb40843b8ca3a91f1b268bec2b836d4986b Mon Sep 17 00:00:00 2001
From: Tim Chen <[email protected]>
Date: Wed, 24 Jul 2019 13:58:18 -0700
Subject: [PATCH 2/3 v2] sched: Move sched fair prio comparison to fair.c

Consolidate the task priority comparison of the fair class
to fair.c. A simple code reorganization and there are no functional changes.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/core.c | 21 ++-------------------
kernel/sched/fair.c | 21 +++++++++++++++++++++
kernel/sched/sched.h | 1 +
3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 50453e1329f3..0f893853766c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -105,25 +105,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
return !dl_time_before(a->dl.deadline, b->dl.deadline);

- if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */
- u64 a_vruntime = a->se.vruntime;
- u64 b_vruntime = b->se.vruntime;
-
- /*
- * Normalize the vruntime if tasks are in different cpus.
- */
- if (task_cpu(a) != task_cpu(b)) {
- b_vruntime -= task_cfs_rq(b)->min_vruntime;
- b_vruntime += task_cfs_rq(a)->min_vruntime;
-
- trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
- a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
- b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
-
- }
-
- return !((s64)(a_vruntime - b_vruntime) <= 0);
- }
+ if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+ return prio_less_fair(a, b);

return false;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02bff10237d4..e289b6e1545b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -602,6 +602,27 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
return delta;
}

+bool prio_less_fair(struct task_struct *a, struct task_struct *b)
+{
+ u64 a_vruntime = a->se.vruntime;
+ u64 b_vruntime = b->se.vruntime;
+
+ /*
+ * Normalize the vruntime if tasks are in different cpus.
+ */
+ if (task_cpu(a) != task_cpu(b)) {
+ b_vruntime -= task_cfs_rq(b)->min_vruntime;
+ b_vruntime += task_cfs_rq(a)->min_vruntime;
+
+ trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
+ a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
+ b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
+
+ }
+
+ return !((s64)(a_vruntime - b_vruntime) <= 0);
+}
+
/*
* The idea is to set a period in which each task runs once.
*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e91c188a452c..bdabe7ce1152 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1015,6 +1015,7 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
}

extern void queue_core_balance(struct rq *rq);
+extern bool prio_less_fair(struct task_struct *a, struct task_struct *b);

#else /* !CONFIG_SCHED_CORE */

--
2.20.1

------------------------patch 3---------------------------
From 5318e23c741a832140effbaf2f79fdf4b08f883c Mon Sep 17 00:00:00 2001
From: Tim Chen <[email protected]>
Date: Tue, 6 Aug 2019 12:50:45 -0700
Subject: [PATCH 3/3 v2] sched: Enforce fairness between cpu threads

CPU thread could be suppressed by its sibling for extended time.
Implement a budget for force idling, making all CPU threads have
equal chance to run.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/core.c | 43 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/deadline.c | 1 +
kernel/sched/fair.c | 11 +++++++++++
kernel/sched/rt.c | 1 +
kernel/sched/sched.h | 4 ++++
5 files changed, 60 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0f893853766c..de83dcb84495 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -207,6 +207,46 @@ static struct task_struct *sched_core_next(struct task_struct *p, unsigned long
return p;
}

+void account_core_idletime(struct task_struct *p, u64 exec)
+{
+ const struct cpumask *smt_mask;
+ struct rq *rq;
+ bool force_idle, refill;
+ int i, cpu;
+
+ rq = task_rq(p);
+ if (!sched_core_enabled(rq))
+ return;
+
+ cpu = task_cpu(p);
+ force_idle = false;
+ refill = true;
+ smt_mask = cpu_smt_mask(cpu);
+
+ for_each_cpu(i, smt_mask) {
+ if (cpu == i || cpu_is_offline(i))
+ continue;
+
+ if (cpu_rq(i)->core_forceidle)
+ force_idle = true;
+
+ /* Only refill if everyone has run out of allowance */
+ if (cpu_rq(i)->core_idle_allowance > 0)
+ refill = false;
+ }
+
+ if (force_idle)
+ rq->core_idle_allowance -= (s64) exec;
+
+ if (rq->core_idle_allowance < 0 && refill) {
+ for_each_cpu(i, smt_mask) {
+ if (cpu_is_offline(i))
+ continue;
+ cpu_rq(i)->core_idle_allowance += (s64) SCHED_IDLE_ALLOWANCE;
+ }
+ }
+}
+
/*
* The static-key + stop-machine variable are needed such that:
*
@@ -273,6 +313,8 @@ void sched_core_put(void)

static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static inline void account_core_idletime(struct task_struct *p, u64 exec) { }
+{

#endif /* CONFIG_SCHED_CORE */

@@ -6773,6 +6815,7 @@ void __init sched_init(void)
rq->core_enabled = 0;
rq->core_tree = RB_ROOT;
rq->core_forceidle = false;
+ rq->core_idle_allowance = (s64) SCHED_IDLE_ALLOWANCE;

rq->core_cookie = 0UL;
#endif
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 64fc444f44f9..684c64a95ec7 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1175,6 +1175,7 @@ static void update_curr_dl(struct rq *rq)

curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);
+ account_core_idletime(curr, delta_exec);

curr->se.exec_start = now;
cgroup_account_cputime(curr, delta_exec);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e289b6e1545b..f65270784c28 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -611,6 +611,16 @@ bool prio_less_fair(struct task_struct *a, struct task_struct *b)
* Normalize the vruntime if tasks are in different cpus.
*/
if (task_cpu(a) != task_cpu(b)) {
+
+ if (a->core_cookie != b->core_cookie) {
+ /*
+ * Will be force idling one thread,
+ * pick the thread that has more allowance.
+ */
+ return (task_rq(a)->core_idle_allowance <
+ task_rq(b)->core_idle_allowance) ? true : false;
+ }
+
b_vruntime -= task_cfs_rq(b)->min_vruntime;
b_vruntime += task_cfs_rq(a)->min_vruntime;

@@ -817,6 +827,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
cgroup_account_cputime(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
+ account_core_idletime(curtask, delta_exec);
}

account_cfs_rq_runtime(cfs_rq, delta_exec);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 81557224548c..6f18e1455778 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -971,6 +971,7 @@ static void update_curr_rt(struct rq *rq)

curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);
+ account_core_idletime(curr, delta_exec);

curr->se.exec_start = now;
cgroup_account_cputime(curr, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bdabe7ce1152..927334b2078c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -963,6 +963,7 @@ struct rq {
struct task_struct *core_pick;
unsigned int core_enabled;
unsigned int core_sched_seq;
+ s64 core_idle_allowance;
struct rb_root core_tree;
bool core_forceidle;

@@ -999,6 +1000,8 @@ static inline int cpu_of(struct rq *rq)
}

#ifdef CONFIG_SCHED_CORE
+#define SCHED_IDLE_ALLOWANCE 5000000 /* 5 msec */
+
DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);

static inline bool sched_core_enabled(struct rq *rq)
@@ -1016,6 +1019,7 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)

extern void queue_core_balance(struct rq *rq);
extern bool prio_less_fair(struct task_struct *a, struct task_struct *b);
+extern void account_core_idletime(struct task_struct *p, u64 exec);

#else /* !CONFIG_SCHED_CORE */

--
2.20.1

2019-08-10 14:19:05

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, Aug 08, 2019 at 02:42:57PM -0700, Tim Chen wrote:
> On 8/8/19 10:27 AM, Tim Chen wrote:
> > On 8/7/19 11:47 PM, Aaron Lu wrote:
> >> On Tue, Aug 06, 2019 at 02:19:57PM -0700, Tim Chen wrote:
> >>> +void account_core_idletime(struct task_struct *p, u64 exec)
> >>> +{
> >>> + const struct cpumask *smt_mask;
> >>> + struct rq *rq;
> >>> + bool force_idle, refill;
> >>> + int i, cpu;
> >>> +
> >>> + rq = task_rq(p);
> >>> + if (!sched_core_enabled(rq) || !p->core_cookie)
> >>> + return;
> >>
> >> I don't see why return here for untagged task. Untagged task can also
> >> preempt tagged task and force a CPU thread enter idle state.
> >> Untagged is just another tag to me, unless we want to allow untagged
> >> task to coschedule with a tagged task.
> >
> > You are right. This needs to be fixed.
> >
>
> Here's the updated patchset, including Aaron's fix and also
> added accounting of force idle time by deadline and rt tasks.

I have two other small changes that I think are worth sending out.

The first simplify logic in pick_task() and the 2nd avoid task pick all
over again when max is preempted. I also refined the previous hack patch to
make schedule always happen only for root cfs rq. Please see below for
details, thanks.

patch1:

From cea56db35fe9f393c357cdb1bdcb2ef9b56cfe97 Mon Sep 17 00:00:00 2001
From: Aaron Lu <[email protected]>
Date: Mon, 5 Aug 2019 21:21:25 +0800
Subject: [PATCH 1/3] sched/core: simplify pick_task()

No need to special case !cookie case in pick_task(), we just need to
make it possible to return idle in sched_core_find() for !cookie query.
And cookie_pick will always have less priority than class_pick, so
remove the redundant check of prio_less(cookie_pick, class_pick).

Signed-off-by: Aaron Lu <[email protected]>
---
kernel/sched/core.c | 19 ++++---------------
1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 90655c9ad937..84fec9933b74 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -186,6 +186,8 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
* The idle task always matches any cookie!
*/
match = idle_sched_class.pick_task(rq);
+ if (!cookie)
+ goto out;

while (node) {
node_task = container_of(node, struct task_struct, core_node);
@@ -199,7 +201,7 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
node = node->rb_left;
}
}
-
+out:
return match;
}

@@ -3657,18 +3659,6 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
if (!class_pick)
return NULL;

- if (!cookie) {
- /*
- * If class_pick is tagged, return it only if it has
- * higher priority than max.
- */
- if (max && class_pick->core_cookie &&
- prio_less(class_pick, max))
- return idle_sched_class.pick_task(rq);
-
- return class_pick;
- }
-
/*
* If class_pick is idle or matches cookie, return early.
*/
@@ -3682,8 +3672,7 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
* the core (so far) and it must be selected, otherwise we must go with
* the cookie pick in order to satisfy the constraint.
*/
- if (prio_less(cookie_pick, class_pick) &&
- (!max || prio_less(max, class_pick)))
+ if (!max || prio_less(max, class_pick))
return class_pick;

return cookie_pick;
--
2.19.1.3.ge56e4f7

patch2:

From 487950dc53a40d5c566602f775ce46a0bab7a412 Mon Sep 17 00:00:00 2001
From: Aaron Lu <[email protected]>
Date: Fri, 9 Aug 2019 14:48:01 +0800
Subject: [PATCH 2/3] sched/core: no need to pick again after max is preempted

When sibling's task preempts current max, there is no need to do the
pick all over again - the preempted cpu could just pick idle and done.

Signed-off-by: Aaron Lu <[email protected]>
---
kernel/sched/core.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84fec9933b74..e88583860abe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3756,7 +3756,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
* order.
*/
for_each_class(class) {
-again:
for_each_cpu_wrap(i, smt_mask, cpu) {
struct rq *rq_i = cpu_rq(i);
struct task_struct *p;
@@ -3828,10 +3827,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
if (j == i)
continue;

- cpu_rq(j)->core_pick = NULL;
+ cpu_rq(j)->core_pick = idle_sched_class.pick_task(cpu_rq(j));
}
occ = 1;
- goto again;
+ goto out;
} else {
/*
* Once we select a task for a cpu, we
@@ -3846,7 +3845,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
}
next_class:;
}
-
+out:
rq->core->core_pick_seq = rq->core->core_task_seq;
next = rq->core_pick;
rq->core_sched_seq = rq->core->core_pick_seq;
--
2.19.1.3.ge56e4f7

patch3:

From 2d396d99e0dd7157b0b4f7a037c8b84ed135ea56 Mon Sep 17 00:00:00 2001
From: Aaron Lu <[email protected]>
Date: Thu, 25 Jul 2019 19:57:21 +0800
Subject: [PATCH 3/3] sched/fair: make tick based schedule always happen

When a hyperthread is forced idle and the other hyperthread has a single
CPU intensive task running, the running task can occupy the hyperthread
for a long time with no scheduling point and starve the other
hyperthread.

Fix this temporarily by always checking if the task has exceed its
timeslice and if so, for root cfs_rq, do a schedule.

Signed-off-by: Aaron Lu <[email protected]>
---
kernel/sched/fair.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 26d29126d6a5..b1f0defdad91 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4011,6 +4011,9 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
return;
}

+ if (cfs_rq->nr_running <= 1)
+ return;
+
/*
* Ensure that a task that missed wakeup preemption by a
* narrow margin doesn't have to wait for a full slice.
@@ -4179,7 +4182,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
return;
#endif

- if (cfs_rq->nr_running > 1)
+ if (cfs_rq->nr_running > 1 || cfs_rq->tg == &root_task_group)
check_preempt_tick(cfs_rq, curr);
}

--
2.19.1.3.ge56e4f7

2019-08-10 14:20:39

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, Aug 08, 2019 at 09:39:45AM -0700, Tim Chen wrote:
> On 8/8/19 5:55 AM, Aaron Lu wrote:
> > On Mon, Aug 05, 2019 at 08:55:28AM -0700, Tim Chen wrote:
> >> On 8/2/19 8:37 AM, Julien Desfossez wrote:
> >>> We tested both Aaron's and Tim's patches and here are our results.
>
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 26fea68f7f54..542974a8da18 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3888,7 +3888,7 @@ next_class:;
> > WARN_ON_ONCE(!rq_i->core_pick);
> >
> > if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> > - rq->core_forceidle = true;
> > + rq_i->core_forceidle = true;
>
> Good catch!
>
> >
> > rq_i->core_pick->core_occupation = occ;
> >
> > With this fixed and together with the patch to let schedule always
> > happen, your latest 2 patches work well for the 10s cpuhog test I
> > described previously:
> > https://lore.kernel.org/lkml/20190725143003.GA992@aaronlu/
>
> That's encouraging. You are talking about my patches
> that try to keep the force idle time between sibling threads
> balanced, right?

Yes.

> >
> > overloaded workload without any cpu binding doesn't work well though, I
> > haven't taken a closer look yet.
> >
>
> I think we need a load balancing scheme among the cores that will try
> to minimize force idle.

Agree.

>
> One possible metric to measure load compatibility imbalance that leads to
> force idle is
>
> Say i, j are sibling threads of a cpu core
> imbalanace = \sum_tagged_cgroup abs(Load_cgroup_cpui - Load_cgroup_cpuj)
>
> This gives us a metric to decide if migrating a task will improve
> load compatability imbalance. As we already track cgroup load on a CPU,
> it should be doable without adding too much overhead.

2019-08-12 15:40:06

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

> I have two other small changes that I think are worth sending out.
>
> The first simplify logic in pick_task() and the 2nd avoid task pick all
> over again when max is preempted. I also refined the previous hack patch to
> make schedule always happen only for root cfs rq. Please see below for
> details, thanks.
>
I see a potential issue here. With the simplification in pick_task,
you might introduce a livelock where the match logic spins for ever.
But you avoid that with the patch 2, by removing the loop if a pick
preempts max. The potential problem is that, you miss a case where
the newly picked task might have a match in the sibling on which max
was selected before. By selecting idle, you ignore the potential match.
As of now, the potential match check does not really work because,
sched_core_find will always return the same task and we do not check
the whole core_tree for a next match. This is in my TODO list to have
sched_core_find to return the best next match, if match was preempted.
But its a bit complex and needs more thought.

Third patch looks good to me.

Thanks,
Vineeth

2019-08-13 02:27:35

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 2019/8/12 23:38, Vineeth Remanan Pillai wrote:
>> I have two other small changes that I think are worth sending out.
>>
>> The first simplify logic in pick_task() and the 2nd avoid task pick all
>> over again when max is preempted. I also refined the previous hack patch to
>> make schedule always happen only for root cfs rq. Please see below for
>> details, thanks.
>>
> I see a potential issue here. With the simplification in pick_task,
> you might introduce a livelock where the match logic spins for ever.
> But you avoid that with the patch 2, by removing the loop if a pick
> preempts max. The potential problem is that, you miss a case where
> the newly picked task might have a match in the sibling on which max
> was selected before. By selecting idle, you ignore the potential match.

Oh that's right, I missed this.

> As of now, the potential match check does not really work because,
> sched_core_find will always return the same task and we do not check
> the whole core_tree for a next match. This is in my TODO list to have
> sched_core_find to return the best next match, if match was preempted.
> But its a bit complex and needs more thought.

Sounds worth to do :-)

2019-08-15 16:10:44

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Wed, 2019-08-07 at 10:10 -0700, Tim Chen wrote:
> On 8/7/19 1:58 AM, Dario Faggioli wrote:
>
> > Since I see that, in this thread, there are various patches being
> > proposed and discussed... should I rerun my benchmarks with them
> > applied? If yes, which ones? And is there, by any chance, one (or
> > maybe
> > more than one) updated git branch(es)?
> >
> Hi Dario,
>
Hi Tim!

> Having an extra set of eyes are certainly welcomed.
> I'll give my 2 cents on the issues with v3.
>
Ok, and thanks a lot for this.

> 1) Unfairness between the sibling threads
> -----------------------------------------
> One sibling thread could be suppressing and force idling
> the sibling thread over proportionally. Resulting in
> the force idled CPU not getting run and stall tasks on
> suppressed CPU.
>
>
> [...]
>
> 2) Not rescheduling forced idled CPU
> ------------------------------------
> The forced idled CPU does not get a chance to re-schedule
> itself, and will stall for a long time even though it
> has eligible tasks to run.
>
> [...]
>
> 3) Load balancing between CPU cores
> -----------------------------------
> Say if one CPU core's sibling threads get forced idled
> a lot as it has mostly incompatible tasks between the siblings,
> moving the incompatible load to other cores and pulling
> compatible load to the core could help CPU utilization.
>
> So just considering the load of a task is not enough during
> load balancing, task compatibility also needs to be considered.
> Peter has put in mechanisms to balance compatible tasks between
> CPU thread siblings, but not across cores.
>
> [...]
>
Ok. Yes, as said, I've been trying to follow the thread, but thanks a
lot again for this summary.

As said, I'm about to have numbers for the repo/branch I mentioned.

I was considering whether to also re-run the benchmarking campaign with
some of the patches that floated around within this thread. Now, thanks
to your summary, I have an even clearer picture about which patch does
what, and that is indeed very useful.

I'll see about putting something together. I'm thinking of picking:

https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/20190725143344.GD992@aaronlu/

And maybe even (part of):
https://lore.kernel.org/lkml/20190810141556.GA73644@aaronlu/#t

If anyone has ideas or suggestions about whether or not this choice
makes sense, feel free to share. :-)

Also, I only have another week before leaving, so let's see what I
manage to actually run, and then share here, by then.

Thanks and Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2019-08-16 02:35:49

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, Aug 15, 2019 at 06:09:28PM +0200, Dario Faggioli wrote:
> On Wed, 2019-08-07 at 10:10 -0700, Tim Chen wrote:
> > On 8/7/19 1:58 AM, Dario Faggioli wrote:
> >
> > > Since I see that, in this thread, there are various patches being
> > > proposed and discussed... should I rerun my benchmarks with them
> > > applied? If yes, which ones? And is there, by any chance, one (or
> > > maybe
> > > more than one) updated git branch(es)?
> > >
> > Hi Dario,
> >
> Hi Tim!
>
> > Having an extra set of eyes are certainly welcomed.
> > I'll give my 2 cents on the issues with v3.
> >
> Ok, and thanks a lot for this.
>
> > 1) Unfairness between the sibling threads
> > -----------------------------------------
> > One sibling thread could be suppressing and force idling
> > the sibling thread over proportionally. Resulting in
> > the force idled CPU not getting run and stall tasks on
> > suppressed CPU.
> >
> >
> > [...]
> >
> > 2) Not rescheduling forced idled CPU
> > ------------------------------------
> > The forced idled CPU does not get a chance to re-schedule
> > itself, and will stall for a long time even though it
> > has eligible tasks to run.
> >
> > [...]
> >
> > 3) Load balancing between CPU cores
> > -----------------------------------
> > Say if one CPU core's sibling threads get forced idled
> > a lot as it has mostly incompatible tasks between the siblings,
> > moving the incompatible load to other cores and pulling
> > compatible load to the core could help CPU utilization.
> >
> > So just considering the load of a task is not enough during
> > load balancing, task compatibility also needs to be considered.
> > Peter has put in mechanisms to balance compatible tasks between
> > CPU thread siblings, but not across cores.
> >
> > [...]
> >
> Ok. Yes, as said, I've been trying to follow the thread, but thanks a
> lot again for this summary.
>
> As said, I'm about to have numbers for the repo/branch I mentioned.
>
> I was considering whether to also re-run the benchmarking campaign with
> some of the patches that floated around within this thread. Now, thanks
> to your summary, I have an even clearer picture about which patch does
> what, and that is indeed very useful.
>
> I'll see about putting something together. I'm thinking of picking:
>
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/20190725143344.GD992@aaronlu/
>
> And maybe even (part of):
> https://lore.kernel.org/lkml/20190810141556.GA73644@aaronlu/#t
>
> If anyone has ideas or suggestions about whether or not this choice
> makes sense, feel free to share. :-)

Makes sense to me.
patch3 in the last link is slightly better than the one in the 2nd link,
so just use that instead.

Thanks,
Aaron

> Also, I only have another week before leaving, so let's see what I
> manage to actually run, and then share here, by then.
>
> Thanks and Regards
> --
> Dario Faggioli, Ph.D
> http://about.me/dario.faggioli
> Virtualization Software Engineer
> SUSE Labs, SUSE https://www.suse.com/
> -------------------------------------------------------------------
> <<This happens because _I_ choose it to happen!>> (Raistlin Majere)
>


2019-08-26 17:16:04

by mark gross

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/16] sched: Introduce sched_class::pick_task()

On Wed, May 29, 2019 at 08:36:45PM +0000, Vineeth Remanan Pillai wrote:
> From: Peter Zijlstra <[email protected]>
>
> Because sched_class::pick_next_task() also implies
> sched_class::set_next_task() (and possibly put_prev_task() and
> newidle_balance) it is not state invariant. This makes it unsuitable
> for remote task selection.
It would be cool if the commit comment would explain what the change is going
to do about pick_next_task being unsuitable for remote task selection.

>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
> Signed-off-by: Julien Desfossez <[email protected]>
> ---
>
> Chnages in v3
> -------------
> - Minor refactor to remove redundant NULL checks
>
> Changes in v2
> -------------
> - Fixes a NULL pointer dereference crash
> - Subhra Mazumdar
> - Tim Chen
>
> ---
> kernel/sched/deadline.c | 21 ++++++++++++++++-----
> kernel/sched/fair.c | 36 +++++++++++++++++++++++++++++++++---
> kernel/sched/idle.c | 10 +++++++++-
> kernel/sched/rt.c | 21 ++++++++++++++++-----
> kernel/sched/sched.h | 2 ++
> kernel/sched/stop_task.c | 21 ++++++++++++++++-----
> 6 files changed, 92 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index d3904168857a..64fc444f44f9 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1722,15 +1722,12 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
> return rb_entry(left, struct sched_dl_entity, rb_node);
> }
>
> -static struct task_struct *
> -pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +static struct task_struct *pick_task_dl(struct rq *rq)
> {
> struct sched_dl_entity *dl_se;
> struct task_struct *p;
> struct dl_rq *dl_rq;
>
> - WARN_ON_ONCE(prev || rf);
> -
> dl_rq = &rq->dl;
>
> if (unlikely(!dl_rq->dl_nr_running))
> @@ -1741,7 +1738,19 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>
> p = dl_task_of(dl_se);
>
> - set_next_task_dl(rq, p);
> + return p;
> +}
> +
> +static struct task_struct *
> +pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +{
> + struct task_struct *p;
> +
> + WARN_ON_ONCE(prev || rf);
What is an admin to do with this warnding if it shows up in there logs?
maybe include some text here to help folks that might hit this warn_on.


> +
> + p = pick_task_dl(rq);
> + if (p)
> + set_next_task_dl(rq, p);
>
> return p;
> }
> @@ -2388,6 +2397,8 @@ const struct sched_class dl_sched_class = {
> .set_next_task = set_next_task_dl,
>
> #ifdef CONFIG_SMP
> + .pick_task = pick_task_dl,
> +
> .select_task_rq = select_task_rq_dl,
> .migrate_task_rq = migrate_task_rq_dl,
> .set_cpus_allowed = set_cpus_allowed_dl,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e65f2dfda77a..02e5dfb85e7d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4136,7 +4136,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> * Avoid running the skip buddy, if running something else can
> * be done without getting too unfair.
> */
> - if (cfs_rq->skip == se) {
> + if (cfs_rq->skip && cfs_rq->skip == se) {
> struct sched_entity *second;
>
> if (se == curr) {
> @@ -4154,13 +4154,13 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> /*
> * Prefer last buddy, try to return the CPU to a preempted task.
> */
> - if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
> + if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
> se = cfs_rq->last;
>
> /*
> * Someone really wants this to run. If it's not unfair, run it.
> */
> - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
> + if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
> se = cfs_rq->next;
>
> clear_buddies(cfs_rq, se);
> @@ -6966,6 +6966,34 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> set_last_buddy(se);
> }
>
> +static struct task_struct *
> +pick_task_fair(struct rq *rq)
> +{
> + struct cfs_rq *cfs_rq = &rq->cfs;
> + struct sched_entity *se;
> +
> + if (!cfs_rq->nr_running)
> + return NULL;
> +
> + do {
> + struct sched_entity *curr = cfs_rq->curr;
> +
> + se = pick_next_entity(cfs_rq, NULL);
> +
> + if (curr) {
> + if (se && curr->on_rq)
> + update_curr(cfs_rq);
> +
> + if (!se || entity_before(curr, se))
> + se = curr;
> + }
> +
> + cfs_rq = group_cfs_rq(se);
> + } while (cfs_rq);
> +
> + return task_of(se);
> +}
> +
> static struct task_struct *
> pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> {
> @@ -10677,6 +10705,8 @@ const struct sched_class fair_sched_class = {
> .set_next_task = set_next_task_fair,
>
> #ifdef CONFIG_SMP
> + .pick_task = pick_task_fair,
> +
> .select_task_rq = select_task_rq_fair,
> .migrate_task_rq = migrate_task_rq_fair,
>
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 7ece8e820b5d..e7f38da60373 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -373,6 +373,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
> resched_curr(rq);
> }
>
> +static struct task_struct *
> +pick_task_idle(struct rq *rq)
> +{
> + return rq->idle;
> +}
> +
> static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> {
> }
> @@ -386,11 +392,12 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next)
> static struct task_struct *
> pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> {
> - struct task_struct *next = rq->idle;
> + struct task_struct *next;
>
> if (prev)
> put_prev_task(rq, prev);
>
> + next = pick_task_idle(rq);
> set_next_task_idle(rq, next);
>
> return next;
> @@ -458,6 +465,7 @@ const struct sched_class idle_sched_class = {
> .set_next_task = set_next_task_idle,
>
> #ifdef CONFIG_SMP
> + .pick_task = pick_task_idle,
> .select_task_rq = select_task_rq_idle,
> .set_cpus_allowed = set_cpus_allowed_common,
> #endif
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 79f2e60516ef..81557224548c 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1548,20 +1548,29 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
> return rt_task_of(rt_se);
> }
>
> -static struct task_struct *
> -pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +static struct task_struct *pick_task_rt(struct rq *rq)
> {
> struct task_struct *p;
> struct rt_rq *rt_rq = &rq->rt;
>
> - WARN_ON_ONCE(prev || rf);
> -
> if (!rt_rq->rt_queued)
> return NULL;
>
> p = _pick_next_task_rt(rq);
>
> - set_next_task_rt(rq, p);
> + return p;
> +}
> +
> +static struct task_struct *
> +pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +{
> + struct task_struct *p;
> +
> + WARN_ON_ONCE(prev || rf);
what does it mean if this warn on goes off to an admin?

> +
> + p = pick_task_rt(rq);
> + if (p)
> + set_next_task_rt(rq, p);
>
> return p;
> }
> @@ -2364,6 +2373,8 @@ const struct sched_class rt_sched_class = {
> .set_next_task = set_next_task_rt,
>
> #ifdef CONFIG_SMP
> + .pick_task = pick_task_rt,
> +
> .select_task_rq = select_task_rq_rt,
>
> .set_cpus_allowed = set_cpus_allowed_common,
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 460dd04e76af..a024dd80eeb3 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1682,6 +1682,8 @@ struct sched_class {
> void (*set_next_task)(struct rq *rq, struct task_struct *p);
>
> #ifdef CONFIG_SMP
> + struct task_struct * (*pick_task)(struct rq *rq);
> +
> int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
> void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
>
> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> index 7e1cee4e65b2..fb6c436cba6c 100644
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -29,20 +29,30 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop)
> }
>
> static struct task_struct *
> -pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +pick_task_stop(struct rq *rq)
> {
> struct task_struct *stop = rq->stop;
>
> - WARN_ON_ONCE(prev || rf);
> -
> if (!stop || !task_on_rq_queued(stop))
> return NULL;
>
> - set_next_task_stop(rq, stop);
> -
> return stop;
> }
>
> +static struct task_struct *
> +pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +{
> + struct task_struct *p;
> +
> + WARN_ON_ONCE(prev || rf);
> +
> + p = pick_task_stop(rq);
> + if (p)
> + set_next_task_stop(rq, p);
> +
> + return p;
> +}
> +
> static void
> enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
> {
> @@ -129,6 +139,7 @@ const struct sched_class stop_sched_class = {
> .set_next_task = set_next_task_stop,
>
> #ifdef CONFIG_SMP
> + .pick_task = pick_task_stop,
> .select_task_rq = select_task_rq_stop,
> .set_cpus_allowed = set_cpus_allowed_common,
> #endif
> --
> 2.17.1
>

2019-08-27 21:15:49

by Matthew Garrett

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

Apple have provided a sysctl that allows applications to indicate that
specific threads should make use of core isolation while allowing
the rest of the system to make use of SMT, and browsers (Safari, Firefox
and Chrome, at least) are now making use of this. Trying to do something
similar using cgroups seems a bit awkward. Would something like this be
reasonable? Having spoken to the Chrome team, I believe that the
semantics we want are:

1) A thread to be able to indicate that it should not run on the same
core as anything not in posession of the same cookie
2) Descendents of that thread to (by default) have the same cookie
3) No other thread be able to obtain the same cookie
4) Threads not be able to rejoin the global group (ie, threads can
segregate themselves from their parent and peers, but can never rejoin
that group once segregated)

but don't know if that's what everyone else would want.

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 094bb03b9cc2..5d411246d4d5 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -229,4 +229,5 @@ struct prctl_mm_map {
# define PR_PAC_APDBKEY (1UL << 3)
# define PR_PAC_APGAKEY (1UL << 4)

+#define PR_CORE_ISOLATE 55
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 12df0e5434b8..a054cfcca511 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2486,6 +2486,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
error = PAC_RESET_KEYS(me, arg2);
break;
+ case PR_CORE_ISOLATE:
+#ifdef CONFIG_SCHED_CORE
+ current->core_cookie = (unsigned long)current;
+#else
+ result = -EINVAL;
+#endif
+ break;
default:
error = -EINVAL;
break;


--
Matthew Garrett | [email protected]

2019-08-27 21:52:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, Aug 27, 2019 at 10:14:17PM +0100, Matthew Garrett wrote:
> Apple have provided a sysctl that allows applications to indicate that
> specific threads should make use of core isolation while allowing
> the rest of the system to make use of SMT, and browsers (Safari, Firefox
> and Chrome, at least) are now making use of this. Trying to do something
> similar using cgroups seems a bit awkward. Would something like this be
> reasonable?

Sure; like I wrote earlier; I only did the cgroup thing because I was
lazy and it was the easiest interface to hack on in a hurry.

The rest of the ABI nonsense can 'trivially' be done later; if when we
decide to actually do this.

And given MDS, I'm still not entirely convinced it all makes sense. If
it were just L1TF, then yes, but now...

> Having spoken to the Chrome team, I believe that the
> semantics we want are:
>
> 1) A thread to be able to indicate that it should not run on the same
> core as anything not in posession of the same cookie
> 2) Descendents of that thread to (by default) have the same cookie
> 3) No other thread be able to obtain the same cookie
> 4) Threads not be able to rejoin the global group (ie, threads can
> segregate themselves from their parent and peers, but can never rejoin
> that group once segregated)
>
> but don't know if that's what everyone else would want.
>
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 094bb03b9cc2..5d411246d4d5 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -229,4 +229,5 @@ struct prctl_mm_map {
> # define PR_PAC_APDBKEY (1UL << 3)
> # define PR_PAC_APGAKEY (1UL << 4)
>
> +#define PR_CORE_ISOLATE 55
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 12df0e5434b8..a054cfcca511 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2486,6 +2486,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> return -EINVAL;
> error = PAC_RESET_KEYS(me, arg2);
> break;
> + case PR_CORE_ISOLATE:
> +#ifdef CONFIG_SCHED_CORE
> + current->core_cookie = (unsigned long)current;

This needs to then also force a reschedule of current. And there's the
little issue of what happens if 'current' dies while its children live
on, and current gets re-used for a new process and does this again.

> +#else
> + result = -EINVAL;
> +#endif
> + break;
> default:
> error = -EINVAL;
> break;
>
>
> --
> Matthew Garrett | [email protected]

2019-08-27 23:25:31

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Wed, Aug 28, 2019 at 5:14 AM Matthew Garrett <[email protected]> wrote:
>
> Apple have provided a sysctl that allows applications to indicate that
> specific threads should make use of core isolation while allowing
> the rest of the system to make use of SMT, and browsers (Safari, Firefox
> and Chrome, at least) are now making use of this. Trying to do something
> similar using cgroups seems a bit awkward. Would something like this be
> reasonable? Having spoken to the Chrome team, I believe that the
> semantics we want are:
>
> 1) A thread to be able to indicate that it should not run on the same
> core as anything not in posession of the same cookie
> 2) Descendents of that thread to (by default) have the same cookie
> 3) No other thread be able to obtain the same cookie
> 4) Threads not be able to rejoin the global group (ie, threads can
> segregate themselves from their parent and peers, but can never rejoin
> that group once segregated)
>
> but don't know if that's what everyone else would want.
>
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 094bb03b9cc2..5d411246d4d5 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -229,4 +229,5 @@ struct prctl_mm_map {
> # define PR_PAC_APDBKEY (1UL << 3)
> # define PR_PAC_APGAKEY (1UL << 4)
>
> +#define PR_CORE_ISOLATE 55
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 12df0e5434b8..a054cfcca511 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2486,6 +2486,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> return -EINVAL;
> error = PAC_RESET_KEYS(me, arg2);
> break;
> + case PR_CORE_ISOLATE:
> +#ifdef CONFIG_SCHED_CORE
> + current->core_cookie = (unsigned long)current;

Because AVX512 instructions could pull down the core frequency,
we also want to give a magic cookie number to all AVX512-using
tasks on the system, so they won't affect the performance/latency
of any other tasks.

This could be done by putting all AVX512 tasks into a cgroup, or
by AVX512 detection the following patch introduced.

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=2f7726f955572e587d5f50fbe9b2deed5334bd90

Thanks,
-Aubrey

2019-08-28 15:32:43

by Phil Auld

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:
> On Tue, Aug 27, 2019 at 10:14:17PM +0100, Matthew Garrett wrote:
> > Apple have provided a sysctl that allows applications to indicate that
> > specific threads should make use of core isolation while allowing
> > the rest of the system to make use of SMT, and browsers (Safari, Firefox
> > and Chrome, at least) are now making use of this. Trying to do something
> > similar using cgroups seems a bit awkward. Would something like this be
> > reasonable?
>
> Sure; like I wrote earlier; I only did the cgroup thing because I was
> lazy and it was the easiest interface to hack on in a hurry.
>
> The rest of the ABI nonsense can 'trivially' be done later; if when we
> decide to actually do this.

I think something that allows the tag to be set may be needed. One of
the use cases for this is virtualization stacks, where you really want
to be able to keep the higher CPU count and to set up the isolation
from management processes on the host.

The current cgroup interface doesn't work for that because it doesn't
apply the tag to children. We've been unable to fully test it in a virt
setup because our VMs are made of a child cgroup per vcpu.

>
> And given MDS, I'm still not entirely convinced it all makes sense. If
> it were just L1TF, then yes, but now...

I was thinking MDS is really the reason for this. L1TF has mitigations but
the only current mitigation for MDS for smt is ... nosmt.

The current core scheduler implementation, I believe, still has (theoretical?)
holes involving interrupts, once/if those are closed it may be even less
attractive.

>
> > Having spoken to the Chrome team, I believe that the
> > semantics we want are:
> >
> > 1) A thread to be able to indicate that it should not run on the same
> > core as anything not in posession of the same cookie
> > 2) Descendents of that thread to (by default) have the same cookie
> > 3) No other thread be able to obtain the same cookie
> > 4) Threads not be able to rejoin the global group (ie, threads can
> > segregate themselves from their parent and peers, but can never rejoin
> > that group once segregated)
> >
> > but don't know if that's what everyone else would want.
> >
> > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > index 094bb03b9cc2..5d411246d4d5 100644
> > --- a/include/uapi/linux/prctl.h
> > +++ b/include/uapi/linux/prctl.h
> > @@ -229,4 +229,5 @@ struct prctl_mm_map {
> > # define PR_PAC_APDBKEY (1UL << 3)
> > # define PR_PAC_APGAKEY (1UL << 4)
> >
> > +#define PR_CORE_ISOLATE 55
> > #endif /* _LINUX_PRCTL_H */
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 12df0e5434b8..a054cfcca511 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2486,6 +2486,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > return -EINVAL;
> > error = PAC_RESET_KEYS(me, arg2);
> > break;
> > + case PR_CORE_ISOLATE:
> > +#ifdef CONFIG_SCHED_CORE
> > + current->core_cookie = (unsigned long)current;
>
> This needs to then also force a reschedule of current. And there's the
> little issue of what happens if 'current' dies while its children live
> on, and current gets re-used for a new process and does this again.

sched_core_get() too?


Cheers,
Phil

>
> > +#else
> > + result = -EINVAL;
> > +#endif
> > + break;
> > default:
> > error = -EINVAL;
> > break;
> >
> >
> > --
> > Matthew Garrett | [email protected]

--

2019-08-28 16:00:49

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 8/27/19 2:50 PM, Peter Zijlstra wrote:
> On Tue, Aug 27, 2019 at 10:14:17PM +0100, Matthew Garrett wrote:
>> Apple have provided a sysctl that allows applications to indicate that
>> specific threads should make use of core isolation while allowing
>> the rest of the system to make use of SMT, and browsers (Safari, Firefox
>> and Chrome, at least) are now making use of this. Trying to do something
>> similar using cgroups seems a bit awkward. Would something like this be
>> reasonable?
>
> Sure; like I wrote earlier; I only did the cgroup thing because I was
> lazy and it was the easiest interface to hack on in a hurry.
>
> The rest of the ABI nonsense can 'trivially' be done later; if when we
> decide to actually do this.
>
> And given MDS, I'm still not entirely convinced it all makes sense. If
> it were just L1TF, then yes, but now...
>

For MDS, core-scheduler does prevent thread to thread
attack between user space threads running on sibling CPU threads.
Yes, it doesn't prevent the user to kernel attack from sibling
which will require additional mitigation measure. However, it does
block a major attack vector for MDS if HT is enabled.

Tim

2019-08-28 16:03:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Wed, Aug 28, 2019 at 11:30:34AM -0400, Phil Auld wrote:
> On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:

> > And given MDS, I'm still not entirely convinced it all makes sense. If
> > it were just L1TF, then yes, but now...
>
> I was thinking MDS is really the reason for this. L1TF has mitigations but
> the only current mitigation for MDS for smt is ... nosmt.

L1TF has no known mitigation that is SMT safe. The moment you have
something in your L1, the other sibling can read it using L1TF.

The nice thing about L1TF is that only (malicious) guests can exploit
it, and therefore the synchronizatin context is VMM. And it so happens
that VMEXITs are 'rare' (and already expensive and thus lots of effort
has already gone into avoiding them).

If you don't use VMs, you're good and SMT is not a problem.

If you do use VMs (and do/can not trust them), _then_ you need
core-scheduling; and in that case, the implementation under discussion
misses things like synchronization on VMEXITs due to interrupts and
things like that.

But under the assumption that VMs don't generate high scheduling rates,
it can work.

> The current core scheduler implementation, I believe, still has (theoretical?)
> holes involving interrupts, once/if those are closed it may be even less
> attractive.

No; so MDS leaks anything the other sibling (currently) does, this makes
_any_ privilidge boundary a synchronization context.

Worse still, the exploit doesn't require a VM at all, any other task can
get to it.

That means you get to sync the siblings on lovely things like system
call entry and exit, along with VMM and anything else that one would
consider a privilidge boundary. Now, system calls are not rare, they
are really quite common in fact. Trying to sync up siblings at the rate
of system calls is utter madness.

So under MDS, SMT is completely hosed. If you use VMs exclusively, then
it _might_ work because a 'pure' host doesn't schedule that often
(maybe, same assumption as for L1TF).

Now, there have been proposals of moving the privilidge boundary further
into the kernel. Just like PTI exposes the entry stack and code to
Meltdown, the thinking is, lets expose more. By moving the priv boundary
the hope is that we can do lots of common system calls without having to
sync up -- lots of details are 'pending'.

2019-08-28 16:19:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Wed, Aug 28, 2019 at 08:59:21AM -0700, Tim Chen wrote:
> On 8/27/19 2:50 PM, Peter Zijlstra wrote:
> > On Tue, Aug 27, 2019 at 10:14:17PM +0100, Matthew Garrett wrote:
> >> Apple have provided a sysctl that allows applications to indicate that
> >> specific threads should make use of core isolation while allowing
> >> the rest of the system to make use of SMT, and browsers (Safari, Firefox
> >> and Chrome, at least) are now making use of this. Trying to do something
> >> similar using cgroups seems a bit awkward. Would something like this be
> >> reasonable?
> >
> > Sure; like I wrote earlier; I only did the cgroup thing because I was
> > lazy and it was the easiest interface to hack on in a hurry.
> >
> > The rest of the ABI nonsense can 'trivially' be done later; if when we
> > decide to actually do this.
> >
> > And given MDS, I'm still not entirely convinced it all makes sense. If
> > it were just L1TF, then yes, but now...
> >
>
> For MDS, core-scheduler does prevent thread to thread
> attack between user space threads running on sibling CPU threads.
> Yes, it doesn't prevent the user to kernel attack from sibling
> which will require additional mitigation measure. However, it does
> block a major attack vector for MDS if HT is enabled.

I'm not sure what your argument is; the dike has two holes; you plug
one, you still drown.

2019-08-28 16:40:14

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 8/28/19 9:01 AM, Peter Zijlstra wrote:
> On Wed, Aug 28, 2019 at 11:30:34AM -0400, Phil Auld wrote:
>> On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:
>
>> The current core scheduler implementation, I believe, still has (theoretical?)
>> holes involving interrupts, once/if those are closed it may be even less
>> attractive.
>
> No; so MDS leaks anything the other sibling (currently) does, this makes
> _any_ privilidge boundary a synchronization context.
>
> Worse still, the exploit doesn't require a VM at all, any other task can
> get to it.
>
> That means you get to sync the siblings on lovely things like system
> call entry and exit, along with VMM and anything else that one would
> consider a privilidge boundary. Now, system calls are not rare, they
> are really quite common in fact. Trying to sync up siblings at the rate
> of system calls is utter madness.
>
> So under MDS, SMT is completely hosed. If you use VMs exclusively, then
> it _might_ work because a 'pure' host doesn't schedule that often
> (maybe, same assumption as for L1TF).
>
> Now, there have been proposals of moving the privilidge boundary further
> into the kernel. Just like PTI exposes the entry stack and code to
> Meltdown, the thinking is, lets expose more. By moving the priv boundary
> the hope is that we can do lots of common system calls without having to
> sync up -- lots of details are 'pending'.
>

If are willing to consider the idea that we will sync with the sibling
only if we touch potential user data, then a significant portion of
syscalls may not need to sync. Yeah, it still sucks because of the
complexity added to audit all the places in kernel that may touch
privileged data and require synchronization.

I did a prototype (without core sched), kernel build slow 2.5%.
So this use case still seem reasonable.

A worst case scenario is concurrent SMT FIO write to encrypted file,
which have a lot of synchronizations due to extended access to privilege
data by crypto, we slow by 9%.

Tim

2019-08-29 14:33:54

by Phil Auld

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Wed, Aug 28, 2019 at 06:01:14PM +0200 Peter Zijlstra wrote:
> On Wed, Aug 28, 2019 at 11:30:34AM -0400, Phil Auld wrote:
> > On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:
>
> > > And given MDS, I'm still not entirely convinced it all makes sense. If
> > > it were just L1TF, then yes, but now...
> >
> > I was thinking MDS is really the reason for this. L1TF has mitigations but
> > the only current mitigation for MDS for smt is ... nosmt.
>
> L1TF has no known mitigation that is SMT safe. The moment you have
> something in your L1, the other sibling can read it using L1TF.
>
> The nice thing about L1TF is that only (malicious) guests can exploit
> it, and therefore the synchronizatin context is VMM. And it so happens
> that VMEXITs are 'rare' (and already expensive and thus lots of effort
> has already gone into avoiding them).
>
> If you don't use VMs, you're good and SMT is not a problem.
>
> If you do use VMs (and do/can not trust them), _then_ you need
> core-scheduling; and in that case, the implementation under discussion
> misses things like synchronization on VMEXITs due to interrupts and
> things like that.
>
> But under the assumption that VMs don't generate high scheduling rates,
> it can work.
>
> > The current core scheduler implementation, I believe, still has (theoretical?)
> > holes involving interrupts, once/if those are closed it may be even less
> > attractive.
>
> No; so MDS leaks anything the other sibling (currently) does, this makes
> _any_ privilidge boundary a synchronization context.
>
> Worse still, the exploit doesn't require a VM at all, any other task can
> get to it.
>
> That means you get to sync the siblings on lovely things like system
> call entry and exit, along with VMM and anything else that one would
> consider a privilidge boundary. Now, system calls are not rare, they
> are really quite common in fact. Trying to sync up siblings at the rate
> of system calls is utter madness.
>
> So under MDS, SMT is completely hosed. If you use VMs exclusively, then
> it _might_ work because a 'pure' host doesn't schedule that often
> (maybe, same assumption as for L1TF).
>
> Now, there have been proposals of moving the privilidge boundary further
> into the kernel. Just like PTI exposes the entry stack and code to
> Meltdown, the thinking is, lets expose more. By moving the priv boundary
> the hope is that we can do lots of common system calls without having to
> sync up -- lots of details are 'pending'.


Thanks for clarifying. My understanding is (somewhat) less fuzzy now. :)

I think, though, that you were basically agreeing with me that the current
core scheduler does not close the holes, or am I reading that wrong.


Cheers,
Phil

--

2019-08-29 14:42:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, Aug 29, 2019 at 10:30:51AM -0400, Phil Auld wrote:
> On Wed, Aug 28, 2019 at 06:01:14PM +0200 Peter Zijlstra wrote:
> > On Wed, Aug 28, 2019 at 11:30:34AM -0400, Phil Auld wrote:
> > > On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:
> >
> > > > And given MDS, I'm still not entirely convinced it all makes sense. If
> > > > it were just L1TF, then yes, but now...
> > >
> > > I was thinking MDS is really the reason for this. L1TF has mitigations but
> > > the only current mitigation for MDS for smt is ... nosmt.
> >
> > L1TF has no known mitigation that is SMT safe. The moment you have
> > something in your L1, the other sibling can read it using L1TF.
> >
> > The nice thing about L1TF is that only (malicious) guests can exploit
> > it, and therefore the synchronizatin context is VMM. And it so happens
> > that VMEXITs are 'rare' (and already expensive and thus lots of effort
> > has already gone into avoiding them).
> >
> > If you don't use VMs, you're good and SMT is not a problem.
> >
> > If you do use VMs (and do/can not trust them), _then_ you need
> > core-scheduling; and in that case, the implementation under discussion
> > misses things like synchronization on VMEXITs due to interrupts and
> > things like that.
> >
> > But under the assumption that VMs don't generate high scheduling rates,
> > it can work.
> >
> > > The current core scheduler implementation, I believe, still has (theoretical?)
> > > holes involving interrupts, once/if those are closed it may be even less
> > > attractive.
> >
> > No; so MDS leaks anything the other sibling (currently) does, this makes
> > _any_ privilidge boundary a synchronization context.
> >
> > Worse still, the exploit doesn't require a VM at all, any other task can
> > get to it.
> >
> > That means you get to sync the siblings on lovely things like system
> > call entry and exit, along with VMM and anything else that one would
> > consider a privilidge boundary. Now, system calls are not rare, they
> > are really quite common in fact. Trying to sync up siblings at the rate
> > of system calls is utter madness.
> >
> > So under MDS, SMT is completely hosed. If you use VMs exclusively, then
> > it _might_ work because a 'pure' host doesn't schedule that often
> > (maybe, same assumption as for L1TF).
> >
> > Now, there have been proposals of moving the privilidge boundary further
> > into the kernel. Just like PTI exposes the entry stack and code to
> > Meltdown, the thinking is, lets expose more. By moving the priv boundary
> > the hope is that we can do lots of common system calls without having to
> > sync up -- lots of details are 'pending'.
>
>
> Thanks for clarifying. My understanding is (somewhat) less fuzzy now. :)
>
> I think, though, that you were basically agreeing with me that the current
> core scheduler does not close the holes, or am I reading that wrong.

Agreed; the missing bits for L1TF are ugly but doable (I've actually
done them before, Tim has that _somewhere_), but I've not seen a
'workable' solution for MDS yet.

2019-09-05 02:30:26

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

> 1) Unfairness between the sibling threads
> -----------------------------------------
> One sibling thread could be suppressing and force idling
> the sibling thread over proportionally. Resulting in
> the force idled CPU not getting run and stall tasks on
> suppressed CPU.
>
> Status:
> i) Aaron has proposed a patchset here based on using one
> rq as a base reference for vruntime for task priority
> comparison between siblings.
>
> https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/
> It works well on fairness but has some initialization issues
>
> ii) Tim has proposed a patchset here to account for forced
> idle time in rq's min_vruntime
> https://lore.kernel.org/lkml/[email protected]/
> It improves over v3 with simpler logic compared to
> Aaron's patch, but does not work as well on fairness
>
> iii) Tim has proposed yet another patch to maintain fairness
> of forced idle time between CPU threads per Peter's suggestion.
> https://lore.kernel.org/lkml/[email protected]/
> Its performance has yet to be tested.
>
> 2) Not rescheduling forced idled CPU
> ------------------------------------
> The forced idled CPU does not get a chance to re-schedule
> itself, and will stall for a long time even though it
> has eligible tasks to run.
>
> Status:
> i) Aaron proposed a patch to fix this to check if there
> are runnable tasks when scheduling tick comes in.
> https://lore.kernel.org/lkml/20190725143344.GD992@aaronlu/
>
> ii) Vineeth has patches to this issue and also issue 1, based
> on scheduling in a new "forced idle task" when getting forced
> idle, but has yet to post the patches.

We finished writing and debugging the PoC for the coresched_idle task
and here are the results and the code.

Those patches are applied on top of Aaron's patches:
- sched: Fix incorrect rq tagged as forced idle
- wrapper for cfs_rq->min_vruntime
https://lore.kernel.org/lkml/20190725143127.GB992@aaronlu/
- core vruntime comparison
https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/

For the testing, we used the same strategy as described in
https://lore.kernel.org/lkml/20190802153715.GA18075@sinkpad/

No tag
------
Test Average Stdev
Alone 1306.90 0.94
nosmt 649.95 1.44
Aaron's full patchset: 828.15 32.45
Aaron's first 2 patches: 832.12 36.53
Tim's first patchset: 852.50 4.11
Tim's second patchset: 855.11 9.89
coresched_idle 985.67 0.83

Sysbench mem untagged, sysbench cpu tagged
------------------------------------------
Test Average Stdev
Alone 1306.90 0.94
nosmt 649.95 1.44
Aaron's full patchset: 586.06 1.77
Tim's first patchset: 852.50 4.11
Tim's second patchset: 663.88 44.43
coresched_idle 653.58 0.49

Sysbench mem tagged, sysbench cpu untagged
------------------------------------------
Test Average Stdev
Alone 1306.90 0.94
nosmt 649.95 1.44
Aaron's full patchset: 583.77 3.52
Tim's first patchset: 564.04 58.05
Tim's second patchset: 524.72 55.24
coresched_idle 653.30 0.81

Both sysbench tagged
--------------------
Test Average Stdev
Alone 1306.90 0.94
nosmt 649.95 1.44
Aaron's full patchset: 582.15 3.75
Tim's first patchset: 679.43 70.07
Tim's second patchset: 563.10 34.58
coresched_idle 653.12 1.68

As we can see from this stress-test, with the coresched_idle thread
being a real process, the fairness is more consistent (low stdev). Also,
the performance remains the same regardless of the tagging, and even
always slightly better than nosmt.

Thanks,

Julien

From: vpillai <[email protected]>
Date: Wed, 4 Sep 2019 17:41:38 +0000
Subject: [RFC PATCH 1/2] coresched_idle thread

---
kernel/sched/core.c | 46 ++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 1 +
2 files changed, 47 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f7839bf96e8b..fe560739c247 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3639,6 +3639,51 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
return a->core_cookie == b->core_cookie;
}

+static int coresched_idle_worker(void *data)
+{
+ struct rq *rq = (struct rq *)data;
+
+ /*
+ * Transition to parked state and dequeue from runqueue.
+ * pick_task() will select us if needed without enqueueing.
+ */
+ set_special_state(TASK_PARKED);
+ schedule();
+
+ while (true) {
+ if (kthread_should_stop())
+ break;
+
+ play_idle(1);
+ }
+
+ return 0;
+}
+
+static void coresched_idle_worker_init(struct rq *rq)
+{
+
+ // XXX core_idle_task needs lock protection?
+ if (!rq->core_idle_task) {
+ rq->core_idle_task = kthread_create_on_cpu(coresched_idle_worker,
+ (void *)rq, cpu_of(rq), "coresched_idle");
+ if (rq->core_idle_task) {
+ wake_up_process(rq->core_idle_task);
+ }
+
+ }
+
+ return;
+}
+
+static void coresched_idle_worker_fini(struct rq *rq)
+{
+ if (rq->core_idle_task) {
+ kthread_stop(rq->core_idle_task);
+ rq->core_idle_task = NULL;
+ }
+}
+
// XXX fairness/fwd progress conditions
/*
* Returns
@@ -6774,6 +6819,7 @@ void __init sched_init(void)
atomic_set(&rq->nr_iowait, 0);

#ifdef CONFIG_SCHED_CORE
+ rq->core_idle_task = NULL;
rq->core = NULL;
rq->core_pick = NULL;
rq->core_enabled = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e91c188a452c..c3ae0af55b05 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -965,6 +965,7 @@ struct rq {
unsigned int core_sched_seq;
struct rb_root core_tree;
bool core_forceidle;
+ struct task_struct *core_idle_task;

/* shared state */
unsigned int core_task_seq;
--
2.17.1

From: vpillai <[email protected]>
Date: Wed, 4 Sep 2019 18:22:55 +0000
Subject: [RFC PATCH 2/2] Use coresched_idle to force idle a sibling

Currently we use idle thread to force idle on a sibling. Lets
use the new coresched_idle thread that scheduler sees a valid
task during force idle.
---
kernel/sched/core.c | 66 ++++++++++++++++++++++++++++++++++++++-------
1 file changed, 56 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe560739c247..e35d69a81adb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -244,23 +244,33 @@ static int __sched_core_stopper(void *data)
static DEFINE_MUTEX(sched_core_mutex);
static int sched_core_count;

+static void coresched_idle_worker_init(struct rq *rq);
+static void coresched_idle_worker_fini(struct rq *rq);
static void __sched_core_enable(void)
{
+ int cpu;
+
// XXX verify there are no cookie tasks (yet)

static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);

+ for_each_online_cpu(cpu)
+ coresched_idle_worker_init(cpu_rq(cpu));
printk("core sched enabled\n");
}

static void __sched_core_disable(void)
{
+ int cpu;
+
// XXX verify there are no cookie tasks (left)

stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);

+ for_each_online_cpu(cpu)
+ coresched_idle_worker_fini(cpu_rq(cpu));
printk("core sched disabled\n");
}

@@ -3626,14 +3636,25 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

#ifdef CONFIG_SCHED_CORE

+static inline bool is_force_idle_task(struct task_struct *p)
+{
+ BUG_ON(task_rq(p)->core_idle_task == NULL);
+ return task_rq(p)->core_idle_task == p;
+}
+
+static inline bool is_core_idle_task(struct task_struct *p)
+{
+ return is_idle_task(p) || is_force_idle_task(p);
+}
+
static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
{
- return is_idle_task(a) || (a->core_cookie == cookie);
+ return is_core_idle_task(a) || (a->core_cookie == cookie);
}

static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
{
- if (is_idle_task(a) || is_idle_task(b))
+ if (is_core_idle_task(a) || is_core_idle_task(b))
return true;

return a->core_cookie == b->core_cookie;
@@ -3641,8 +3662,6 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)

static int coresched_idle_worker(void *data)
{
- struct rq *rq = (struct rq *)data;
-
/*
* Transition to parked state and dequeue from runqueue.
* pick_task() will select us if needed without enqueueing.
@@ -3666,7 +3685,7 @@ static void coresched_idle_worker_init(struct rq *rq)
// XXX core_idle_task needs lock protection?
if (!rq->core_idle_task) {
rq->core_idle_task = kthread_create_on_cpu(coresched_idle_worker,
- (void *)rq, cpu_of(rq), "coresched_idle");
+ NULL, cpu_of(rq), "coresched_idle");
if (rq->core_idle_task) {
wake_up_process(rq->core_idle_task);
}
@@ -3684,6 +3703,14 @@ static void coresched_idle_worker_fini(struct rq *rq)
}
}

+static inline struct task_struct *core_idle_task(struct rq *rq)
+{
+ BUG_ON(rq->core_idle_task == NULL);
+
+ return rq->core_idle_task;
+
+}
+
// XXX fairness/fwd progress conditions
/*
* Returns
@@ -3709,7 +3736,7 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
*/
if (max && class_pick->core_cookie &&
prio_less(class_pick, max))
- return idle_sched_class.pick_task(rq);
+ return core_idle_task(rq);

return class_pick;
}
@@ -3853,7 +3880,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
goto done;
}

- if (!is_idle_task(p))
+ if (!is_force_idle_task(p))
occ++;

rq_i->core_pick = p;
@@ -3906,7 +3933,6 @@ next_class:;
rq->core->core_pick_seq = rq->core->core_task_seq;
next = rq->core_pick;
rq->core_sched_seq = rq->core->core_pick_seq;
- trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);

/*
* Reschedule siblings
@@ -3924,13 +3950,24 @@ next_class:;

WARN_ON_ONCE(!rq_i->core_pick);

- if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
+ if (is_core_idle_task(rq_i->core_pick) && rq_i->nr_running) {
+ /*
+ * Matching logic can sometimes select idle_task when
+ * iterating the sched_classes. If that selection is
+ * actually a forced idle case, we need to update the
+ * core_pick to coresched_idle.
+ */
+ if (is_idle_task(rq_i->core_pick))
+ rq_i->core_pick = core_idle_task(rq_i);
rq_i->core_forceidle = true;
+ }

rq_i->core_pick->core_occupation = occ;

- if (i == cpu)
+ if (i == cpu) {
+ next = rq_i->core_pick;
continue;
+ }

if (rq_i->curr != rq_i->core_pick) {
trace_printk("IPI(%d)\n", i);
@@ -3947,6 +3984,7 @@ next_class:;
WARN_ON_ONCE(1);
}
}
+ trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);

done:
set_next_task(rq, next);
@@ -4200,6 +4238,12 @@ static void __sched notrace __schedule(bool preempt)
* is a RELEASE barrier),
*/
++*switch_count;
+#ifdef CONFIG_SCHED_CORE
+ if (next == rq->core_idle_task)
+ next->state = TASK_RUNNING;
+ else if (prev == rq->core_idle_task)
+ prev->state = TASK_PARKED;
+#endif

trace_sched_switch(preempt, prev, next);

@@ -6479,6 +6523,7 @@ int sched_cpu_activate(unsigned int cpu)
#ifdef CONFIG_SCHED_CORE
if (static_branch_unlikely(&__sched_core_enabled)) {
rq->core_enabled = true;
+ coresched_idle_worker_init(rq);
}
#endif
}
@@ -6535,6 +6580,7 @@ int sched_cpu_deactivate(unsigned int cpu)
struct rq *rq = cpu_rq(cpu);
if (static_branch_unlikely(&__sched_core_enabled)) {
rq->core_enabled = false;
+ coresched_idle_worker_fini(rq);
}
#endif
static_branch_dec_cpuslocked(&sched_smt_present);
--
2.17.1

2019-09-07 16:21:50

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 8/7/19 10:10 AM, Tim Chen wrote:

> 3) Load balancing between CPU cores
> -----------------------------------
> Say if one CPU core's sibling threads get forced idled
> a lot as it has mostly incompatible tasks between the siblings,
> moving the incompatible load to other cores and pulling
> compatible load to the core could help CPU utilization.
>
> So just considering the load of a task is not enough during
> load balancing, task compatibility also needs to be considered.
> Peter has put in mechanisms to balance compatible tasks between
> CPU thread siblings, but not across cores.
>
> Status:
> I have not seen patches on this issue. This issue could lead to
> large variance in workload performance based on your luck
> in placing the workload among the cores.
>

I've made an attempt in the following two patches to address
the load balancing of mismatched load between the siblings.

It is applied on top of Aaron's patches:
- sched: Fix incorrect rq tagged as forced idle
- wrapper for cfs_rq->min_vruntime
https://lore.kernel.org/lkml/20190725143127.GB992@aaronlu/
- core vruntime comparison
https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/

I will love Julien, Aaron and others to try it out. Suggestions
to tune it is welcomed.

Tim

---

From c7b91fb26d787d020f0795c3fbec82914889dc67 Mon Sep 17 00:00:00 2001
From: Tim Chen <[email protected]>
Date: Wed, 21 Aug 2019 15:48:15 -0700
Subject: [PATCH 1/2] sched: scan core sched load mismatch

Calculate the mismatched load imbalance on a core when
running the core scheduler when we are updating the
load balance statistics. This will guide the load
balancer later to move load to another CPU that can
reduce the mismatched load.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 150 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 149 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 730c9359e9c9..b3d6a6482553 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7507,6 +7507,9 @@ static inline int migrate_degrades_locality(struct task_struct *p,
}
#endif

+static inline s64 core_sched_imbalance_improvement(int src_cpu, int dst_cpu,
+ struct task_struct *p);
+
/*
* can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
*/
@@ -7970,6 +7973,11 @@ struct sg_lb_stats {
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
#endif
+#ifdef CONFIG_SCHED_CORE
+ int imbl_cpu;
+ struct task_group *imbl_tg;
+ s64 imbl_load;
+#endif
};

/*
@@ -8314,6 +8322,145 @@ static bool update_nohz_stats(struct rq *rq, bool force)
#endif
}

+#ifdef CONFIG_SCHED_CORE
+static inline int cpu_sibling(int cpu)
+{
+ int i;
+
+ for_each_cpu(i, cpu_smt_mask(cpu)) {
+ if (i == cpu)
+ continue;
+ return i;
+ }
+ return -1;
+}
+
+static inline s64 core_sched_imbalance_delta(int src_cpu, int dst_cpu,
+ int src_sibling, int dst_sibling,
+ struct task_group *tg, u64 task_load)
+{
+ struct sched_entity *se, *se_sibling, *dst_se, *dst_se_sibling;
+ s64 excess, deficit, old_mismatch, new_mismatch;
+
+ if (src_cpu == dst_cpu)
+ return -1;
+
+ /* XXX SMT4 will require additional logic */
+
+ se = tg->se[src_cpu];
+ se_sibling = tg->se[src_sibling];
+
+ excess = se->avg.load_avg - se_sibling->avg.load_avg;
+ if (src_sibling == dst_cpu) {
+ old_mismatch = abs(excess);
+ new_mismatch = abs(excess - 2*task_load);
+ return old_mismatch - new_mismatch;
+ }
+
+ dst_se = tg->se[dst_cpu];
+ dst_se_sibling = tg->se[dst_sibling];
+ deficit = dst_se->avg.load_avg - dst_se_sibling->avg.load_avg;
+
+ old_mismatch = abs(excess) + abs(deficit);
+ new_mismatch = abs(excess - (s64) task_load) +
+ abs(deficit + (s64) task_load);
+
+ if (excess > 0 && deficit < 0)
+ return old_mismatch - new_mismatch;
+ else
+ /* no mismatch improvement */
+ return -1;
+}
+
+static inline s64 core_sched_imbalance_improvement(int src_cpu, int dst_cpu,
+ struct task_struct *p)
+{
+ int src_sibling, dst_sibling;
+ unsigned long task_load = task_h_load(p);
+ struct task_group *tg;
+
+ if (!p->se.parent)
+ return 0;
+
+ tg = p->se.parent->cfs_rq->tg;
+ if (!tg->tagged)
+ return 0;
+
+ /* XXX SMT4 will require additional logic */
+ src_sibling = cpu_sibling(src_cpu);
+ dst_sibling = cpu_sibling(dst_cpu);
+
+ if (src_sibling == -1 || dst_sibling == -1)
+ return 0;
+
+ return core_sched_imbalance_delta(src_cpu, dst_cpu,
+ src_sibling, dst_sibling,
+ tg, task_load);
+}
+
+static inline void core_sched_imbalance_scan(struct sg_lb_stats *sgs,
+ int src_cpu,
+ int dst_cpu)
+{
+ struct rq *rq;
+ struct cfs_rq *cfs_rq, *pos;
+ struct task_group *tg;
+ s64 mismatch;
+ int src_sibling, dst_sibling;
+ u64 src_avg_load_task;
+
+ if (!sched_core_enabled(cpu_rq(src_cpu)) ||
+ !sched_core_enabled(cpu_rq(dst_cpu)) ||
+ src_cpu == dst_cpu)
+ return;
+
+ rq = cpu_rq(src_cpu);
+
+ src_sibling = cpu_sibling(src_cpu);
+ dst_sibling = cpu_sibling(dst_cpu);
+
+ if (src_sibling == -1 || dst_sibling == -1)
+ return;
+
+ src_avg_load_task = cpu_avg_load_per_task(src_cpu);
+
+ if (src_avg_load_task == 0)
+ return;
+
+ /*
+ * Imbalance in tagged task group's load causes forced
+ * idle time in sibling, that will be counted as mismatched load
+ * on the forced idled cpu. Record the source cpu in the sched
+ * group causing the largest mismatched load.
+ */
+ for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
+
+ tg = cfs_rq->tg;
+ if (!tg->tagged)
+ continue;
+
+ mismatch = core_sched_imbalance_delta(src_cpu, dst_cpu,
+ src_sibling, dst_sibling,
+ tg, src_avg_load_task);
+
+ if (mismatch > sgs->imbl_load &&
+ mismatch > src_avg_load_task) {
+ sgs->imbl_load = mismatch;
+ sgs->imbl_tg = tg;
+ sgs->imbl_cpu = src_cpu;
+ }
+ }
+}
+
+#else
+#define core_sched_imbalance_scan(sgs, src_cpu, dst_cpu)
+static inline s64 core_sched_imbalance_improvement(int src_cpu, int dst_cpu,
+ struct task_struct *p)
+{
+ return 0;
+}
+#endif /* CONFIG_SCHED_CORE */
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @env: The load balancing environment.
@@ -8345,7 +8492,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
else
load = source_load(i, load_idx);

- sgs->group_load += load;
+ core_sched_imbalance_scan(sgs, i, env->dst_cpu);
+
sgs->group_util += cpu_util(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;

--
2.20.1


From a11084f84de9c174f36cf2701ba5bbe1546e45f5 Mon Sep 17 00:00:00 2001
From: Tim Chen <[email protected]>
Date: Wed, 28 Aug 2019 11:22:43 -0700
Subject: [PATCH 2/2] sched: load balance core imbalanced load

If moving mismatched core scheduling load can reduce load imbalance
more than regular load balancing, move the mismatched load instead.

On regular load balancing, also skip moving a task that could increase
load mismatch.

Move only one mismatched task at a time to reduce load disturbance.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3d6a6482553..69939c977797 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7412,6 +7412,11 @@ struct lb_env {
enum fbq_type fbq_type;
enum group_type src_grp_type;
struct list_head tasks;
+#ifdef CONFIG_SCHED_CORE
+ int imbl_cpu;
+ struct task_group *imbl_tg;
+ s64 imbl_load;
+#endif
};

/*
@@ -7560,6 +7565,12 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
return 0;
}

+#ifdef CONFIG_SCHED_CORE
+ /* Don't migrate if we increase core imbalance */
+ if (core_sched_imbalance_improvement(env->src_cpu, env->dst_cpu, p) < 0)
+ return 0;
+#endif
+
/* Record that we found atleast one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;

@@ -8533,6 +8544,14 @@ static inline void update_sg_lb_stats(struct lb_env *env,

sgs->group_no_capacity = group_is_overloaded(env, sgs);
sgs->group_type = group_classify(group, sgs);
+
+#ifdef CONFIG_SCHED_CORE
+ if (sgs->imbl_load > env->imbl_load) {
+ env->imbl_cpu = sgs->imbl_cpu;
+ env->imbl_tg = sgs->imbl_tg;
+ env->imbl_load = sgs->imbl_load;
+ }
+#endif
}

/**
@@ -9066,6 +9085,15 @@ static struct rq *find_busiest_queue(struct lb_env *env,
unsigned long busiest_load = 0, busiest_capacity = 1;
int i;

+#ifdef CONFIG_SCHED_CORE
+ if (env->imbl_load > env->imbalance) {
+ env->imbalance = cpu_avg_load_per_task(env->imbl_cpu);
+ return cpu_rq(env->imbl_cpu);
+ } else {
+ env->imbl_load = 0;
+ }
+#endif
+
for_each_cpu_and(i, sched_group_span(group), env->cpus) {
unsigned long capacity, wl;
enum fbq_type rt;
--
2.20.1

2019-09-08 05:36:32

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 9/4/19 6:44 PM, Julien Desfossez wrote:


>@@ -3853,7 +3880,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> goto done;
> }
>
>- if (!is_idle_task(p))
>+ if (!is_force_idle_task(p))

Should this be if (!is_core_idle_task(p))
instead?

> occ++;
>


Tim

2019-09-10 14:35:01

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 29-Aug-2019 04:38:21 PM, Peter Zijlstra wrote:
> On Thu, Aug 29, 2019 at 10:30:51AM -0400, Phil Auld wrote:
> > I think, though, that you were basically agreeing with me that the current
> > core scheduler does not close the holes, or am I reading that wrong.
>
> Agreed; the missing bits for L1TF are ugly but doable (I've actually
> done them before, Tim has that _somewhere_), but I've not seen a
> 'workable' solution for MDS yet.

Following the discussion we had yesterday at LPC, after we have agreed
on a solution for fixing the current fairness issue, we will post the
v4. We will then work on prototyping the other synchronisation points
(syscalls, interrupts and VMEXIT) to evaluate the overhead in various
use-cases.

Depending on the use-case, we know the performance overhead maybe
heavier than just disabling SMT, but the benchmarks we have seen so far
indicate that there are valid cases for core scheduling. Core scheduling
will continue to be unused by default, but with it, we will have the
option to tune the system to be both secure and faster than disabling
SMT for those cases.

Thanks,

Julien

P.S: I think the branch that contains the VMEXIT handling is here
https://github.com/pdxChen/gang/commits/sched_1.23-base

2019-09-11 14:07:18

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

Hi Tim & Julien,

On Fri, Sep 06, 2019 at 11:30:20AM -0700, Tim Chen wrote:
> On 8/7/19 10:10 AM, Tim Chen wrote:
>
> > 3) Load balancing between CPU cores
> > -----------------------------------
> > Say if one CPU core's sibling threads get forced idled
> > a lot as it has mostly incompatible tasks between the siblings,
> > moving the incompatible load to other cores and pulling
> > compatible load to the core could help CPU utilization.
> >
> > So just considering the load of a task is not enough during
> > load balancing, task compatibility also needs to be considered.
> > Peter has put in mechanisms to balance compatible tasks between
> > CPU thread siblings, but not across cores.
> >
> > Status:
> > I have not seen patches on this issue. This issue could lead to
> > large variance in workload performance based on your luck
> > in placing the workload among the cores.
> >
>
> I've made an attempt in the following two patches to address
> the load balancing of mismatched load between the siblings.
>
> It is applied on top of Aaron's patches:
> - sched: Fix incorrect rq tagged as forced idle
> - wrapper for cfs_rq->min_vruntime
> https://lore.kernel.org/lkml/20190725143127.GB992@aaronlu/
> - core vruntime comparison
> https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/

So both of you are working on top of my 2 patches that deal with the
fairness issue, but I had the feeling Tim's alternative patches[1] are
simpler than mine and achieves the same result(after the force idle tag
fix), so unless there is something I missed, I think we should go with
the simpler one?

[1]: https://lore.kernel.org/lkml/[email protected]/

2019-09-11 16:22:05

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 9/11/19 7:02 AM, Aaron Lu wrote:
> Hi Tim & Julien,
>
> On Fri, Sep 06, 2019 at 11:30:20AM -0700, Tim Chen wrote:
>> On 8/7/19 10:10 AM, Tim Chen wrote:
>>
>>> 3) Load balancing between CPU cores
>>> -----------------------------------
>>> Say if one CPU core's sibling threads get forced idled
>>> a lot as it has mostly incompatible tasks between the siblings,
>>> moving the incompatible load to other cores and pulling
>>> compatible load to the core could help CPU utilization.
>>>
>>> So just considering the load of a task is not enough during
>>> load balancing, task compatibility also needs to be considered.
>>> Peter has put in mechanisms to balance compatible tasks between
>>> CPU thread siblings, but not across cores.
>>>
>>> Status:
>>> I have not seen patches on this issue. This issue could lead to
>>> large variance in workload performance based on your luck
>>> in placing the workload among the cores.
>>>
>>
>> I've made an attempt in the following two patches to address
>> the load balancing of mismatched load between the siblings.
>>
>> It is applied on top of Aaron's patches:
>> - sched: Fix incorrect rq tagged as forced idle
>> - wrapper for cfs_rq->min_vruntime
>> https://lore.kernel.org/lkml/20190725143127.GB992@aaronlu/
>> - core vruntime comparison
>> https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/
>
> So both of you are working on top of my 2 patches that deal with the
> fairness issue, but I had the feeling Tim's alternative patches[1] are
> simpler than mine and achieves the same result(after the force idle tag

I think Julien's result show that my patches did not do as well as
your patches for fairness. Aubrey did some other testing with the same
conclusion. So I think keeping the forced idle time balanced is not
enough for maintaining fairness.

Will love to see if my load balancing patches help for your workload.

Tim

> fix), so unless there is something I missed, I think we should go with
> the simpler one?
>
> [1]: https://lore.kernel.org/lkml/[email protected]/
>

2019-09-11 16:53:14

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

> > So both of you are working on top of my 2 patches that deal with the
> > fairness issue, but I had the feeling Tim's alternative patches[1] are
> > simpler than mine and achieves the same result(after the force idle tag
>
> I think Julien's result show that my patches did not do as well as
> your patches for fairness. Aubrey did some other testing with the same
> conclusion. So I think keeping the forced idle time balanced is not
> enough for maintaining fairness.
>
There are two main issues - vruntime comparison issue and the
forced idle issue. coresched_idle thread patch is addressing
the forced idle issue as scheduler is no longer overloading idle
thread for forcing idle. If I understand correctly, Tim's patch
also tries to fix the forced idle issue. On top of fixing forced
idle issue, we also need to fix that vruntime comparison issue
and I think thats where Aaron's patch helps.

I think comparing parent's runtime also will have issues once
the task group has a lot more threads with different running
patterns. One example is a task group with lot of active threads
and a thread with fairly less activity. So when this less active
thread is competing with a thread in another group, there is a
chance that it loses continuously for a while until the other
group catches up on its vruntime.

As discussed during LPC, probably start thinking along the lines
of global vruntime or core wide vruntime to fix the vruntime
comparison issue?

Thanks,
Vineeth

2019-09-12 12:40:51

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Wed, Sep 11, 2019 at 12:47:34PM -0400, Vineeth Remanan Pillai wrote:
> > > So both of you are working on top of my 2 patches that deal with the
> > > fairness issue, but I had the feeling Tim's alternative patches[1] are
> > > simpler than mine and achieves the same result(after the force idle tag
> >
> > I think Julien's result show that my patches did not do as well as
> > your patches for fairness. Aubrey did some other testing with the same
> > conclusion. So I think keeping the forced idle time balanced is not
> > enough for maintaining fairness.
> >
> There are two main issues - vruntime comparison issue and the
> forced idle issue. coresched_idle thread patch is addressing
> the forced idle issue as scheduler is no longer overloading idle
> thread for forcing idle. If I understand correctly, Tim's patch
> also tries to fix the forced idle issue. On top of fixing forced

Er...I don't think so. Tim's patch is meant to solve fairness issue as
mine, it doesn't attempt to address the forced idle issue.

> idle issue, we also need to fix that vruntime comparison issue
> and I think thats where Aaron's patch helps.
>
> I think comparing parent's runtime also will have issues once
> the task group has a lot more threads with different running
> patterns. One example is a task group with lot of active threads
> and a thread with fairly less activity. So when this less active
> thread is competing with a thread in another group, there is a
> chance that it loses continuously for a while until the other
> group catches up on its vruntime.

I actually think this is expected behaviour.

Without core scheduling, when deciding which task to run, we will first
decide which "se" to run from the CPU's root level cfs runqueue and then
go downwards. Let's call the chosen se on the root level cfs runqueue
the winner se. Then with core scheduling, we will also need compare the
two winner "se"s of each hyperthread and choose the core wide winner "se".

>
> As discussed during LPC, probably start thinking along the lines
> of global vruntime or core wide vruntime to fix the vruntime
> comparison issue?

core wide vruntime makes sense when there are multiple tasks of
different cgroups queued on the same core. e.g. when there are two
tasks of cgroupA and one task of cgroupB are queued on the same core,
assume cgroupA's one task is on one hyperthread and its other task is on
the other hyperthread with cgroupB's task. With my current
implementation or Tim's, cgroupA will get more time than cgroupB. If we
maintain core wide vruntime for cgroupA and cgroupB, we should be able
to maintain fairness between cgroups on this core. Tim propose to solve
this problem by doing some kind of load balancing if I'm not mistaken, I
haven't taken a look at this yet.

2019-09-12 17:47:18

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Wed, Sep 11, 2019 at 09:19:02AM -0700, Tim Chen wrote:
> On 9/11/19 7:02 AM, Aaron Lu wrote:
> > Hi Tim & Julien,
> >
> > On Fri, Sep 06, 2019 at 11:30:20AM -0700, Tim Chen wrote:
> >> On 8/7/19 10:10 AM, Tim Chen wrote:
> >>
> >>> 3) Load balancing between CPU cores
> >>> -----------------------------------
> >>> Say if one CPU core's sibling threads get forced idled
> >>> a lot as it has mostly incompatible tasks between the siblings,
> >>> moving the incompatible load to other cores and pulling
> >>> compatible load to the core could help CPU utilization.
> >>>
> >>> So just considering the load of a task is not enough during
> >>> load balancing, task compatibility also needs to be considered.
> >>> Peter has put in mechanisms to balance compatible tasks between
> >>> CPU thread siblings, but not across cores.
> >>>
> >>> Status:
> >>> I have not seen patches on this issue. This issue could lead to
> >>> large variance in workload performance based on your luck
> >>> in placing the workload among the cores.
> >>>
> >>
> >> I've made an attempt in the following two patches to address
> >> the load balancing of mismatched load between the siblings.
> >>
> >> It is applied on top of Aaron's patches:
> >> - sched: Fix incorrect rq tagged as forced idle
> >> - wrapper for cfs_rq->min_vruntime
> >> https://lore.kernel.org/lkml/20190725143127.GB992@aaronlu/
> >> - core vruntime comparison
> >> https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/
> >
> > So both of you are working on top of my 2 patches that deal with the
> > fairness issue, but I had the feeling Tim's alternative patches[1] are
> > simpler than mine and achieves the same result(after the force idle tag
>
> I think Julien's result show that my patches did not do as well as
> your patches for fairness. Aubrey did some other testing with the same
> conclusion. So I think keeping the forced idle time balanced is not
> enough for maintaining fairness.

Well, I have done following tests:
1 Julien's test script: https://paste.debian.net/plainh/834cf45c
2 start two tagged will-it-scale/page_fault1, see how each performs;
3 Aubrey's mysql test: https://github.com/aubreyli/coresched_bench.git

They all show your patchset performs equally well...And consider what
the patch does, I think they are really doing the same thing in
different ways.

2019-09-12 19:02:40

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 9/12/19 5:04 AM, Aaron Lu wrote:

> Well, I have done following tests:
> 1 Julien's test script: https://paste.debian.net/plainh/834cf45c
> 2 start two tagged will-it-scale/page_fault1, see how each performs;
> 3 Aubrey's mysql test: https://github.com/aubreyli/coresched_bench.git
>
> They all show your patchset performs equally well...And consider what
> the patch does, I think they are really doing the same thing in
> different ways.
>

Aaron,

The new feature of my new patches attempt to load balance between cores,
and remove imbalance of cgroup load on a core that causes forced idle.
Whereas previous patches attempt for fairness of cgroup between sibling threads,
so I think the goals are kind of orthogonal and complementary.

The premise is this, say cgroup1 is occupying 50% of cpu on cpu thread 1
and 25% of cpu on cpu thread 2, that means we have a 25% cpu imbalance
and cpu is force idled 25% of the time. So ideally we need to remove
12.5% of cgroup 1 load from cpu thread 1 to sibling thread 2, so they
both run at 37.5% on both thread for cgroup1 load without causing
any force idled time. Otherwise we will try to remove 25% of cgroup1
load from cpu thread 1 to another core that has cgroup1 load to match.

This load balance is done in the regular load balance paths.

Previously for v3, only sched_core_balance made an attempt to pull a cookie task, and only
in the idle balance path. So if the cpu is kept busy, the cgroup load imbalance
between sibling threads could last a long time. And the thread fairness
patches for v3 don't help to balance load for such cases.

The new patches take into actual consideration of the amount of load imbalance
of the same group between sibling threads when selecting task to pull,
and it also prevent task migration that creates
more load imbalance. So hopefully this feature will help when we have
more cores and need load balance across the cores. This tries to help
even cgroup workload between threads to minimize forced idle time, and also
even out load across cores.

In your test, how many cores are on your machine and how many threads did
each page_fault1 spawn off?

Tim

2019-09-12 19:31:19

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 9/12/19 5:35 AM, Aaron Lu wrote:
> On Wed, Sep 11, 2019 at 12:47:34PM -0400, Vineeth Remanan Pillai wrote:

>
> core wide vruntime makes sense when there are multiple tasks of
> different cgroups queued on the same core. e.g. when there are two
> tasks of cgroupA and one task of cgroupB are queued on the same core,
> assume cgroupA's one task is on one hyperthread and its other task is on
> the other hyperthread with cgroupB's task. With my current
> implementation or Tim's, cgroupA will get more time than cgroupB.

I think that's expected because cgroup A has two tasks and cgroup B
has one task, so cgroup A should get twice the cpu time than cgroup B
to maintain fairness.

> If we
> maintain core wide vruntime for cgroupA and cgroupB, we should be able
> to maintain fairness between cgroups on this core.

I don't think the right thing to do is to give cgroupA and cgroupB equal
time on a core. The time they get should still depend on their
load weight. The better thing to do is to move one task from cgroupA
to another core, that has only one cgroupA task so it can be paired up
with that lonely cgroupA task. This will eliminate the forced idle time
for cgropuA both on current core and also the migrated core.

> Tim propose to solve
> this problem by doing some kind of load balancing if I'm not mistaken, I
> haven't taken a look at this yet.
>

My new patchset is trying to solve a different problem. It is
not trying to maintain fairness between cgroup on a core, but tries to
even out the load of a cgroup between threads, and even out general
load between cores. This will minimize the forced idle time.

The fairness between cgroup relies still on
proper vruntime accounting and proper comparison of vruntime between
threads. So for now, I am still using Aaron's patchset for this purpose
as it has better fairness property than my other proposed patchsets
for fairness purpose.

With just Aaron's current patchset we may have a lot of forced idle time
due to the uneven distribution of tasks of different cgroup among the
threads and cores, even though scheduling fairness is maintained.
My new patches try to remove those forced idle time by moving the
tasks around, to minimize cgroup unevenness between sibling threads
and general load unevenness between the CPUs.

Tim

2019-09-13 01:20:57

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, Sep 12, 2019 at 8:04 PM Aaron Lu <[email protected]> wrote:
>
> On Wed, Sep 11, 2019 at 09:19:02AM -0700, Tim Chen wrote:
> > On 9/11/19 7:02 AM, Aaron Lu wrote:
> > I think Julien's result show that my patches did not do as well as
> > your patches for fairness. Aubrey did some other testing with the same
> > conclusion. So I think keeping the forced idle time balanced is not
> > enough for maintaining fairness.
>
> Well, I have done following tests:
> 1 Julien's test script: https://paste.debian.net/plainh/834cf45c
> 2 start two tagged will-it-scale/page_fault1, see how each performs;
> 3 Aubrey's mysql test: https://github.com/aubreyli/coresched_bench.git
>
> They all show your patchset performs equally well...And consider what
> the patch does, I think they are really doing the same thing in
> different ways.

It looks like we are not on the same page, if you don't mind, can both of
you rebase your patchset onto v5.3-rc8 and provide a public branch so I
can fetch and test it at least by my benchmark?

Thanks,
-Aubrey

2019-09-13 14:03:56

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, Sep 12, 2019 at 10:05:43AM -0700, Tim Chen wrote:
> On 9/12/19 5:04 AM, Aaron Lu wrote:
>
> > Well, I have done following tests:
> > 1 Julien's test script: https://paste.debian.net/plainh/834cf45c
> > 2 start two tagged will-it-scale/page_fault1, see how each performs;
> > 3 Aubrey's mysql test: https://github.com/aubreyli/coresched_bench.git
> >
> > They all show your patchset performs equally well...And consider what
> > the patch does, I think they are really doing the same thing in
> > different ways.
> >
>
> Aaron,
>
> The new feature of my new patches attempt to load balance between cores,
> and remove imbalance of cgroup load on a core that causes forced idle.
> Whereas previous patches attempt for fairness of cgroup between sibling threads,
> so I think the goals are kind of orthogonal and complementary.
>
> The premise is this, say cgroup1 is occupying 50% of cpu on cpu thread 1
> and 25% of cpu on cpu thread 2, that means we have a 25% cpu imbalance
> and cpu is force idled 25% of the time. So ideally we need to remove
> 12.5% of cgroup 1 load from cpu thread 1 to sibling thread 2, so they
> both run at 37.5% on both thread for cgroup1 load without causing
> any force idled time. Otherwise we will try to remove 25% of cgroup1
> load from cpu thread 1 to another core that has cgroup1 load to match.
>
> This load balance is done in the regular load balance paths.
>
> Previously for v3, only sched_core_balance made an attempt to pull a cookie task, and only
> in the idle balance path. So if the cpu is kept busy, the cgroup load imbalance
> between sibling threads could last a long time. And the thread fairness
> patches for v3 don't help to balance load for such cases.
>
> The new patches take into actual consideration of the amount of load imbalance
> of the same group between sibling threads when selecting task to pull,
> and it also prevent task migration that creates
> more load imbalance. So hopefully this feature will help when we have
> more cores and need load balance across the cores. This tries to help
> even cgroup workload between threads to minimize forced idle time, and also
> even out load across cores.

Will take a look at your new patches, thanks for the explanation.

> In your test, how many cores are on your machine and how many threads did
> each page_fault1 spawn off?

The test VM has 16 cores and 32 threads.
I created 2 tagged cgroups to run page_fault1 and each page_fault1 has
16 processes, like this:
$ ./src/will-it-scale/page_fault1_processes -t 16 -s 60

2019-09-13 19:04:30

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, Sep 12, 2019 at 10:29:13AM -0700, Tim Chen wrote:
> On 9/12/19 5:35 AM, Aaron Lu wrote:
> > On Wed, Sep 11, 2019 at 12:47:34PM -0400, Vineeth Remanan Pillai wrote:
>
> >
> > core wide vruntime makes sense when there are multiple tasks of
> > different cgroups queued on the same core. e.g. when there are two
> > tasks of cgroupA and one task of cgroupB are queued on the same core,
> > assume cgroupA's one task is on one hyperthread and its other task is on
> > the other hyperthread with cgroupB's task. With my current
> > implementation or Tim's, cgroupA will get more time than cgroupB.
>
> I think that's expected because cgroup A has two tasks and cgroup B
> has one task, so cgroup A should get twice the cpu time than cgroup B
> to maintain fairness.

Like you said below, the ideal run time for each cgroup should depend on
their individual weight. The fact cgroupA has two tasks doesn't mean it
has twice the weight. Both cgroup can have the same cpu.share settings
and then, the more task a cgroup has, the less weight it can get for the
cgroup's per-cpu se.

I now realized one thing that's different in your idle_allowance
implementation and my core_vruntime implementation. In your
implementation, the idle_allowance is absolute time while vruntime can
be adjusted by the se's weight, that's probably one area your
implementation can make things less fair then mine.

> > If we
> > maintain core wide vruntime for cgroupA and cgroupB, we should be able
> > to maintain fairness between cgroups on this core.
>
> I don't think the right thing to do is to give cgroupA and cgroupB equal
> time on a core. The time they get should still depend on their
> load weight.

Agree.

> The better thing to do is to move one task from cgroupA to another core,
> that has only one cgroupA task so it can be paired up
> with that lonely cgroupA task. This will eliminate the forced idle time
> for cgropuA both on current core and also the migrated core.

I'm not sure if this is always possible.

Say on a 16cores/32threads machine, there are 3 cgroups, each has 16 cpu
intensive tasks, will it be possible to make things perfectly balanced?

Don't get me wrong, I think this kind of load balancing is good and
needed, but I'm not sure if we can always make things perfectly
balanced. And if not, do we care those few cores where cgroup tasks are
not balanced and then, do we need to implement the core_wide cgoup
fairness functionality or we don't care since those cores are supposed
to be few and isn't a big deal?

> > Tim propose to solve
> > this problem by doing some kind of load balancing if I'm not mistaken, I
> > haven't taken a look at this yet.
> >
>
> My new patchset is trying to solve a different problem. It is
> not trying to maintain fairness between cgroup on a core, but tries to
> even out the load of a cgroup between threads, and even out general
> load between cores. This will minimize the forced idle time.

Understood.

>
> The fairness between cgroup relies still on
> proper vruntime accounting and proper comparison of vruntime between
> threads. So for now, I am still using Aaron's patchset for this purpose
> as it has better fairness property than my other proposed patchsets
> for fairness purpose.
>
> With just Aaron's current patchset we may have a lot of forced idle time
> due to the uneven distribution of tasks of different cgroup among the
> threads and cores, even though scheduling fairness is maintained.
> My new patches try to remove those forced idle time by moving the
> tasks around, to minimize cgroup unevenness between sibling threads
> and general load unevenness between the CPUs.

Yes I think this is definitely a good thing to do.

2019-09-14 01:36:04

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 9/13/19 7:15 AM, Aaron Lu wrote:
> On Thu, Sep 12, 2019 at 10:29:13AM -0700, Tim Chen wrote:

>
>> The better thing to do is to move one task from cgroupA to another core,
>> that has only one cgroupA task so it can be paired up
>> with that lonely cgroupA task. This will eliminate the forced idle time
>> for cgropuA both on current core and also the migrated core.
>
> I'm not sure if this is always possible.

During update_sg_lb_stats, we can scan for opportunities where pulling a task
from a source cpu in the sched group to the target dest cpu can reduce the forced idle imbalance.
And we also prevent task migrations that increase forced idle imbalance.

With those policies in place, we may not achieve perfect balance, but at least
we will load balance in the right direction to lower forced idle imbalance.

>
> Say on a 16cores/32threads machine, there are 3 cgroups, each has 16 cpu
> intensive tasks, will it be possible to make things perfectly balanced?
>
> Don't get me wrong, I think this kind of load balancing is good and
> needed, but I'm not sure if we can always make things perfectly
> balanced. And if not, do we care those few cores where cgroup tasks are
> not balanced and then, do we need to implement the core_wide cgoup
> fairness functionality or we don't care since those cores are supposed
> to be few and isn't a big deal?

Yes, we still need core wide fairness for tasks. Load balancing is to
move tasks around so we will have less imbalance of cgroup tasks in a core
that results in forced idle time. Once they are in place, we still need
to maintain fairness in a core.

Tim

2019-09-15 16:54:14

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Fri, Sep 13, 2019 at 07:12:52AM +0800, Aubrey Li wrote:
> On Thu, Sep 12, 2019 at 8:04 PM Aaron Lu <[email protected]> wrote:
> >
> > On Wed, Sep 11, 2019 at 09:19:02AM -0700, Tim Chen wrote:
> > > On 9/11/19 7:02 AM, Aaron Lu wrote:
> > > I think Julien's result show that my patches did not do as well as
> > > your patches for fairness. Aubrey did some other testing with the same
> > > conclusion. So I think keeping the forced idle time balanced is not
> > > enough for maintaining fairness.
> >
> > Well, I have done following tests:
> > 1 Julien's test script: https://paste.debian.net/plainh/834cf45c
> > 2 start two tagged will-it-scale/page_fault1, see how each performs;
> > 3 Aubrey's mysql test: https://github.com/aubreyli/coresched_bench.git
> >
> > They all show your patchset performs equally well...And consider what
> > the patch does, I think they are really doing the same thing in
> > different ways.
>
> It looks like we are not on the same page, if you don't mind, can both of
> you rebase your patchset onto v5.3-rc8 and provide a public branch so I
> can fetch and test it at least by my benchmark?

I'm using the following branch as base which is v5.1.5 based:
https://github.com/digitalocean/linux-coresched coresched-v3-v5.1.5-test

And I have pushed Tim's branch to:
https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim

Mine:
https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-core_vruntime

The two branches both have two patches I have sent previouslly:
https://lore.kernel.org/lkml/20190810141556.GA73644@aaronlu/
Although it has some potential performance loss as pointed out by
Vineeth, I haven't got time to rework it yet.

2019-09-18 07:26:48

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Sun, Sep 15, 2019 at 10:14 PM Aaron Lu <[email protected]> wrote:
>
> On Fri, Sep 13, 2019 at 07:12:52AM +0800, Aubrey Li wrote:
> > On Thu, Sep 12, 2019 at 8:04 PM Aaron Lu <[email protected]> wrote:
> > >
> > > On Wed, Sep 11, 2019 at 09:19:02AM -0700, Tim Chen wrote:
> > > > On 9/11/19 7:02 AM, Aaron Lu wrote:
> > > > I think Julien's result show that my patches did not do as well as
> > > > your patches for fairness. Aubrey did some other testing with the same
> > > > conclusion. So I think keeping the forced idle time balanced is not
> > > > enough for maintaining fairness.
> > >
> > > Well, I have done following tests:
> > > 1 Julien's test script: https://paste.debian.net/plainh/834cf45c
> > > 2 start two tagged will-it-scale/page_fault1, see how each performs;
> > > 3 Aubrey's mysql test: https://github.com/aubreyli/coresched_bench.git
> > >
> > > They all show your patchset performs equally well...And consider what
> > > the patch does, I think they are really doing the same thing in
> > > different ways.
> >
> > It looks like we are not on the same page, if you don't mind, can both of
> > you rebase your patchset onto v5.3-rc8 and provide a public branch so I
> > can fetch and test it at least by my benchmark?
>
> I'm using the following branch as base which is v5.1.5 based:
> https://github.com/digitalocean/linux-coresched coresched-v3-v5.1.5-test
>
> And I have pushed Tim's branch to:
> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
>
> Mine:
> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-core_vruntime
>
> The two branches both have two patches I have sent previouslly:
> https://lore.kernel.org/lkml/20190810141556.GA73644@aaronlu/
> Although it has some potential performance loss as pointed out by
> Vineeth, I haven't got time to rework it yet.

In terms of these two branches, we tested two cases:

1) 32 AVX threads and 32 mysql threads on one core(2 HT)
2) 192 AVX threads and 192 mysql threads on 96 cores(192 HTs)

For case 1), we saw two branches is on par

Branch: coresched-v3-v5.1.5-test-core_vruntime
-Avg throughput:: 1865.62 (std: 20.6%)
-Avg latency: 26.43 (std: 8.3%)

Branch: coresched-v3-v5.1.5-test-tim
- Avg throughput: 1804.88 (std: 20.1%)
- Avg latency: 29.78 (std: 11.8%)

For case 2), we saw core vruntime performs better than counting forced
idle time

Branch: coresched-v3-v5.1.5-test-core_vruntime
- Avg throughput: 5528.56 (std: 44.2%)
- Avg latency: 165.99 (std: 45.2%)

Branch: coresched-v3-v5.1.5-test-tim
-Avg throughput:: 3842.33 (std: 35.1%)
-Avg latency: 306.99 (std: 72.9%)

As Aaron pointed out, vruntime is with se's weight, which could be a reason
for the difference.

So should we go with core vruntime approach?
Or Tim - do you want to improve forced idle time approach?

Thanks,
-Aubrey

2019-09-18 22:59:50

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 9/17/19 6:33 PM, Aubrey Li wrote:
> On Sun, Sep 15, 2019 at 10:14 PM Aaron Lu <[email protected]> wrote:

>>
>> And I have pushed Tim's branch to:
>> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
>>
>> Mine:
>> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-core_vruntime


Aubrey,

Thanks for testing with your set up.

I think the test that's of interest is to see my load balancing added on top
of Aaron's fairness patch, instead of using my previous version of
forced idle approach in coresched-v3-v5.1.5-test-tim branch.

I've added my two load balance patches on top of Aaron's patches
in coresched-v3-v5.1.5-test-core_vruntime branch and put it in

https://github.com/pdxChen/gang/tree/coresched-v3-v5.1.5-test-core_vruntime-lb

>
> As Aaron pointed out, vruntime is with se's weight, which could be a reason
> for the difference.
>
> So should we go with core vruntime approach?
> Or Tim - do you want to improve forced idle time approach?
>

I hope to improve the forced idle time later. But for now let's see if
additional load balance logic can help remove cgroup mismatch
and improve performance, on top of Aaron's fairness patches.

Thanks.

Tim


2019-09-18 23:14:01

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 9/10/19 7:27 AM, Julien Desfossez wrote:
> On 29-Aug-2019 04:38:21 PM, Peter Zijlstra wrote:
>> On Thu, Aug 29, 2019 at 10:30:51AM -0400, Phil Auld wrote:
>>> I think, though, that you were basically agreeing with me that the current
>>> core scheduler does not close the holes, or am I reading that wrong.
>>
>> Agreed; the missing bits for L1TF are ugly but doable (I've actually
>> done them before, Tim has that _somewhere_), but I've not seen a
>> 'workable' solution for MDS yet.
>

The L1TF problem is a much bigger one for HT than MDS. It is relatively
easy for a Rogue VM to sniff L1 cached memory locations. While for MDS,
it is quite difficult for the attacker to associate data in the cpu buffers
with specific memory to make the sniffed data useful.

Even if we don't have a complete solution yet for MDS HT vulnerability,
it is worthwhile to plug the L1TF hole for HT first with core scheduler,
as L1TF is much more exploitable.

Tim

> Following the discussion we had yesterday at LPC, after we have agreed
> on a solution for fixing the current fairness issue, we will post the
> v4. We will then work on prototyping the other synchronisation points
> (syscalls, interrupts and VMEXIT) to evaluate the overhead in various
> use-cases.
>
> Depending on the use-case, we know the performance overhead maybe
> heavier than just disabling SMT, but the benchmarks we have seen so far
> indicate that there are valid cases for core scheduling. Core scheduling
> will continue to be unused by default, but with it, we will have the
> option to tune the system to be both secure and faster than disabling
> SMT for those cases.
>
> Thanks,
>
> Julien
>
> P.S: I think the branch that contains the VMEXIT handling is here
> https://github.com/pdxChen/gang/commits/sched_1.23-base
>

2019-09-18 23:25:10

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 9/4/19 6:44 PM, Julien Desfossez wrote:

> +
> +static void coresched_idle_worker_fini(struct rq *rq)
> +{
> + if (rq->core_idle_task) {
> + kthread_stop(rq->core_idle_task);
> + rq->core_idle_task = NULL;
> + }

During testing, I have found access of rq->core_idle_task as
NULL pointer from other cpus (other than the cpu executing the stop_machine
function) when you toggle cpu.tag of the cgroup.
Doing locking here is tricky because the rq lock is being
transitioned from core lock to per run queue lock. As a fix,
I made coresched_idle_worker_fini a null function, and not
to null out the core_idle_task.

Tim

2019-09-18 23:42:36

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, Sep 19, 2019 at 4:41 AM Tim Chen <[email protected]> wrote:
>
> On 9/17/19 6:33 PM, Aubrey Li wrote:
> > On Sun, Sep 15, 2019 at 10:14 PM Aaron Lu <[email protected]> wrote:
>
> >>
> >> And I have pushed Tim's branch to:
> >> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> >>
> >> Mine:
> >> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-core_vruntime
>
>
> Aubrey,
>
> Thanks for testing with your set up.
>
> I think the test that's of interest is to see my load balancing added on top
> of Aaron's fairness patch, instead of using my previous version of
> forced idle approach in coresched-v3-v5.1.5-test-tim branch.
>

I'm trying to figure out a way to solve fairness only(not include task
placement),
So @Vineeth - if everyone is okay with Aaron's fairness patch, maybe
we should have a v4?

Thanks,
-Aubrey

2019-09-26 09:07:44

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Sat, Sep 7, 2019 at 2:30 AM Tim Chen <[email protected]> wrote:
> +static inline s64 core_sched_imbalance_delta(int src_cpu, int dst_cpu,
> + int src_sibling, int dst_sibling,
> + struct task_group *tg, u64 task_load)
> +{
> + struct sched_entity *se, *se_sibling, *dst_se, *dst_se_sibling;
> + s64 excess, deficit, old_mismatch, new_mismatch;
> +
> + if (src_cpu == dst_cpu)
> + return -1;
> +
> + /* XXX SMT4 will require additional logic */
> +
> + se = tg->se[src_cpu];
> + se_sibling = tg->se[src_sibling];
> +
> + excess = se->avg.load_avg - se_sibling->avg.load_avg;
> + if (src_sibling == dst_cpu) {
> + old_mismatch = abs(excess);
> + new_mismatch = abs(excess - 2*task_load);
> + return old_mismatch - new_mismatch;
> + }
> +
> + dst_se = tg->se[dst_cpu];
> + dst_se_sibling = tg->se[dst_sibling];
> + deficit = dst_se->avg.load_avg - dst_se_sibling->avg.load_avg;
> +
> + old_mismatch = abs(excess) + abs(deficit);
> + new_mismatch = abs(excess - (s64) task_load) +
> + abs(deficit + (s64) task_load);

If I understood correctly, these formulas made an assumption that the task
being moved to the destination is matched the destination's core cookie. so if
the task is not matched with dst's core cookie and still have to stay
in the runqueue
then the formula becomes not correct.

> /**
> * update_sg_lb_stats - Update sched_group's statistics for load balancing.
> * @env: The load balancing environment.
> @@ -8345,7 +8492,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> else
> load = source_load(i, load_idx);
>
> - sgs->group_load += load;

Why is this load update line removed?

> + core_sched_imbalance_scan(sgs, i, env->dst_cpu);
> +
> sgs->group_util += cpu_util(i);
> sgs->sum_nr_running += rq->cfs.h_nr_running;
>

Thanks,
-Aubrey

2019-09-26 09:48:54

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 9/24/19 7:40 PM, Aubrey Li wrote:
> On Sat, Sep 7, 2019 at 2:30 AM Tim Chen <[email protected]> wrote:
>> +static inline s64 core_sched_imbalance_delta(int src_cpu, int dst_cpu,
>> + int src_sibling, int dst_sibling,
>> + struct task_group *tg, u64 task_load)
>> +{
>> + struct sched_entity *se, *se_sibling, *dst_se, *dst_se_sibling;
>> + s64 excess, deficit, old_mismatch, new_mismatch;
>> +
>> + if (src_cpu == dst_cpu)
>> + return -1;
>> +
>> + /* XXX SMT4 will require additional logic */
>> +
>> + se = tg->se[src_cpu];
>> + se_sibling = tg->se[src_sibling];
>> +
>> + excess = se->avg.load_avg - se_sibling->avg.load_avg;
>> + if (src_sibling == dst_cpu) {
>> + old_mismatch = abs(excess);
>> + new_mismatch = abs(excess - 2*task_load);
>> + return old_mismatch - new_mismatch;
>> + }
>> +
>> + dst_se = tg->se[dst_cpu];
>> + dst_se_sibling = tg->se[dst_sibling];
>> + deficit = dst_se->avg.load_avg - dst_se_sibling->avg.load_avg;
>> +
>> + old_mismatch = abs(excess) + abs(deficit);
>> + new_mismatch = abs(excess - (s64) task_load) +
>> + abs(deficit + (s64) task_load);
>
> If I understood correctly, these formulas made an assumption that the task
> being moved to the destination is matched the destination's core cookie.

That's not the case. We do not need to match the destination's core cookie, as that
may change after context switches. It needs to reduce the load mismatch with
the destination CPU's sibling for that cgroup.

> so if
> the task is not matched with dst's core cookie and still have to stay
> in the runqueue
> then the formula becomes not correct.
>
>> /**
>> * update_sg_lb_stats - Update sched_group's statistics for load balancing.
>> * @env: The load balancing environment.
>> @@ -8345,7 +8492,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> else
>> load = source_load(i, load_idx);
>>
>> - sgs->group_load += load;
>
> Why is this load update line removed?

This was removed accidentally. Should be restored.

>
>> + core_sched_imbalance_scan(sgs, i, env->dst_cpu);
>> +
>> sgs->group_util += cpu_util(i);
>> sgs->sum_nr_running += rq->cfs.h_nr_running;
>>
>


Thanks.

Tim

2019-09-26 10:05:03

by Aubrey Li

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, Sep 26, 2019 at 1:24 AM Tim Chen <[email protected]> wrote:
>
> On 9/24/19 7:40 PM, Aubrey Li wrote:
> > On Sat, Sep 7, 2019 at 2:30 AM Tim Chen <[email protected]> wrote:
> >> +static inline s64 core_sched_imbalance_delta(int src_cpu, int dst_cpu,
> >> + int src_sibling, int dst_sibling,
> >> + struct task_group *tg, u64 task_load)
> >> +{
> >> + struct sched_entity *se, *se_sibling, *dst_se, *dst_se_sibling;
> >> + s64 excess, deficit, old_mismatch, new_mismatch;
> >> +
> >> + if (src_cpu == dst_cpu)
> >> + return -1;
> >> +
> >> + /* XXX SMT4 will require additional logic */
> >> +
> >> + se = tg->se[src_cpu];
> >> + se_sibling = tg->se[src_sibling];
> >> +
> >> + excess = se->avg.load_avg - se_sibling->avg.load_avg;
> >> + if (src_sibling == dst_cpu) {
> >> + old_mismatch = abs(excess);
> >> + new_mismatch = abs(excess - 2*task_load);
> >> + return old_mismatch - new_mismatch;
> >> + }
> >> +
> >> + dst_se = tg->se[dst_cpu];
> >> + dst_se_sibling = tg->se[dst_sibling];
> >> + deficit = dst_se->avg.load_avg - dst_se_sibling->avg.load_avg;
> >> +
> >> + old_mismatch = abs(excess) + abs(deficit);
> >> + new_mismatch = abs(excess - (s64) task_load) +
> >> + abs(deficit + (s64) task_load);
> >
> > If I understood correctly, these formulas made an assumption that the task
> > being moved to the destination is matched the destination's core cookie.
>
> That's not the case. We do not need to match the destination's core cookie,

I actually meant destination core's core cookie.

> as that may change after context switches. It needs to reduce the load mismatch
> with the destination CPU's sibling for that cgroup.

So the new_mismatch is not always true, especially when there are more
cgroups and
more core cookies on the system.

Thanks,
-Aubrey

2019-09-30 11:54:34

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, Sep 12, 2019 at 8:35 AM Aaron Lu <[email protected]> wrote:
> >
> > I think comparing parent's runtime also will have issues once
> > the task group has a lot more threads with different running
> > patterns. One example is a task group with lot of active threads
> > and a thread with fairly less activity. So when this less active
> > thread is competing with a thread in another group, there is a
> > chance that it loses continuously for a while until the other
> > group catches up on its vruntime.
>
> I actually think this is expected behaviour.
>
> Without core scheduling, when deciding which task to run, we will first
> decide which "se" to run from the CPU's root level cfs runqueue and then
> go downwards. Let's call the chosen se on the root level cfs runqueue
> the winner se. Then with core scheduling, we will also need compare the
> two winner "se"s of each hyperthread and choose the core wide winner "se".
>
Sorry, I misunderstood the fix and I did not initially see the core wide
min_vruntime that you tried to maintain in the rq->core. This approach
seems reasonable. I think we can fix the potential starvation that you
mentioned in the comment by adjusting for the difference in all the children
cfs_rq when we set the minvruntime in rq->core. Since we take the lock for
both the queues, it should be doable and I am trying to see how we can best
do that.

> >
> > As discussed during LPC, probably start thinking along the lines
> > of global vruntime or core wide vruntime to fix the vruntime
> > comparison issue?
>
> core wide vruntime makes sense when there are multiple tasks of
> different cgroups queued on the same core. e.g. when there are two
> tasks of cgroupA and one task of cgroupB are queued on the same core,
> assume cgroupA's one task is on one hyperthread and its other task is on
> the other hyperthread with cgroupB's task. With my current
> implementation or Tim's, cgroupA will get more time than cgroupB. If we
> maintain core wide vruntime for cgroupA and cgroupB, we should be able
> to maintain fairness between cgroups on this core. Tim propose to solve
> this problem by doing some kind of load balancing if I'm not mistaken, I
> haven't taken a look at this yet.
I think your fix is almost close to maintaining a core wide vruntime as you
have a single minvruntime to compare now across the siblings in the core.
To make the fix complete, we might need to adjust the whole tree's
min_vruntime and I think its doable.

Thanks,
Vineeth

2019-09-30 14:37:14

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Wed, Sep 18, 2019 at 6:16 PM Aubrey Li <[email protected]> wrote:
>
> On Thu, Sep 19, 2019 at 4:41 AM Tim Chen <[email protected]> wrote:
> >
> > On 9/17/19 6:33 PM, Aubrey Li wrote:
> > > On Sun, Sep 15, 2019 at 10:14 PM Aaron Lu <[email protected]> wrote:
> >
> > >>
> > >> And I have pushed Tim's branch to:
> > >> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> > >>
> > >> Mine:
> > >> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-core_vruntime
> >
> >
> > Aubrey,
> >
> > Thanks for testing with your set up.
> >
> > I think the test that's of interest is to see my load balancing added on top
> > of Aaron's fairness patch, instead of using my previous version of
> > forced idle approach in coresched-v3-v5.1.5-test-tim branch.
> >
>
> I'm trying to figure out a way to solve fairness only(not include task
> placement),
> So @Vineeth - if everyone is okay with Aaron's fairness patch, maybe
> we should have a v4?
>
Yes, I think we can move to v4 with Aaron's fairness fix and potentially
Tim's load balancing fixes. I am working on some improvements to Aaron's
fixes and shall post the changes after some testing. Basically, what I am
trying to do is to propagate the min_vruntime change down to all the cf_rq
and individual se when we update the cfs_rq(rq->core)->min_vrutime. So,
we can make sure that the rq stays in sync and starvation do not happen.

If everything goes well, we shall also post the v4 towards the end of this
week. We would be testing Tim's load balancing patches in an
over-committed VM scenario to observe the effect of the fix.

Thanks,
Vineeth

2019-09-30 15:24:49

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

> I've made an attempt in the following two patches to address
> the load balancing of mismatched load between the siblings.
>
> It is applied on top of Aaron's patches:
> - sched: Fix incorrect rq tagged as forced idle
> - wrapper for cfs_rq->min_vruntime
> https://lore.kernel.org/lkml/20190725143127.GB992@aaronlu/
> - core vruntime comparison
> https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/
>
> I will love Julien, Aaron and others to try it out. Suggestions
> to tune it is welcomed.

Just letting you know that I will be testing your load balancing patches
this week along with the changes Vineeth is currently doing. I didn't
test it before because I was focused on single threaded and pinned
micro-benchmarks, but I am back on scaling tests so it will be
interesting to see.

Thanks,

Julien

2019-10-02 21:54:36

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Mon, Sep 30, 2019 at 7:53 AM Vineeth Remanan Pillai
<[email protected]> wrote:
>
> >
> Sorry, I misunderstood the fix and I did not initially see the core wide
> min_vruntime that you tried to maintain in the rq->core. This approach
> seems reasonable. I think we can fix the potential starvation that you
> mentioned in the comment by adjusting for the difference in all the children
> cfs_rq when we set the minvruntime in rq->core. Since we take the lock for
> both the queues, it should be doable and I am trying to see how we can best
> do that.
>
Attaching here with, the 2 patches I was working on in preparation of v4.

Patch 1 is an improvement of patch 2 of Aaron where I am propagating the
vruntime changes to the whole tree.
Patch 2 is an improvement for patch 3 of Aaron where we do resched_curr
only when the sibling is forced idle.

Micro benchmarks seems good. Will be doing larger set of tests and hopefully
posting v4 by end of week. Please let me know what you think of these patches
(patch 1 is on top of Aaron's patch 2, patch 2 replaces Aaron's patch 3)

Thanks,
Vineeth

[PATCH 1/2] sched/fair: propagate the min_vruntime change to the whole rq tree

When we adjust the min_vruntime of rq->core, we need to propgate
that down the tree so as to not cause starvation of existing tasks
based on previous vruntime.
---
kernel/sched/fair.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59cb01a1563b..e8dd78a8c54d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -476,6 +476,23 @@ static inline u64 cfs_rq_min_vruntime(struct
cfs_rq *cfs_rq)
return cfs_rq->min_vruntime;
}

+static void coresched_adjust_vruntime(struct cfs_rq *cfs_rq, u64 delta)
+{
+ struct sched_entity *se, *next;
+
+ if (!cfs_rq)
+ return;
+
+ cfs_rq->min_vruntime -= delta;
+ rbtree_postorder_for_each_entry_safe(se, next,
+ &cfs_rq->tasks_timeline.rb_root, run_node) {
+ if (se->vruntime > delta)
+ se->vruntime -= delta;
+ if (se->my_q)
+ coresched_adjust_vruntime(se->my_q, delta);
+ }
+}
+
static void update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
{
struct cfs_rq *cfs_rq_core;
@@ -487,8 +504,11 @@ static void
update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
return;

cfs_rq_core = core_cfs_rq(cfs_rq);
- cfs_rq_core->min_vruntime = max(cfs_rq_core->min_vruntime,
- cfs_rq->min_vruntime);
+ if (cfs_rq_core != cfs_rq &&
+ cfs_rq->min_vruntime < cfs_rq_core->min_vruntime) {
+ u64 delta = cfs_rq_core->min_vruntime - cfs_rq->min_vruntime;
+ coresched_adjust_vruntime(cfs_rq_core, delta);
+ }
}

bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
--
2.17.1

[PATCH 2/2] sched/fair : Wake up forced idle siblings if needed

If a cpu has only one task and if it has used up its timeslice,
then we should try to wake up the sibling to give the forced idle
thread a chance.
We do that by triggering schedule which will IPI the sibling if
the task in the sibling wins the priority check.
---
kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 43 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8dd78a8c54d..ba4d929abae6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4165,6 +4165,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct
sched_entity *se, int flags)
update_min_vruntime(cfs_rq);
}

+static inline bool
+__entity_slice_used(struct sched_entity *se)
+{
+ return (se->sum_exec_runtime - se->prev_sum_exec_runtime) >
+ sched_slice(cfs_rq_of(se), se);
+}
+
/*
* Preempt the current task with a newly woken task if needed:
*/
@@ -10052,6 +10059,39 @@ static void rq_offline_fair(struct rq *rq)

#endif /* CONFIG_SMP */

+#ifdef CONFIG_SCHED_CORE
+/*
+ * If runqueue has only one task which used up its slice and
+ * if the sibling is forced idle, then trigger schedule
+ * to give forced idle task a chance.
+ */
+static void resched_forceidle(struct rq *rq, struct sched_entity *se)
+{
+ int cpu = cpu_of(rq), sibling_cpu;
+ if (rq->cfs.nr_running > 1 || !__entity_slice_used(se))
+ return;
+
+ for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) {
+ struct rq *sibling_rq;
+ if (sibling_cpu == cpu)
+ continue;
+ if (cpu_is_offline(sibling_cpu))
+ continue;
+
+ sibling_rq = cpu_rq(sibling_cpu);
+ if (sibling_rq->core_forceidle) {
+ resched_curr(rq);
+ break;
+ }
+ }
+}
+#else
+static inline void resched_forceidle(struct rq *rq, struct sched_entity *se)
+{
+}
+#endif
+
+
/*
* scheduler tick hitting a task of our scheduling class.
*
@@ -10075,6 +10115,9 @@ static void task_tick_fair(struct rq *rq,
struct task_struct *curr, int queued)

update_misfit_status(curr, rq);
update_overutilized_status(task_rq(curr));
+
+ if (sched_core_enabled(rq))
+ resched_forceidle(rq, &curr->se);
}

/*
--
2.17.1

2019-10-10 13:55:24

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Wed, Oct 02, 2019 at 04:48:14PM -0400, Vineeth Remanan Pillai wrote:
> On Mon, Sep 30, 2019 at 7:53 AM Vineeth Remanan Pillai
> <[email protected]> wrote:
> >
> > >
> > Sorry, I misunderstood the fix and I did not initially see the core wide
> > min_vruntime that you tried to maintain in the rq->core. This approach
> > seems reasonable. I think we can fix the potential starvation that you
> > mentioned in the comment by adjusting for the difference in all the children
> > cfs_rq when we set the minvruntime in rq->core. Since we take the lock for
> > both the queues, it should be doable and I am trying to see how we can best
> > do that.
> >
> Attaching here with, the 2 patches I was working on in preparation of v4.
>
> Patch 1 is an improvement of patch 2 of Aaron where I am propagating the
> vruntime changes to the whole tree.

I didn't see why we need do this.

We only need to have the root level sched entities' vruntime become core
wide since we will compare vruntime for them across hyperthreads. For
sched entities on sub cfs_rqs, we never(at least, not now) compare their
vruntime outside their cfs_rqs.

Thanks,
Aaron

> Patch 2 is an improvement for patch 3 of Aaron where we do resched_curr
> only when the sibling is forced idle.

2019-10-10 14:31:22

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

> I didn't see why we need do this.
>
> We only need to have the root level sched entities' vruntime become core
> wide since we will compare vruntime for them across hyperthreads. For
> sched entities on sub cfs_rqs, we never(at least, not now) compare their
> vruntime outside their cfs_rqs.
>
The reason we need to do this is because, new tasks that gets created will
have a vruntime based on the new min_vruntime and old tasks will have it
based on the old min_vruntime and it can cause starvation based on how
you set the min_vruntime. With this new patch, we normalize the whole
tree so that new tasks and old tasks compare with the same min_vruntime.

Thanks,
Vineeth

2019-10-11 07:34:35

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Thu, Oct 10, 2019 at 10:29:47AM -0400, Vineeth Remanan Pillai wrote:
> > I didn't see why we need do this.
> >
> > We only need to have the root level sched entities' vruntime become core
> > wide since we will compare vruntime for them across hyperthreads. For
> > sched entities on sub cfs_rqs, we never(at least, not now) compare their
> > vruntime outside their cfs_rqs.
> >
> The reason we need to do this is because, new tasks that gets created will
> have a vruntime based on the new min_vruntime and old tasks will have it
> based on the old min_vruntime

I think this is expected behaviour.

> and it can cause starvation based on how
> you set the min_vruntime.

Care to elaborate the starvation problem?

> With this new patch, we normalize the whole
> tree so that new tasks and old tasks compare with the same min_vruntime.

Again, what's the point of normalizing sched entities' vruntime in
sub-cfs_rqs? Their vruntime comparisons only happen inside their own
cfs_rq, we don't do cross CPU vruntime comparison for them.

2019-10-11 11:33:39

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

> > The reason we need to do this is because, new tasks that gets created will
> > have a vruntime based on the new min_vruntime and old tasks will have it
> > based on the old min_vruntime
>
> I think this is expected behaviour.
>
I don't think this is the expected behavior. If we hadn't changed the root
cfs->min_vruntime for the core rq, then it would have been the expected
behaviour. But now, we are updating the core rq's root cfs, min_vruntime
without changing the the vruntime down to the tree. To explain, consider
this example based on your patch. Let cpu 1 and 2 be siblings. And let rq(cpu1)
be the core rq. Let rq1->cfs->min_vruntime=1000 and rq2->cfs->min_vruntime=2000.
So in update_core_cfs_rq_min_vruntime, you update rq1->cfs->min_vruntime
to 2000 because that is the max. So new tasks enqueued on rq1 starts with
vruntime of 2000 while the tasks in that runqueue are still based on the old
min_vruntime(1000). So the new tasks gets enqueued some where to the right
of the tree and has to wait until already existing tasks catch up the
vruntime to
2000. This is what I meant by starvation. This happens always when we update
the core rq's cfs->min_vruntime. Hope this clarifies.

> > and it can cause starvation based on how
> > you set the min_vruntime.
>
> Care to elaborate the starvation problem?

Explained above.

> Again, what's the point of normalizing sched entities' vruntime in
> sub-cfs_rqs? Their vruntime comparisons only happen inside their own
> cfs_rq, we don't do cross CPU vruntime comparison for them.

As I mentioned above, this is to avoid the starvation case. Even though we are
not doing cross cfs_rq comparison, the whole tree's vruntime is based on the
root cfs->min_vruntime and we will have an imbalance if we change the root
cfs->min_vruntime without updating down the tree.

Thanks,
Vineeth

2019-10-11 12:05:57

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Fri, Oct 11, 2019 at 07:32:48AM -0400, Vineeth Remanan Pillai wrote:
> > > The reason we need to do this is because, new tasks that gets created will
> > > have a vruntime based on the new min_vruntime and old tasks will have it
> > > based on the old min_vruntime
> >
> > I think this is expected behaviour.
> >
> I don't think this is the expected behavior. If we hadn't changed the root
> cfs->min_vruntime for the core rq, then it would have been the expected
> behaviour. But now, we are updating the core rq's root cfs, min_vruntime
> without changing the the vruntime down to the tree. To explain, consider
> this example based on your patch. Let cpu 1 and 2 be siblings. And let rq(cpu1)
> be the core rq. Let rq1->cfs->min_vruntime=1000 and rq2->cfs->min_vruntime=2000.
> So in update_core_cfs_rq_min_vruntime, you update rq1->cfs->min_vruntime
> to 2000 because that is the max. So new tasks enqueued on rq1 starts with
> vruntime of 2000 while the tasks in that runqueue are still based on the old
> min_vruntime(1000). So the new tasks gets enqueued some where to the right
> of the tree and has to wait until already existing tasks catch up the
> vruntime to
> 2000. This is what I meant by starvation. This happens always when we update
> the core rq's cfs->min_vruntime. Hope this clarifies.

Thanks for the clarification.

Yes, this is the initialization issue I mentioned before when core
scheduling is initially enabled. rq1's vruntime is bumped the first time
update_core_cfs_rq_min_vruntime() is called and if there are already
some tasks queued, new tasks queued on rq1 will be starved to some extent.

Agree that this needs fix. But we shouldn't need do this afterwards.

So do I understand correctly that patch1 is meant to solve the
initialization issue?

2019-10-11 12:12:15

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

> Thanks for the clarification.
>
> Yes, this is the initialization issue I mentioned before when core
> scheduling is initially enabled. rq1's vruntime is bumped the first time
> update_core_cfs_rq_min_vruntime() is called and if there are already
> some tasks queued, new tasks queued on rq1 will be starved to some extent.
>
> Agree that this needs fix. But we shouldn't need do this afterwards.
>
> So do I understand correctly that patch1 is meant to solve the
> initialization issue?

I think we need this update logic even after initialization. I mean, core
runqueue's min_vruntime can get updated every time when the core
runqueue's min_vruntime changes with respect to the sibling's min_vruntime.
So, whenever this update happens, we would need to propagate the changes
down the tree right? Please let me know if I am visualizing it wrong.

Thanks,
Vineeth

2019-10-12 04:12:32

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Fri, Oct 11, 2019 at 08:10:30AM -0400, Vineeth Remanan Pillai wrote:
> > Thanks for the clarification.
> >
> > Yes, this is the initialization issue I mentioned before when core
> > scheduling is initially enabled. rq1's vruntime is bumped the first time
> > update_core_cfs_rq_min_vruntime() is called and if there are already
> > some tasks queued, new tasks queued on rq1 will be starved to some extent.
> >
> > Agree that this needs fix. But we shouldn't need do this afterwards.
> >
> > So do I understand correctly that patch1 is meant to solve the
> > initialization issue?
>
> I think we need this update logic even after initialization. I mean, core
> runqueue's min_vruntime can get updated every time when the core
> runqueue's min_vruntime changes with respect to the sibling's min_vruntime.
> So, whenever this update happens, we would need to propagate the changes
> down the tree right? Please let me know if I am visualizing it wrong.

I don't think we need do the normalization afterwrads and it appears
we are on the same page regarding core wide vruntime.

The intent of my patch is to treat all the root level sched entities of
the two siblings as if they are in a single cfs_rq of the core. With a
core wide min_vruntime, the core scheduler can decide which sched entity
to run next. And the individual sched entity's vruntime shouldn't be
changed based on the change of core wide min_vruntime, or faireness can
hurt(if we add or reduce vruntime of a sched entity, its credit will
change).

The weird thing about my patch is, the min_vruntime is often increased,
it doesn't point to the smallest value as in a traditional cfs_rq. This
probabaly can be changed to follow the tradition, I don't quite remember
why I did this, will need to check this some time later.

All those sub cfs_rq's sched entities are not interesting. Because once
we decided which sched entity in the root level cfs_rq should run next,
we can then pick the final next task from there(using the usual way). In
other words, to make scheduler choose the correct candidate for the core,
we only need worry about sched entities on both CPU's root level cfs_rqs.

Does this make sense?

2019-10-13 12:45:46

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Fri, Oct 11, 2019 at 11:55 PM Aaron Lu <[email protected]> wrote:

>
> I don't think we need do the normalization afterwrads and it appears
> we are on the same page regarding core wide vruntime.
>
> The intent of my patch is to treat all the root level sched entities of
> the two siblings as if they are in a single cfs_rq of the core. With a
> core wide min_vruntime, the core scheduler can decide which sched entity
> to run next. And the individual sched entity's vruntime shouldn't be
> changed based on the change of core wide min_vruntime, or faireness can
> hurt(if we add or reduce vruntime of a sched entity, its credit will
> change).
>
Ok, I think I get it now. I see that your first patch actually wraps
all the places
where min_vruntime is accessed. So yes, the tree vruntime updation is needed
only one time. From then on, since we use the wrapper cfs_rq_min_vruntime(),
both the runqueues would self adjust from then on based on the code wide
min_vruntime. Also by the virtue that min_vruntime stays min from there on, the
tree updation logic will not be called more than once. So I think the
changes are safe.
I will do some profiling to make sure that it is actually called once only.

> The weird thing about my patch is, the min_vruntime is often increased,
> it doesn't point to the smallest value as in a traditional cfs_rq. This
> probabaly can be changed to follow the tradition, I don't quite remember
> why I did this, will need to check this some time later.

Yeah, I noticed this. In my patch, I had already accounted for this and changed
to min() instead of max() which is more logical that min_vruntime should be the
minimum of both the run queue.

> All those sub cfs_rq's sched entities are not interesting. Because once
> we decided which sched entity in the root level cfs_rq should run next,
> we can then pick the final next task from there(using the usual way). In
> other words, to make scheduler choose the correct candidate for the core,
> we only need worry about sched entities on both CPU's root level cfs_rqs.
>
Understood. The only reason I did the normalize is to get both the runqueues
under one min_vruntime always. And as long as we use the cfs_rq_min_vruntime
from then on, we wouldn't be calling the balancing logic any more.

> Does this make sense?

Sure, thanks for the clarification.

Thanks,
Vineeth

2019-10-14 10:01:03

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Sun, Oct 13, 2019 at 08:44:32AM -0400, Vineeth Remanan Pillai wrote:
> On Fri, Oct 11, 2019 at 11:55 PM Aaron Lu <[email protected]> wrote:
>
> >
> > I don't think we need do the normalization afterwrads and it appears
> > we are on the same page regarding core wide vruntime.

Should be "we are not on the same page..."

[...]
> > The weird thing about my patch is, the min_vruntime is often increased,
> > it doesn't point to the smallest value as in a traditional cfs_rq. This
> > probabaly can be changed to follow the tradition, I don't quite remember
> > why I did this, will need to check this some time later.
>
> Yeah, I noticed this. In my patch, I had already accounted for this and changed
> to min() instead of max() which is more logical that min_vruntime should be the
> minimum of both the run queue.

I now remembered why I used max().

Assume rq1 and rq2's min_vruntime are both at 2000 and the core wide
min_vruntime is also 2000. Also assume both runqueues are empty at the
moment. Then task t1 is queued to rq1 and runs for a long time while rq2
keeps empty. rq1's min_vruntime will be incremented all the time while
the core wide min_vruntime stays at 2000 if min() is used. Then when
another task gets queued to rq2, it will get really large unfair boost
by using a much smaller min_vruntime as its base.

To fix this, either max() is used as is done in my patch, or adjust
rq2's min_vruntime to be the same as rq1's on each
update_core_cfs_min_vruntime() when rq2 is found empty and then use
min() to get the core wide min_vruntime. Looks not worth the trouble to
use min().

2019-10-21 12:33:42

by Vineeth Remanan Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Mon, Oct 14, 2019 at 5:57 AM Aaron Lu <[email protected]> wrote:
>
> I now remembered why I used max().
>
> Assume rq1 and rq2's min_vruntime are both at 2000 and the core wide
> min_vruntime is also 2000. Also assume both runqueues are empty at the
> moment. Then task t1 is queued to rq1 and runs for a long time while rq2
> keeps empty. rq1's min_vruntime will be incremented all the time while
> the core wide min_vruntime stays at 2000 if min() is used. Then when
> another task gets queued to rq2, it will get really large unfair boost
> by using a much smaller min_vruntime as its base.
>
> To fix this, either max() is used as is done in my patch, or adjust
> rq2's min_vruntime to be the same as rq1's on each
> update_core_cfs_min_vruntime() when rq2 is found empty and then use
> min() to get the core wide min_vruntime. Looks not worth the trouble to
> use min().

Understood. I think this case is a special case where one runqueue is empty
and hence the min_vruntime of the core should match the progressing vruntime
of the active runqueue. If we use max as the core wide min_vruntime, I think
we may hit starvation elsewhere. On quick example I can think of is during
force idle. When a sibling is forced idle, and if a new task gets enqueued
in the force idled runq, it would inherit the max vruntime and would starve
until the other tasks in the forced idle sibling catches up. While this might
be okay, we are deviating from the concept that all new tasks inherits the
min_vruntime of the cpu(core in our case). I have not tested deeply to see
if there are any assumptions which may fail if we use max.

The modified patch actually takes care of syncing the min_vruntime across
the siblings so that, core wide min_vruntime and per cpu min_vruntime
always stays in sync.

Thanks,
Vineeth

2019-10-29 10:20:27

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> I'm using the following branch as base which is v5.1.5 based:
> https://github.com/digitalocean/linux-coresched coresched-v3-v5.1.5-
> test
>
> And I have pushed Tim's branch to:
> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
>
> Mine:
> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> core_vruntime
>
Hello,

As anticipated, I've been trying to follow the development of this
feature and, in the meantime, I have done some benchmarks.

I actually have a lot of data (and am planning for more), so I am
sending a few emails, each one with a subset of the numbers in it,
instead than just one which would be beyond giant! :-)

I'll put, in this first one, some background and some common
information, e.g., about the benchmarking platform and configurations,
and on how to read and interpreet the data that will follow.

It's quite hard to come up with a concise summary, and sometimes it's
even tricky to identify consolidated trends. There are also things that
looks weird and, although I double checked my methodology, I can't
exclude that of glitches or errors may have occurred. For each of the
benchmark, I have at least some information about what the
configuration was when it was run, and also some monitorning and perf
data. So, if interested, try to ask and we'll see what we can dig out.

And in any case, I have the procedure for running these benchmarks
fairly decently (although not completely) automated. So if we see
things that looks really really weird, I can rerun (perhaps with
different configuration, more monitoring, etc).

For each benchmark, I'll "dump" the results, with just some comments
about the things that I find more relevant/interesting. Then, if we
want, we can look at them and analyze them together.
For each experiment, I do have some limited amount of tracing and
debugging information still available, in case it could be useful. And,
as said, I can always rerun.

I can also provide, quite easily, different looking tables. E.g.,
different set of columns, different baselines, etc. Just as what you
thinks it would be the most interesting to see, and, most likely, it
will be possible to do it.

Oh, and I'll upload text files whose contents will be identical to the
emails in this space:

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/

In case tables are rendered better in a browser than in a MUA.

Thanks and Regards,
Dario
---

Code:
1) Linux 5.1.5 (commit id 835365932f0dc25468840753e071c05ad6abc76f)
2) https://github.com/digitalocean/linux-coresched/tree/vpillai/coresched-v3-v5.1.5-test
3) https://github.com/aaronlu/linux/tree/coresched-v3-v5.1.5-test-core_vruntime
4) https://github.com/aaronlu/linux/tree/coresched-v3-v5.1.5-test-tim

Benchmarking suite:
- MMTests: https://github.com/gormanm/mmtests
- Tweaked to deal with running benchmarks in VMs. Still working on
upstreaming that to Mel (a WIP is available here:
https://github.com/dfaggioli/mmtests/tree/bench-virt )

Benchmarking host:
- CPU: 1 socket, 4 cores, 2 threads
- RAM: 32 GB
- distro: opneSUSE Tumbleweed
- HW Bugs Mitigations: fully disabled
- Filesys: XFS

VMs:
- vCPUs: either 8 or 4
- distro: opneSUSE Tumbleweed
- Kernel: 5.1.16
- HW Bugs Mitigations: fully disabled
- Filesys: XFS

Benchmarks:
- STREAM : pure memory benchmark (various kind of mem-ops done
in parallel). Parallelism is NR_CPUS/2 tasks
- Kernbench : builds a kernel, with varying number of compile
jobs. HT is, in general, known to help, as it let
us do "more parallel" builds
- Hackbench : communication (via pipes, in this case) between
group of processes. As we deal with _groups_ of
tasks, we're already in saturation with 1 group,
hence we expect HyperThreading disabled
configurations to suffer
- mutilate : load generator for memcached, with high request
rate;
- netperf-unix : two communicating tasks. Without any pinning
(neither at the host nor at the guest level), we
expect HT to play a role. In fact, depending on
where the two task are scheduler (i.e., whether on
two core of the same thread, or not) performance may
vary
- sysbenchcpu : the process-based CPU stressing workload of sysbench
- sysbenchthread : the thread-based CPU stressing workload of sysbench
- sysbench : the database workload

This is kind of a legend for the columns you will see in the tables.

- v-* : vanilla, i.e., benchmarks were run on code _without_ any
core-scheduling patch applied (see 1 in 'Code' section above)
- *BM-* : baremetal, i.e., benchmarks were run on the host, without
any VM running or anything
- *VM-* : Virtual Machine, i.e., benchmarks were run inside a VM, with
the following haracteristics:
- *VM- : benchmarks were run in a VM with 8 vCPUs. That was the
only VM running in the system
- *VM-v4 : benchmarks were run in a VM with 4 vCPUs. That was the
only VM running in the system
- *VMx2 : benchmark were run in a VM with 8 vCPUs, and there was
another VM running, also with 8 vCPUS, generating CPU,
memory and IO stress load for about 600%
- *-csc-* : benchmarks were run with Core scheduling v3 patch
series (see 2 in 'Code' section above)
- *-csc_stallfix-* : benchmarks were run with Core scheduling v3 and
the 'stallfix' feature enabled
- *-csc_vruntime-* : benchmarks were run with Core scheduling v3 + the
vruntime patches (see 3 in 'Code' section above)
- *-csc-_tim-* : benchmarks were run with Core scheduling v3 +
Tim's patches (see 4 in 'Code' section above)
- *-noHT : benchmarks were run with HyperThreading Disabled
- *-HT : benchmarks were run with Hyperthreading enabled

So, for instance, the column BM-noHT shows data from a run done on
baremetal, with HyperThreading disabled. The column v-VM-HT shows data
from a run done in a 8 vCPUs VM, with HyperThreading enabled, and no
core-scheduling patches applied. The column VM-csc_vruntime-HT shows
data from a run done in a 8 vCPUs VM with core-scheduling v3 patches +
the vruntime patches applied. The column VM-v4-HT shows data from a run
done in a 4 vCPUs VM, core-scheduling patches were applied but not used
(the vCPUs of the VM weren't tagged). The column VMx2-csc_vruntime-HT
shows data from a run done in a 8 vCPUs VM, core-scheduling v3 + Tim's
patchs were applied and the vCPUs of the VM tagged, while there was
another (untagged) VM in the system, trying to introduce ~600% load
(CPU, memory and IO, via stress-ng). Etc.

See the 'Appendix' at the bottom of this email, for a comprehensive
list of all the combinations (or, at least I think is comprehensive...
I hope I haven't missed any :-) ).

In all tables, percent increase and decrease are always relative to the
first column. It is already taken care of whether lower or higher
values are better.
Basically, when we see -x.yz%, it always means performance are worse
than the baseline, and the absolute value of that (i.e., x.yz) tells
you by how much.

If, for instance, we want to compare HT and non HT, on baremetal, we
check the BM-HT and BM-noHT columns.
If we want to compare v3 + vruntime patches against no HyperThreading,
when the system is overloaded, we look at VMx2-noHT and VMx2-
csc_vruntime-HT columns and check by how much they deviates from the
baseline (i.e., which one regresses more). For comparing, the various
core scheduling solutions, we can check by how much each one is either
better or worse than baseline. And so on...

The most relevant comparisons, IMO, are:
- the various core scheduling solutions against their respective HT
baseline. This, in fact, tells us what people will experience if they
start using core scheduling on these workloads
- the various core scheduling solutions against their respective noHT
baseline. This, in fact, tells use whether or not core scheduling is
effective, for the given workload, or if it would just be better to
disable HyperThreading
- the overhead introduced by the core scheduling patches, when they are
not used (i.e., v-BM-HT against BM-HT, or v-VM-HT against VM-HT). This,
in fact, tells us what happens to *everyone*, including the ones that
do not want core scheduling and will keep it disabled, if we merge it

Note that the overhead, so far, has been evaluated only for the -csc
case, i.e., when patches from point 2 in 'Code' above are applied, but
tasks/vCPUs are not tagged, and hence core scheduling is not really
used,

Anyway, let's get to the point where I give you some data already! :-D
:-D

STREAM
======

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-1_stream.txt

v v BM BM BM BM BM BM
BM-HT BM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
MB/sec copy 33827.50 ( 0.00%) 33654.32 ( -0.51%) 33683.34 ( -0.43%) 33819.30 ( -0.02%) 33830.88 ( 0.01%) 33731.02 ( -0.29%) 33573.76 ( -0.75%) 33292.76 ( -1.58%)
MB/sec scale 22762.02 ( 0.00%) 22524.00 ( -1.05%) 22416.54 ( -1.52%) 22444.16 ( -1.40%) 22652.56 ( -0.48%) 22462.80 ( -1.31%) 22461.90 ( -1.32%) 22670.84 ( -0.40%)
MB/sec add 26141.76 ( 0.00%) 26241.42 ( 0.38%) 26559.40 ( 1.60%) 26365.36 ( 0.86%) 26607.10 ( 1.78%) 26384.50 ( 0.93%) 26117.78 ( -0.09%) 26192.12 ( 0.19%)
MB/sec triad 26522.46 ( 0.00%) 26555.26 ( 0.12%) 26499.62 ( -0.09%) 26373.26 ( -0.56%) 26667.32 ( 0.55%) 26642.70 ( 0.45%) 26505.38 ( -0.06%) 26409.60 ( -0.43%)
v v VM VM VM VM VM VM
VM-HT VM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
MB/sec copy 34559.32 ( 0.00%) 34153.30 ( -1.17%) 34236.64 ( -0.93%) 33724.38 ( -2.42%) 33535.60 ( -2.96%) 33534.10 ( -2.97%) 33469.70 ( -3.15%) 33873.18 ( -1.99%)
MB/sec scale 22556.18 ( 0.00%) 22834.88 ( 1.24%) 22733.12 ( 0.78%) 23010.46 ( 2.01%) 22480.60 ( -0.34%) 22552.94 ( -0.01%) 22756.50 ( 0.89%) 22434.96 ( -0.54%)
MB/sec add 26209.70 ( 0.00%) 26640.08 ( 1.64%) 26692.54 ( 1.84%) 26747.40 ( 2.05%) 26358.20 ( 0.57%) 26353.50 ( 0.55%) 26686.62 ( 1.82%) 26256.50 ( 0.18%)
MB/sec triad 26521.80 ( 0.00%) 26490.26 ( -0.12%) 26598.66 ( 0.29%) 26466.30 ( -0.21%) 26560.48 ( 0.15%) 26496.30 ( -0.10%) 26609.10 ( 0.33%) 26450.68 ( -0.27%)
v v VM VM VM VM VM VM
VM-v4-HT VM-v4-noHT v4-HT v4-noHT v4-csc-HT v4-csc_stallfix-HT v4-csc_tim-HT v4-csc_vruntime-HT
MB/sec copy 32257.48 ( 0.00%) 32504.18 ( 0.76%) 32375.66 ( 0.37%) 32261.98 ( 0.01%) 31940.84 ( -0.98%) 32070.88 ( -0.58%) 31926.80 ( -1.03%) 31882.18 ( -1.16%)
MB/sec scale 19806.46 ( 0.00%) 20281.18 ( 2.40%) 20266.80 ( 2.32%) 20075.46 ( 1.36%) 19847.66 ( 0.21%) 20119.00 ( 1.58%) 19899.84 ( 0.47%) 20060.48 ( 1.28%)
MB/sec add 22178.58 ( 0.00%) 22426.92 ( 1.12%) 22185.54 ( 0.03%) 22153.52 ( -0.11%) 21975.80 ( -0.91%) 22097.72 ( -0.36%) 21827.66 ( -1.58%) 22068.04 ( -0.50%)
MB/sec triad 22149.10 ( 0.00%) 22200.54 ( 0.23%) 22142.10 ( -0.03%) 21933.04 ( -0.98%) 21898.50 ( -1.13%) 22160.64 ( 0.05%) 22003.40 ( -0.66%) 21951.16 ( -0.89%)
v v VMx2 VMx2 VMx2 VMx2 VMx2 VMx2
VMx2-HT VMx2-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
MB/sec copy 33514.96 ( 0.00%) 24740.70 ( -26.18%) 30410.96 ( -9.26%) 22157.24 ( -33.89%) 29552.60 ( -11.82%) 29374.78 ( -12.35%) 28717.38 ( -14.31%) 29143.88 ( -13.04%)
MB/sec scale 22605.74 ( 0.00%) 15473.56 ( -31.55%) 19051.76 ( -15.72%) 15278.64 ( -32.41%) 19246.98 ( -14.86%) 19081.04 ( -15.59%) 18747.60 ( -17.07%) 18776.02 ( -16.94%)
MB/sec add 26249.56 ( 0.00%) 18559.92 ( -29.29%) 21143.90 ( -19.45%) 18664.30 ( -28.90%) 21236.00 ( -19.10%) 21067.40 ( -19.74%) 20878.78 ( -20.46%) 21266.92 ( -18.98%)
MB/sec triad 26290.16 ( 0.00%) 19274.10 ( -26.69%) 20573.62 ( -21.74%) 17631.52 ( -32.93%) 21066.94 ( -19.87%) 20975.04 ( -20.22%) 20944.56 ( -20.33%) 20942.18 ( -20.34%)

So, STREAM, at least in this configuration, it is not (as it could have
been expected) really sensitive to HyperThreading. In fact, in most
cases, both when run on baremetal and in VMs, HT and noHT results are
pretty much the same. When core scheduling is used, things does not
look bad at all to me, although results are, most of the time, only
marginally worse.

Do check, however, the overloaded case. There, disabling HT has quite a
big impact, and core scheduling does a rather good job in restoring
good performance.

From the overhead point of view, the situation does not look too bad
either. In fact, in the first three group of measurements, the overhead
introduced by having core scheduling patches in, is acceptable (there
are actually cases where they seem to do more good than harm! :-P).
However, when the system is overloaded, despite there not being any
tagged task, numbers look pretty bad. It seems that, for instance, of
the 13.04% performance drop between v-VMx2-HT and VMx2-csc_vruntime-HT,
9.26% comes from overhead (as that's there already in VMx2-HT)!!

Something to investigate better, I guess...


Appendix

* v-BM-HT : no coresched patch applied, baremetal, HyperThreading enabled
* v-BM-noHT : no coresched patch applied, baremetal, Hyperthreading disabled
* v-VM-HT : no coresched patch applied, 8 vCPUs VM, HyperThreading enabled
* v-VM-noHT : no coresched patch applied, 8 vCPUs VM, Hyperthreading disabled
* v-VM-v4-HT : no coresched patch applied, 4 vCPUs VM, HyperThreading enabled
* v-VM-v4-noHT : no coresched patch applied, 4 vCPUs VM, Hyperthreading disabled
* v-VMx2-HT : no coresched patch applied, 8 vCPUs VM + 600% stress overhead, HyperThreading enabled
* v-VMx2-noHT : no coresched patch applied, 8 vCPUs VM + 600% stress overhead, Hyperthreading disabled

* BM-HT : baremetal, HyperThreading enabled
* BM-noHT : baremetal, Hyperthreading disabled
* BM-csc-HT : baremetal, coresched-v3 (Hyperthreading enabled, of course)
* BM-csc_stallfix-HT : baremetal, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* BM-csc_tim-HT : baremetal, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* BM-csc_vruntime-HT : baremetal, coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

* VM-HT : 8 vCPUs VM, HyperThreading enabled
* VM-noHT : 8 vCPUs VM, Hyperthreading disabled
* VM-csc-HT : 8 vCPUs VM, coresched-v3 (Hyperthreading enabled, of course)
* VM-csc_stallfix-HT : 8 vCPUs VM, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* VM-csc_tim-HT : 8 vCPUs VM, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* VM-csc_vruntime-HT : 8 vCPUs VM, coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

* VM-v4-HT : 4 vCPUs VM, HyperThreading enabled
* VM-v4-noHT : 4 vCPUs VM, Hyperthreading disabled
* VM-v4-csc-HT : 4 vCPUs VM, coresched-v3 (Hyperthreading enabled, of course)
* VM-v4-csc_stallfix-HT : 4 vCPUs VM, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* VM-v4-csc_tim-HT : 4 vCPUs VM, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* VM-v4-csc_vruntime-HT : 4 vCPUs VM, coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

* VMx2-HT : 8 vCPUs VM + 600% stress overhead, HyperThreading enabled
* VMx2-noHT : 8 vCPUs VM + 600% stress overhead, Hyperthreading disabled
* VMx2-csc-HT : 8 vCPUs VM + 600% stress overhead, coresched-v3 (Hyperthreading enabled, of course)
* VMx2-csc_stallfix-HT : 8 vCPUs VM + 600% stress overhead, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* VMx2-csc_tim-HT : 8 vCPUs VM + 600% stress overhead, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* VMx2-csc_vruntime-HT : 8 vCPUs VM + 600% stress overhead,
coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2019-10-29 10:23:59

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, 2019-10-29 at 10:11 +0100, Dario Faggioli wrote:
> On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> > I'm using the following branch as base which is v5.1.5 based:
> > https://github.com/digitalocean/linux-coresched coresched-v3-
> > v5.1.5-
> > test
> >
> > And I have pushed Tim's branch to:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> >
> > Mine:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> > core_vruntime
> >
> Hello,
>
> As anticipated, I've been trying to follow the development of this
> feature and, in the meantime, I have done some benchmarks.
>
> I actually have a lot of data (and am planning for more), so I am
> sending a few emails, each one with a subset of the numbers in it,
> instead than just one which would be beyond giant! :-)
>
HACKBENCH-PROCESS-PIPES
=======================

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-2_hackbench.txt

v v BM BM BM BM BM BM
BM-HT BM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Amean 1 0.4372 ( 0.00%) 0.6714 * -53.57%* 0.4242 ( 2.97%) 0.6654 * -52.20%* 1.1748 *-168.71%* 1.1666 *-166.83%* 1.1496 *-162.95%* 1.3696 *-213.27%*
Amean 3 1.1466 ( 0.00%) 2.0284 * -76.91%* 1.1700 ( -2.04%) 2.0320 * -77.22%* 6.8484 *-497.28%* 6.7128 *-485.45%* 6.8068 *-493.65%* 7.9702 *-595.12%*
Amean 5 2.0140 ( 0.00%) 2.7834 * -38.20%* 2.0116 ( 0.12%) 2.6688 * -32.51%* 11.9494 *-493.32%* 11.9450 *-493.10%* 12.0904 *-500.32%* 14.4190 *-615.94%*
Amean 7 2.5396 ( 0.00%) 3.3064 * -30.19%* 2.6002 ( -2.39%) 2.9074 * -14.48%* 16.0142 *-530.58%* 16.3418 *-543.48%* 17.0896 *-572.92%* 20.2050 *-695.60%*
Amean 12 3.2930 ( 0.00%) 2.9830 ( 9.41%) 3.4226 ( -3.94%) 2.8522 ( 13.39%) 28.1482 *-754.79%* 27.3916 *-731.81%* 28.5938 *-768.32%* 34.6184 *-951.27%*
Amean 18 3.9452 ( 0.00%) 4.0806 ( -3.43%) 3.7950 ( 3.81%) 3.8938 ( 1.30%) 41.7120 *-957.28%* 42.1062 *-967.28%* 44.2136 *-1020.69%* 51.0200 *-1193.22%*
Amean 24 4.1618 ( 0.00%) 4.8466 * -16.45%* 4.2258 ( -1.54%) 5.0720 * -21.87%* 56.8598 *-1266.23%* 57.3568 *-1278.17%* 61.2660 *-1372.10%* 68.2538 *-1540.01%*
Amean 30 4.7790 ( 0.00%) 5.9726 * -24.98%* 4.8756 ( -2.02%) 5.9434 * -24.36%* 72.1602 *-1409.94%* 75.8036 *-1486.18%* 80.0066 *-1574.13%* 81.3768 *-1602.80%*
Amean 32 5.1680 ( 0.00%) 6.6000 * -27.71%* 5.2004 ( -0.63%) 6.5490 * -26.72%* 78.1974 *-1413.11%* 82.9812 *-1505.67%* 87.7340 *-1597.64%* 85.6876 *-1558.04%*
Stddev 1 0.0173 ( 0.00%) 0.0624 (-259.99%) 0.0129 ( 25.54%) 0.0451 (-160.36%) 0.0727 (-319.12%) 0.0954 (-450.05%) 0.0545 (-214.16%) 0.0960 (-453.74%)
Stddev 3 0.0471 ( 0.00%) 0.3035 (-544.20%) 0.0547 ( -16.12%) 0.2745 (-482.52%) 0.2029 (-330.60%) 0.1102 (-133.97%) 0.2391 (-407.38%) 0.1864 (-295.62%)
Stddev 5 0.1492 ( 0.00%) 0.4178 (-180.05%) 0.1419 ( 4.88%) 0.2569 ( -72.22%) 0.3878 (-159.96%) 0.2584 ( -73.22%) 0.4259 (-185.47%) 0.3340 (-123.89%)
Stddev 7 0.2077 ( 0.00%) 0.2941 ( -41.59%) 0.2281 ( -9.85%) 0.2049 ( 1.33%) 0.3922 ( -88.86%) 0.8178 (-293.78%) 0.4064 ( -95.67%) 0.6127 (-195.02%)
Stddev 12 0.5560 ( 0.00%) 0.2038 ( 63.34%) 0.2113 ( 62.00%) 0.2490 ( 55.21%) 1.0797 ( -94.21%) 0.7564 ( -36.06%) 1.0225 ( -83.91%) 1.5233 (-174.01%)
Stddev 18 0.3556 ( 0.00%) 0.3110 ( 12.55%) 0.3054 ( 14.11%) 0.1265 ( 64.43%) 1.0258 (-188.47%) 1.4386 (-304.53%) 1.1818 (-232.33%) 2.6710 (-651.08%)
Stddev 24 0.1844 ( 0.00%) 0.1614 ( 12.46%) 0.1135 ( 38.46%) 0.3679 ( -99.54%) 2.0997 (-1038.84%) 1.1493 (-523.36%) 0.5214 (-182.79%) 2.9229 (-1485.35%)
Stddev 30 0.1420 ( 0.00%) 0.0875 ( 38.37%) 0.1799 ( -26.66%) 0.1076 ( 24.26%) 4.5079 (-3073.51%) 1.5704 (-1005.58%) 1.4054 (-889.42%) 4.5743 (-3120.26%)
Stddev 32 0.2184 ( 0.00%) 0.2427 ( -11.11%) 0.3143 ( -43.92%) 0.3517 ( -61.01%) 3.7345 (-1609.81%) 1.3564 (-521.00%) 2.1822 (-899.11%) 4.0896 (-1772.40%)
v v VM VM VM VM VM VM
VM-HT VM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Amean 1 0.6762 ( 0.00%) 1.6824 *-148.80%* 0.6630 ( 1.95%) 1.5706 *-132.27%* 0.6600 ( 2.40%) 0.6514 ( 3.67%) 0.6780 ( -0.27%) 0.6964 * -2.99%*
Amean 3 1.7130 ( 0.00%) 2.2806 * -33.13%* 1.8074 * -5.51%* 2.1882 * -27.74%* 1.7992 ( -5.03%) 1.7896 * -4.47%* 1.7760 ( -3.68%) 1.8264 * -6.62%*
Amean 5 2.8688 ( 0.00%) 2.8048 ( 2.23%) 2.5880 * 9.79%* 2.8474 ( 0.75%) 2.5486 * 11.16%* 2.7896 ( 2.76%) 2.4402 * 14.94%* 2.6020 * 9.30%*
Amean 7 3.3432 ( 0.00%) 3.3434 ( -0.01%) 3.4564 ( -3.39%) 3.4704 ( -3.80%) 3.2424 ( 3.02%) 3.5620 ( -6.54%) 3.0352 ( 9.21%) 3.0874 ( 7.65%)
Amean 12 5.8936 ( 0.00%) 4.9968 * 15.22%* 5.1560 * 12.52%* 5.0670 * 14.03%* 4.2722 * 27.51%* 5.6570 ( 4.01%) 4.8270 * 18.10%* 4.2514 * 27.86%*
Amean 18 6.3938 ( 0.00%) 7.0542 ( -10.33%) 6.4900 ( -1.50%) 6.9682 ( -8.98%) 5.6478 ( 11.67%) 7.4324 ( -16.24%) 6.3160 ( 1.22%) 6.5124 ( -1.85%)
Amean 24 7.6278 ( 0.00%) 9.5096 * -24.67%* 7.0062 * 8.15%* 9.0278 * -18.35%* 7.5650 ( 0.82%) 8.2604 ( -8.29%) 8.5372 ( -11.92%) 7.7008 ( -0.96%)
Amean 30 10.5534 ( 0.00%) 10.9456 ( -3.72%) 11.4470 ( -8.47%) 11.1330 ( -5.49%) 8.8486 ( 16.15%) 10.8508 ( -2.82%) 12.6182 ( -19.57%) 10.5836 ( -0.29%)
Amean 32 11.6024 ( 0.00%) 12.6052 ( -8.64%) 10.8236 ( 6.71%) 11.2156 ( 3.33%) 9.0654 * 21.87%* 11.5074 ( 0.82%) 10.3592 ( 10.72%) 11.7174 ( -0.99%)
Stddev 1 0.0143 ( 0.00%) 0.3700 (-2480.08%) 0.0261 ( -82.29%) 0.1206 (-741.16%) 0.0444 (-209.74%) 0.0270 ( -88.42%) 0.0238 ( -65.73%) 0.0180 ( -25.17%)
Stddev 3 0.0473 ( 0.00%) 0.1384 (-192.63%) 0.0739 ( -56.38%) 0.0714 ( -51.01%) 0.0931 ( -96.94%) 0.0517 ( -9.28%) 0.0960 (-103.02%) 0.0553 ( -17.04%)
Stddev 5 0.1236 ( 0.00%) 0.2251 ( -82.10%) 0.1607 ( -29.98%) 0.1464 ( -18.41%) 0.1319 ( -6.69%) 0.1842 ( -48.96%) 0.0647 ( 47.66%) 0.1977 ( -59.92%)
Stddev 7 0.3597 ( 0.00%) 0.1105 ( 69.27%) 0.2288 ( 36.38%) 0.1335 ( 62.90%) 0.3131 ( 12.94%) 0.2958 ( 17.76%) 0.1581 ( 56.05%) 0.2361 ( 34.37%)
Stddev 12 0.4677 ( 0.00%) 0.0846 ( 81.91%) 0.6319 ( -35.11%) 0.1526 ( 67.37%) 0.3591 ( 23.23%) 0.6898 ( -47.48%) 0.4853 ( -3.76%) 0.2963 ( 36.64%)
Stddev 18 1.2289 ( 0.00%) 0.1849 ( 84.95%) 1.1160 ( 9.18%) 0.2497 ( 79.68%) 0.9843 ( 19.90%) 0.9542 ( 22.35%) 0.6621 ( 46.12%) 0.7668 ( 37.60%)
Stddev 24 0.5202 ( 0.00%) 0.9344 ( -79.60%) 0.1940 ( 62.71%) 0.4706 ( 9.53%) 0.8362 ( -60.73%) 1.0819 (-107.96%) 1.0229 ( -96.63%) 0.8881 ( -70.72%)
Stddev 30 2.1557 ( 0.00%) 0.7984 ( 62.96%) 1.2499 ( 42.02%) 0.9804 ( 54.52%) 0.4846 ( 77.52%) 1.2901 ( 40.15%) 1.5532 ( 27.95%) 1.2932 ( 40.01%)
Stddev 32 2.2255 ( 0.00%) 1.1321 ( 49.13%) 2.2380 ( -0.56%) 0.2127 ( 90.44%) 0.3654 ( 83.58%) 1.5727 ( 29.33%) 2.3291 ( -4.66%) 2.0936 ( 5.93%)
v v VM VM VM VM VM VM
VM-v4-HT VM-v4-noHT v4-HT v4-noHT v4-csc-HT v4-csc_stallfix-HT v4-csc_tim-HT v4-csc_vruntime-HT
Amean 1 1.2194 ( 0.00%) 1.1974 ( 1.80%) 1.0308 * 15.47%* 1.1054 * 9.35%* 1.0522 * 13.71%* 1.2290 ( -0.79%) 1.1392 * 6.58%* 1.0524 * 13.70%*
Amean 3 2.2568 ( 0.00%) 2.0588 ( 8.77%) 2.0708 ( 8.24%) 2.2352 ( 0.96%) 2.1808 ( 3.37%) 2.2820 ( -1.12%) 2.1682 ( 3.93%) 2.2598 ( -0.13%)
Amean 5 2.9848 ( 0.00%) 2.9912 ( -0.21%) 2.4938 * 16.45%* 2.4634 * 17.47%* 2.6890 ( 9.91%) 2.8908 ( 3.15%) 2.8636 ( 4.06%) 2.4158 * 19.06%*
Amean 7 3.4500 ( 0.00%) 3.2538 ( 5.69%) 3.3646 ( 2.48%) 3.2666 ( 5.32%) 3.0800 ( 10.72%) 4.2206 ( -22.34%) 3.1016 ( 10.10%) 3.2186 ( 6.71%)
Amean 12 6.0414 ( 0.00%) 5.0624 ( 16.20%) 5.1276 ( 15.13%) 5.1066 ( 15.47%) 4.7728 * 21.00%* 5.5068 ( 8.85%) 4.7544 * 21.30%* 5.8920 ( 2.47%)
Amean 16 7.5510 ( 0.00%) 7.6888 ( -1.82%) 6.9732 ( 7.65%) 5.9098 * 21.73%* 6.5542 * 13.20%* 6.9492 ( 7.97%) 6.4372 * 14.75%* 6.1968 * 17.93%*
Stddev 1 0.0786 ( 0.00%) 0.1166 ( -48.34%) 0.1762 (-124.09%) 0.0712 ( 9.45%) 0.1541 ( -95.99%) 0.0814 ( -3.55%) 0.0452 ( 42.57%) 0.1817 (-131.05%)
Stddev 3 0.2220 ( 0.00%) 0.1887 ( 15.03%) 0.2174 ( 2.07%) 0.1368 ( 38.37%) 0.1928 ( 13.17%) 0.4342 ( -95.57%) 0.2353 ( -5.97%) 0.1753 ( 21.06%)
Stddev 5 0.4586 ( 0.00%) 0.4689 ( -2.23%) 0.3202 ( 30.19%) 0.2666 ( 41.88%) 0.3108 ( 32.24%) 0.2876 ( 37.29%) 0.4067 ( 11.33%) 0.3010 ( 34.37%)
Stddev 7 0.3769 ( 0.00%) 0.5242 ( -39.09%) 0.2523 ( 33.06%) 0.2498 ( 33.71%) 0.4527 ( -20.13%) 1.5089 (-300.36%) 0.3998 ( -6.09%) 0.4018 ( -6.62%)
Stddev 12 1.0604 ( 0.00%) 0.5194 ( 51.02%) 0.5646 ( 46.75%) 0.5255 ( 50.45%) 0.4872 ( 54.06%) 0.3447 ( 67.49%) 0.5492 ( 48.21%) 0.5487 ( 48.26%)
Stddev 16 0.6245 ( 0.00%) 0.5220 ( 16.40%) 0.6914 ( -10.71%) 0.6984 ( -11.83%) 0.3543 ( 43.27%) 0.5137 ( 17.74%) 0.5083 ( 18.61%) 1.0744 ( -72.04%)
v v VMx2 VMx2 VMx2 VMx2 VMx2 VMx2
VMx2-HT VMx2-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Amean 1 0.6780 ( 0.00%) 2.1522 *-217.43%* 1.6790 *-147.64%* 2.2440 *-230.97%* 15.6090 *-2202.21%* 9.1920 *-1255.75%* 2.0732 *-205.78%* 2.2390 *-230.24%*
Amean 3 1.8452 ( 0.00%) 3.2456 * -75.89%* 2.5214 * -36.65%* 3.3944 * -83.96%* 13.7258 *-643.87%* 9.7654 *-429.23%* 2.6668 * -44.53%* 2.9044 * -57.40%*
Amean 5 2.8564 ( 0.00%) 4.2548 * -48.96%* 3.2500 * -13.78%* 4.3958 * -53.89%* 11.6808 *-308.93%* 8.4724 *-196.61%* 3.4000 * -19.03%* 3.5298 * -23.58%*
Amean 7 3.3712 ( 0.00%) 5.1056 * -51.45%* 4.1636 * -23.50%* 4.9874 * -47.94%* 14.0516 (-316.81%) 9.1454 *-171.28%* 4.1396 * -22.79%* 4.3570 * -29.24%*
Amean 12 4.8134 ( 0.00%) 7.4516 * -54.81%* 6.1254 * -27.26%* 7.3488 * -52.67%* 13.7680 *-186.03%* 13.7400 *-185.45%* 6.0424 * -25.53%* 6.3404 * -31.72%*
Amean 18 6.1980 ( 0.00%) 10.4126 * -68.00%* 8.5942 * -38.66%* 9.7554 * -57.40%* 10.7234 * -73.01%* 13.5214 *-118.16%* 8.6628 * -39.77%* 8.5910 * -38.61%*
Amean 24 8.2116 ( 0.00%) 12.7570 * -55.35%* 10.9386 * -33.21%* 12.9032 * -57.13%* 17.8492 *-117.37%* 14.4432 * -75.89%* 10.7488 * -30.90%* 11.1930 * -36.31%*
Amean 30 9.3264 ( 0.00%) 15.3704 * -64.81%* 13.6630 * -46.50%* 15.4806 * -65.99%* 16.9272 * -81.50%* 18.4184 * -97.49%* 13.6174 * -46.01%* 13.5026 * -44.78%*
Amean 32 10.1954 ( 0.00%) 16.5494 * -62.32%* 14.7450 * -44.62%* 16.4650 * -61.49%* 13.9146 * -36.48%* 17.5790 * -72.42%* 14.6072 * -43.27%* 14.0346 * -37.66%*
Stddev 1 0.0129 ( 0.00%) 0.2665 (-1959.36%) 0.2262 (-1647.54%) 0.0703 (-443.26%) 9.5450 (-73651.40%) 6.3226 (-48752.57%) 0.1736 (-1241.41%) 0.3070 (-2272.09%)
Stddev 3 0.1156 ( 0.00%) 0.3653 (-215.89%) 0.1388 ( -20.00%) 0.2067 ( -78.77%) 3.3781 (-2821.39%) 3.2241 (-2688.20%) 0.0970 ( 16.10%) 0.1247 ( -7.87%)
Stddev 5 0.0817 ( 0.00%) 0.2968 (-263.19%) 0.0938 ( -14.79%) 0.1367 ( -67.25%) 3.1273 (-3726.47%) 3.9175 (-4693.34%) 0.1155 ( -41.31%) 0.1864 (-128.10%)
Stddev 7 0.1190 ( 0.00%) 0.1337 ( -12.31%) 0.3136 (-163.48%) 0.2034 ( -70.91%) 17.1856 (-14336.96%) 2.8676 (-2308.94%) 0.0794 ( 33.32%) 0.1362 ( -14.42%)
Stddev 12 0.5237 ( 0.00%) 0.2929 ( 44.08%) 0.1386 ( 73.53%) 0.2330 ( 55.51%) 7.1869 (-1272.26%) 8.0085 (-1429.13%) 0.2535 ( 51.59%) 0.1441 ( 72.49%)
Stddev 18 0.4791 ( 0.00%) 0.1528 ( 68.10%) 0.4254 ( 11.21%) 0.2953 ( 38.37%) 2.8492 (-494.71%) 4.3767 (-813.54%) 0.3205 ( 33.10%) 0.3094 ( 35.43%)
Stddev 24 1.1591 ( 0.00%) 0.2616 ( 77.43%) 0.2446 ( 78.90%) 0.4363 ( 62.36%) 5.3380 (-360.53%) 3.2433 (-179.81%) 0.2704 ( 76.67%) 0.4402 ( 62.02%)
Stddev 30 0.6561 ( 0.00%) 0.4008 ( 38.91%) 0.2670 ( 59.31%) 0.1451 ( 77.88%) 4.5765 (-597.54%) 4.2380 (-545.95%) 0.4768 ( 27.33%) 0.2693 ( 58.96%)
Stddev 32 1.2197 ( 0.00%) 0.4459 ( 63.44%) 0.5496 ( 54.94%) 0.4179 ( 65.74%) 1.3553 ( -11.12%) 4.3537 (-256.96%) 0.5602 ( 54.07%) 0.5312 ( 56.45%)

So, situation looks pretty *terrible* on baremetal. Well, something to
investigate, I guess.

In VMs, on the other hand, things don't look too bad. Although, it's a
bit weird that the benchmark seems to be sensitive to HT when run on
baremetal, but not so much when run in VMs. Among the various core
scheduling implementations/patches, plain v3 seems to have issues with
this workload, while Tim's and vruntime are better.

It is confirmed that, in virtualization scenario, under
overcommittment, core scheduling is more effective than disabling
HyperThreading, at least either with Tim's or with vruntime patches.

Overhead wise, it is an hard call. Numbers vary a lot, and I think each
group of measurements needs to be looked at carefully to come up with a
sensible analysis.

Overhead is there, that's for sure, in the baremetal case. Also, it
looks more severe when HT is disabled (e.g., comapre v-BM-noHT with BM-
noHT).

In virt cases, it's really an hard call. E.g., when a VM with 4 vCPUs
is used, core scheduling seems to be able to make the system
significantly faster! :-O
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2019-10-29 10:29:02

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, 2019-10-29 at 10:11 +0100, Dario Faggioli wrote:
> On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> > I'm using the following branch as base which is v5.1.5 based:
> > https://github.com/digitalocean/linux-coresched coresched-v3-
> > v5.1.5-
> > test
> >
> > And I have pushed Tim's branch to:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> >
> > Mine:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> > core_vruntime
> >
> Hello,
>
> As anticipated, I've been trying to follow the development of this
> feature and, in the meantime, I have done some benchmarks.
>
> I actually have a lot of data (and am planning for more), so I am
> sending a few emails, each one with a subset of the numbers in it,
> instead than just one which would be beyond giant! :-)
>
SYSBENCHCPU
===========

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-4_sysbenchcpu.txt

v v BM BM BM BM BM BM
BM-HT BM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Amean 1 22.55 ( 0.00%) 22.55 * -0.03%* 22.55 ( 0.00%) 22.55 ( -0.02%) 22.56 * -0.04%* 22.56 * -0.03%* 22.55 ( 0.01%) 22.55 ( -0.03%)
Amean 3 7.68 ( 0.00%) 7.69 * -0.07%* 7.68 ( -0.02%) 7.69 * -0.09%* 7.69 * -0.07%* 7.68 ( -0.02%) 7.68 ( -0.02%) 7.68 ( -0.02%)
Amean 5 4.80 ( 0.00%) 5.79 * -20.64%* 4.80 ( 0.00%) 5.79 * -20.73%* 4.81 * -0.24%* 4.80 * -0.18%* 4.81 * -0.24%* 4.81 * -0.24%*
Amean 7 3.59 ( 0.00%) 5.79 * -61.24%* 3.59 ( 0.00%) 5.79 * -61.20%* 3.59 ( -0.08%) 3.60 * -0.16%* 3.62 * -0.92%* 3.61 * -0.52%*
Amean 12 3.18 ( 0.00%) 5.79 * -81.79%* 3.19 ( -0.13%) 5.79 * -81.92%* 3.20 ( -0.49%) 3.19 * -0.27%* 3.27 * -2.78%* 3.41 * -7.04%*
Amean 16 3.19 ( 0.00%) 5.79 * -81.29%* 3.20 ( -0.13%) 5.79 * -81.38%* 3.24 ( -1.52%) 3.20 ( -0.31%) 3.27 * -2.51%* 3.40 * -6.40%*
Stddev 1 0.00 ( 0.00%) 0.01 ( -41.42%) 0.01 ( -82.57%) 0.01 (-194.39%) 0.01 ( -82.57%) 0.01 ( -41.42%) 0.01 (-158.20%) 0.01 (-269.68%)
Stddev 3 0.00 ( 0.00%) 0.01 ( -99.00%) 0.00 ( -99.00%) 0.00 ( -99.00%) 0.01 ( -99.00%) 0.00 ( -99.00%) 0.00 ( -99.00%) 0.00 ( -99.00%)
Stddev 5 0.01 ( 0.00%) 0.02 (-302.08%) 0.01 ( 0.00%) 0.02 (-258.24%) 0.00 ( 8.71%) 0.01 ( -0.00%) 0.01 ( -41.42%) 0.00 ( 8.71%)
Stddev 7 0.00 ( 0.00%) 0.02 ( -99.00%) 0.00 ( 0.00%) 0.02 ( -99.00%) 0.00 ( -99.00%) 0.01 ( -99.00%) 0.01 ( -99.00%) 0.01 ( -99.00%)
Stddev 12 0.01 ( 0.00%) 0.02 (-125.32%) 0.01 ( -54.42%) 0.02 (-103.81%) 0.03 (-321.54%) 0.01 ( -20.89%) 0.03 (-300.00%) 0.03 (-330.56%)
Stddev 16 0.01 ( 0.00%) 0.02 ( -3.28%) 0.01 ( 4.55%) 0.02 ( -44.53%) 0.10 (-591.05%) 0.02 ( -21.11%) 0.02 ( -44.53%) 0.04 (-145.85%)
v v VM VM VM VM VM VM
VM-HT VM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Amean 1 24.38 ( 0.00%) 22.57 * 7.41%* 22.56 * 7.45%* 22.56 * 7.44%* 22.56 * 7.44%* 22.94 * 5.91%* 22.56 * 7.44%* 22.57 * 7.43%*
Amean 3 8.12 ( 0.00%) 7.70 * 5.21%* 7.68 * 5.45%* 7.68 * 5.40%* 7.69 * 5.36%* 7.72 * 4.99%* 7.68 * 5.43%* 7.68 * 5.40%*
Amean 5 4.96 ( 0.00%) 5.85 * -17.88%* 4.79 * 3.43%* 5.85 * -17.91%* 4.81 * 3.02%* 5.09 * -2.68%* 4.82 * 2.94%* 4.80 * 3.25%*
Amean 7 3.62 ( 0.00%) 5.85 * -61.72%* 3.59 * 0.91%* 5.87 * -62.04%* 3.69 ( -1.82%) 3.94 * -8.76%* 3.62 ( 0.12%) 3.60 * 0.59%*
Amean 12 3.18 ( 0.00%) 5.89 * -84.93%* 3.19 ( -0.09%) 5.83 * -83.18%* 3.26 * -2.51%* 3.40 * -6.64%* 3.28 * -3.10%* 3.31 * -3.86%*
Amean 16 3.18 ( 0.00%) 5.86 * -83.89%* 3.18 ( 0.04%) 5.85 * -83.85%* 3.28 * -3.10%* 3.45 * -8.30%* 3.29 * -3.28%* 3.31 * -3.90%*
Stddev 1 0.02 ( 0.00%) 0.01 ( 47.21%) 0.01 ( 66.12%) 0.01 ( 53.84%) 0.01 ( 55.65%) 0.19 (-1012.08%) 0.01 ( 53.84%) 0.01 ( 53.84%)
Stddev 3 0.00 ( 0.00%) 0.02 (-254.96%) 0.00 ( 100.00%) 0.01 ( -61.25%) 0.00 ( 0.00%) 0.06 (-1116.55%) 0.00 ( 22.54%) 0.01 ( -9.54%)
Stddev 5 0.00 ( 0.00%) 0.07 (-1637.81%) 0.00 ( -0.00%) 0.06 (-1587.21%) 0.01 (-255.90%) 0.18 (-4785.35%) 0.01 (-236.65%) 0.01 ( -52.75%)
Stddev 7 0.00 ( 0.00%) 0.10 (-20807867647333972.00%) 0.01 (-1575931584692200.00%) 0.08 (-16887732666966734.00%) 0.15 (-30988869061711636.00%) 0.12 (-24303774169159900.00%) 0.01 (-1640281598669292.00%) 0.01 (-2228703820443910.00%)
Stddev 12 0.01 ( 0.00%) 0.12 (-2069.49%) 0.01 ( -41.42%) 0.08 (-1441.64%) 0.08 (-1396.11%) 0.09 (-1516.07%) 0.03 (-458.27%) 0.03 (-403.32%)
Stddev 16 0.01 ( 0.00%) 0.07 (-1295.23%) 0.00 ( 8.71%) 0.07 (-1235.42%) 0.11 (-1867.23%) 0.26 (-4790.98%) 0.04 (-711.38%) 0.05 (-800.00%)
v v VM VM VM VM VM VM
VM-v4-HT VM-v4-noHT v4-HT v4-noHT v4-csc-HT v4-csc_stallfix-HT v4-csc_tim-HT v4-csc_vruntime-HT
Amean 1 24.37 ( 0.00%) 22.76 * 6.62%* 22.56 * 7.43%* 22.57 * 7.41%* 22.57 * 7.41%* 22.86 * 6.21%* 22.56 * 7.44%* 22.56 * 7.44%*
Amean 3 8.12 ( 0.00%) 7.73 * 4.82%* 7.68 * 5.40%* 7.68 * 5.40%* 7.69 * 5.33%* 7.70 * 5.24%* 7.68 * 5.42%* 7.68 * 5.38%*
Amean 5 6.11 ( 0.00%) 5.90 * 3.53%* 5.79 * 5.35%* 5.78 * 5.51%* 5.79 * 5.33%* 6.07 * 0.77%* 5.78 * 5.44%* 5.77 * 5.58%*
Amean 7 6.11 ( 0.00%) 5.92 * 3.11%* 5.79 * 5.33%* 5.77 * 5.59%* 5.80 * 5.16%* 6.07 * 0.70%* 5.78 * 5.42%* 5.77 * 5.54%*
Amean 8 6.11 ( 0.00%) 5.91 * 3.30%* 5.78 * 5.33%* 5.77 * 5.52%* 5.81 * 4.93%* 6.09 ( 0.28%) 5.78 * 5.45%* 5.77 * 5.52%*
Stddev 1 0.01 ( 0.00%) 0.01 ( 27.45%) 0.01 ( 27.45%) 0.01 ( 17.28%) 0.01 ( -57.28%) 0.08 (-765.72%) 0.01 ( 39.30%) 0.01 ( 27.45%)
Stddev 3 0.00 ( 0.00%) 0.03 (-616.47%) 0.00 ( -29.10%) 0.00 ( -29.10%) 0.01 (-138.05%) 0.01 (-108.17%) 0.00 ( 0.00%) 0.01 ( -41.42%)
Stddev 5 0.01 ( 0.00%) 0.03 (-336.77%) 0.02 (-128.71%) 0.01 ( -75.41%) 0.01 ( -70.97%) 0.02 (-128.71%) 0.01 ( -54.42%) 0.01 ( -59.33%)
Stddev 7 0.01 ( 0.00%) 0.04 (-214.79%) 0.02 ( -82.57%) 0.01 ( 3.08%) 0.01 ( -10.10%) 0.02 ( -52.75%) 0.01 ( 3.08%) 0.01 ( -1.50%)
Stddev 8 0.01 ( 0.00%) 0.04 (-329.84%) 0.02 (-122.54%) 0.01 ( -11.27%) 0.03 (-228.78%) 0.07 (-649.92%) 0.01 ( -38.01%) 0.01 ( -11.27%)
v v VMx2 VMx2 VMx2 VMx2 VMx2 VMx2
VMx2-HT VMx2-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Amean 1 22.58 ( 0.00%) 23.41 * -3.67%* 25.22 * -11.70%* 23.36 * -3.48%* 31.57 * -39.84%* 29.30 * -29.78%* 31.82 * -40.93%* 25.60 * -13.38%*
Amean 3 7.69 ( 0.00%) 11.03 * -43.47%* 8.81 * -14.66%* 11.19 * -45.60%* 8.64 * -12.34%* 10.68 * -38.88%* 10.10 * -31.43%* 9.05 * -17.75%*
Amean 5 4.80 ( 0.00%) 11.04 *-130.24%* 5.66 * -17.96%* 11.35 *-136.61%* 6.66 * -38.81%* 7.92 * -65.18%* 6.27 * -30.65%* 6.61 * -37.83%*
Amean 7 3.58 ( 0.00%) 10.94 *-205.34%* 5.44 * -51.79%* 11.53 *-221.93%* 6.39 * -78.35%* 6.23 * -73.96%* 5.65 * -57.70%* 6.07 * -69.30%*
Amean 12 3.18 ( 0.00%) 10.95 *-243.74%* 5.52 * -73.44%* 11.31 *-255.09%* 7.19 *-125.93%* 9.43 *-196.05%* 5.62 * -76.54%* 6.09 * -91.16%*
Amean 16 3.18 ( 0.00%) 11.10 *-248.65%* 5.43 * -70.56%* 11.35 *-256.55%* 6.84 *-115.04%* 7.19 *-125.90%* 5.50 * -72.76%* 6.13 * -92.64%*
Stddev 1 0.01 ( 0.00%) 0.14 (-1403.85%) 0.16 (-1575.99%) 0.09 (-866.82%) 1.41 (-14685.15%) 1.84 (-19279.11%) 1.96 (-20463.33%) 0.38 (-3929.30%)
Stddev 3 0.00 ( 0.00%) 0.51 (-10380.94%) 0.14 (-2749.21%) 0.68 (-13757.63%) 0.30 (-6133.94%) 0.80 (-16343.24%) 0.12 (-2332.69%) 0.16 (-3086.85%)
Stddev 5 0.01 ( 0.00%) 0.30 (-5443.92%) 0.23 (-4280.45%) 0.58 (-10722.66%) 0.21 (-3888.73%) 2.41 (-45040.69%) 0.12 (-2113.22%) 0.16 (-2867.88%)
Stddev 7 0.00 ( 0.00%) 0.40 (-8009.13%) 0.08 (-1480.51%) 0.47 (-9518.94%) 1.04 (-21241.56%) 0.51 (-10409.80%) 0.22 (-4319.28%) 0.25 (-5017.81%)
Stddev 12 0.01 ( 0.00%) 0.53 (-9750.38%) 0.16 (-2879.09%) 0.66 (-12313.77%) 1.29 (-24020.36%) 3.97 (-74080.08%) 0.25 (-4570.12%) 0.32 (-5817.77%)
Stddev 16 0.00 ( 0.00%) 0.46 (-9428.17%) 0.25 (-5124.17%) 0.75 (-15173.77%) 0.78 (-15851.43%) 1.38 (-28142.45%) 0.35 (-7094.72%) 0.21 (-4230.36%)

This is even better than kernbench! :-)

Basically, both on baremetal and in VMs, noHT causes some -81.29%. With
core scheduling, the worst we get is -6.40%.

And that's in the not Virt-overcommitted case. In such case, noHT
brings us down to -248.65% (i.e., we are ~250% slower, it's not that
we're going back in time :-P). Core scheduling contains the damage to
either -72.76% (Tim's patches) or -92.64% (vruntime). While plain v3
is, again, worse than both.

Overhead looks similar to other cases already discussed.

--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2019-10-29 10:31:06

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, 2019-10-29 at 10:11 +0100, Dario Faggioli wrote:
> On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> > I'm using the following branch as base which is v5.1.5 based:
> > https://github.com/digitalocean/linux-coresched coresched-v3-
> > v5.1.5-
> > test
> >
> > And I have pushed Tim's branch to:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> >
> > Mine:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> > core_vruntime
> >
> Hello,
>
> As anticipated, I've been trying to follow the development of this
> feature and, in the meantime, I have done some benchmarks.
>
> I actually have a lot of data (and am planning for more), so I am
> sending a few emails, each one with a subset of the numbers in it,
> instead than just one which would be beyond giant! :-)
>
SYSBENCH
========

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-6_sysbench.txt

v v BM BM BM BM BM BM
BM-HT BM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Hmean 1 235.81 ( 0.00%) 221.49 ( -6.07%) 245.28 ( 4.01%) 230.53 ( -2.24%) 241.40 ( 2.37%) 225.00 ( -4.58%) 225.50 ( -4.37%) 202.38 ( -14.18%)
Hmean 4 273.77 ( 0.00%) 290.01 ( 5.93%) 292.47 ( 6.83%) 261.76 ( -4.39%) 287.58 ( 5.04%) 281.30 ( 2.75%) 274.21 ( 0.16%) 271.91 ( -0.68%)
Hmean 7 346.60 ( 0.00%) 315.58 ( -8.95%) 345.38 ( -0.35%) 349.29 ( 0.78%) 363.76 ( 4.95%) 349.09 ( 0.72%) 355.69 ( 2.62%) 336.69 ( -2.86%)
Hmean 8 343.17 ( 0.00%) 353.73 ( 3.08%) 409.04 ( 19.19%) 411.31 ( 19.86%) 406.77 ( 18.53%) 306.33 ( -10.74%) 393.70 ( 14.72%) 342.73 ( -0.13%)
Stddev 1 44.93 ( 0.00%) 50.07 ( -11.44%) 25.05 ( 44.24%) 39.22 ( 12.71%) 26.27 ( 41.54%) 42.77 ( 4.81%) 43.63 ( 2.90%) 62.88 ( -39.95%)
Stddev 4 16.03 ( 0.00%) 23.37 ( -45.76%) 23.77 ( -48.25%) 22.40 ( -39.69%) 18.63 ( -16.19%) 14.37 ( 10.35%) 9.34 ( 41.72%) 25.21 ( -57.23%)
Stddev 7 22.88 ( 0.00%) 37.54 ( -64.07%) 26.57 ( -16.16%) 38.50 ( -68.26%) 59.14 (-158.51%) 26.73 ( -16.83%) 24.58 ( -7.43%) 32.93 ( -43.94%)
Stddev 8 36.74 ( 0.00%) 36.60 ( 0.39%) 102.82 (-179.86%) 93.56 (-154.65%) 77.33 (-110.47%) 36.16 ( 1.58%) 44.27 ( -20.50%) 35.15 ( 4.33%)
v v VM VM VM VM VM VM
VM-HT VM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Hmean 1 215.16 ( 0.00%) 225.80 * 4.95%* 205.74 ( -4.38%) 200.61 ( -6.76%) 169.70 ( -21.13%) 168.84 ( -21.53%) 157.27 ( -26.91%) 162.94 ( -24.27%)
Hmean 4 163.44 ( 0.00%) 189.82 * 16.14%* 164.54 ( 0.67%) 148.47 ( -9.16%) 40.62 * -75.15%* 53.14 * -67.49%* 129.51 * -20.76%* 158.99 ( -2.72%)
Hmean 7 162.74 ( 0.00%) 185.79 ( 14.17%) 211.79 ( 30.14%) 186.92 ( 14.86%) 28.02 * -82.78%* 34.32 * -78.91%* 130.01 ( -20.11%) 145.47 ( -10.61%)
Hmean 8 240.19 ( 0.00%) 192.24 ( -19.96%) 192.87 * -19.70%* 194.01 ( -19.23%) 16.92 * -92.95%* 30.55 * -87.28%* 150.51 * -37.34%* 147.67 * -38.52%*
Stddev 1 1.80 ( 0.00%) 4.14 (-129.80%) 6.04 (-234.90%) 24.54 (-1261.18%) 61.54 (-3313.19%) 62.92 (-3389.48%) 55.91 (-3000.67%) 58.29 (-3132.75%)
Stddev 4 6.33 ( 0.00%) 14.22 (-124.51%) 7.04 ( -11.07%) 13.77 (-117.30%) 13.23 (-108.83%) 5.07 ( 19.92%) 15.73 (-148.38%) 14.04 (-121.61%)
Stddev 7 24.70 ( 0.00%) 37.50 ( -51.78%) 35.59 ( -44.07%) 29.96 ( -21.26%) 20.07 ( 18.77%) 21.06 ( 14.74%) 24.97 ( -1.07%) 30.52 ( -23.52%)
Stddev 8 23.23 ( 0.00%) 41.13 ( -77.03%) 27.30 ( -17.49%) 41.67 ( -79.35%) 3.75 ( 83.88%) 12.62 ( 45.68%) 55.87 (-140.49%) 37.12 ( -59.76%)
v v VM VM VM VM VM VM
VM-v4-HT VM-v4-noHT v4-HT v4-noHT v4-csc-HT v4-csc_stallfix-HT v4-csc_tim-HT v4-csc_vruntime-HT
Hmean 1 216.12 ( 0.00%) 310.43 ( 43.63%) 168.80 ( -21.90%) 196.04 ( -9.29%) 168.27 ( -22.14%) 188.53 ( -12.77%) 176.57 ( -18.30%) 180.12 ( -16.66%)
Hmean 3 161.91 ( 0.00%) 160.80 ( -0.69%) 160.33 ( -0.97%) 175.36 ( 8.31%) 51.27 * -68.34%* 52.18 * -67.77%* 137.95 ( -14.80%) 166.12 ( 2.60%)
Hmean 4 156.44 ( 0.00%) 196.25 * 25.45%* 165.19 ( 5.60%) 199.78 * 27.71%* 50.67 * -67.61%* 40.42 * -74.16%* 175.03 ( 11.88%) 172.66 * 10.37%*
Stddev 1 4.67 ( 0.00%) 100.18 (-2043.42%) 50.61 (-982.87%) 141.33 (-2923.68%) 69.08 (-1377.96%) 70.09 (-1399.51%) 47.32 (-912.42%) 42.67 (-813.02%)
Stddev 3 12.62 ( 0.00%) 8.76 ( 30.60%) 13.18 ( -4.38%) 9.42 ( 25.41%) 3.57 ( 71.72%) 28.47 (-125.51%) 21.08 ( -67.00%) 19.37 ( -53.48%)
Stddev 4 8.52 ( 0.00%) 8.54 ( -0.17%) 3.09 ( 63.70%) 13.84 ( -62.42%) 24.33 (-185.50%) 9.30 ( -9.10%) 16.60 ( -94.73%) 9.64 ( -13.07%)
v v VMx2 VMx2 VMx2 VMx2 VMx2 VMx2
VMx2-HT VMx2-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Hmean 1 168.87 ( 0.00%) 154.79 ( -8.34%) 190.28 ( 12.68%) 151.18 ( -10.48%) 136.73 ( -19.03%) 36.63 * -78.31%* 83.21 ( -50.73%) 124.42 ( -26.32%)
Hmean 4 163.65 ( 0.00%) 87.90 * -46.29%* 119.37 * -27.06%* 87.94 * -46.26%* 26.96 * -83.53%* 24.08 * -85.29%* 54.15 * -66.91%* 63.80 * -61.01%*
Hmean 7 181.60 ( 0.00%) 89.10 * -50.93%* 148.16 ( -18.41%) 75.71 * -58.31%* 16.98 * -90.65%* 23.92 * -86.83%* 57.28 * -68.46%* 66.10 * -63.60%*
Hmean 8 198.98 ( 0.00%) 94.24 * -52.64%* 141.96 ( -28.65%) 96.62 * -51.44%* 23.22 * -88.33%* 29.24 * -85.30%* 80.10 * -59.74%* 80.36 * -59.61%*
Stddev 1 61.59 ( 0.00%) 44.71 ( 27.41%) 52.14 ( 15.33%) 44.61 ( 27.56%) 90.12 ( -46.32%) 42.53 ( 30.94%) 38.32 ( 37.79%) 43.03 ( 30.12%)
Stddev 4 8.65 ( 0.00%) 21.74 (-151.41%) 21.18 (-144.98%) 22.51 (-160.27%) 19.72 (-128.07%) 4.38 ( 49.38%) 2.68 ( 68.95%) 14.40 ( -66.55%)
Stddev 7 17.94 ( 0.00%) 15.14 ( 15.62%) 29.94 ( -66.88%) 17.30 ( 3.54%) 26.23 ( -46.20%) 5.17 ( 71.17%) 15.98 ( 10.95%) 18.43 ( -2.72%)
Stddev 8 38.45 ( 0.00%) 19.68 ( 48.82%) 44.14 ( -14.78%) 28.84 ( 25.00%) 10.64 ( 72.33%) 11.65 ( 69.71%) 22.63 ( 41.15%) 16.00 ( 58.39%)

Core scheduling does not seem to be able to handle sysbench well, yet.
In this case, things are not to bad on baremtal (and the best
performing coresched variant is again the one with Tim's patches).

But things go bad when running the banchmark in VMs, where core
scheduling almost always loses against no HyperThreading, even in the
overcommitted case. For virt. cases, it's also not straightforward to
tell which set of patches are best, as some runs are in favours of
Tim's, some others of vruntime's.

--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2019-10-29 10:32:11

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, 2019-10-29 at 10:11 +0100, Dario Faggioli wrote:
> On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> > I'm using the following branch as base which is v5.1.5 based:
> > https://github.com/digitalocean/linux-coresched coresched-v3-
> > v5.1.5-
> > test
> >
> > And I have pushed Tim's branch to:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> >
> > Mine:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> > core_vruntime
> >
> Hello,
>
> As anticipated, I've been trying to follow the development of this
> feature and, in the meantime, I have done some benchmarks.
>
> I actually have a lot of data (and am planning for more), so I am
> sending a few emails, each one with a subset of the numbers in it,
> instead than just one which would be beyond giant! :-)
>
MEMCACHED/MUTILATE
==================

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-7_mutilate.txt

v v BM BM BM BM BM BM
BM-HT BM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Hmean 1 60672.24 ( 0.00%) 60476.27 ( -0.32%) 61303.65 ( 1.04%) 61353.05 ( 1.12%) 57026.54 * -6.01%* 56874.19 * -6.26%* 56781.04 * -6.41%* 56642.48 * -6.64%*
Hmean 3 149695.40 ( 0.00%) 131084.79 * -12.43%* 150802.12 ( 0.74%) 131058.25 * -12.45%* 138260.22 * -7.64%* 138002.31 * -7.81%* 136907.04 * -8.54%* 138317.36 * -7.60%*
Hmean 5 198656.98 ( 0.00%) 181719.88 * -8.53%* 196429.59 ( -1.12%) 186468.57 * -6.14%* 171612.43 * -13.61%* 176464.24 * -11.17%* 162578.64 * -18.16%* 170196.51 * -14.33%*
Hmean 7 180858.07 ( 0.00%) 187088.17 * 3.44%* 181549.92 ( 0.38%) 189813.97 * 4.95%* 166898.82 * -7.72%* 164724.43 * -8.92%* 143102.58 * -20.88%* 128900.19 * -28.73%*
Hmean 8 157176.91 ( 0.00%) 190533.25 * 21.22%* 159795.04 ( 1.67%) 188249.31 * 19.77%* 168399.58 * 7.14%* 169700.06 * 7.97%* 137152.66 * -12.74%* 123355.03 * -21.52%*
Stddev 1 551.36 ( 0.00%) 52.18 ( 90.54%) 306.31 ( 44.44%) 385.01 ( 30.17%) 353.37 ( 35.91%) 350.76 ( 36.38%) 476.56 ( 13.57%) 279.80 ( 49.25%)
Stddev 3 1206.26 ( 0.00%) 2368.98 ( -96.39%) 652.07 ( 45.94%) 1300.67 ( -7.83%) 1849.15 ( -53.30%) 2434.55 (-101.83%) 1063.43 ( 11.84%) 1825.98 ( -51.38%)
Stddev 5 3843.30 ( 0.00%) 122.92 ( 96.80%) 2883.77 ( 24.97%) 481.11 ( 87.48%) 4099.56 ( -6.67%) 4424.91 ( -15.13%) 596.25 ( 84.49%) 1345.78 ( 64.98%)
Stddev 7 2990.97 ( 0.00%) 1645.75 ( 44.98%) 6567.23 (-119.57%) 5422.57 ( -81.30%) 3191.68 ( -6.71%) 2438.38 ( 18.48%) 1211.94 ( 59.48%) 705.34 ( 76.42%)
Stddev 8 3637.12 ( 0.00%) 1490.36 ( 59.02%) 3386.42 ( 6.89%) 2437.43 ( 32.98%) 1056.02 ( 70.97%) 1391.07 ( 61.75%) 1488.57 ( 59.07%) 774.85 ( 78.70%)
v v VM VM VM VM VM VM
VM-HT VM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Hmean 1 42580.67 ( 0.00%) 53440.15 * 25.50%* 54293.03 * 27.51%* 52771.89 * 23.93%* 53689.85 * 26.09%* 53804.35 * 26.36%* 53730.74 * 26.19%* 53902.75 * 26.59%*
Hmean 3 115732.63 ( 0.00%) 70651.66 * -38.95%* 125537.18 * 8.47%* 70738.43 * -38.88%* 126041.27 * 8.91%* 126387.30 * 9.21%* 127285.87 * 9.98%* 126816.33 * 9.58%*
Hmean 5 176633.72 ( 0.00%) 22349.68 * -87.35%* 182113.02 * 3.10%* 19850.93 * -88.76%* 137910.40 * -21.92%* 180934.44 * 2.43%* 175009.60 ( -0.92%) 174758.21 ( -1.06%)
Hmean 7 182512.86 ( 0.00%) 48728.14 * -73.30%* 186272.46 * 2.06%* 49015.69 * -73.14%* 157398.75 * -13.76%* 184989.00 ( 1.36%) 172307.74 * -5.59%* 173203.24 * -5.10%*
Hmean 8 192244.37 ( 0.00%) 65283.21 * -66.04%* 195435.08 ( 1.66%) 63616.87 * -66.91%* 176288.98 * -8.30%* 192413.50 ( 0.09%) 188699.07 ( -1.84%) 185870.91 * -3.32%*
Stddev 1 337.79 ( 0.00%) 523.30 ( -54.92%) 22.90 ( 93.22%) 353.55 ( -4.67%) 196.63 ( 41.79%) 470.55 ( -39.30%) 414.86 ( -22.82%) 190.50 ( 43.60%)
Stddev 3 1945.52 ( 0.00%) 420.36 ( 78.39%) 1365.07 ( 29.84%) 954.82 ( 50.92%) 1136.12 ( 41.60%) 1674.98 ( 13.91%) 1539.07 ( 20.89%) 1129.09 ( 41.96%)
Stddev 5 964.02 ( 0.00%) 2121.27 (-120.05%) 3232.44 (-235.31%) 2284.37 (-136.96%) 8374.66 (-768.73%) 3156.06 (-227.39%) 948.17 ( 1.64%) 2700.91 (-180.17%)
Stddev 7 1028.64 ( 0.00%) 1740.95 ( -69.25%) 2860.04 (-178.04%) 3366.68 (-227.29%) 3360.79 (-226.72%) 4762.47 (-362.99%) 2968.56 (-188.59%) 1429.88 ( -39.01%)
Stddev 8 3794.15 ( 0.00%) 1893.82 ( 50.09%) 565.06 ( 85.11%) 893.04 ( 76.46%) 3201.43 ( 15.62%) 1931.66 ( 49.09%) 360.93 ( 90.49%) 1945.37 ( 48.73%)
v v VM VM VM VM VM VM
VM-v4-HT VM-v4-noHT v4-HT v4-noHT v4-csc-HT v4-csc_stallfix-HT v4-csc_tim-HT v4-csc_vruntime-HT
Hmean 1 44216.52 ( 0.00%) 53968.94 * 22.06%* 54078.28 * 22.30%* 54290.57 * 22.78%* 53552.08 * 21.11%* 51342.78 * 16.12%* 53967.09 * 22.05%* 53790.48 * 21.65%*
Hmean 3 105120.19 ( 0.00%) 120769.59 * 14.89%* 121460.04 * 15.54%* 117410.09 * 11.69%* 120296.87 * 14.44%* 109089.25 ( 3.78%) 123133.70 * 17.14%* 123614.82 * 17.59%*
Hmean 4 144618.05 ( 0.00%) 167039.86 * 15.50%* 178922.55 * 23.72%* 175641.36 * 21.45%* 185338.19 * 28.16%* 148152.02 ( 2.44%) 179709.39 * 24.26%* 179754.74 * 24.30%*
Stddev 1 144.30 ( 0.00%) 101.12 ( 29.93%) 420.02 (-191.08%) 294.43 (-104.04%) 300.29 (-108.10%) 330.58 (-129.10%) 353.69 (-145.11%) 142.58 ( 1.19%)
Stddev 3 3300.71 ( 0.00%) 905.78 ( 72.56%) 2537.34 ( 23.13%) 3289.78 ( 0.33%) 1113.35 ( 66.27%) 2685.20 ( 18.65%) 8425.99 (-155.28%) 2280.72 ( 30.90%)
Stddev 4 1828.49 ( 0.00%) 9851.42 (-438.77%) 2779.22 ( -52.00%) 3312.07 ( -81.14%) 7392.50 (-304.30%) 6090.54 (-233.09%) 3320.77 ( -81.61%) 4836.89 (-164.53%)
v v VMx2 VMx2 VMx2 VMx2 VMx2 VMx2
VMx2-HT VMx2-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Hmean 1 53045.85 ( 0.00%) 35310.51 * -33.43%* 38949.82 * -26.57%* 35500.55 * -33.08%* 49350.06 * -6.97%* 35372.44 * -33.32%* 40556.30 * -23.54%* 42238.57 * -20.37%*
Hmean 3 124865.12 ( 0.00%) 7997.67 * -93.59%* 57204.70 * -54.19%* 8373.67 * -93.29%* 56781.47 * -54.53%* 38892.18 * -68.85%* 51207.79 * -58.99%* 46151.33 * -63.04%*
Hmean 5 178848.31 ( 0.00%) 4984.40 * -97.21%* 31431.45 * -82.43%* 5101.29 * -97.15%* 41204.02 * -76.96%* 24592.26 * -86.25%* 33541.47 * -81.25%* 28762.54 * -83.92%*
Hmean 7 187708.49 ( 0.00%) 12945.53 * -93.10%* 57937.51 * -69.13%* 12898.91 * -93.13%* 89180.74 * -52.49%* 48390.34 * -74.22%* 54198.33 * -71.13%* 50261.68 * -73.22%*
Hmean 8 199929.38 ( 0.00%) 27516.10 * -86.24%* 77578.88 * -61.20%* 27608.20 * -86.19%* 117871.45 * -41.04%* 65719.85 * -67.13%* 76669.78 * -61.65%* 67983.03 * -66.00%*
Stddev 1 404.57 ( 0.00%) 708.20 ( -75.05%) 102.16 ( 74.75%) 258.18 ( 36.19%) 132.11 ( 67.35%) 1145.71 (-183.19%) 740.95 ( -83.14%) 91.67 ( 77.34%)
Stddev 3 1867.11 ( 0.00%) 729.15 ( 60.95%) 2529.52 ( -35.48%) 1212.31 ( 35.07%) 1765.46 ( 5.44%) 2698.45 ( -44.53%) 376.23 ( 79.85%) 2108.90 ( -12.95%)
Stddev 5 1556.91 ( 0.00%) 193.16 ( 87.59%) 3307.78 (-112.46%) 207.87 ( 86.65%) 1401.20 ( 10.00%) 1945.52 ( -24.96%) 1821.44 ( -16.99%) 2166.87 ( -39.18%)
Stddev 7 3622.13 ( 0.00%) 1690.01 ( 53.34%) 5268.81 ( -45.46%) 690.12 ( 80.95%) 8337.77 (-130.19%) 3737.07 ( -3.17%) 1900.15 ( 47.54%) 3388.45 ( 6.45%)
Stddev 8 1913.74 ( 0.00%) 1460.07 ( 23.71%) 2068.35 ( -8.08%) 1750.51 ( 8.53%) 4229.08 (-120.99%) 3477.66 ( -81.72%) 3318.68 ( -73.41%) 1338.76 ( 30.04%)

Mutilate brings us back where core scheduling does bad on baremetal,
but fine in VMs. The benchmark seems to be much more sensitive to
HyperThreading when run inside virtual machines, and core scheduling
guarantees better results than disabling hyperthreading, in such cases.
Interestingly, this time the thing is more evedint in the *non*
overcommitted scenario.

In fact, with one 8 vCPUs VM, v-VM-noHT is -87.35% and VM-csc_vruntime-
HT is only -1.06%, which is really good. In the overcommitted case, v-
VMx2-noHT reaches -97.21% while VMx2-csc_vruntime-HT is -83.92%. So,
yeah, better, but not "as better" as before.

--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2019-10-29 10:35:13

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, 2019-10-29 at 10:11 +0100, Dario Faggioli wrote:
> On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> > I'm using the following branch as base which is v5.1.5 based:
> > https://github.com/digitalocean/linux-coresched coresched-v3-
> > v5.1.5-
> > test
> >
> > And I have pushed Tim's branch to:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> >
> > Mine:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> > core_vruntime
> >
> Hello,
>
> As anticipated, I've been trying to follow the development of this
> feature and, in the meantime, I have done some benchmarks.
>
> I actually have a lot of data (and am planning for more), so I am
> sending a few emails, each one with a subset of the numbers in it,
> instead than just one which would be beyond giant! :-)
>
NETPERF-UNIX
============

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-7_mutilate.txt

v v BM BM BM BM BM BM
BM-HT BM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Hmean 64 984.93 ( 0.00%) 1011.24 ( 2.67%) 1000.82 ( 1.61%) 1001.98 ( 1.73%) 947.61 ( -3.79%) 823.10 * -16.43%* 789.89 * -19.80%* 928.11 * -5.77%*
Hmean 256 3683.78 ( 0.00%) 3763.52 ( 2.16%) 3788.96 ( 2.86%) 3725.17 ( 1.12%) 1254.25 * -65.95%* 1261.92 * -65.74%* 1264.02 * -65.69%* 1260.30 * -65.79%*
Hmean 2048 7928.28 ( 0.00%) 7845.97 ( -1.04%) 7911.88 ( -0.21%) 7809.11 * -1.50%* 5334.65 * -32.71%* 5340.97 * -32.63%* 5337.93 * -32.67%* 5394.57 * -31.96%*
Hmean 8192 8134.23 ( 0.00%) 8096.88 ( -0.46%) 8258.84 ( 1.53%) 8076.36 ( -0.71%) 5374.33 * -33.93%* 5394.12 * -33.69%* 5504.00 * -32.34%* 5447.92 * -33.02%*
Stddev 64 30.46 ( 0.00%) 31.07 ( -2.00%) 15.36 ( 49.58%) 41.51 ( -36.26%) 54.71 ( -79.62%) 121.84 (-299.98%) 81.75 (-168.35%) 33.75 ( -10.79%)
Stddev 256 102.26 ( 0.00%) 104.86 ( -2.55%) 107.90 ( -5.52%) 116.17 ( -13.61%) 3.32 ( 96.75%) 5.79 ( 94.34%) 14.69 ( 85.63%) 19.37 ( 81.06%)
Stddev 2048 94.73 ( 0.00%) 48.12 ( 49.21%) 137.50 ( -45.15%) 43.30 ( 54.29%) 58.65 ( 38.08%) 50.97 ( 46.20%) 57.43 ( 39.38%) 36.39 ( 61.58%)
Stddev 8192 172.77 ( 0.00%) 48.68 ( 71.83%) 261.65 ( -51.45%) 76.30 ( 55.83%) 40.65 ( 76.47%) 27.61 ( 84.02%) 65.26 ( 62.23%) 31.66 ( 81.68%)
v v VM VM VM VM VM VM
VM-HT VM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Hmean 64 516.67 ( 0.00%) 585.50 ( 13.32%) 592.54 ( 14.68%) 582.70 ( 12.78%) 591.32 ( 14.45%) 546.68 ( 5.81%) 602.58 ( 16.63%) 729.10 * 41.12%*
Hmean 256 1070.01 ( 0.00%) 1306.95 * 22.14%* 1193.89 * 11.58%* 1271.17 * 18.80%* 1362.78 * 27.36%* 1171.49 ( 9.48%) 1335.98 * 24.86%* 1248.79 * 16.71%*
Hmean 2048 5002.14 ( 0.00%) 6865.72 * 37.26%* 5569.42 ( 11.34%) 5074.48 ( 1.45%) 5849.11 ( 16.93%) 4745.97 ( -5.12%) 6330.87 ( 26.56%) 6418.51 ( 28.32%)
Hmean 8192 5116.24 ( 0.00%) 7494.15 * 46.48%* 6960.17 * 36.04%* 6009.31 ( 17.46%) 6114.30 ( 19.51%) 5226.13 ( 2.15%) 6316.40 ( 23.46%) 7924.66 * 54.89%*
Stddev 64 81.30 ( 0.00%) 139.96 ( -72.15%) 162.38 ( -99.72%) 113.29 ( -39.35%) 150.34 ( -84.91%) 162.05 ( -99.32%) 163.74 (-101.39%) 47.94 ( 41.04%)
Stddev 256 64.89 ( 0.00%) 130.57 (-101.20%) 115.02 ( -77.23%) 120.49 ( -85.68%) 106.63 ( -64.31%) 140.18 (-116.01%) 118.29 ( -82.28%) 133.66 (-105.96%)
Stddev 2048 779.45 ( 0.00%) 767.31 ( 1.56%) 1869.81 (-139.89%) 1265.84 ( -62.40%) 1249.59 ( -60.32%) 506.63 ( 35.00%) 1427.22 ( -83.11%) 2296.29 (-194.60%)
Stddev 8192 942.88 ( 0.00%) 2559.80 (-171.49%) 1207.26 ( -28.04%) 1776.59 ( -88.42%) 1405.17 ( -49.03%) 1379.48 ( -46.30%) 2565.86 (-172.13%) 1066.06 ( -13.06%)
v v VM VM VM VM VM VM
VM-v4-HT VM-v4-noHT v4-HT v4-noHT v4-csc-HT v4-csc_stallfix-HT v4-csc_tim-HT v4-csc_vruntime-HT
Hmean 64 626.51 ( 0.00%) 535.18 ( -14.58%) 610.07 ( -2.62%) 509.04 ( -18.75%) 552.16 ( -11.87%) 471.44 * -24.75%* 484.50 * -22.67%* 488.32 * -22.06%*
Hmean 256 999.57 ( 0.00%) 1159.65 * 16.02%* 1209.25 * 20.98%* 1217.94 * 21.85%* 1196.13 * 19.66%* 1286.01 * 28.66%* 1154.57 * 15.51%* 1238.40 * 23.89%*
Hmean 2048 3882.52 ( 0.00%) 4483.92 ( 15.49%) 4969.62 * 28.00%* 4910.98 * 26.49%* 4646.33 * 19.67%* 5247.76 * 35.16%* 4515.47 * 16.30%* 5096.38 * 31.26%*
Hmean 8192 4086.48 ( 0.00%) 4935.20 ( 20.77%) 4711.62 * 15.30%* 5067.04 ( 24.00%) 5887.99 * 44.08%* 5360.18 * 31.17%* 5847.04 * 43.08%* 5990.50 * 46.59%*
Stddev 64 134.26 ( 0.00%) 117.95 ( 12.15%) 67.29 ( 49.88%) 88.91 ( 33.78%) 78.73 ( 41.36%) 22.34 ( 83.36%) 45.36 ( 66.22%) 64.62 ( 51.87%)
Stddev 256 36.69 ( 0.00%) 32.94 ( 10.22%) 93.79 (-155.60%) 52.76 ( -43.79%) 72.03 ( -96.31%) 94.69 (-158.06%) 32.05 ( 12.65%) 62.31 ( -69.82%)
Stddev 2048 64.82 ( 0.00%) 785.16 (-1111.23%) 1086.67 (-1576.36%) 863.76 (-1232.48%) 552.05 (-751.62%) 597.66 (-821.99%) 300.86 (-364.12%) 1057.90 (-1531.98%)
Stddev 8192 248.29 ( 0.00%) 1345.78 (-442.02%) 636.92 (-156.53%) 1497.43 (-503.10%) 1528.17 (-515.48%) 204.43 ( 17.66%) 788.65 (-217.63%) 1380.18 (-455.88%)
v v VMx2 VMx2 VMx2 VMx2 VMx2 VMx2
VMx2-HT VMx2-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Hmean 64 575.39 ( 0.00%) 230.57 * -59.93%* 525.97 ( -8.59%) 241.09 * -58.10%* 671.83 ( 16.76%) 574.64 ( -0.13%) 676.93 ( 17.65%) 713.81 ( 24.06%)
Hmean 256 1243.12 ( 0.00%) 679.95 * -45.30%* 1262.76 ( 1.58%) 646.82 * -47.97%* 1607.80 * 29.34%* 1297.86 ( 4.40%) 1573.09 * 26.54%* 1244.30 ( 0.09%)
Hmean 2048 4448.89 ( 0.00%) 3020.71 * -32.10%* 4460.65 ( 0.26%) 3342.89 * -24.86%* 7086.92 * 59.30%* 4544.81 ( 2.16%) 4209.05 ( -5.39%) 4346.58 ( -2.30%)
Hmean 8192 5539.82 ( 0.00%) 3118.35 * -43.71%* 4003.50 * -27.73%* 2931.43 * -47.08%* 6069.77 ( 9.57%) 5571.91 ( 0.58%) 5143.26 ( -7.16%) 4245.56 * -23.36%*
Stddev 64 128.48 ( 0.00%) 33.63 ( 73.82%) 111.27 ( 13.39%) 64.39 ( 49.89%) 79.47 ( 38.14%) 189.98 ( -47.87%) 89.67 ( 30.20%) 135.17 ( -5.21%)
Stddev 256 191.78 ( 0.00%) 252.00 ( -31.40%) 225.33 ( -17.49%) 123.00 ( 35.86%) 183.12 ( 4.52%) 231.23 ( -20.57%) 161.13 ( 15.98%) 66.95 ( 65.09%)
Stddev 2048 463.85 ( 0.00%) 1364.71 (-194.21%) 1390.20 (-199.71%) 382.49 ( 17.54%) 1271.89 (-174.20%) 1058.81 (-128.27%) 602.49 ( -29.89%) 595.07 ( -28.29%)
Stddev 8192 1230.77 ( 0.00%) 2402.19 ( -95.18%) 567.52 ( 53.89%) 511.30 ( 58.46%) 2551.77 (-107.33%) 1065.61 ( 13.42%) 1070.69 ( 13.01%) 512.77 ( 58.34%)

As many other instances: baremetal is suffering with core scheduling.
VMs are doing reasonably good, especially when there is overcommit.

---

Ok, that's it for now... Any comment, discussion, feedback, etc, more
than welcome.

Thanks and Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2019-10-29 17:41:34

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, 2019-10-29 at 10:11 +0100, Dario Faggioli wrote:
> On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> > I'm using the following branch as base which is v5.1.5 based:
> > https://github.com/digitalocean/linux-coresched coresched-v3-
> > v5.1.5-
> > test
> >
> > And I have pushed Tim's branch to:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> >
> > Mine:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> > core_vruntime
> >
> Hello,
>
> As anticipated, I've been trying to follow the development of this
> feature and, in the meantime, I have done some benchmarks.
>
> I actually have a lot of data (and am planning for more), so I am
> sending a few emails, each one with a subset of the numbers in it,
> instead than just one which would be beyond giant! :-)
>
KERNBENCH
=========

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-3_kernbench.txt

v v BM BM BM BM BM BM
BM-HT BM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Amean elsp-2 200.07 ( 0.00%) 196.88 * 1.60%* 199.93 ( 0.07%) 196.89 * 1.59%* 200.86 * -0.39%* 200.83 * -0.38%* 199.47 * 0.30%* 200.45 * -0.19%*
Amean elsp-4 115.56 ( 0.00%) 108.64 * 5.99%* 115.12 ( 0.39%) 108.72 * 5.92%* 118.17 * -2.25%* 116.92 * -1.18%* 115.92 ( -0.31%) 115.86 ( -0.25%)
Amean elsp-8 84.72 ( 0.00%) 110.77 * -30.75%* 84.19 ( 0.62%) 111.03 * -31.06%* 84.78 ( -0.07%) 84.63 ( 0.11%) 89.09 * -5.16%* 90.21 * -6.48%*
Amean elsp-16 85.06 ( 0.00%) 113.63 * -33.59%* 85.33 ( -0.32%) 113.83 * -33.81%* 85.95 * -1.04%* 85.73 * -0.78%* 90.20 * -6.04%* 90.46 * -6.35%*
Stddev elsp-2 0.11 ( 0.00%) 0.05 ( 59.33%) 0.43 (-278.63%) 0.05 ( 60.30%) 0.20 ( -75.43%) 0.15 ( -28.90%) 0.16 ( -40.87%) 0.08 ( 26.69%)
Stddev elsp-4 0.54 ( 0.00%) 0.37 ( 30.80%) 0.02 ( 96.11%) 0.09 ( 83.05%) 1.10 (-105.52%) 0.24 ( 55.54%) 0.10 ( 81.29%) 0.26 ( 50.67%)
Stddev elsp-8 0.82 ( 0.00%) 0.25 ( 69.66%) 0.28 ( 65.58%) 0.27 ( 66.75%) 0.30 ( 63.64%) 0.07 ( 92.05%) 0.09 ( 88.92%) 0.19 ( 77.18%)
Stddev elsp-16 0.07 ( 0.00%) 0.32 (-375.21%) 0.41 (-502.93%) 0.19 (-176.54%) 0.22 (-219.28%) 0.21 (-208.51%) 0.31 (-358.10%) 0.09 ( -32.49%)
v v VM VM VM VM VM VM
VM-HT VM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Amean elsp-2 229.61 ( 0.00%) 202.26 * 11.91%* 205.30 * 10.59%* 201.14 * 12.40%* 207.43 * 9.66%* 207.30 * 9.72%* 205.92 * 10.32%* 206.69 * 9.98%*
Amean elsp-4 128.32 ( 0.00%) 124.33 * 3.11%* 116.84 * 8.95%* 124.66 * 2.85%* 121.86 * 5.04%* 140.87 * -9.78%* 118.28 * 7.83%* 122.58 * 4.48%*
Amean elsp-8 87.33 ( 0.00%) 118.45 * -35.63%* 85.52 ( 2.07%) 118.61 * -35.81%* 96.92 * -10.98%* 110.31 * -26.31%* 88.69 ( -1.55%) 88.49 ( -1.32%)
Amean elsp-16 87.00 ( 0.00%) 116.17 * -33.52%* 86.41 ( 0.68%) 116.43 * -33.82%* 103.24 * -18.67%* 90.77 * -4.33%* 88.89 * -2.17%* 89.35 * -2.70%*
Stddev elsp-2 0.24 ( 0.00%) 1.90 (-702.44%) 0.39 ( -63.41%) 0.45 ( -91.41%) 1.78 (-650.49%) 1.31 (-454.67%) 0.22 ( 8.97%) 1.09 (-362.56%)
Stddev elsp-4 0.10 ( 0.00%) 0.56 (-484.51%) 0.16 ( -63.75%) 0.60 (-524.50%) 0.47 (-392.73%) 2.53 (-2556.69%) 0.37 (-288.05%) 0.19 ( -96.77%)
Stddev elsp-8 1.48 ( 0.00%) 0.28 ( 81.02%) 0.08 ( 94.57%) 0.19 ( 87.46%) 1.25 ( 15.23%) 2.84 ( -92.18%) 1.23 ( 16.49%) 0.58 ( 60.60%)
Stddev elsp-16 0.62 ( 0.00%) 0.33 ( 46.43%) 0.07 ( 89.43%) 0.50 ( 20.24%) 0.54 ( 12.75%) 0.93 ( -49.52%) 0.44 ( 29.54%) 0.29 ( 53.95%)
v v VM VM VM VM VM VM
VM-v4-HT VM-v4-noHT v4-HT v4-noHT v4-csc-HT v4-csc_stallfix-HT v4-csc_tim-HT v4-csc_vruntime-HT
Amean elsp-2 227.42 ( 0.00%) 202.75 * 10.85%* 201.44 * 11.42%* 201.05 * 11.60%* 207.88 * 8.59%* 210.35 * 7.51%* 204.07 * 10.27%* 204.16 * 10.23%*
Amean elsp-4 124.46 ( 0.00%) 128.74 * -3.44%* 111.03 * 10.79%* 110.48 * 11.23%* 116.46 * 6.43%* 132.08 * -6.13%* 112.10 * 9.93%* 112.44 * 9.66%*
Amean elsp-8 127.12 ( 0.00%) 114.00 * 10.32%* 113.37 * 10.82%* 112.50 * 11.50%* 118.62 * 6.69%* 135.04 * -6.23%* 114.19 * 10.18%* 114.36 * 10.04%*
Stddev elsp-2 0.16 ( 0.00%) 0.05 ( 68.44%) 0.23 ( -45.93%) 0.32 (-101.53%) 0.89 (-471.21%) 0.86 (-452.61%) 0.13 ( 18.49%) 0.08 ( 51.98%)
Stddev elsp-4 0.09 ( 0.00%) 0.90 (-958.30%) 0.17 ( -95.69%) 0.40 (-364.45%) 1.33 (-1462.83%) 0.61 (-621.84%) 0.20 (-134.86%) 0.06 ( 28.48%)
Stddev elsp-8 0.14 ( 0.00%) 0.10 ( 29.71%) 0.40 (-181.91%) 0.05 ( 67.30%) 1.34 (-857.21%) 0.43 (-206.99%) 0.09 ( 39.30%) 0.12 ( 12.79%)
v v VMx2 VMx2 VMx2 VMx2 VMx2 VMx2
VMx2-HT VMx2-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Amean elsp-2 206.25 ( 0.00%) 338.78 * -64.26%* 296.38 * -43.70%* 335.66 * -62.75%* 327.38 * -58.73%* 443.46 *-115.01%* 317.41 * -53.90%* 295.89 * -43.47%*
Amean elsp-4 117.69 ( 0.00%) 267.38 *-127.19%* 174.91 * -48.62%* 263.22 *-123.66%* 201.69 * -71.37%* 317.24 *-169.56%* 193.92 * -64.77%* 193.04 * -64.02%*
Amean elsp-8 85.93 ( 0.00%) 225.41 *-162.31%* 146.21 * -70.14%* 221.96 *-158.29%* 188.12 *-118.91%* 203.79 *-137.15%* 154.66 * -79.98%* 162.26 * -88.82%*
Amean elsp-16 86.82 ( 0.00%) 221.32 *-154.92%* 141.53 * -63.02%* 217.27 *-150.27%* 180.06 *-107.41%* 190.36 *-119.27%* 143.76 * -65.59%* 156.84 * -80.65%*
Stddev elsp-2 1.01 ( 0.00%) 1.11 ( -10.37%) 2.24 (-121.88%) 1.36 ( -35.11%) 22.86 (-2164.27%) 24.78 (-2353.59%) 2.84 (-180.79%) 0.91 ( 10.16%)
Stddev elsp-4 0.18 ( 0.00%) 4.55 (-2398.38%) 0.62 (-239.73%) 3.15 (-1630.91%) 3.00 (-1551.21%) 10.85 (-5864.08%) 5.87 (-3124.85%) 3.79 (-1984.93%)
Stddev elsp-8 0.42 ( 0.00%) 1.13 (-167.05%) 0.44 ( -4.45%) 2.91 (-590.61%) 4.16 (-885.58%) 6.85 (-1524.29%) 0.78 ( -85.34%) 1.48 (-252.23%)
Stddev elsp-16 0.41 ( 0.00%) 2.40 (-478.92%) 2.21 (-432.16%) 3.27 (-688.15%) 9.50 (-2189.54%) 3.39 (-717.66%) 2.00 (-383.08%) 0.48 ( -15.60%)


So, if only building kernels were the only thing that people do with
computers, we could say we're done, and go have beers! :-)

In fact, here, core scheduling is doing good on baremetal. E.g., look
at what happens, in the BM- group of measurements, when number of jobs
is higher than 8, which is how many CPUs we have.

It's actually doing quite fine in Virt too. Among the various variants
(sorry), plain v3, even with stallfix on, is the one performing worse.
Tim's patches seems to me to be the better looking set of numbers for
this workload.

Furthermore, there's basically no overhead --actually, there are
speedups-- until there is not virt-overcommitment. In fact, in the VMx2
group of measurements, even just applying the core scheduling v3
patches makes the VM

--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2019-10-29 17:43:30

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, 2019-10-29 at 10:11 +0100, Dario Faggioli wrote:
> On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> > I'm using the following branch as base which is v5.1.5 based:
> > https://github.com/digitalocean/linux-coresched coresched-v3-
> > v5.1.5-
> > test
> >
> > And I have pushed Tim's branch to:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> >
> > Mine:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> > core_vruntime
> >
> Hello,
>
> As anticipated, I've been trying to follow the development of this
> feature and, in the meantime, I have done some benchmarks.
>
> I actually have a lot of data (and am planning for more), so I am
> sending a few emails, each one with a subset of the numbers in it,
> instead than just one which would be beyond giant! :-)
>
SYSBENCHTHREAD
==============

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-5_sysbenchthread.txt

v v BM BM BM BM BM BM
BM-HT BM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Amean 1 1.12 ( 0.00%) 1.11 ( 0.38%) 1.14 * -2.43%* 1.15 * -2.81%* 8.49 *-659.72%* 8.43 *-654.99%* 8.42 *-653.96%* 8.38 *-649.74%*
Amean 3 1.11 ( 0.00%) 1.11 ( 0.00%) 1.14 * -2.70%* 1.14 * -2.96%* 8.35 *-651.29%* 8.42 *-657.20%* 8.34 *-650.77%* 8.41 *-656.56%*
Amean 5 1.11 ( 0.00%) 1.11 ( 0.00%) 1.14 * -2.70%* 1.14 * -2.82%* 8.34 *-649.81%* 8.41 *-655.46%* 8.32 *-647.75%* 8.34 *-649.42%*
Amean 7 1.12 ( 0.00%) 1.11 ( 0.38%) 1.14 * -2.43%* 1.15 * -2.69%* 8.31 *-643.48%* 8.40 *-652.05%* 8.41 *-653.20%* 8.33 *-645.65%*
Amean 12 1.12 ( 0.00%) 1.12 ( 0.00%) 1.14 * -2.43%* 1.14 * -2.56%* 8.38 *-651.34%* 8.41 *-654.16%* 8.32 *-645.71%* 8.38 *-651.09%*
Amean 16 1.11 ( 0.00%) 1.11 ( -0.13%) 1.15 * -3.08%* 1.14 * -2.57%* 8.42 *-656.61%* 8.36 *-651.60%* 8.31 *-646.73%* 8.39 *-654.30%*
Stddev 1 0.01 ( 0.00%) 0.01 ( 49.47%) 0.01 ( 64.27%) 0.01 ( 39.86%) 0.08 (-414.47%) 0.16 (-976.34%) 0.01 ( 36.42%) 0.09 (-525.69%)
Stddev 3 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.01 (-108.17%) 0.13 (-3211.60%) 0.10 (-2531.86%) 0.06 (-1515.55%) 0.10 (-2467.10%)
Stddev 5 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.01 ( -9.54%) 0.12 (-2271.92%) 0.10 (-1917.42%) 0.09 (-1700.00%) 0.09 (-1725.38%)
Stddev 7 0.01 ( 0.00%) 0.01 ( 20.53%) 0.01 ( 43.80%) 0.01 ( 0.00%) 0.13 (-1255.69%) 0.10 (-988.21%) 0.03 (-247.93%) 0.08 (-762.68%)
Stddev 12 0.01 ( 0.00%) 0.01 ( -24.03%) 0.00 ( 37.98%) 0.01 ( 0.00%) 0.13 (-1577.68%) 0.10 (-1210.31%) 0.10 (-1140.97%) 0.06 (-644.73%)
Stddev 16 0.00 ( 0.00%) 0.01 ( -9.54%) 0.01 ( -54.92%) 0.00 ( 22.54%) 0.15 (-3032.73%) 0.12 (-2401.20%) 0.11 (-2206.51%) 0.07 (-1304.28%)
v v VM VM VM VM VM VM
VM-HT VM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Amean 1 1.43 ( 0.00%) 1.15 * 19.56%* 1.15 * 19.36%* 1.19 * 16.67%* 1.16 * 19.16%* 1.17 * 18.46%* 1.16 * 18.66%* 1.16 * 18.86%*
Amean 3 1.43 ( 0.00%) 1.16 * 19.16%* 1.16 * 19.26%* 1.17 * 18.46%* 1.16 * 19.16%* 1.21 * 15.17%* 1.16 * 18.86%* 1.21 * 15.27%*
Amean 5 1.43 ( 0.00%) 1.15 * 19.28%* 1.18 * 17.78%* 1.17 * 18.08%* 1.16 * 19.18%* 1.17 * 18.18%* 1.16 * 19.08%* 1.18 * 17.58%*
Amean 7 1.43 ( 0.00%) 1.16 * 19.26%* 1.16 * 18.86%* 1.17 * 18.56%* 1.16 * 19.16%* 1.19 * 16.57%* 1.15 * 19.36%* 1.15 * 19.36%*
Amean 12 1.44 ( 0.00%) 1.16 * 19.44%* 1.16 * 19.44%* 1.17 * 18.75%* 1.16 * 19.25%* 1.18 * 18.06%* 1.16 * 19.15%* 1.16 * 19.44%*
Amean 16 1.46 ( 0.00%) 1.16 * 20.37%* 1.16 * 20.37%* 1.16 * 20.47%* 1.16 * 20.37%* 1.17 * 19.59%* 1.17 * 20.08%* 1.16 * 20.37%*
Stddev 1 0.00 ( 0.00%) 0.00 ( 0.00%) 0.01 ( -41.42%) 0.05 (-1314.21%) 0.00 ( -29.10%) 0.00 ( -29.10%) 0.01 (-236.65%) 0.02 (-316.33%)
Stddev 3 0.00 ( 0.00%) 0.00 ( -29.10%) 0.01 (-108.17%) 0.03 (-773.69%) 0.01 (-100.00%) 0.09 (-2324.18%) 0.01 (-138.05%) 0.06 (-1378.74%)
Stddev 5 0.00 ( 0.00%) 0.01 ( -99.00%) 0.05 ( -99.00%) 0.02 ( -99.00%) 0.01 ( -99.00%) 0.01 ( -99.00%) 0.01 ( -99.00%) 0.04 ( -99.00%)
Stddev 7 0.00 ( 0.00%) 0.01 ( -41.42%) 0.01 (-138.05%) 0.02 (-468.62%) 0.01 (-100.00%) 0.06 (-1600.00%) 0.01 (-108.17%) 0.01 ( -41.42%)
Stddev 12 0.00 ( 0.00%) 0.00 ( 100.00%) 0.00 ( 100.00%) 0.03 (-11031521092846084.00%) 0.00 (-2034518927425100.00%) 0.01 (-4169523056347680.00%) 0.01 (-2228703820443891.00%) 0.00 ( 100.00%)
Stddev 16 0.05 ( 0.00%) 0.00 ( 92.31%) 0.00 ( 92.31%) 0.00 ( 100.00%) 0.00 ( 92.31%) 0.01 ( 80.64%) 0.01 ( 89.12%) 0.00 ( 92.31%)
v v VM VM VM VM VM VM
VM-v4-HT VM-v4-noHT v4-HT v4-noHT v4-csc-HT v4-csc_stallfix-HT v4-csc_tim-HT v4-csc_vruntime-HT
Amean 1 1.43 ( 0.00%) 1.17 * 18.15%* 1.15 * 19.64%* 1.15 * 19.44%* 1.16 * 19.14%* 1.17 * 18.34%* 1.16 * 18.94%* 1.16 * 18.94%*
Amean 3 1.43 ( 0.00%) 1.17 * 18.33%* 1.17 * 18.73%* 1.15 * 19.52%* 1.15 * 19.52%* 1.19 * 16.93%* 1.16 * 19.12%* 1.16 * 19.32%*
Amean 5 1.43 ( 0.00%) 1.18 * 17.45%* 1.15 * 19.44%* 1.15 * 19.54%* 1.17 * 18.15%* 1.18 * 17.55%* 1.15 * 19.44%* 1.16 * 19.04%*
Amean 7 1.43 ( 0.00%) 1.18 * 17.53%* 1.16 * 19.32%* 1.15 * 19.72%* 1.17 * 18.33%* 1.17 * 18.13%* 1.16 * 18.92%* 1.15 * 19.62%*
Amean 8 1.43 ( 0.00%) 1.17 * 18.46%* 1.16 * 19.16%* 1.16 * 18.96%* 1.17 * 18.46%* 1.18 * 17.86%* 1.15 * 19.36%* 1.16 * 19.06%*
Stddev 1 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 22.54%) 0.01 ( -9.54%) 0.01 (-119.09%) 0.01 ( -67.33%) 0.01 (-149.00%) 0.01 ( -84.39%)
Stddev 3 0.01 ( 0.00%) 0.01 ( 12.29%) 0.02 (-105.69%) 0.01 ( 0.00%) 0.01 ( 0.00%) 0.05 (-515.82%) 0.01 ( -27.10%) 0.01 ( -41.42%)
Stddev 5 0.00 ( 0.00%) 0.04 (-700.00%) 0.01 ( -61.25%) 0.01 ( -54.92%) 0.04 (-717.31%) 0.01 ( -84.39%) 0.01 ( -61.25%) 0.01 ( -18.32%)
Stddev 7 0.01 ( 0.00%) 0.04 (-390.68%) 0.01 ( 3.92%) 0.00 ( 51.96%) 0.04 (-366.58%) 0.01 ( 0.00%) 0.02 (-151.15%) 0.01 ( 3.92%)
Stddev 8 0.00 ( 0.00%) 0.00 ( -29.10%) 0.01 (-151.66%) 0.02 (-383.05%) 0.05 (-1100.00%) 0.01 (-108.17%) 0.01 ( -41.42%) 0.01 (-138.05%)
v v VMx2 VMx2 VMx2 VMx2 VMx2 VMx2
VMx2-HT VMx2-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
Amean 1 1.16 ( 0.00%) 1.22 * -5.80%* 1.54 * -32.84%* 1.22 * -5.31%* 2.83 (-144.32%) 1.35 * -16.42%* 1.69 * -46.30%* 1.35 * -16.91%*
Amean 3 1.16 ( 0.00%) 1.24 * -6.90%* 1.56 * -34.11%* 1.24 * -6.53%* 1.89 * -63.05%* 1.95 * -68.35%* 1.67 * -43.84%* 1.33 * -14.41%*
Amean 5 1.17 ( 0.00%) 1.21 * -3.79%* 1.52 * -30.48%* 1.23 * -5.75%* 1.43 * -22.89%* 1.85 ( -58.51%) 1.71 * -46.88%* 1.35 * -15.91%*
Amean 7 1.15 ( 0.00%) 1.23 * -6.81%* 1.54 * -33.29%* 1.24 * -7.05%* 1.68 * -45.79%* 1.39 * -20.79%* 2.09 * -80.69%* 1.30 * -12.62%*
Amean 12 1.16 ( 0.00%) 1.25 * -7.25%* 1.69 * -45.09%* 1.22 * -5.28%* 1.72 * -47.54%* 1.52 * -31.08%* 2.08 * -78.87%* 1.32 * -13.14%*
Amean 16 1.16 ( 0.00%) 1.23 * -6.14%* 1.68 * -44.23%* 1.22 * -4.67%* 1.80 * -54.91%* 1.37 * -17.44%* 2.18 * -87.10%* 1.28 * -9.95%*
Stddev 1 0.01 ( 0.00%) 0.05 (-569.58%) 0.14 (-1693.97%) 0.02 (-169.26%) 3.19 (-42121.82%) 0.25 (-3271.57%) 0.26 (-3362.06%) 0.09 (-1072.60%)
Stddev 3 0.01 ( 0.00%) 0.06 (-664.85%) 0.06 (-630.95%) 0.03 (-259.56%) 0.84 (-10177.51%) 0.98 (-11852.56%) 0.45 (-5404.28%) 0.08 (-845.29%)
Stddev 5 0.00 ( 0.00%) 0.02 (-393.96%) 0.17 (-3371.31%) 0.07 (-1423.81%) 0.32 (-6361.11%) 1.04 (-21129.08%) 0.22 (-4383.53%) 0.06 (-1033.14%)
Stddev 7 0.01 ( 0.00%) 0.07 (-1234.79%) 0.20 (-3727.10%) 0.08 (-1483.25%) 0.57 (-10594.70%) 0.19 (-3455.98%) 0.30 (-5454.88%) 0.08 (-1400.56%)
Stddev 12 0.00 ( 0.00%) 0.08 (-1641.84%) 0.16 (-3078.68%) 0.03 (-524.50%) 0.50 (-10161.87%) 0.08 (-1634.36%) 0.32 (-6423.80%) 0.07 (-1232.67%)
Stddev 16 0.00 ( 0.00%) 0.05 (-996.36%) 0.20 (-3958.82%) 0.06 (-1045.43%) 0.60 (-12116.06%) 0.23 (-4642.99%) 0.23 (-4594.04%) 0.05 (-842.34%)

This is all quite bizarre. Judging from these numbers, the benchmark
seems to be no sensitive to HT at all (while, e.g., sysbenchcpu was),
when on baremetal. In VMs, at least in most cases, things are
significantly *faster* with HT off. Which I know is something that can
happen, I just was not expecting it from this workload.

Also, core scheduling is a total disaster on baremetal. And its
behavior in VM is an anti-pattern too.

So, I guess I'll go trying to see if I did something wrong when
configuring or running this particular benchmark. If that is not the
case, then core scheduling has a serious issue when dealing with
threads instead than with processes, I would say.

--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2019-10-29 20:36:06

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 29-Oct-2019 10:20:57 AM, Dario Faggioli wrote:
> > Hello,
> >
> > As anticipated, I've been trying to follow the development of this
> > feature and, in the meantime, I have done some benchmarks.
> >
> > I actually have a lot of data (and am planning for more), so I am
> > sending a few emails, each one with a subset of the numbers in it,
> > instead than just one which would be beyond giant! :-)
> >

Hi Dario,

Thank you for this comprehensive set of tests and analyses !

It confirms the trend we are seeing for the VM cases. Basically when the
CPUs are overcommitted, core scheduling helps compared to noHT. But when
we have I/O in the mix (sysbench-oltp), then it becomes a bit less
clear, it depends if the CPU is still overcommitted or not. About the
2nd VM that is doing the background noise, is it enough to fill up the
disk queues or is its disk throughput somewhat limited ? Have you
compared the results if you disable the disk noise ?

Our approach for bare-metal tests is a bit different, we are
constraining a set of processes only on a limited set of cpus, but I
like your approach because it pushes more the number of processes
against the whole system. And I have no explanation for why sysbench
thread vs process is so different.

And it also confirms, core scheduling has trouble scaling with the
number of threads, it works pretty well in VMs because the number of
threads is limited by the number of vcpus, but the bare-metal cases show
a major scaling issue (which is not too surprising).

I am curious, for the tagging in KVM, do you move all the vcpus into the
same cgroup before tagging ? Did you leave the emulator threads
untagged at all time ?

For the overhead (without tagging), have you tried bisecting the
patchset to see which patch introduces the overhead ? it is more than I
had in mind.

And for the cases when core scheduling improves the performance compared
to the baseline numbers, could it be related to frequency scaling (more
work to do means a higher chance of running at a higher frequency) ?

We are almost ready to send the v4 patchset (most likely tomorrow), it
has been rebased on v5.3.5, so stay tuned and ready for another set of
tests ;-)

Thanks,

Julien

2019-10-29 22:11:49

by Julien Desfossez

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 18-Sep-2019 01:40:58 PM, Tim Chen wrote:
> I think the test that's of interest is to see my load balancing added on top
> of Aaron's fairness patch, instead of using my previous version of
> forced idle approach in coresched-v3-v5.1.5-test-tim branch.
>
> I've added my two load balance patches on top of Aaron's patches
> in coresched-v3-v5.1.5-test-core_vruntime branch and put it in
>
> https://github.com/pdxChen/gang/tree/coresched-v3-v5.1.5-test-core_vruntime-lb

We have been trying to benchmark with the load balancer patches and have
experienced some hard lockups with the saturated test cases, but we
don't have traces for now.

Since we are mostly focused on testing the rebased v4 currently, we will
post it without those patches and then we can try to debug more.

Thanks,

Julien

2019-11-01 21:43:43

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 10/29/19 1:40 PM, Julien Desfossez wrote:
> On 18-Sep-2019 01:40:58 PM, Tim Chen wrote:
>> I think the test that's of interest is to see my load balancing added on top
>> of Aaron's fairness patch, instead of using my previous version of
>> forced idle approach in coresched-v3-v5.1.5-test-tim branch.
>>
>> I've added my two load balance patches on top of Aaron's patches
>> in coresched-v3-v5.1.5-test-core_vruntime branch and put it in
>>
>> https://github.com/pdxChen/gang/tree/coresched-v3-v5.1.5-test-core_vruntime-lb
>
> We have been trying to benchmark with the load balancer patches and have
> experienced some hard lockups with the saturated test cases, but we
> don't have traces for now.
>
> Since we are mostly focused on testing the rebased v4 currently, we will
> post it without those patches and then we can try to debug more.
>

Aubrey has been experimenting with a couple of patches that tries to move
task to cpu with matching cookie on wake up and load balance. They are
much simpler than mine and got better performance on the workload he was
testing. He'll rebase those patches on v4 and post them.

Tim

2019-11-15 16:33:11

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Tue, 2019-10-29 at 16:34 -0400, Julien Desfossez wrote:
> On 29-Oct-2019 10:20:57 AM, Dario Faggioli wrote:
> > > Hello,
> > >
> > > As anticipated, I've been trying to follow the development of
> > > this
> > > feature and, in the meantime, I have done some benchmarks.
>
> Hi Dario,
>
Hi!

> Thank you for this comprehensive set of tests and analyses !
>
Sure. And sorry for replying so late. I was travelling (for speaking
about core scheduling and virtualization at KVMForum :-P) and, after
that, had some catch up to do

> It confirms the trend we are seeing for the VM cases. Basically when
> the
> CPUs are overcommitted, core scheduling helps compared to noHT.
>
Yep.

> But when
> we have I/O in the mix (sysbench-oltp), then it becomes a bit less
> clear, it depends if the CPU is still overcommitted or not. About the
> 2nd VM that is doing the background noise, is it enough to fill up
> the
> disk queues or is its disk throughput somewhat limited ? Have you
> compared the results if you disable the disk noise ?
>
There were some IO, but it was mostly CPU noise. Anyway, sure, I can
repeat the experiments with different kind of noise. TBH, I also have
other ideas for different setup. And of course, I'll work on v4 now.

> Our approach for bare-metal tests is a bit different, we are
> constraining a set of processes only on a limited set of cpus, but I
> like your approach because it pushes more the number of processes
> against the whole system.
>
Yes, and for this time, I deliberately choose a small system, to avoid
having NUMA effects, etc. But I'm working toward running the evaluation
on a bigger box.

> I am curious, for the tagging in KVM, do you move all the vcpus into
> the
> same cgroup before tagging ? Did you leave the emulator threads
> untagged at all time ?
>
So, for this round, yes, all the vcpus of the VM were put in the same
cgroup, and then I set the tag for it.

All the other threads that libvirt creates were left out of such group
(and were, hence, untagged). I did a few manual runs with _all_ the
tasks related to a VM in a tagged cgroup, but I did not see much
differences (that's why the numbers for these runs are not reported).

The VM did not have any virtual topology defined.

And in fact, one thing that I want to try is to put pairs of vcpus in
the same cgroup, tag it, and define a virtual HT topology for the VM
(i.e., mark the two vcpu that will be in the same cgroup with the same
tag as threads of the same core).

> For the overhead (without tagging), have you tried bisecting the
> patchset to see which patch introduces the overhead ? it is more than
> I
> had in mind.
>
Yes, there is definitely something weird. Well, in the meantime, I
improved my automated procedure for running the benchmarks. I'll rerun
on v4. And I'll do a bisect if the overhead is still that big.

> And for the cases when core scheduling improves the performance
> compared
> to the baseline numbers, could it be related to frequency scaling
> (more
> work to do means a higher chance of running at a higher frequency) ?
>
Governor was 'performance' during all the experiments. But yes, since
it's intel_pstate that is in charge, frequency can still vary, and
something like what you suggest may indeed be happening, I think.

> We are almost ready to send the v4 patchset (most likely tomorrow),
> it
> has been rebased on v5.3.5, so stay tuned and ready for another set
> of
> tests ;-)
>
Already on it. :-)

Thanks and Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part