2020-10-20 18:18:13

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH v8 -tip 00/26] Core scheduling

Eighth iteration of the Core-Scheduling feature.

Core scheduling is a feature that allows only trusted tasks to run
concurrently on cpus sharing compute resources (eg: hyperthreads on a
core). The goal is to mitigate the core-level side-channel attacks
without requiring to disable SMT (which has a significant impact on
performance in some situations). Core scheduling (as of v7) mitigates
user-space to user-space attacks and user to kernel attack when one of
the siblings enters the kernel via interrupts or system call.

By default, the feature doesn't change any of the current scheduler
behavior. The user decides which tasks can run simultaneously on the
same core (for now by having them in the same tagged cgroup). When a tag
is enabled in a cgroup and a task from that cgroup is running on a
hardware thread, the scheduler ensures that only idle or trusted tasks
run on the other sibling(s). Besides security concerns, this feature can
also be beneficial for RT and performance applications where we want to
control how tasks make use of SMT dynamically.

This iteration focuses on the the following stuff:
- Redesigned API.
- Rework of Kernel Protection feature based on Thomas's entry work.
- Rework of hotplug fixes.
- Address review comments in v7

Joel: Both a CGroup and Per-task interface via prctl(2) are provided for
configuring core sharing. More details are provided in documentation patch.
Kselftests are provided to verify the correctness/rules of the interface.

Julien: TPCC tests showed improvements with core-scheduling. With kernel
protection enabled, it does not show any regression. Possibly ASI will improve
the performance for those who choose kernel protection (can be toggled through
sched_core_protect_kernel sysctl). Results:
v8 average stdev diff
baseline (SMT on) 1197.272 44.78312824
core sched ( kernel protect) 412.9895 45.42734343 -65.51%
core sched (no kernel protect) 686.6515 71.77756931 -42.65%
nosmt 408.667 39.39042872 -65.87%

v8 is rebased on tip/master.

Future work
===========
- Load balancing/Migration fixes for core scheduling.
With v6, Load balancing is partially coresched aware, but has some
issues w.r.t process/taskgroup weights:
https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...
- Core scheduling test framework: kselftests, torture tests etc

Changes in v8
=============
- New interface/API implementation
- Joel
- Revised kernel protection patch
- Joel
- Revised Hotplug fixes
- Joel
- Minor bug fixes and address review comments
- Vineeth

Changes in v7
=============
- Kernel protection from untrusted usermode tasks
- Joel, Vineeth
- Fix for hotplug crashes and hangs
- Joel, Vineeth

Changes in v6
=============
- Documentation
- Joel
- Pause siblings on entering nmi/irq/softirq
- Joel, Vineeth
- Fix for RCU crash
- Joel
- Fix for a crash in pick_next_task
- Yu Chen, Vineeth
- Minor re-write of core-wide vruntime comparison
- Aaron Lu
- Cleanup: Address Review comments
- Cleanup: Remove hotplug support (for now)
- Build fixes: 32 bit, SMT=n, AUTOGROUP=n etc
- Joel, Vineeth

Changes in v5
=============
- Fixes for cgroup/process tagging during corner cases like cgroup
destroy, task moving across cgroups etc
- Tim Chen
- Coresched aware task migrations
- Aubrey Li
- Other minor stability fixes.

Changes in v4
=============
- Implement a core wide min_vruntime for vruntime comparison of tasks
across cpus in a core.
- Aaron Lu
- Fixes a typo bug in setting the forced_idle cpu.
- Aaron Lu

Changes in v3
=============
- Fixes the issue of sibling picking up an incompatible task
- Aaron Lu
- Vineeth Pillai
- Julien Desfossez
- Fixes the issue of starving threads due to forced idle
- Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
- Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
- Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
- Aaron Lu

Changes in v2
=============
- Fixes for couple of NULL pointer dereference crashes
- Subhra Mazumdar
- Tim Chen
- Improves priority comparison logic for process in different cpus
- Peter Zijlstra
- Aaron Lu
- Fixes a hard lockup in rq locking
- Vineeth Pillai
- Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
- Vineeth Pillai
- Julien Desfossez
- Fix for 32bit build
- Aubrey Li

Aubrey Li (1):
sched: migration changes for core scheduling

Joel Fernandes (Google) (13):
sched/fair: Snapshot the min_vruntime of CPUs on force idle
arch/x86: Add a new TIF flag for untrusted tasks
kernel/entry: Add support for core-wide protection of kernel-mode
entry/idle: Enter and exit kernel protection during idle entry and
exit
sched: Split the cookie and setup per-task cookie on fork
sched: Add a per-thread core scheduling interface
sched: Add a second-level tag for nested CGroup usecase
sched: Release references to the per-task cookie on exit
sched: Handle task addition to CGroup
sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG
kselftest: Add tests for core-sched interface
sched: Move core-scheduler interfacing code to a new file
Documentation: Add core scheduling documentation

Peter Zijlstra (10):
sched: Wrap rq::lock access
sched: Introduce sched_class::pick_task()
sched: Core-wide rq->lock
sched/fair: Add a few assertions
sched: Basic tracking of matching tasks
sched: Add core wide task selection and scheduling.
sched: Trivial forced-newidle balancer
irq_work: Cleanup
sched: cgroup tagging interface for core scheduling
sched: Debug bits...

Vineeth Pillai (2):
sched/fair: Fix forced idle sibling starvation corner case
entry/kvm: Protect the kernel when entering from guest

.../admin-guide/hw-vuln/core-scheduling.rst | 312 +++++
Documentation/admin-guide/hw-vuln/index.rst | 1 +
.../admin-guide/kernel-parameters.txt | 7 +
arch/x86/include/asm/thread_info.h | 2 +
arch/x86/kvm/x86.c | 3 +
drivers/gpu/drm/i915/i915_request.c | 4 +-
include/linux/entry-common.h | 20 +-
include/linux/entry-kvm.h | 12 +
include/linux/irq_work.h | 33 +-
include/linux/irqflags.h | 4 +-
include/linux/sched.h | 27 +-
include/uapi/linux/prctl.h | 3 +
kernel/Kconfig.preempt | 6 +
kernel/bpf/stackmap.c | 2 +-
kernel/entry/common.c | 25 +-
kernel/entry/kvm.c | 13 +
kernel/fork.c | 1 +
kernel/irq_work.c | 18 +-
kernel/printk/printk.c | 6 +-
kernel/rcu/tree.c | 3 +-
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 1135 ++++++++++++++++-
kernel/sched/coretag.c | 468 +++++++
kernel/sched/cpuacct.c | 12 +-
kernel/sched/deadline.c | 34 +-
kernel/sched/debug.c | 8 +-
kernel/sched/fair.c | 272 ++--
kernel/sched/idle.c | 24 +-
kernel/sched/pelt.h | 2 +-
kernel/sched/rt.c | 22 +-
kernel/sched/sched.h | 302 ++++-
kernel/sched/stop_task.c | 13 +-
kernel/sched/topology.c | 4 +-
kernel/sys.c | 3 +
kernel/time/tick-sched.c | 6 +-
kernel/trace/bpf_trace.c | 2 +-
tools/include/uapi/linux/prctl.h | 3 +
tools/testing/selftests/sched/.gitignore | 1 +
tools/testing/selftests/sched/Makefile | 14 +
tools/testing/selftests/sched/config | 1 +
.../testing/selftests/sched/test_coresched.c | 840 ++++++++++++
41 files changed, 3437 insertions(+), 232 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
create mode 100644 kernel/sched/coretag.c
create mode 100644 tools/testing/selftests/sched/.gitignore
create mode 100644 tools/testing/selftests/sched/Makefile
create mode 100644 tools/testing/selftests/sched/config
create mode 100644 tools/testing/selftests/sched/test_coresched.c

--
2.29.0.rc1.297.gfa9743e501-goog


2020-10-20 18:18:18

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH v8 -tip 01/26] sched: Wrap rq::lock access

From: Peter Zijlstra <[email protected]>

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Tested-by: Julien Desfossez <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
---
kernel/sched/core.c | 46 +++++++++---------
kernel/sched/cpuacct.c | 12 ++---
kernel/sched/deadline.c | 18 +++----
kernel/sched/debug.c | 4 +-
kernel/sched/fair.c | 38 +++++++--------
kernel/sched/idle.c | 4 +-
kernel/sched/pelt.h | 2 +-
kernel/sched/rt.c | 8 +--
kernel/sched/sched.h | 105 +++++++++++++++++++++-------------------
kernel/sched/topology.c | 4 +-
10 files changed, 122 insertions(+), 119 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d2003a7d5ab5..97181b3d12eb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -186,12 +186,12 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)

for (;;) {
rq = task_rq(p);
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
rq_pin_lock(rq, rf);
return rq;
}
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));

while (unlikely(task_on_rq_migrating(p)))
cpu_relax();
@@ -210,7 +210,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
for (;;) {
raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
rq = task_rq(p);
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
/*
* move_queued_task() task_rq_lock()
*
@@ -232,7 +232,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
rq_pin_lock(rq, rf);
return rq;
}
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);

while (unlikely(task_on_rq_migrating(p)))
@@ -302,7 +302,7 @@ void update_rq_clock(struct rq *rq)
{
s64 delta;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

if (rq->clock_update_flags & RQCF_ACT_SKIP)
return;
@@ -611,7 +611,7 @@ void resched_curr(struct rq *rq)
struct task_struct *curr = rq->curr;
int cpu;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

if (test_tsk_need_resched(curr))
return;
@@ -635,10 +635,10 @@ void resched_cpu(int cpu)
struct rq *rq = cpu_rq(cpu);
unsigned long flags;

- raw_spin_lock_irqsave(&rq->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), flags);
if (cpu_online(cpu) || cpu == smp_processor_id())
resched_curr(rq);
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
}

#ifdef CONFIG_SMP
@@ -1137,7 +1137,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
struct uclamp_se *uc_se = &p->uclamp[clamp_id];
struct uclamp_bucket *bucket;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

/* Update task effective clamp */
p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
@@ -1177,7 +1177,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
unsigned int bkt_clamp;
unsigned int rq_clamp;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

/*
* If sched_uclamp_used was enabled after task @p was enqueued,
@@ -1733,7 +1733,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
struct task_struct *p, int new_cpu)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

deactivate_task(rq, p, DEQUEUE_NOCLOCK);
set_task_cpu(p, new_cpu);
@@ -1845,7 +1845,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
* Because __kthread_bind() calls this on blocked tasks without
* holding rq->lock.
*/
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
}
if (running)
@@ -1982,7 +1982,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
* task_rq_lock().
*/
WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
- lockdep_is_held(&task_rq(p)->lock)));
+ lockdep_is_held(rq_lockp(task_rq(p)))));
#endif
/*
* Clearly, migrating tasks to offline CPUs is a fairly daft thing.
@@ -2493,7 +2493,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
{
int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

if (p->sched_contributes_to_load)
rq->nr_uninterruptible--;
@@ -3495,10 +3495,10 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
* do an early lockdep release here:
*/
rq_unpin_lock(rq, rf);
- spin_release(&rq->lock.dep_map, _THIS_IP_);
+ spin_release(&rq_lockp(rq)->dep_map, _THIS_IP_);
#ifdef CONFIG_DEBUG_SPINLOCK
/* this is a valid case when another task releases the spinlock */
- rq->lock.owner = next;
+ rq_lockp(rq)->owner = next;
#endif
}

@@ -3509,8 +3509,8 @@ static inline void finish_lock_switch(struct rq *rq)
* fix up the runqueue lock - which gets 'carried over' from
* prev into current:
*/
- spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
- raw_spin_unlock_irq(&rq->lock);
+ spin_acquire(&rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
+ raw_spin_unlock_irq(rq_lockp(rq));
}

/*
@@ -3660,7 +3660,7 @@ static void __balance_callback(struct rq *rq)
void (*func)(struct rq *rq);
unsigned long flags;

- raw_spin_lock_irqsave(&rq->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), flags);
head = rq->balance_callback;
rq->balance_callback = NULL;
while (head) {
@@ -3671,7 +3671,7 @@ static void __balance_callback(struct rq *rq)

func(rq);
}
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
}

static inline void balance_callback(struct rq *rq)
@@ -6521,7 +6521,7 @@ void init_idle(struct task_struct *idle, int cpu)
__sched_fork(0, idle);

raw_spin_lock_irqsave(&idle->pi_lock, flags);
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));

idle->state = TASK_RUNNING;
idle->se.exec_start = sched_clock();
@@ -6559,7 +6559,7 @@ void init_idle(struct task_struct *idle, int cpu)
#ifdef CONFIG_SMP
idle->on_cpu = 1;
#endif
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irqrestore(&idle->pi_lock, flags);

/* Set the preempt count _outside_ the spinlocks! */
@@ -7131,7 +7131,7 @@ void __init sched_init(void)
struct rq *rq;

rq = cpu_rq(i);
- raw_spin_lock_init(&rq->lock);
+ raw_spin_lock_init(&rq->__lock);
rq->nr_running = 0;
rq->calc_load_active = 0;
rq->calc_load_update = jiffies + LOAD_FREQ;
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 941c28cf9738..38c1a68e91f0 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -112,7 +112,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
/*
* Take rq->lock to make 64-bit read safe on 32-bit platforms.
*/
- raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
#endif

if (index == CPUACCT_STAT_NSTATS) {
@@ -126,7 +126,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
}

#ifndef CONFIG_64BIT
- raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
#endif

return data;
@@ -141,14 +141,14 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
/*
* Take rq->lock to make 64-bit write safe on 32-bit platforms.
*/
- raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
#endif

for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
cpuusage->usages[i] = val;

#ifndef CONFIG_64BIT
- raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
#endif
}

@@ -253,13 +253,13 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
* Take rq->lock to make 64-bit read safe on 32-bit
* platforms.
*/
- raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
#endif

seq_printf(m, " %llu", cpuusage->usages[index]);

#ifndef CONFIG_64BIT
- raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
#endif
}
seq_puts(m, "\n");
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6d93f4518734..814ec49502b1 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -119,7 +119,7 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->running_bw;

- lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
dl_rq->running_bw += dl_bw;
SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
@@ -132,7 +132,7 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->running_bw;

- lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
dl_rq->running_bw -= dl_bw;
SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
if (dl_rq->running_bw > old)
@@ -146,7 +146,7 @@ void __add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->this_bw;

- lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
dl_rq->this_bw += dl_bw;
SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */
}
@@ -156,7 +156,7 @@ void __sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->this_bw;

- lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
dl_rq->this_bw -= dl_bw;
SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */
if (dl_rq->this_bw > old)
@@ -966,7 +966,7 @@ static int start_dl_timer(struct task_struct *p)
ktime_t now, act;
s64 delta;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

/*
* We want the timer to fire at the deadline, but considering
@@ -1076,9 +1076,9 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
* If the runqueue is no longer available, migrate the
* task elsewhere. This necessarily changes rq.
*/
- lockdep_unpin_lock(&rq->lock, rf.cookie);
+ lockdep_unpin_lock(rq_lockp(rq), rf.cookie);
rq = dl_task_offline_migration(rq, p);
- rf.cookie = lockdep_pin_lock(&rq->lock);
+ rf.cookie = lockdep_pin_lock(rq_lockp(rq));
update_rq_clock(rq);

/*
@@ -1727,7 +1727,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
* from try_to_wake_up(). Hence, p->pi_lock is locked, but
* rq->lock is not... So, lock it
*/
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
if (p->dl.dl_non_contending) {
sub_running_bw(&p->dl, &rq->dl);
p->dl.dl_non_contending = 0;
@@ -1742,7 +1742,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
put_task_struct(p);
}
sub_rq_bw(&p->dl, &rq->dl);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
}

static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 0655524700d2..c8fee8d9dfd4 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -551,7 +551,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "exec_clock",
SPLIT_NS(cfs_rq->exec_clock));

- raw_spin_lock_irqsave(&rq->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), flags);
if (rb_first_cached(&cfs_rq->tasks_timeline))
MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
last = __pick_last_entity(cfs_rq);
@@ -559,7 +559,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
max_vruntime = last->vruntime;
min_vruntime = cfs_rq->min_vruntime;
rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "MIN_vruntime",
SPLIT_NS(MIN_vruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "min_vruntime",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa4c6227cd6d..dbd9368a959d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1101,7 +1101,7 @@ struct numa_group {
static struct numa_group *deref_task_numa_group(struct task_struct *p)
{
return rcu_dereference_check(p->numa_group, p == current ||
- (lockdep_is_held(&task_rq(p)->lock) && !READ_ONCE(p->on_cpu)));
+ (lockdep_is_held(rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu)));
}

static struct numa_group *deref_curr_numa_group(struct task_struct *p)
@@ -5291,7 +5291,7 @@ static void __maybe_unused update_runtime_enabled(struct rq *rq)
{
struct task_group *tg;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

rcu_read_lock();
list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -5310,7 +5310,7 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
{
struct task_group *tg;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

rcu_read_lock();
list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -6772,7 +6772,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
* In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
* rq->lock and can modify state directly.
*/
- lockdep_assert_held(&task_rq(p)->lock);
+ lockdep_assert_held(rq_lockp(task_rq(p)));
detach_entity_cfs_rq(&p->se);

} else {
@@ -7400,7 +7400,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
{
s64 delta;

- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(rq_lockp(env->src_rq));

if (p->sched_class != &fair_sched_class)
return 0;
@@ -7498,7 +7498,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
{
int tsk_cache_hot;

- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(rq_lockp(env->src_rq));

/*
* We do not migrate tasks that are:
@@ -7576,7 +7576,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
*/
static void detach_task(struct task_struct *p, struct lb_env *env)
{
- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(rq_lockp(env->src_rq));

deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
set_task_cpu(p, env->dst_cpu);
@@ -7592,7 +7592,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
{
struct task_struct *p;

- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(rq_lockp(env->src_rq));

list_for_each_entry_reverse(p,
&env->src_rq->cfs_tasks, se.group_node) {
@@ -7628,7 +7628,7 @@ static int detach_tasks(struct lb_env *env)
struct task_struct *p;
int detached = 0;

- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(rq_lockp(env->src_rq));

if (env->imbalance <= 0)
return 0;
@@ -7750,7 +7750,7 @@ static int detach_tasks(struct lb_env *env)
*/
static void attach_task(struct rq *rq, struct task_struct *p)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

BUG_ON(task_rq(p) != rq);
activate_task(rq, p, ENQUEUE_NOCLOCK);
@@ -9684,7 +9684,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
if (need_active_balance(&env)) {
unsigned long flags;

- raw_spin_lock_irqsave(&busiest->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(busiest), flags);

/*
* Don't kick the active_load_balance_cpu_stop,
@@ -9692,7 +9692,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
* moved to this_cpu:
*/
if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
- raw_spin_unlock_irqrestore(&busiest->lock,
+ raw_spin_unlock_irqrestore(rq_lockp(busiest),
flags);
env.flags |= LBF_ALL_PINNED;
goto out_one_pinned;
@@ -9708,7 +9708,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
busiest->push_cpu = this_cpu;
active_balance = 1;
}
- raw_spin_unlock_irqrestore(&busiest->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(busiest), flags);

if (active_balance) {
stop_one_cpu_nowait(cpu_of(busiest),
@@ -10460,7 +10460,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
time_before(jiffies, READ_ONCE(nohz.next_blocked)))
return;

- raw_spin_unlock(&this_rq->lock);
+ raw_spin_unlock(rq_lockp(this_rq));
/*
* This CPU is going to be idle and blocked load of idle CPUs
* need to be updated. Run the ilb locally as it is a good
@@ -10469,7 +10469,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
*/
if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
kick_ilb(NOHZ_STATS_KICK);
- raw_spin_lock(&this_rq->lock);
+ raw_spin_lock(rq_lockp(this_rq));
}

#else /* !CONFIG_NO_HZ_COMMON */
@@ -10535,7 +10535,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
goto out;
}

- raw_spin_unlock(&this_rq->lock);
+ raw_spin_unlock(rq_lockp(this_rq));

update_blocked_averages(this_cpu);
rcu_read_lock();
@@ -10573,7 +10573,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
}
rcu_read_unlock();

- raw_spin_lock(&this_rq->lock);
+ raw_spin_lock(rq_lockp(this_rq));

if (curr_cost > this_rq->max_idle_balance_cost)
this_rq->max_idle_balance_cost = curr_cost;
@@ -11049,9 +11049,9 @@ void unregister_fair_sched_group(struct task_group *tg)

rq = cpu_rq(cpu);

- raw_spin_lock_irqsave(&rq->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), flags);
list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
}
}

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f324dc36fc43..8ce6e80352cf 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -421,10 +421,10 @@ struct task_struct *pick_next_task_idle(struct rq *rq)
static void
dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
{
- raw_spin_unlock_irq(&rq->lock);
+ raw_spin_unlock_irq(rq_lockp(rq));
printk(KERN_ERR "bad: scheduling from the idle thread!\n");
dump_stack();
- raw_spin_lock_irq(&rq->lock);
+ raw_spin_lock_irq(rq_lockp(rq));
}

/*
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 795e43e02afc..e850bd71a8ce 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -141,7 +141,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)

static inline u64 rq_clock_pelt(struct rq *rq)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
assert_clock_updated(rq);

return rq->clock_pelt - rq->lost_idle_time;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f215eea6a966..e57fca05b660 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -887,7 +887,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
if (skip)
continue;

- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
update_rq_clock(rq);

if (rt_rq->rt_time) {
@@ -925,7 +925,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)

if (enqueue)
sched_rt_rq_enqueue(rt_rq);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
}

if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF))
@@ -2094,9 +2094,9 @@ void rto_push_irq_work_func(struct irq_work *work)
* When it gets updated, a check is made if a push is possible.
*/
if (has_pushable_tasks(rq)) {
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
push_rt_tasks(rq);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
}

raw_spin_lock(&rd->rto_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index df80bfcea92e..587ebabebaff 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -894,7 +894,7 @@ DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
*/
struct rq {
/* runqueue lock: */
- raw_spinlock_t lock;
+ raw_spinlock_t __lock;

/*
* nr_running and cpu_load should be in the same cacheline because
@@ -1075,6 +1075,10 @@ static inline int cpu_of(struct rq *rq)
#endif
}

+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+ return &rq->__lock;
+}

#ifdef CONFIG_SCHED_SMT
extern void __update_idle_core(struct rq *rq);
@@ -1142,7 +1146,7 @@ static inline void assert_clock_updated(struct rq *rq)

static inline u64 rq_clock(struct rq *rq)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
assert_clock_updated(rq);

return rq->clock;
@@ -1150,7 +1154,7 @@ static inline u64 rq_clock(struct rq *rq)

static inline u64 rq_clock_task(struct rq *rq)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
assert_clock_updated(rq);

return rq->clock_task;
@@ -1176,7 +1180,7 @@ static inline u64 rq_clock_thermal(struct rq *rq)

static inline void rq_clock_skip_update(struct rq *rq)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
rq->clock_update_flags |= RQCF_REQ_SKIP;
}

@@ -1186,7 +1190,7 @@ static inline void rq_clock_skip_update(struct rq *rq)
*/
static inline void rq_clock_cancel_skipupdate(struct rq *rq)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
rq->clock_update_flags &= ~RQCF_REQ_SKIP;
}

@@ -1215,7 +1219,7 @@ struct rq_flags {
*/
static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
{
- rf->cookie = lockdep_pin_lock(&rq->lock);
+ rf->cookie = lockdep_pin_lock(rq_lockp(rq));

#ifdef CONFIG_SCHED_DEBUG
rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
@@ -1230,12 +1234,12 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
rf->clock_update_flags = RQCF_UPDATED;
#endif

- lockdep_unpin_lock(&rq->lock, rf->cookie);
+ lockdep_unpin_lock(rq_lockp(rq), rf->cookie);
}

static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
{
- lockdep_repin_lock(&rq->lock, rf->cookie);
+ lockdep_repin_lock(rq_lockp(rq), rf->cookie);

#ifdef CONFIG_SCHED_DEBUG
/*
@@ -1256,7 +1260,7 @@ static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
__releases(rq->lock)
{
rq_unpin_lock(rq, rf);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
}

static inline void
@@ -1265,7 +1269,7 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
__releases(p->pi_lock)
{
rq_unpin_lock(rq, rf);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
}

@@ -1273,7 +1277,7 @@ static inline void
rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
- raw_spin_lock_irqsave(&rq->lock, rf->flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), rf->flags);
rq_pin_lock(rq, rf);
}

@@ -1281,7 +1285,7 @@ static inline void
rq_lock_irq(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
- raw_spin_lock_irq(&rq->lock);
+ raw_spin_lock_irq(rq_lockp(rq));
rq_pin_lock(rq, rf);
}

@@ -1289,7 +1293,7 @@ static inline void
rq_lock(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
rq_pin_lock(rq, rf);
}

@@ -1297,7 +1301,7 @@ static inline void
rq_relock(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
rq_repin_lock(rq, rf);
}

@@ -1306,7 +1310,7 @@ rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
__releases(rq->lock)
{
rq_unpin_lock(rq, rf);
- raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), rf->flags);
}

static inline void
@@ -1314,7 +1318,7 @@ rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
__releases(rq->lock)
{
rq_unpin_lock(rq, rf);
- raw_spin_unlock_irq(&rq->lock);
+ raw_spin_unlock_irq(rq_lockp(rq));
}

static inline void
@@ -1322,7 +1326,7 @@ rq_unlock(struct rq *rq, struct rq_flags *rf)
__releases(rq->lock)
{
rq_unpin_lock(rq, rf);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
}

static inline struct rq *
@@ -1387,7 +1391,7 @@ queue_balance_callback(struct rq *rq,
struct callback_head *head,
void (*func)(struct rq *rq))
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

if (unlikely(head->next))
return;
@@ -2091,7 +2095,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
__acquires(busiest->lock)
__acquires(this_rq->lock)
{
- raw_spin_unlock(&this_rq->lock);
+ raw_spin_unlock(rq_lockp(this_rq));
double_rq_lock(this_rq, busiest);

return 1;
@@ -2110,20 +2114,22 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
__acquires(busiest->lock)
__acquires(this_rq->lock)
{
- int ret = 0;
-
- if (unlikely(!raw_spin_trylock(&busiest->lock))) {
- if (busiest < this_rq) {
- raw_spin_unlock(&this_rq->lock);
- raw_spin_lock(&busiest->lock);
- raw_spin_lock_nested(&this_rq->lock,
- SINGLE_DEPTH_NESTING);
- ret = 1;
- } else
- raw_spin_lock_nested(&busiest->lock,
- SINGLE_DEPTH_NESTING);
+ if (rq_lockp(this_rq) == rq_lockp(busiest))
+ return 0;
+
+ if (likely(raw_spin_trylock(rq_lockp(busiest))))
+ return 0;
+
+ if (rq_lockp(busiest) >= rq_lockp(this_rq)) {
+ raw_spin_lock_nested(rq_lockp(busiest), SINGLE_DEPTH_NESTING);
+ return 0;
}
- return ret;
+
+ raw_spin_unlock(rq_lockp(this_rq));
+ raw_spin_lock(rq_lockp(busiest));
+ raw_spin_lock_nested(rq_lockp(this_rq), SINGLE_DEPTH_NESTING);
+
+ return 1;
}

#endif /* CONFIG_PREEMPTION */
@@ -2133,11 +2139,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
*/
static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
{
- if (unlikely(!irqs_disabled())) {
- /* printk() doesn't work well under rq->lock */
- raw_spin_unlock(&this_rq->lock);
- BUG_ON(1);
- }
+ lockdep_assert_irqs_disabled();

return _double_lock_balance(this_rq, busiest);
}
@@ -2145,8 +2147,9 @@ static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
__releases(busiest->lock)
{
- raw_spin_unlock(&busiest->lock);
- lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
+ if (rq_lockp(this_rq) != rq_lockp(busiest))
+ raw_spin_unlock(rq_lockp(busiest));
+ lock_set_subclass(&rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
}

static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
@@ -2187,16 +2190,16 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
__acquires(rq2->lock)
{
BUG_ON(!irqs_disabled());
- if (rq1 == rq2) {
- raw_spin_lock(&rq1->lock);
+ if (rq_lockp(rq1) == rq_lockp(rq2)) {
+ raw_spin_lock(rq_lockp(rq1));
__acquire(rq2->lock); /* Fake it out ;) */
} else {
- if (rq1 < rq2) {
- raw_spin_lock(&rq1->lock);
- raw_spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
+ if (rq_lockp(rq1) < rq_lockp(rq2)) {
+ raw_spin_lock(rq_lockp(rq1));
+ raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING);
} else {
- raw_spin_lock(&rq2->lock);
- raw_spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
+ raw_spin_lock(rq_lockp(rq2));
+ raw_spin_lock_nested(rq_lockp(rq1), SINGLE_DEPTH_NESTING);
}
}
}
@@ -2211,9 +2214,9 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
__releases(rq1->lock)
__releases(rq2->lock)
{
- raw_spin_unlock(&rq1->lock);
- if (rq1 != rq2)
- raw_spin_unlock(&rq2->lock);
+ raw_spin_unlock(rq_lockp(rq1));
+ if (rq_lockp(rq1) != rq_lockp(rq2))
+ raw_spin_unlock(rq_lockp(rq2));
else
__release(rq2->lock);
}
@@ -2236,7 +2239,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
{
BUG_ON(!irqs_disabled());
BUG_ON(rq1 != rq2);
- raw_spin_lock(&rq1->lock);
+ raw_spin_lock(rq_lockp(rq1));
__acquire(rq2->lock); /* Fake it out ;) */
}

@@ -2251,7 +2254,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
__releases(rq2->lock)
{
BUG_ON(rq1 != rq2);
- raw_spin_unlock(&rq1->lock);
+ raw_spin_unlock(rq_lockp(rq1));
__release(rq2->lock);
}

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index dd7770226086..eeb9aca1c853 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -454,7 +454,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
struct root_domain *old_rd = NULL;
unsigned long flags;

- raw_spin_lock_irqsave(&rq->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), flags);

if (rq->rd) {
old_rd = rq->rd;
@@ -480,7 +480,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
set_rq_online(rq);

- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), flags);

if (old_rd)
call_rcu(&old_rd->rcu, free_rootdomain);
--
2.29.0.rc1.297.gfa9743e501-goog

2020-10-20 18:19:00

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.

From: Peter Zijlstra <[email protected]>

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

During a CPU hotplug event, schedule would be called with the hotplugged
CPU not in the cpumask. So use for_each_cpu(_wrap)_or to include the
current cpu in the task pick loop.

There are multiple loops in pick_next_task that iterate over CPUs in
smt_mask. During a hotplug event, sibling could be removed from the
smt_mask while pick_next_task is running. So we cannot trust the mask
across the different loops. This can confuse the logic. Add a retry logic
if smt_mask changes between the loops.

Tested-by: Julien Desfossez <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
kernel/sched/core.c | 301 ++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 6 +-
2 files changed, 305 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a032f481c6e6..12030b77bd6d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4533,7 +4533,7 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
* Pick up the highest-prio task:
*/
static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
const struct sched_class *class;
struct task_struct *p;
@@ -4574,6 +4574,294 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
}

#ifdef CONFIG_SCHED_CORE
+static inline bool is_task_rq_idle(struct task_struct *t)
+{
+ return (task_rq(t)->idle == t);
+}
+
+static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
+{
+ return is_task_rq_idle(a) || (a->core_cookie == cookie);
+}
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+ if (is_task_rq_idle(a) || is_task_rq_idle(b))
+ return true;
+
+ return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ * rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
+{
+ struct task_struct *class_pick, *cookie_pick;
+ unsigned long cookie = rq->core->core_cookie;
+
+ class_pick = class->pick_task(rq);
+ if (!class_pick)
+ return NULL;
+
+ if (!cookie) {
+ /*
+ * If class_pick is tagged, return it only if it has
+ * higher priority than max.
+ */
+ if (max && class_pick->core_cookie &&
+ prio_less(class_pick, max))
+ return idle_sched_class.pick_task(rq);
+
+ return class_pick;
+ }
+
+ /*
+ * If class_pick is idle or matches cookie, return early.
+ */
+ if (cookie_equals(class_pick, cookie))
+ return class_pick;
+
+ cookie_pick = sched_core_find(rq, cookie);
+
+ /*
+ * If class > max && class > cookie, it is the highest priority task on
+ * the core (so far) and it must be selected, otherwise we must go with
+ * the cookie pick in order to satisfy the constraint.
+ */
+ if (prio_less(cookie_pick, class_pick) &&
+ (!max || prio_less(max, class_pick)))
+ return class_pick;
+
+ return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+ struct task_struct *next, *max = NULL;
+ const struct sched_class *class;
+ const struct cpumask *smt_mask;
+ bool need_sync;
+ int i, j, cpu;
+
+ if (!sched_core_enabled(rq))
+ return __pick_next_task(rq, prev, rf);
+
+ cpu = cpu_of(rq);
+
+ /* Stopper task is switching into idle, no need core-wide selection. */
+ if (cpu_is_offline(cpu)) {
+ /*
+ * Reset core_pick so that we don't enter the fastpath when
+ * coming online. core_pick would already be migrated to
+ * another cpu during offline.
+ */
+ rq->core_pick = NULL;
+ return __pick_next_task(rq, prev, rf);
+ }
+
+ /*
+ * If there were no {en,de}queues since we picked (IOW, the task
+ * pointers are all still valid), and we haven't scheduled the last
+ * pick yet, do so now.
+ *
+ * rq->core_pick can be NULL if no selection was made for a CPU because
+ * it was either offline or went offline during a sibling's core-wide
+ * selection. In this case, do a core-wide selection.
+ */
+ if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+ rq->core->core_pick_seq != rq->core_sched_seq &&
+ rq->core_pick) {
+ WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+ next = rq->core_pick;
+ if (next != prev) {
+ put_prev_task(rq, prev);
+ set_next_task(rq, next);
+ }
+
+ rq->core_pick = NULL;
+ return next;
+ }
+
+ put_prev_task_balance(rq, prev, rf);
+
+ smt_mask = cpu_smt_mask(cpu);
+
+ /*
+ * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+ *
+ * @task_seq guards the task state ({en,de}queues)
+ * @pick_seq is the @task_seq we did a selection on
+ * @sched_seq is the @pick_seq we scheduled
+ *
+ * However, preemptions can cause multiple picks on the same task set.
+ * 'Fix' this by also increasing @task_seq for every pick.
+ */
+ rq->core->core_task_seq++;
+ need_sync = !!rq->core->core_cookie;
+
+ /* reset state */
+ rq->core->core_cookie = 0UL;
+ for_each_cpu(i, smt_mask) {
+ struct rq *rq_i = cpu_rq(i);
+
+ rq_i->core_pick = NULL;
+
+ if (rq_i->core_forceidle) {
+ need_sync = true;
+ rq_i->core_forceidle = false;
+ }
+
+ if (i != cpu)
+ update_rq_clock(rq_i);
+ }
+
+ /*
+ * Try and select tasks for each sibling in decending sched_class
+ * order.
+ */
+ for_each_class(class) {
+again:
+ for_each_cpu_wrap(i, smt_mask, cpu) {
+ struct rq *rq_i = cpu_rq(i);
+ struct task_struct *p;
+
+ if (rq_i->core_pick)
+ continue;
+
+ /*
+ * If this sibling doesn't yet have a suitable task to
+ * run; ask for the most elegible task, given the
+ * highest priority task already selected for this
+ * core.
+ */
+ p = pick_task(rq_i, class, max);
+ if (!p) {
+ /*
+ * If there weren't no cookies; we don't need to
+ * bother with the other siblings.
+ * If the rest of the core is not running a tagged
+ * task, i.e. need_sync == 0, and the current CPU
+ * which called into the schedule() loop does not
+ * have any tasks for this class, skip selecting for
+ * other siblings since there's no point. We don't skip
+ * for RT/DL because that could make CFS force-idle RT.
+ */
+ if (i == cpu && !need_sync && class == &fair_sched_class)
+ goto next_class;
+
+ continue;
+ }
+
+ /*
+ * Optimize the 'normal' case where there aren't any
+ * cookies and we don't need to sync up.
+ */
+ if (i == cpu && !need_sync && !p->core_cookie) {
+ next = p;
+ goto done;
+ }
+
+ rq_i->core_pick = p;
+
+ /*
+ * If this new candidate is of higher priority than the
+ * previous; and they're incompatible; we need to wipe
+ * the slate and start over. pick_task makes sure that
+ * p's priority is more than max if it doesn't match
+ * max's cookie.
+ *
+ * NOTE: this is a linear max-filter and is thus bounded
+ * in execution time.
+ */
+ if (!max || !cookie_match(max, p)) {
+ struct task_struct *old_max = max;
+
+ rq->core->core_cookie = p->core_cookie;
+ max = p;
+
+ if (old_max) {
+ for_each_cpu(j, smt_mask) {
+ if (j == i)
+ continue;
+
+ cpu_rq(j)->core_pick = NULL;
+ }
+ goto again;
+ } else {
+ /*
+ * Once we select a task for a cpu, we
+ * should not be doing an unconstrained
+ * pick because it might starve a task
+ * on a forced idle cpu.
+ */
+ need_sync = true;
+ }
+
+ }
+ }
+next_class:;
+ }
+
+ rq->core->core_pick_seq = rq->core->core_task_seq;
+ next = rq->core_pick;
+ rq->core_sched_seq = rq->core->core_pick_seq;
+
+ /* Something should have been selected for current CPU */
+ WARN_ON_ONCE(!next);
+
+ /*
+ * Reschedule siblings
+ *
+ * NOTE: L1TF -- at this point we're no longer running the old task and
+ * sending an IPI (below) ensures the sibling will no longer be running
+ * their task. This ensures there is no inter-sibling overlap between
+ * non-matching user state.
+ */
+ for_each_cpu(i, smt_mask) {
+ struct rq *rq_i = cpu_rq(i);
+
+ /*
+ * An online sibling might have gone offline before a task
+ * could be picked for it, or it might be offline but later
+ * happen to come online, but its too late and nothing was
+ * picked for it. That's Ok - it will pick tasks for itself,
+ * so ignore it.
+ */
+ if (!rq_i->core_pick)
+ continue;
+
+ if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running)
+ rq_i->core_forceidle = true;
+
+ if (i == cpu) {
+ rq_i->core_pick = NULL;
+ continue;
+ }
+
+ /* Did we break L1TF mitigation requirements? */
+ WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+
+ if (rq_i->curr == rq_i->core_pick) {
+ rq_i->core_pick = NULL;
+ continue;
+ }
+
+ resched_curr(rq_i);
+ }
+
+done:
+ set_next_task(rq, next);
+ return next;
+}

static inline void sched_core_cpu_starting(unsigned int cpu)
{
@@ -4608,6 +4896,12 @@ static inline void sched_core_cpu_starting(unsigned int cpu)

static inline void sched_core_cpu_starting(unsigned int cpu) {}

+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+ return __pick_next_task(rq, prev, rf);
+}
+
#endif /* CONFIG_SCHED_CORE */

/*
@@ -7446,7 +7740,12 @@ void __init sched_init(void)

#ifdef CONFIG_SCHED_CORE
rq->core = NULL;
+ rq->core_pick = NULL;
rq->core_enabled = 0;
+ rq->core_tree = RB_ROOT;
+ rq->core_forceidle = false;
+
+ rq->core_cookie = 0UL;
#endif
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4964453591c3..2b6e0bf61720 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1052,11 +1052,16 @@ struct rq {
#ifdef CONFIG_SCHED_CORE
/* per rq */
struct rq *core;
+ struct task_struct *core_pick;
unsigned int core_enabled;
+ unsigned int core_sched_seq;
struct rb_root core_tree;
+ unsigned char core_forceidle;

/* shared state */
unsigned int core_task_seq;
+ unsigned int core_pick_seq;
+ unsigned long core_cookie;
#endif
};

@@ -1936,7 +1941,6 @@ static inline void put_prev_task(struct rq *rq, struct task_struct *prev)

static inline void set_next_task(struct rq *rq, struct task_struct *next)
{
- WARN_ON_ONCE(rq->curr != next);
next->sched_class->set_next_task(rq, next, false);
}

--
2.29.0.rc1.297.gfa9743e501-goog

2020-10-20 18:20:05

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH v8 -tip 15/26] entry/kvm: Protect the kernel when entering from guest

From: Vineeth Pillai <[email protected]>

Similar to how user to kernel mode transitions are protected in earlier
patches, protect the entry into kernel from guest mode as well.

Tested-by: Julien Desfossez <[email protected]>
Signed-off-by: Vineeth Pillai <[email protected]>
---
arch/x86/kvm/x86.c | 3 +++
include/linux/entry-kvm.h | 12 ++++++++++++
kernel/entry/kvm.c | 13 +++++++++++++
3 files changed, 28 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ce856e0ece84..05a281f3ef28 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8540,6 +8540,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
*/
smp_mb__after_srcu_read_unlock();

+ kvm_exit_to_guest_mode(vcpu);
+
/*
* This handles the case where a posted interrupt was
* notified with kvm_vcpu_kick.
@@ -8633,6 +8635,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
}
}

+ kvm_enter_from_guest_mode(vcpu);
local_irq_enable();
preempt_enable();

diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 0cef17afb41a..32aabb7f3e6d 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -77,4 +77,16 @@ static inline bool xfer_to_guest_mode_work_pending(void)
}
#endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */

+/**
+ * kvm_enter_from_guest_mode - Hook called just after entering kernel from guest.
+ * @vcpu: Pointer to the current VCPU data
+ */
+void kvm_enter_from_guest_mode(struct kvm_vcpu *vcpu);
+
+/**
+ * kvm_exit_to_guest_mode - Hook called just before entering guest from kernel.
+ * @vcpu: Pointer to the current VCPU data
+ */
+void kvm_exit_to_guest_mode(struct kvm_vcpu *vcpu);
+
#endif
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index eb1a8a4c867c..b0b7facf4374 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -49,3 +49,16 @@ int xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu)
return xfer_to_guest_mode_work(vcpu, ti_work);
}
EXPORT_SYMBOL_GPL(xfer_to_guest_mode_handle_work);
+
+void kvm_enter_from_guest_mode(struct kvm_vcpu *vcpu)
+{
+ sched_core_unsafe_enter();
+}
+EXPORT_SYMBOL_GPL(kvm_enter_from_guest_mode);
+
+void kvm_exit_to_guest_mode(struct kvm_vcpu *vcpu)
+{
+ sched_core_unsafe_exit();
+ sched_core_wait_till_safe(XFER_TO_GUEST_MODE_WORK);
+}
+EXPORT_SYMBOL_GPL(kvm_exit_to_guest_mode);
--
2.29.0.rc1.297.gfa9743e501-goog

2020-10-20 18:20:05

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH v8 -tip 22/26] sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG

This will be used by kselftest to verify the CGroup cookie value that is
set by the CGroup interface.

Tested-by: Julien Desfossez <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/core.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1321c26a8385..b3afbba5abe1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9520,6 +9520,13 @@ static u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css, struct c
return tg->core_tag_color;
}

+#ifdef CONFIG_SCHED_DEBUG
+static u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+ return cpu_core_get_group_cookie(css_tg(css));
+}
+#endif
+
struct write_core_tag {
struct cgroup_subsys_state *css;
unsigned long cookie;
@@ -9695,6 +9702,14 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_core_tag_color_read_u64,
.write_u64 = cpu_core_tag_color_write_u64,
},
+#ifdef CONFIG_SCHED_DEBUG
+ /* Read the effective cookie (color+tag) of the group. */
+ {
+ .name = "core_group_cookie",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_core_group_cookie_read_u64,
+ },
+#endif
#endif
#ifdef CONFIG_UCLAMP_TASK_GROUP
{
@@ -9882,6 +9897,14 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_core_tag_color_read_u64,
.write_u64 = cpu_core_tag_color_write_u64,
},
+#ifdef CONFIG_SCHED_DEBUG
+ /* Read the effective cookie (color+tag) of the group. */
+ {
+ .name = "core_group_cookie",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_core_group_cookie_read_u64,
+ },
+#endif
#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
--
2.29.0.rc1.297.gfa9743e501-goog

2020-10-20 18:20:06

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH v8 -tip 21/26] sched: Handle task addition to CGroup

Due to earlier patches, the old way of computing a task's cookie when it
is added to a CGroup,is outdated. Update it by fetching the group's
cookie using the new helpers.

Tested-by: Julien Desfossez <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/core.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 61e1dcf11000..1321c26a8385 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8505,6 +8505,9 @@ void sched_offline_group(struct task_group *tg)
spin_unlock_irqrestore(&task_group_lock, flags);
}

+#define SCHED_CORE_GROUP_COOKIE_MASK ((1UL << (sizeof(unsigned long) * 4)) - 1)
+static unsigned long cpu_core_get_group_cookie(struct task_group *tg);
+
static void sched_change_group(struct task_struct *tsk, int type)
{
struct task_group *tg;
@@ -8519,11 +8522,13 @@ static void sched_change_group(struct task_struct *tsk, int type)
tg = autogroup_task_group(tsk, tg);

#ifdef CONFIG_SCHED_CORE
- if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
- tsk->core_cookie = 0UL;
+ if (tsk->core_group_cookie) {
+ tsk->core_group_cookie = 0UL;
+ tsk->core_cookie &= ~SCHED_CORE_GROUP_COOKIE_MASK;
+ }

- if (tg->core_tagged /* && !tsk->core_cookie ? */)
- tsk->core_cookie = (unsigned long)tg;
+ tsk->core_group_cookie = cpu_core_get_group_cookie(tg);
+ tsk->core_cookie |= tsk->core_group_cookie;
#endif

tsk->sched_task_group = tg;
@@ -9471,7 +9476,7 @@ static unsigned long cpu_core_get_group_cookie(struct task_group *tg)

if (tg->core_tagged) {
unsigned long cookie = ((unsigned long)tg << 8) | color;
- cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
+ cookie &= SCHED_CORE_GROUP_COOKIE_MASK;
return cookie;
}
}
--
2.29.0.rc1.297.gfa9743e501-goog

2020-10-20 18:20:12

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH v8 -tip 19/26] sched: Add a second-level tag for nested CGroup usecase

Google has a usecase where the first level tag to tag a CGroup is not
sufficient. So, a patch is carried for years where a second tag is added which
is writeable by unprivileged users.

Google uses DAC controls to make the 'tag' possible to set only by root while
the second-level 'color' can be changed by anyone. The actual names that
Google uses is different, but the concept is the same.

The hierarchy looks like:

Root group
/ \
A B (These are created by the root daemon - borglet).
/ \ \
C D E (These are created by AppEngine within the container).

The reason why Google has two parts is that AppEngine wants to allow a subset of
subcgroups within a parent tagged cgroup sharing execution. Think of these
subcgroups belong to the same customer or project. Because these subcgroups are
created by AppEngine, they are not tracked by borglet (the root daemon),
therefore borglet won't have a chance to set a color for them. That's where
'color' file comes from. Color could be set by AppEngine, and once set, the
normal tasks within the subcgroup would not be able to overwrite it. This is
enforced by promoting the permission of the color file in cgroupfs.

The 'color' is a 8-bit value allowing for upto 256 unique colors. IMHO, having
more than these many CGroups sounds like a scalability issue so this suffices.
We steal the lower 8-bits of the cookie to set the color.

Tested-by: Julien Desfossez <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/core.c | 181 +++++++++++++++++++++++++++++++++++++------
kernel/sched/sched.h | 3 +-
2 files changed, 158 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a0678614a056..42aa811eab14 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8522,7 +8522,7 @@ static void sched_change_group(struct task_struct *tsk, int type)
if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
tsk->core_cookie = 0UL;

- if (tg->tagged /* && !tsk->core_cookie ? */)
+ if (tg->core_tagged /* && !tsk->core_cookie ? */)
tsk->core_cookie = (unsigned long)tg;
#endif

@@ -8623,9 +8623,9 @@ static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
#ifdef CONFIG_SCHED_CORE
struct task_group *tg = css_tg(css);

- if (tg->tagged) {
+ if (tg->core_tagged) {
sched_core_put();
- tg->tagged = 0;
+ tg->core_tagged = 0;
}
#endif
}
@@ -9228,7 +9228,7 @@ void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool gr

if (sched_core_enqueued(p)) {
sched_core_dequeue(task_rq(p), p);
- if (!p->core_task_cookie)
+ if (!p->core_cookie)
return;
}

@@ -9448,41 +9448,100 @@ int sched_core_share_pid(pid_t pid)
}

/* CGroup interface */
+
+/*
+ * Helper to get the cookie in a hierarchy.
+ * The cookie is a combination of a tag and color. Any ancestor
+ * can have a tag/color. tag is the first-level cookie setting
+ * with color being the second. Atmost one color and one tag is
+ * allowed.
+ */
+static unsigned long cpu_core_get_group_cookie(struct task_group *tg)
+{
+ unsigned long color = 0;
+
+ if (!tg)
+ return 0;
+
+ for (; tg; tg = tg->parent) {
+ if (tg->core_tag_color) {
+ WARN_ON_ONCE(color);
+ color = tg->core_tag_color;
+ }
+
+ if (tg->core_tagged) {
+ unsigned long cookie = ((unsigned long)tg << 8) | color;
+ cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
+ return cookie;
+ }
+ }
+
+ return 0;
+}
+
+/* Determine if any group in @tg's children are tagged or colored. */
+static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag,
+ bool check_color)
+{
+ struct task_group *child;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(child, &tg->children, siblings) {
+ if ((child->core_tagged && check_tag) ||
+ (child->core_tag_color && check_color)) {
+ rcu_read_unlock();
+ return true;
+ }
+
+ rcu_read_unlock();
+ return cpu_core_check_descendants(child, check_tag, check_color);
+ }
+
+ rcu_read_unlock();
+ return false;
+}
+
static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
{
struct task_group *tg = css_tg(css);

- return !!tg->tagged;
+ return !!tg->core_tagged;
+}
+
+static u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+ struct task_group *tg = css_tg(css);
+
+ return tg->core_tag_color;
}

struct write_core_tag {
struct cgroup_subsys_state *css;
- int val;
+ unsigned long cookie;
};

static int __sched_write_tag(void *data)
{
struct write_core_tag *tag = (struct write_core_tag *) data;
- struct cgroup_subsys_state *css = tag->css;
- int val = tag->val;
- struct task_group *tg = css_tg(tag->css);
- struct css_task_iter it;
struct task_struct *p;
+ struct cgroup_subsys_state *css;

- tg->tagged = !!val;
+ rcu_read_lock();
+ css_for_each_descendant_pre(css, tag->css) {
+ struct css_task_iter it;

- css_task_iter_start(css, 0, &it);
- /*
- * Note: css_task_iter_next will skip dying tasks.
- * There could still be dying tasks left in the core queue
- * when we set cgroup tag to 0 when the loop is done below.
- */
- while ((p = css_task_iter_next(&it))) {
- unsigned long cookie = !!val ? (unsigned long)tg : 0UL;
+ css_task_iter_start(css, 0, &it);
+ /*
+ * Note: css_task_iter_next will skip dying tasks.
+ * There could still be dying tasks left in the core queue
+ * when we set cgroup tag to 0 when the loop is done below.
+ */
+ while ((p = css_task_iter_next(&it)))
+ sched_core_tag_requeue(p, tag->cookie, true /* group */);

- sched_core_tag_requeue(p, cookie, true /* group */);
+ css_task_iter_end(&it);
}
- css_task_iter_end(&it);
+ rcu_read_unlock();

return 0;
}
@@ -9498,20 +9557,80 @@ static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype
if (!static_branch_likely(&sched_smt_present))
return -EINVAL;

- if (tg->tagged == !!val)
+ if (!tg->core_tagged && val) {
+ /* Tag is being set. Check ancestors and descendants. */
+ if (cpu_core_get_group_cookie(tg) ||
+ cpu_core_check_descendants(tg, true /* tag */, true /* color */))
+ return -EBUSY;
+ } else if (tg->core_tagged && !val) {
+ /* Tag is being reset. Check descendants. */
+ if (cpu_core_check_descendants(tg, true /* tag */, true /* color */))
+ return -EBUSY;
+ } else {
return 0;
+ }

if (!!val)
sched_core_get();

wtag.css = css;
- wtag.val = val;
+ wtag.cookie = (unsigned long)tg << 8; /* Reserve lower 8 bits for color. */
+
+ /* Truncate the upper 32-bits - those are used by the per-task cookie. */
+ wtag.cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
+
+ tg->core_tagged = val;
+
stop_machine(__sched_write_tag, (void *) &wtag, NULL);
if (!val)
sched_core_put();

return 0;
}
+
+static int cpu_core_tag_color_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 val)
+{
+ struct task_group *tg = css_tg(css);
+ struct write_core_tag wtag;
+ u64 cookie;
+
+ if (val > 255)
+ return -ERANGE;
+
+ if (!static_branch_likely(&sched_smt_present))
+ return -EINVAL;
+
+ cookie = cpu_core_get_group_cookie(tg);
+ /* Can't set color if nothing in the ancestors were tagged. */
+ if (!cookie)
+ return -EINVAL;
+
+ /*
+ * Something in the ancestors already colors us. Can't change the color
+ * at this level.
+ */
+ if (!tg->core_tag_color && (cookie & 255))
+ return -EINVAL;
+
+ /*
+ * Check if any descendants are colored. If so, we can't recolor them.
+ * Don't need to check if descendants are tagged, since we don't allow
+ * tagging when already tagged.
+ */
+ if (cpu_core_check_descendants(tg, false /* tag */, true /* color */))
+ return -EINVAL;
+
+ cookie &= ~255;
+ cookie |= val;
+ wtag.css = css;
+ wtag.cookie = cookie;
+ tg->core_tag_color = val;
+
+ stop_machine(__sched_write_tag, (void *) &wtag, NULL);
+
+ return 0;
+}
#endif

static struct cftype cpu_legacy_files[] = {
@@ -9552,11 +9671,17 @@ static struct cftype cpu_legacy_files[] = {
#endif
#ifdef CONFIG_SCHED_CORE
{
- .name = "tag",
+ .name = "core_tag",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = cpu_core_tag_read_u64,
.write_u64 = cpu_core_tag_write_u64,
},
+ {
+ .name = "core_tag_color",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_core_tag_color_read_u64,
+ .write_u64 = cpu_core_tag_color_write_u64,
+ },
#endif
#ifdef CONFIG_UCLAMP_TASK_GROUP
{
@@ -9733,11 +9858,17 @@ static struct cftype cpu_files[] = {
#endif
#ifdef CONFIG_SCHED_CORE
{
- .name = "tag",
+ .name = "core_tag",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = cpu_core_tag_read_u64,
.write_u64 = cpu_core_tag_write_u64,
},
+ {
+ .name = "core_tag_color",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_core_tag_color_read_u64,
+ .write_u64 = cpu_core_tag_color_write_u64,
+ },
#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 661569ee4650..aebeb91c4a0f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -385,7 +385,8 @@ struct task_group {
struct cgroup_subsys_state css;

#ifdef CONFIG_SCHED_CORE
- int tagged;
+ int core_tagged;
+ u8 core_tag_color;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.29.0.rc1.297.gfa9743e501-goog

2020-10-20 18:21:18

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH v8 -tip 25/26] Documentation: Add core scheduling documentation

Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Vineeth Pillai <[email protected]>
Tested-by: Julien Desfossez <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
.../admin-guide/hw-vuln/core-scheduling.rst | 312 ++++++++++++++++++
Documentation/admin-guide/hw-vuln/index.rst | 1 +
2 files changed, 313 insertions(+)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index 000000000000..eacafbb8fa3f
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,312 @@
+Core Scheduling
+***************
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+----------------
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practise, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``).
+
+Usage
+-----
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core, while doing its best
+to satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+######
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.tag``
+Writing ``1`` into this file results in all tasks in the group get tagged. This
+results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inheritted
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.tag, it is not possible to set this
+ for any descendant of the tagged group. For finer grained control, the
+ ``cpu.tag_color`` file described next may be used.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+ a core with kernel threads and untagged system threads. For this reason,
+ if a group has ``cpu.tag`` of 0, it is considered to be trusted.
+
+* ``cpu.tag_color``
+For finer grained control over core sharing, a color can also be set in
+addition to the tag. This allows to further control core sharing between child
+CGroups within an already tagged CGroup. The color and the tag are both used to
+generate a `cookie` which is used by the scheduler to identify the group.
+
+Upto 256 different colors can be set (0-255) by writing into this file.
+
+A sample real-world usage of this file follows:
+
+Google uses DAC controls to make ``cpu.tag`` writeable only by root and the
+``cpu.tag_color`` can be changed by anyone.
+
+The hierarchy looks like this:
+::
+ Root group
+ / \
+ A B (These are created by the root daemon - borglet).
+ / \ \
+ C D E (These are created by AppEngine within the container).
+
+A and B are containers for 2 different jobs or apps that are created by a root
+daemon called borglet. borglet then tags each of these group with the ``cpu.tag``
+file. The job itself can create additional child CGroups which are colored by
+the container's AppEngine with the ``cpu.tag_color`` file.
+
+The reason why Google uses this 2-level tagging system is that AppEngine wants to
+allow a subset of child CGroups within a tagged parent CGroup to be co-scheduled on a
+core while not being co-scheduled with other child CGroups. Think of these
+child CGroups as belonging to the same customer or project. Because these
+child CGroups are created by AppEngine, they are not tracked by borglet (the
+root daemon), therefore borglet won't have a chance to set a color for them.
+That's where cpu.tag_color file comes in. A color could be set by AppEngine,
+and once set, the normal tasks within the subcgroup would not be able to
+overwrite it. This is enforced by promoting the permission of the
+``cpu.tag_color`` file in cgroupfs.
+
+The color is an 8-bit value allowing for upto 256 unique colors.
+
+.. note:: Once a CGroup is colored, none of its descendants can be re-colored. Also
+ coloring of a CGroup is possible only if either the group or one of its
+ ancestors were tagged via the ``cpu.tag`` file.
+
+prctl interface
+###############
+A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` is available to a process to request
+sharing a core with another process. For example, consider 2 processes ``P1``
+and ``P2`` with PIDs 100 and 200. If process ``P1`` calls
+``prctl(PR_SCHED_CORE_SHARE, 200)``, the kernel makes ``P1`` share a core with ``P2``.
+The kernel performs ptrace access mode checks before granting the request.
+
+.. note:: This operation is not commutative. P1 calling
+ ``prctl(PR_SCHED_CORE_SHARE, pidof(P2)`` is not the same as P2 calling the
+ same for P1. The former case is P1 joining P2's group of processes
+ (which P2 would have joined with ``prctl(2)`` prior to P1's ``prctl(2)``).
+
+.. note:: The core-sharing granted with prctl(2) will be subject to
+ core-sharing restrictions specified by the CGroup interface. For example
+ if P1 and P2 are a part of 2 different tagged CGroups, then they will
+ not share a core even if a prctl(2) call is made. This is analogous
+ to how affinities are set using the cpuset interface.
+
+It is important to note that, on a ``CLONE_THREAD`` ``clone(2)`` syscall, the child
+will be assigned the same tag as its parent and thus be allowed to share a core
+with them. is design choice is because, for the security usecase, a
+``CLONE_THREAD`` child can access its parent's address space anyway, so there's
+no point in not allowing them to share a core. If a different behavior is
+desired, the child thread can call ``prctl(2)`` as needed. This behavior is
+specific to the ``prctl(2)`` interface. For the CGroup interface, the child of a
+fork always share's a core with its parent's. On the other hand, if a parent
+was previously tagged via ``prctl(2)`` and does a regular ``fork(2)`` syscall, the
+child will receive a unique tag.
+
+Design/Implementation
+---------------------
+Each task that is tagged is assigned a cookie internally in the kernel. As
+mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
+each other and share a core.
+
+The basic idea is that, every schedule event tries to select tasks for all the
+siblings of a core such that all the selected tasks running on a core are
+trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
+The idle task is considered special, in that it trusts every thing.
+
+During a ``schedule()`` event on any sibling of a core, the highest priority task for
+that core is picked and assigned to the sibling calling ``schedule()`` if it has it
+enqueued. For rest of the siblings in the core, highest priority task with the
+same cookie is selected if there is one runnable in their individual run
+queues. If a task with same cookie is not available, the idle task is selected.
+Idle task is globally trusted.
+
+Once a task has been selected for all the siblings in the core, an IPI is sent to
+siblings for whom a new task was selected. Siblings on receiving the IPI, will
+switch to the new task immediately. If an idle task is selected for a sibling,
+then the sibling is considered to be in a `forced idle` state. i.e., it may
+have tasks on its on runqueue to run, however it will still have to run idle.
+More on this in the next section.
+
+Forced-idling of tasks
+----------------------
+The scheduler tries its best to find tasks that trust each other such that all
+tasks selected to be scheduled are of the highest priority in a core. However,
+it is possible that some runqueues had tasks that were incompatibile with the
+highest priority ones in the core. Favoring security over fairness, one or more
+siblings could be forced to select a lower priority task if the highest
+priority task is not trusted with respect to the core wide highest priority
+task. If a sibling does not have a trusted task to run, it will be forced idle
+by the scheduler(idle thread is scheduled to run).
+
+When the highest priorty task is selected to run, a reschedule-IPI is sent to
+the sibling to force it into idle. This results in 4 cases which need to be
+considered depending on whether a VM or a regular usermode process was running
+on either HT::
+
+ HT1 (attack) HT2 (victim)
+ A idle -> user space user space -> idle
+ B idle -> user space guest -> idle
+ C idle -> guest user space -> idle
+ D idle -> guest guest -> idle
+
+Note that for better performance, we do not wait for the destination CPU
+(victim) to enter idle mode. This is because the sending of the IPI would bring
+the destination CPU immediately into kernel mode from user space, or VMEXIT
+in the case of guests. At best, this would only leak some scheduler metadata
+which may not be worth protecting. It is also possible that the IPI is received
+too late on some architectures, but this has not been observed in the case of
+x86.
+
+Kernel protection from untrusted tasks
+--------------------------------------
+The scheduler on its own cannot protect the kernel executing concurrently with
+an untrusted task in a core. This is because the scheduler is unaware of
+interrupts/syscalls at scheduling time. To mitigate this, an IPI is sent to
+siblings on kernel entry (syscall and IRQ). This IPI forces the sibling to enter
+kernel mode and wait before returning to user until all siblings of the
+core have left kernel mode. This process is also known as stunning. For good
+performance, an IPI is sent only to a sibling only if it is running a tagged
+task. If a sibling is running a kernel thread or is idle, no IPI is sent.
+
+The kernel protection feature can be turned off on the kernel command line by
+passing ``sched_core_protect_kernel=0``.
+
+Other alternative ideas discussed for kernel protection are listed below just
+for completeness. They all have limitations:
+
+1. Changing interrupt affinities to a trusted core which does not execute untrusted tasks
+#########################################################################################
+By changing the interrupt affinities to a designated safe-CPU which runs
+only trusted tasks, IRQ data can be protected. One issue is this involves
+giving up a full CPU core of the system to run safe tasks. Another is that,
+per-cpu interrupts such as the local timer interrupt cannot have their
+affinity changed. also, sensitive timer callbacks such as the random entropy timer
+can run in softirq on return from these interrupts and expose sensitive
+data. In the future, that could be mitigated by forcing softirqs into threaded
+mode by utilizing a mechanism similar to ``CONFIG_PREEMPT_RT``.
+
+Yet another issue with this is, for multiqueue devices with managed
+interrupts, the IRQ affinities cannot be changed however it could be
+possible to force a reduced number of queues which would in turn allow to
+shield one or two CPUs from such interrupts and queue handling for the price
+of indirection.
+
+2. Running IRQs as threaded-IRQs
+################################
+This would result in forcing IRQs into the scheduler which would then provide
+the process-context mitigation. However, not all interrupts can be threaded.
+Also this does nothing about syscall entries.
+
+3. Kernel Address Space Isolation
+#################################
+System calls could run in a much restricted address space which is
+guarenteed not to leak any sensitive data. There are practical limitation in
+implementing this - the main concern being how to decide on an address space
+that is guarenteed to not have any sensitive data.
+
+4. Limited cookie-based protection
+##################################
+On a system call, change the cookie to the system trusted cookie and initiate a
+schedule event. This would be better than pausing all the siblings during the
+entire duration for the system call, but still would be a huge hit to the
+performance.
+
+Trust model
+-----------
+Core scheduling maintains trust relationships amongst groups of tasks by
+assigning the tag of them with the same cookie value.
+When a system with core scheduling boots, all tasks are considered to trust
+each other. This is because the core scheduler does not have information about
+trust relationships until userspace uses the above mentioned interfaces, to
+communicate them. In other words, all tasks have a default cookie value of 0.
+and are considered system-wide trusted. The stunning of siblings running
+cookie-0 tasks is also avoided.
+
+Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
+within such groups are considered to trust each other, but do not trust those
+outside. Tasks outside the group also don't trust tasks within.
+
+Limitations
+-----------
+Core scheduling tries to guarentee that only trusted tasks run concurrently on a
+core. But there could be small window of time during which untrusted tasks run
+concurrently or kernel could be running concurrently with a task not trusted by
+kernel.
+
+1. IPI processing delays
+########################
+Core scheduling selects only trusted tasks to run together. IPI is used to notify
+the siblings to switch to the new task. But there could be hardware delays in
+receiving of the IPI on some arch (on x86, this has not been observed). This may
+cause an attacker task to start running on a cpu before its siblings receive the
+IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
+may populate data in the cache and micro acrhitectural buffers after the attacker
+starts to run and this is a possibility for data leak.
+
+Open cross-HT issues that core scheduling does not solve
+--------------------------------------------------------
+1. For MDS
+##########
+Core scheduling cannot protect against MDS attacks between an HT running in
+user mode and another running in kernel mode. Even though both HTs run tasks
+which trust each other, kernel memory is still considered untrusted. Such
+attacks are possible for any combination of sibling CPU modes (host or guest mode).
+
+2. For L1TF
+###########
+Core scheduling cannot protect against a L1TF guest attackers exploiting a
+guest or host victim. This is because the guest attacker can craft invalid
+PTEs which are not inverted due to a vulnerable guest kernel. The only
+solution is to disable EPT.
+
+For both MDS and L1TF, if the guest vCPU is configured to not trust each
+other (by tagging separately), then the guest to guest attacks would go away.
+Or it could be a system admin policy which considers guest to guest attacks as
+a guest problem.
+
+Another approach to resolve these would be to make every untrusted task on the
+system to not trust every other untrusted task. While this could reduce
+parallelism of the untrusted tasks, it would still solve the above issues while
+allowing system processes (trusted tasks) to share a core.
+
+Use cases
+---------
+The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
+with SMT enabled. There are other use cases where this feature could be used:
+
+- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
+ that uses SIMD instructions etc.
+- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
+ together could also be realized using core scheduling. One example is vcpus of
+ a VM.
+
+Future work
+-----------
+Skipping per-HT mitigations if task is trusted
+##############################################
+If core scheduling is enabled, by default all tasks trust each other as
+mentioned above. In such scenario, it may be desirable to skip the same-HT
+mitigations on return to the trusted user-mode to improve performance.
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index 21710f8609fe..361ccbbd9e54 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -16,3 +16,4 @@ are configurable at compile, boot or run time.
multihit.rst
special-register-buffer-data-sampling.rst
l1d_flush.rst
+ core-scheduling.rst
--
2.29.0.rc1.297.gfa9743e501-goog

2020-10-20 19:17:30

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 25/26] Documentation: Add core scheduling documentation

Hi Joel,

On 10/19/20 6:43 PM, Joel Fernandes (Google) wrote:
> Document the usecases, design and interfaces for core scheduling.
>
> Co-developed-by: Vineeth Pillai <[email protected]>
> Tested-by: Julien Desfossez <[email protected]>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> ---
> .../admin-guide/hw-vuln/core-scheduling.rst | 312 ++++++++++++++++++
> Documentation/admin-guide/hw-vuln/index.rst | 1 +
> 2 files changed, 313 insertions(+)
> create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
>
> diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> new file mode 100644
> index 000000000000..eacafbb8fa3f
> --- /dev/null
> +++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> @@ -0,0 +1,312 @@
> +Core Scheduling
> +***************
> +Core scheduling support allows userspace to define groups of tasks that can
> +share a core. These groups can be specified either for security usecases (one
> +group of tasks don't trust another), or for performance usecases (some
> +workloads may benefit from running on the same core as they don't need the same
> +hardware resources of the shared core).
> +
> +Security usecase
> +----------------
> +A cross-HT attack involves the attacker and victim running on different
> +Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
> +Without core scheduling, the only full mitigation of cross-HT attacks is to
> +disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
> +by ensuring that trusted tasks can share a core. This increase in core sharing
> +can improvement performance, however it is not guaranteed that performance will
> +always improve, though that is seen to be the case with a number of real world
> +workloads. In theory, core scheduling aims to perform at least as good as when
> +Hyper Threading is disabled. In practise, this is mostly the case though not
> +always: as synchronizing scheduling decisions across 2 or more CPUs in a core
> +involves additional overhead - especially when the system is lightly loaded
> +(``total_threads <= N/2``).

N is number of CPUs?

> +
> +Usage
> +-----
> +Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
> +Using this feature, userspace defines groups of tasks that trust each other.
> +The core scheduler uses this information to make sure that tasks that do not
> +trust each other will never run simultaneously on a core, while doing its best
> +to satisfy the system's scheduling requirements.
> +
> +There are 2 ways to use core-scheduling:
> +
> +CGroup
> +######
> +Core scheduling adds additional files to the CPU controller CGroup:
> +
> +* ``cpu.tag``
> +Writing ``1`` into this file results in all tasks in the group get tagged. This

getting
or being

> +results in all the CGroup's tasks allowed to run concurrently on a core's
> +hyperthreads (also called siblings).
> +
> +The file being a value of ``0`` means the tag state of the CGroup is inheritted

inherited

> +from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
> +group is tagged.
> +
> +.. note:: Once a CGroup is tagged via cpu.tag, it is not possible to set this
> + for any descendant of the tagged group. For finer grained control, the
> + ``cpu.tag_color`` file described next may be used.
> +
> +.. note:: When a CGroup is not tagged, all the tasks within the group can share
> + a core with kernel threads and untagged system threads. For this reason,
> + if a group has ``cpu.tag`` of 0, it is considered to be trusted.
> +
> +* ``cpu.tag_color``
> +For finer grained control over core sharing, a color can also be set in
> +addition to the tag. This allows to further control core sharing between child
> +CGroups within an already tagged CGroup. The color and the tag are both used to
> +generate a `cookie` which is used by the scheduler to identify the group.
> +
> +Upto 256 different colors can be set (0-255) by writing into this file.

Up to

> +
> +A sample real-world usage of this file follows:
> +
> +Google uses DAC controls to make ``cpu.tag`` writeable only by root and the

$search tells me "writable".

> +``cpu.tag_color`` can be changed by anyone.
> +
> +The hierarchy looks like this:
> +::
> + Root group
> + / \
> + A B (These are created by the root daemon - borglet).
> + / \ \
> + C D E (These are created by AppEngine within the container).
> +
> +A and B are containers for 2 different jobs or apps that are created by a root
> +daemon called borglet. borglet then tags each of these group with the ``cpu.tag``
> +file. The job itself can create additional child CGroups which are colored by
> +the container's AppEngine with the ``cpu.tag_color`` file.
> +
> +The reason why Google uses this 2-level tagging system is that AppEngine wants to
> +allow a subset of child CGroups within a tagged parent CGroup to be co-scheduled on a
> +core while not being co-scheduled with other child CGroups. Think of these
> +child CGroups as belonging to the same customer or project. Because these
> +child CGroups are created by AppEngine, they are not tracked by borglet (the
> +root daemon), therefore borglet won't have a chance to set a color for them.
> +That's where cpu.tag_color file comes in. A color could be set by AppEngine,
> +and once set, the normal tasks within the subcgroup would not be able to
> +overwrite it. This is enforced by promoting the permission of the
> +``cpu.tag_color`` file in cgroupfs.
> +
> +The color is an 8-bit value allowing for upto 256 unique colors.

up to

> +
> +.. note:: Once a CGroup is colored, none of its descendants can be re-colored. Also
> + coloring of a CGroup is possible only if either the group or one of its
> + ancestors were tagged via the ``cpu.tag`` file.

was

> +
> +prctl interface
> +###############
> +A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` is available to a process to request
> +sharing a core with another process. For example, consider 2 processes ``P1``
> +and ``P2`` with PIDs 100 and 200. If process ``P1`` calls
> +``prctl(PR_SCHED_CORE_SHARE, 200)``, the kernel makes ``P1`` share a core with ``P2``.
> +The kernel performs ptrace access mode checks before granting the request.
> +
> +.. note:: This operation is not commutative. P1 calling
> + ``prctl(PR_SCHED_CORE_SHARE, pidof(P2)`` is not the same as P2 calling the
> + same for P1. The former case is P1 joining P2's group of processes
> + (which P2 would have joined with ``prctl(2)`` prior to P1's ``prctl(2)``).
> +
> +.. note:: The core-sharing granted with prctl(2) will be subject to
> + core-sharing restrictions specified by the CGroup interface. For example
> + if P1 and P2 are a part of 2 different tagged CGroups, then they will
> + not share a core even if a prctl(2) call is made. This is analogous
> + to how affinities are set using the cpuset interface.
> +
> +It is important to note that, on a ``CLONE_THREAD`` ``clone(2)`` syscall, the child
> +will be assigned the same tag as its parent and thus be allowed to share a core
> +with them. is design choice is because, for the security usecase, a

^^^ missing subject ...

> +``CLONE_THREAD`` child can access its parent's address space anyway, so there's
> +no point in not allowing them to share a core. If a different behavior is
> +desired, the child thread can call ``prctl(2)`` as needed. This behavior is
> +specific to the ``prctl(2)`` interface. For the CGroup interface, the child of a
> +fork always share's a core with its parent's. On the other hand, if a parent

shares with its parent's what? "parent's" is possessive.


> +was previously tagged via ``prctl(2)`` and does a regular ``fork(2)`` syscall, the
> +child will receive a unique tag.
> +
> +Design/Implementation
> +---------------------
> +Each task that is tagged is assigned a cookie internally in the kernel. As
> +mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
> +each other and share a core.
> +
> +The basic idea is that, every schedule event tries to select tasks for all the
> +siblings of a core such that all the selected tasks running on a core are
> +trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
> +The idle task is considered special, in that it trusts every thing.

or everything.

> +
> +During a ``schedule()`` event on any sibling of a core, the highest priority task for
> +that core is picked and assigned to the sibling calling ``schedule()`` if it has it

too many it it
They are not the same "it,", are they?

> +enqueued. For rest of the siblings in the core, highest priority task with the
> +same cookie is selected if there is one runnable in their individual run
> +queues. If a task with same cookie is not available, the idle task is selected.
> +Idle task is globally trusted.
> +
> +Once a task has been selected for all the siblings in the core, an IPI is sent to
> +siblings for whom a new task was selected. Siblings on receiving the IPI, will

no comma^

> +switch to the new task immediately. If an idle task is selected for a sibling,
> +then the sibling is considered to be in a `forced idle` state. i.e., it may

I.e.,

> +have tasks on its on runqueue to run, however it will still have to run idle.
> +More on this in the next section.
> +
> +Forced-idling of tasks
> +----------------------
> +The scheduler tries its best to find tasks that trust each other such that all
> +tasks selected to be scheduled are of the highest priority in a core. However,
> +it is possible that some runqueues had tasks that were incompatibile with the

incompatible

> +highest priority ones in the core. Favoring security over fairness, one or more
> +siblings could be forced to select a lower priority task if the highest
> +priority task is not trusted with respect to the core wide highest priority
> +task. If a sibling does not have a trusted task to run, it will be forced idle
> +by the scheduler(idle thread is scheduled to run).

scheduler (idle

> +
> +When the highest priorty task is selected to run, a reschedule-IPI is sent to

priority

> +the sibling to force it into idle. This results in 4 cases which need to be
> +considered depending on whether a VM or a regular usermode process was running
> +on either HT::
> +
> + HT1 (attack) HT2 (victim)
> + A idle -> user space user space -> idle
> + B idle -> user space guest -> idle
> + C idle -> guest user space -> idle
> + D idle -> guest guest -> idle
> +
> +Note that for better performance, we do not wait for the destination CPU
> +(victim) to enter idle mode. This is because the sending of the IPI would bring
> +the destination CPU immediately into kernel mode from user space, or VMEXIT
> +in the case of guests. At best, this would only leak some scheduler metadata

drop one space ^^

> +which may not be worth protecting. It is also possible that the IPI is received
> +too late on some architectures, but this has not been observed in the case of
> +x86.
> +
> +Kernel protection from untrusted tasks
> +--------------------------------------
> +The scheduler on its own cannot protect the kernel executing concurrently with
> +an untrusted task in a core. This is because the scheduler is unaware of
> +interrupts/syscalls at scheduling time. To mitigate this, an IPI is sent to
> +siblings on kernel entry (syscall and IRQ). This IPI forces the sibling to enter
> +kernel mode and wait before returning to user until all siblings of the
> +core have left kernel mode. This process is also known as stunning. For good
> +performance, an IPI is sent only to a sibling only if it is running a tagged
> +task. If a sibling is running a kernel thread or is idle, no IPI is sent.
> +
> +The kernel protection feature can be turned off on the kernel command line by
> +passing ``sched_core_protect_kernel=0``.
> +
> +Other alternative ideas discussed for kernel protection are listed below just
> +for completeness. They all have limitations:
> +
> +1. Changing interrupt affinities to a trusted core which does not execute untrusted tasks
> +#########################################################################################
> +By changing the interrupt affinities to a designated safe-CPU which runs
> +only trusted tasks, IRQ data can be protected. One issue is this involves
> +giving up a full CPU core of the system to run safe tasks. Another is that,
> +per-cpu interrupts such as the local timer interrupt cannot have their
> +affinity changed. also, sensitive timer callbacks such as the random entropy timer

Also,

> +can run in softirq on return from these interrupts and expose sensitive
> +data. In the future, that could be mitigated by forcing softirqs into threaded
> +mode by utilizing a mechanism similar to ``CONFIG_PREEMPT_RT``.
> +
> +Yet another issue with this is, for multiqueue devices with managed
> +interrupts, the IRQ affinities cannot be changed however it could be

changed; however,

> +possible to force a reduced number of queues which would in turn allow to
> +shield one or two CPUs from such interrupts and queue handling for the price
> +of indirection.
> +
> +2. Running IRQs as threaded-IRQs
> +################################
> +This would result in forcing IRQs into the scheduler which would then provide
> +the process-context mitigation. However, not all interrupts can be threaded.
> +Also this does nothing about syscall entries.
> +
> +3. Kernel Address Space Isolation
> +#################################
> +System calls could run in a much restricted address space which is
> +guarenteed not to leak any sensitive data. There are practical limitation in

guaranteed limitations in


> +implementing this - the main concern being how to decide on an address space
> +that is guarenteed to not have any sensitive data.

guaranteed

> +
> +4. Limited cookie-based protection
> +##################################
> +On a system call, change the cookie to the system trusted cookie and initiate a
> +schedule event. This would be better than pausing all the siblings during the
> +entire duration for the system call, but still would be a huge hit to the
> +performance.
> +
> +Trust model
> +-----------
> +Core scheduling maintains trust relationships amongst groups of tasks by
> +assigning the tag of them with the same cookie value.
> +When a system with core scheduling boots, all tasks are considered to trust
> +each other. This is because the core scheduler does not have information about
> +trust relationships until userspace uses the above mentioned interfaces, to
> +communicate them. In other words, all tasks have a default cookie value of 0.
> +and are considered system-wide trusted. The stunning of siblings running
> +cookie-0 tasks is also avoided.
> +
> +Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
> +within such groups are considered to trust each other, but do not trust those
> +outside. Tasks outside the group also don't trust tasks within.
> +
> +Limitations
> +-----------
> +Core scheduling tries to guarentee that only trusted tasks run concurrently on a

guarantee

> +core. But there could be small window of time during which untrusted tasks run
> +concurrently or kernel could be running concurrently with a task not trusted by
> +kernel.
> +
> +1. IPI processing delays
> +########################
> +Core scheduling selects only trusted tasks to run together. IPI is used to notify
> +the siblings to switch to the new task. But there could be hardware delays in
> +receiving of the IPI on some arch (on x86, this has not been observed). This may
> +cause an attacker task to start running on a cpu before its siblings receive the

CPU

> +IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
> +may populate data in the cache and micro acrhitectural buffers after the attacker

architectural

> +starts to run and this is a possibility for data leak.
> +
> +Open cross-HT issues that core scheduling does not solve
> +--------------------------------------------------------
> +1. For MDS
> +##########
> +Core scheduling cannot protect against MDS attacks between an HT running in
> +user mode and another running in kernel mode. Even though both HTs run tasks
> +which trust each other, kernel memory is still considered untrusted. Such
> +attacks are possible for any combination of sibling CPU modes (host or guest mode).
> +
> +2. For L1TF
> +###########
> +Core scheduling cannot protect against a L1TF guest attackers exploiting a

an attacker


> +guest or host victim. This is because the guest attacker can craft invalid
> +PTEs which are not inverted due to a vulnerable guest kernel. The only
> +solution is to disable EPT.

huh? what is EPT? where is it documented/discussed?

> +
> +For both MDS and L1TF, if the guest vCPU is configured to not trust each
> +other (by tagging separately), then the guest to guest attacks would go away.
> +Or it could be a system admin policy which considers guest to guest attacks as
> +a guest problem.
> +
> +Another approach to resolve these would be to make every untrusted task on the
> +system to not trust every other untrusted task. While this could reduce
> +parallelism of the untrusted tasks, it would still solve the above issues while
> +allowing system processes (trusted tasks) to share a core.
> +
> +Use cases
> +---------
> +The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
> +with SMT enabled. There are other use cases where this feature could be used:
> +
> +- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
> + that uses SIMD instructions etc.
> +- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
> + together could also be realized using core scheduling. One example is vcpus of

vCPUs

> + a VM.
> +
> +Future work
> +-----------
> +Skipping per-HT mitigations if task is trusted
> +##############################################
> +If core scheduling is enabled, by default all tasks trust each other as
> +mentioned above. In such scenario, it may be desirable to skip the same-HT
> +mitigations on return to the trusted user-mode to improve performance.

> diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
> index 21710f8609fe..361ccbbd9e54 100644
> --- a/Documentation/admin-guide/hw-vuln/index.rst
> +++ b/Documentation/admin-guide/hw-vuln/index.rst
> @@ -16,3 +16,4 @@ are configurable at compile, boot or run time.
> multihit.rst
> special-register-buffer-data-sampling.rst
> l1d_flush.rst
> + core-scheduling.rst

Might be an indentation problem there? Can't tell for sure.


thanks.

2020-10-23 18:21:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.

On Mon, Oct 19, 2020 at 09:43:16PM -0400, Joel Fernandes (Google) wrote:
> From: Peter Zijlstra <[email protected]>
>
> Instead of only selecting a local task, select a task for all SMT
> siblings for every reschedule on the core (irrespective which logical
> CPU does the reschedule).

This:

>
> During a CPU hotplug event, schedule would be called with the hotplugged
> CPU not in the cpumask. So use for_each_cpu(_wrap)_or to include the
> current cpu in the task pick loop.
>
> There are multiple loops in pick_next_task that iterate over CPUs in
> smt_mask. During a hotplug event, sibling could be removed from the
> smt_mask while pick_next_task is running. So we cannot trust the mask
> across the different loops. This can confuse the logic. Add a retry logic
> if smt_mask changes between the loops.

isn't entirely accurate anymore, is it?

2020-10-23 18:36:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.

On Mon, Oct 19, 2020 at 09:43:16PM -0400, Joel Fernandes (Google) wrote:
> + /*
> + * If this sibling doesn't yet have a suitable task to
> + * run; ask for the most elegible task, given the
> + * highest priority task already selected for this
> + * core.
> + */
> + p = pick_task(rq_i, class, max);
> + if (!p) {
> + /*
> + * If there weren't no cookies; we don't need to
> + * bother with the other siblings.
> + * If the rest of the core is not running a tagged
> + * task, i.e. need_sync == 0, and the current CPU
> + * which called into the schedule() loop does not
> + * have any tasks for this class, skip selecting for
> + * other siblings since there's no point. We don't skip
> + * for RT/DL because that could make CFS force-idle RT.
> + */
> + if (i == cpu && !need_sync && class == &fair_sched_class)
> + goto next_class;
> +
> + continue;
> + }

I'm failing to understand the class == &fair_sched_class bit.

IIRC the condition is such that the core doesn't have a cookie (we don't
need to sync the threads) so we'll only do a pick for our local CPU.

That should be invariant of class.

2020-10-23 18:36:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.

On Fri, Oct 23, 2020 at 03:51:29PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 19, 2020 at 09:43:16PM -0400, Joel Fernandes (Google) wrote:
> > + /*
> > + * If this sibling doesn't yet have a suitable task to
> > + * run; ask for the most elegible task, given the
> > + * highest priority task already selected for this
> > + * core.
> > + */
> > + p = pick_task(rq_i, class, max);
> > + if (!p) {
> > + /*
> > + * If there weren't no cookies; we don't need to
> > + * bother with the other siblings.
> > + * If the rest of the core is not running a tagged
> > + * task, i.e. need_sync == 0, and the current CPU
> > + * which called into the schedule() loop does not
> > + * have any tasks for this class, skip selecting for
> > + * other siblings since there's no point. We don't skip
> > + * for RT/DL because that could make CFS force-idle RT.
> > + */
> > + if (i == cpu && !need_sync && class == &fair_sched_class)
> > + goto next_class;
> > +
> > + continue;
> > + }
>
> I'm failing to understand the class == &fair_sched_class bit.
>
> IIRC the condition is such that the core doesn't have a cookie (we don't
> need to sync the threads) so we'll only do a pick for our local CPU.
>
> That should be invariant of class.

That is; it should be the exact counterpart of this bit:

> + /*
> + * Optimize the 'normal' case where there aren't any
> + * cookies and we don't need to sync up.
> + */
> + if (i == cpu && !need_sync && !p->core_cookie) {
> + next = p;
> + goto done;
> + }

If there is no task found in this class, try the next class, if there
is, we done.

2020-10-23 19:04:33

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.

On Fri, Oct 23, 2020 at 03:54:00PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 23, 2020 at 03:51:29PM +0200, Peter Zijlstra wrote:
> > On Mon, Oct 19, 2020 at 09:43:16PM -0400, Joel Fernandes (Google) wrote:
> > > + /*
> > > + * If this sibling doesn't yet have a suitable task to
> > > + * run; ask for the most elegible task, given the
> > > + * highest priority task already selected for this
> > > + * core.
> > > + */
> > > + p = pick_task(rq_i, class, max);
> > > + if (!p) {
> > > + /*
> > > + * If there weren't no cookies; we don't need to
> > > + * bother with the other siblings.
> > > + * If the rest of the core is not running a tagged
> > > + * task, i.e. need_sync == 0, and the current CPU
> > > + * which called into the schedule() loop does not
> > > + * have any tasks for this class, skip selecting for
> > > + * other siblings since there's no point. We don't skip
> > > + * for RT/DL because that could make CFS force-idle RT.
> > > + */
> > > + if (i == cpu && !need_sync && class == &fair_sched_class)
> > > + goto next_class;
> > > +
> > > + continue;
> > > + }
> >
> > I'm failing to understand the class == &fair_sched_class bit.

The last line in the comment explains it "We don't skip for RT/DL because
that could make CFS force-idle RT.".

Even if need_sync == false, we need to go look at other CPUs (non-local
CPUs) to see if they could be running RT.

Say the RQs in a particular core look like this:
Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.

rq0 rq1
CFS1 (tagged) RT1 (not tag)
CFS2 (tagged)

Say schedule() runs on rq0. Now, it will enter the above loop and
pick_task(RT) will return NULL for 'p'. It will enter the above if() block
and see that need_sync == false and will skip RT entirely.

The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
rq0 rq1
CFS1 IDLE

When it should have selected:
rq0 r1
IDLE RT

I saw this issue on real-world usecases in ChromeOS where an RT task gets
constantly force-idled and breaks RT. The "class == &fair_sched_class" bit
cures it.

> > > + * for RT/DL because that could make CFS force-idle RT.
> > IIRC the condition is such that the core doesn't have a cookie (we don't
> > need to sync the threads) so we'll only do a pick for our local CPU.
> >
> > That should be invariant of class.
>
> That is; it should be the exact counterpart of this bit:
>
> > + /*
> > + * Optimize the 'normal' case where there aren't any
> > + * cookies and we don't need to sync up.
> > + */
> > + if (i == cpu && !need_sync && !p->core_cookie) {
> > + next = p;
> > + goto done;
> > + }
>
> If there is no task found in this class, try the next class, if there
> is, we done.

That's Ok. But we cannot skip RT class on other CPUs.

thanks,

- Joel

2020-10-23 19:08:03

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.

On Fri, Oct 23, 2020 at 05:05:44PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 19, 2020 at 09:43:16PM -0400, Joel Fernandes (Google) wrote:
> > From: Peter Zijlstra <[email protected]>
> >
> > Instead of only selecting a local task, select a task for all SMT
> > siblings for every reschedule on the core (irrespective which logical
> > CPU does the reschedule).
>
> This:
>
> >
> > During a CPU hotplug event, schedule would be called with the hotplugged
> > CPU not in the cpumask. So use for_each_cpu(_wrap)_or to include the
> > current cpu in the task pick loop.
> >
> > There are multiple loops in pick_next_task that iterate over CPUs in
> > smt_mask. During a hotplug event, sibling could be removed from the
> > smt_mask while pick_next_task is running. So we cannot trust the mask
> > across the different loops. This can confuse the logic. Add a retry logic
> > if smt_mask changes between the loops.
>
> isn't entirely accurate anymore, is it?

Yes you are right, we need to delete this bit from the changelog. :-(. I'll
go do that.

thanks,

- Joel

2020-10-23 19:31:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.

On Fri, Oct 23, 2020 at 01:57:24PM -0400, Joel Fernandes wrote:
> On Fri, Oct 23, 2020 at 03:54:00PM +0200, Peter Zijlstra wrote:
> > On Fri, Oct 23, 2020 at 03:51:29PM +0200, Peter Zijlstra wrote:
> > > On Mon, Oct 19, 2020 at 09:43:16PM -0400, Joel Fernandes (Google) wrote:
> > > > + /*
> > > > + * If this sibling doesn't yet have a suitable task to
> > > > + * run; ask for the most elegible task, given the
> > > > + * highest priority task already selected for this
> > > > + * core.
> > > > + */
> > > > + p = pick_task(rq_i, class, max);
> > > > + if (!p) {
> > > > + /*
> > > > + * If there weren't no cookies; we don't need to
> > > > + * bother with the other siblings.
> > > > + * If the rest of the core is not running a tagged
> > > > + * task, i.e. need_sync == 0, and the current CPU
> > > > + * which called into the schedule() loop does not
> > > > + * have any tasks for this class, skip selecting for
> > > > + * other siblings since there's no point. We don't skip
> > > > + * for RT/DL because that could make CFS force-idle RT.
> > > > + */
> > > > + if (i == cpu && !need_sync && class == &fair_sched_class)
> > > > + goto next_class;
> > > > +
> > > > + continue;
> > > > + }
> > >
> > > I'm failing to understand the class == &fair_sched_class bit.
>
> The last line in the comment explains it "We don't skip for RT/DL because
> that could make CFS force-idle RT.".

Well, yes, but it does not explain how this can come about, now does it.

> Even if need_sync == false, we need to go look at other CPUs (non-local
> CPUs) to see if they could be running RT.
>
> Say the RQs in a particular core look like this:
> Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.
>
> rq0 rq1
> CFS1 (tagged) RT1 (not tag)
> CFS2 (tagged)
>
> Say schedule() runs on rq0. Now, it will enter the above loop and
> pick_task(RT) will return NULL for 'p'. It will enter the above if() block
> and see that need_sync == false and will skip RT entirely.
>
> The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
> rq0 rq1
> CFS1 IDLE
>
> When it should have selected:
> rq0 r1
> IDLE RT
>
> I saw this issue on real-world usecases in ChromeOS where an RT task gets
> constantly force-idled and breaks RT. The "class == &fair_sched_class" bit
> cures it.

Ah, I see. The thing is, this looses the optimization for a bunch of
valid (and arguably common) scenarios. The problem is that the moment we
end up selecting a task with a cookie we've invalidated the premise
under which we ended up with the selected task.

How about this then?

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4709,6 +4709,7 @@ pick_next_task(struct rq *rq, struct tas
need_sync = !!rq->core->core_cookie;

/* reset state */
+reset:
rq->core->core_cookie = 0UL;
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);
@@ -4748,14 +4749,8 @@ pick_next_task(struct rq *rq, struct tas
/*
* If there weren't no cookies; we don't need to
* bother with the other siblings.
- * If the rest of the core is not running a tagged
- * task, i.e. need_sync == 0, and the current CPU
- * which called into the schedule() loop does not
- * have any tasks for this class, skip selecting for
- * other siblings since there's no point. We don't skip
- * for RT/DL because that could make CFS force-idle RT.
*/
- if (i == cpu && !need_sync && !p->core_cookie)
+ if (i == cpu && !need_sync)
goto next_class;

continue;
@@ -4765,7 +4760,17 @@ pick_next_task(struct rq *rq, struct tas
* Optimize the 'normal' case where there aren't any
* cookies and we don't need to sync up.
*/
- if (i == cpu && !need_sync && !p->core_cookie) {
+ if (i == cpu && !need_sync) {
+ if (p->core_cookie) {
+ /*
+ * This optimization is only valid as
+ * long as there are no cookies
+ * involved.
+ */
+ need_sync = true;
+ goto reset;
+ }
+
next = p;
goto done;
}
@@ -4805,7 +4810,6 @@ pick_next_task(struct rq *rq, struct tas
*/
need_sync = true;
}
-
}
}
next_class:;

2020-10-24 00:34:41

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.

On Fri, Oct 23, 2020 at 09:26:54PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 23, 2020 at 01:57:24PM -0400, Joel Fernandes wrote:
> > On Fri, Oct 23, 2020 at 03:54:00PM +0200, Peter Zijlstra wrote:
> > > On Fri, Oct 23, 2020 at 03:51:29PM +0200, Peter Zijlstra wrote:
> > > > On Mon, Oct 19, 2020 at 09:43:16PM -0400, Joel Fernandes (Google) wrote:
> > > > > + /*
> > > > > + * If this sibling doesn't yet have a suitable task to
> > > > > + * run; ask for the most elegible task, given the
> > > > > + * highest priority task already selected for this
> > > > > + * core.
> > > > > + */
> > > > > + p = pick_task(rq_i, class, max);
> > > > > + if (!p) {
> > > > > + /*
> > > > > + * If there weren't no cookies; we don't need to
> > > > > + * bother with the other siblings.
> > > > > + * If the rest of the core is not running a tagged
> > > > > + * task, i.e. need_sync == 0, and the current CPU
> > > > > + * which called into the schedule() loop does not
> > > > > + * have any tasks for this class, skip selecting for
> > > > > + * other siblings since there's no point. We don't skip
> > > > > + * for RT/DL because that could make CFS force-idle RT.
> > > > > + */
> > > > > + if (i == cpu && !need_sync && class == &fair_sched_class)
> > > > > + goto next_class;
> > > > > +
> > > > > + continue;
> > > > > + }
> > > >
> > > > I'm failing to understand the class == &fair_sched_class bit.
> >
> > The last line in the comment explains it "We don't skip for RT/DL because
> > that could make CFS force-idle RT.".
>
> Well, yes, but it does not explain how this can come about, now does it.

Sorry, I should have made it a separate commit with the below explanation. Oh
well, live and learn!

> > Even if need_sync == false, we need to go look at other CPUs (non-local
> > CPUs) to see if they could be running RT.
> >
> > Say the RQs in a particular core look like this:
> > Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.
> >
> > rq0 rq1
> > CFS1 (tagged) RT1 (not tag)
> > CFS2 (tagged)
> >
> > Say schedule() runs on rq0. Now, it will enter the above loop and
> > pick_task(RT) will return NULL for 'p'. It will enter the above if() block
> > and see that need_sync == false and will skip RT entirely.
> >
> > The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
> > rq0 rq1
> > CFS1 IDLE
> >
> > When it should have selected:
> > rq0 r1
> > IDLE RT
> >
> > I saw this issue on real-world usecases in ChromeOS where an RT task gets
> > constantly force-idled and breaks RT. The "class == &fair_sched_class" bit
> > cures it.
>
> Ah, I see. The thing is, this looses the optimization for a bunch of
> valid (and arguably common) scenarios. The problem is that the moment we
> end up selecting a task with a cookie we've invalidated the premise
> under which we ended up with the selected task.
>
> How about this then?

This does look better. It makes sense and I think it will work. I will look
more into it and also test it.

BTW, as further optimization in the future, isn't it better for the
schedule() loop on 1 HT to select for all HT *even if* need_sync == false to
begin with? i.e. no cookied tasks are runnable.

That way the pick loop in schedule() running on other HTs can directly pick
what was pre-selected for it via:
if (rq->core->core_pick_seq == rq->core->core_task_seq &&
rq->core->core_pick_seq != rq->core_sched_seq &&
rq->core_pick)
.. which I think is more efficient. Its just a thought and may not be worth doing.

thanks,

- Joel


> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4709,6 +4709,7 @@ pick_next_task(struct rq *rq, struct tas
> need_sync = !!rq->core->core_cookie;
>
> /* reset state */
> +reset:
> rq->core->core_cookie = 0UL;
> for_each_cpu(i, smt_mask) {
> struct rq *rq_i = cpu_rq(i);
> @@ -4748,14 +4749,8 @@ pick_next_task(struct rq *rq, struct tas
> /*
> * If there weren't no cookies; we don't need to
> * bother with the other siblings.
> - * If the rest of the core is not running a tagged
> - * task, i.e. need_sync == 0, and the current CPU
> - * which called into the schedule() loop does not
> - * have any tasks for this class, skip selecting for
> - * other siblings since there's no point. We don't skip
> - * for RT/DL because that could make CFS force-idle RT.
> */
> - if (i == cpu && !need_sync && !p->core_cookie)
> + if (i == cpu && !need_sync)
> goto next_class;
>
> continue;
> @@ -4765,7 +4760,17 @@ pick_next_task(struct rq *rq, struct tas
> * Optimize the 'normal' case where there aren't any
> * cookies and we don't need to sync up.
> */
> - if (i == cpu && !need_sync && !p->core_cookie) {
> + if (i == cpu && !need_sync) {
> + if (p->core_cookie) {
> + /*
> + * This optimization is only valid as
> + * long as there are no cookies
> + * involved.
> + */
> + need_sync = true;
> + goto reset;
> + }
> +
> next = p;
> goto done;
> }
> @@ -4805,7 +4810,6 @@ pick_next_task(struct rq *rq, struct tas
> */
> need_sync = true;
> }
> -
> }
> }
> next_class:;
>

2020-10-26 10:16:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.

On Fri, Oct 23, 2020 at 05:31:18PM -0400, Joel Fernandes wrote:
> On Fri, Oct 23, 2020 at 09:26:54PM +0200, Peter Zijlstra wrote:

> > How about this then?
>
> This does look better. It makes sense and I think it will work. I will look
> more into it and also test it.

Hummm... Looking at it again I wonder if I can make something like the
below work.

(depends on the next patch that pulls core_forceidle into core-wide
state)

That would retain the CFS-cgroup optimization as well, for as long as
there's no cookies around.

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4691,8 +4691,6 @@ pick_next_task(struct rq *rq, struct tas
return next;
}

- put_prev_task_balance(rq, prev, rf);
-
smt_mask = cpu_smt_mask(cpu);

/*
@@ -4707,14 +4705,25 @@ pick_next_task(struct rq *rq, struct tas
*/
rq->core->core_task_seq++;
need_sync = !!rq->core->core_cookie;
-
- /* reset state */
-reset:
- rq->core->core_cookie = 0UL;
if (rq->core->core_forceidle) {
need_sync = true;
rq->core->core_forceidle = false;
}
+
+ if (!need_sync) {
+ next = __pick_next_task(rq, prev, rf);
+ if (!next->core_cookie) {
+ rq->core_pick = NULL;
+ return next;
+ }
+ put_prev_task(next);
+ need_sync = true;
+ } else {
+ put_prev_task_balance(rq, prev, rf);
+ }
+
+ /* reset state */
+ rq->core->core_cookie = 0UL;
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);

@@ -4744,35 +4752,8 @@ pick_next_task(struct rq *rq, struct tas
* core.
*/
p = pick_task(rq_i, class, max);
- if (!p) {
- /*
- * If there weren't no cookies; we don't need to
- * bother with the other siblings.
- */
- if (i == cpu && !need_sync)
- goto next_class;
-
+ if (!p)
continue;
- }
-
- /*
- * Optimize the 'normal' case where there aren't any
- * cookies and we don't need to sync up.
- */
- if (i == cpu && !need_sync) {
- if (p->core_cookie) {
- /*
- * This optimization is only valid as
- * long as there are no cookies
- * involved.
- */
- need_sync = true;
- goto reset;
- }
-
- next = p;
- goto done;
- }

rq_i->core_pick = p;

2020-10-26 11:52:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.

On Fri, Oct 23, 2020 at 05:31:18PM -0400, Joel Fernandes wrote:
> BTW, as further optimization in the future, isn't it better for the
> schedule() loop on 1 HT to select for all HT *even if* need_sync == false to
> begin with? i.e. no cookied tasks are runnable.
>
> That way the pick loop in schedule() running on other HTs can directly pick
> what was pre-selected for it via:
> if (rq->core->core_pick_seq == rq->core->core_task_seq &&
> rq->core->core_pick_seq != rq->core_sched_seq &&
> rq->core_pick)
> .. which I think is more efficient. Its just a thought and may not be worth doing.

I'm not sure that works. Imagine a sibling doing a wakeup (or sleep)
just after you done your core wide pick. Then it will have to repick and
you end up with having to do 2*nr_smt picks instead of 2 picks.


2020-10-28 16:04:03

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.

On Mon, Oct 26, 2020 at 09:28:14AM +0100, Peter Zijlstra wrote:
> On Fri, Oct 23, 2020 at 05:31:18PM -0400, Joel Fernandes wrote:
> > BTW, as further optimization in the future, isn't it better for the
> > schedule() loop on 1 HT to select for all HT *even if* need_sync == false to
> > begin with? i.e. no cookied tasks are runnable.
> >
> > That way the pick loop in schedule() running on other HTs can directly pick
> > what was pre-selected for it via:
> > if (rq->core->core_pick_seq == rq->core->core_task_seq &&
> > rq->core->core_pick_seq != rq->core_sched_seq &&
> > rq->core_pick)
> > .. which I think is more efficient. Its just a thought and may not be worth doing.
>
> I'm not sure that works. Imagine a sibling doing a wakeup (or sleep)
> just after you done your core wide pick. Then it will have to repick and
> you end up with having to do 2*nr_smt picks instead of 2 picks.

For a workload that is having mostly runnable tasks (not doing lot of wakeup
/ sleep), maybe it makes sense.

Also if you have only cookied tasks and they are doing wake up / sleep, then
you have 2*nr_smt_picks anyway as the core picks constantly get invalidated,
AFAICS.

I guess in the current code, the assumptions are:
1. Most tasks are not cookied task
2. They can wake up and sleep a lot

I guess those are Ok assumptions though, but maybe we could document it.

thanks,

- Joel

2020-10-30 13:29:09

by Ning, Hongyu

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 00/26] Core scheduling

On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
> Eighth iteration of the Core-Scheduling feature.
>
> Core scheduling is a feature that allows only trusted tasks to run
> concurrently on cpus sharing compute resources (eg: hyperthreads on a
> core). The goal is to mitigate the core-level side-channel attacks
> without requiring to disable SMT (which has a significant impact on
> performance in some situations). Core scheduling (as of v7) mitigates
> user-space to user-space attacks and user to kernel attack when one of
> the siblings enters the kernel via interrupts or system call.
>
> By default, the feature doesn't change any of the current scheduler
> behavior. The user decides which tasks can run simultaneously on the
> same core (for now by having them in the same tagged cgroup). When a tag
> is enabled in a cgroup and a task from that cgroup is running on a
> hardware thread, the scheduler ensures that only idle or trusted tasks
> run on the other sibling(s). Besides security concerns, this feature can
> also be beneficial for RT and performance applications where we want to
> control how tasks make use of SMT dynamically.
>
> This iteration focuses on the the following stuff:
> - Redesigned API.
> - Rework of Kernel Protection feature based on Thomas's entry work.
> - Rework of hotplug fixes.
> - Address review comments in v7
>
> Joel: Both a CGroup and Per-task interface via prctl(2) are provided for
> configuring core sharing. More details are provided in documentation patch.
> Kselftests are provided to verify the correctness/rules of the interface.
>
> Julien: TPCC tests showed improvements with core-scheduling. With kernel
> protection enabled, it does not show any regression. Possibly ASI will improve
> the performance for those who choose kernel protection (can be toggled through
> sched_core_protect_kernel sysctl). Results:
> v8 average stdev diff
> baseline (SMT on) 1197.272 44.78312824
> core sched ( kernel protect) 412.9895 45.42734343 -65.51%
> core sched (no kernel protect) 686.6515 71.77756931 -42.65%
> nosmt 408.667 39.39042872 -65.87%
>
> v8 is rebased on tip/master.
>
> Future work
> ===========
> - Load balancing/Migration fixes for core scheduling.
> With v6, Load balancing is partially coresched aware, but has some
> issues w.r.t process/taskgroup weights:
> https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...
> - Core scheduling test framework: kselftests, torture tests etc
>
> Changes in v8
> =============
> - New interface/API implementation
> - Joel
> - Revised kernel protection patch
> - Joel
> - Revised Hotplug fixes
> - Joel
> - Minor bug fixes and address review comments
> - Vineeth
>

> create mode 100644 tools/testing/selftests/sched/config
> create mode 100644 tools/testing/selftests/sched/test_coresched.c
>

Adding 4 workloads test results for Core Scheduling v8:

- kernel under test: coresched community v8 from https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched-v5.9
- workloads:
-- A. sysbench cpu (192 threads) + sysbench cpu (192 threads)
-- B. sysbench cpu (192 threads) + sysbench mysql (192 threads, mysqld forced into the same cgroup)
-- C. uperf netperf.xml (192 threads over TCP or UDP protocol separately)
-- D. will-it-scale context_switch via pipe (192 threads)
- test machine setup:
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 4
- test results:
-- workload A, no obvious performance drop in cs_on:
+----------------------+------+----------------------+------------------------+
| | ** | sysbench cpu * 192 | sysbench mysql * 192 |
+======================+======+======================+========================+
| cgroup | ** | cg_sysbench_cpu_0 | cg_sysbench_mysql_0 |
+----------------------+------+----------------------+------------------------+
| record_item | ** | Tput_avg (events/s) | Tput_avg (events/s) |
+----------------------+------+----------------------+------------------------+
| coresched_normalized | ** | 1.01 | 0.87 |
+----------------------+------+----------------------+------------------------+
| default_normalized | ** | 1 | 1 |
+----------------------+------+----------------------+------------------------+
| smtoff_normalized | ** | 0.59 | 0.82 |
+----------------------+------+----------------------+------------------------+

-- workload B, no obvious performance drop in cs_on:
+----------------------+------+----------------------+------------------------+
| | ** | sysbench cpu * 192 | sysbench cpu * 192 |
+======================+======+======================+========================+
| cgroup | ** | cg_sysbench_cpu_0 | cg_sysbench_cpu_1 |
+----------------------+------+----------------------+------------------------+
| record_item | ** | Tput_avg (events/s) | Tput_avg (events/s) |
+----------------------+------+----------------------+------------------------+
| coresched_normalized | ** | 1.01 | 0.98 |
+----------------------+------+----------------------+------------------------+
| default_normalized | ** | 1 | 1 |
+----------------------+------+----------------------+------------------------+
| smtoff_normalized | ** | 0.6 | 0.6 |
+----------------------+------+----------------------+------------------------+

-- workload C, known performance drop in cs_on since Core Scheduling v6:
+----------------------+------+---------------------------+---------------------------+
| | ** | uperf netperf TCP * 192 | uperf netperf UDP * 192 |
+======================+======+===========================+===========================+
| cgroup | ** | cg_uperf | cg_uperf |
+----------------------+------+---------------------------+---------------------------+
| record_item | ** | Tput_avg (Gb/s) | Tput_avg (Gb/s) |
+----------------------+------+---------------------------+---------------------------+
| coresched_normalized | ** | 0.46 | 0.48 |
+----------------------+------+---------------------------+---------------------------+
| default_normalized | ** | 1 | 1 |
+----------------------+------+---------------------------+---------------------------+
| smtoff_normalized | ** | 0.82 | 0.79 |
+----------------------+------+---------------------------+---------------------------+

-- workload D, new added syscall workload, performance drop in cs_on:
+----------------------+------+-------------------------------+
| | ** | will-it-scale * 192 |
| | | (pipe based context_switch) |
+======================+======+===============================+
| cgroup | ** | cg_will-it-scale |
+----------------------+------+-------------------------------+
| record_item | ** | threads_avg |
+----------------------+------+-------------------------------+
| coresched_normalized | ** | 0.2 |
+----------------------+------+-------------------------------+
| default_normalized | ** | 1 |
+----------------------+------+-------------------------------+
| smtoff_normalized | ** | 0.89 |
+----------------------+------+-------------------------------+

comments: per internal analyzing, suspected huge amount of spin_lock contention in cs_on, may lead to significant performance drop

- notes on test results record_item:
* coresched_normalized: smton, cs enabled, test result normalized by default value
* default_normalized: smton, cs disabled, test result normalized by default value
* smtoff_normalized: smtoff, test result normalized by default value

2020-10-31 00:46:18

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 19/26] sched: Add a second-level tag for nested CGroup usecase

On Mon, Oct 19, 2020 at 6:45 PM Joel Fernandes (Google)
<[email protected]> wrote:
>
> +static unsigned long cpu_core_get_group_cookie(struct task_group *tg)
> +{
> + unsigned long color = 0;
> +
> + if (!tg)
> + return 0;
> +
> + for (; tg; tg = tg->parent) {
> + if (tg->core_tag_color) {
> + WARN_ON_ONCE(color);
> + color = tg->core_tag_color;
> + }
> +
> + if (tg->core_tagged) {
> + unsigned long cookie = ((unsigned long)tg << 8) | color;
> + cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
> + return cookie;
> + }
> + }
> +
> + return 0;
> +}

I'm a bit wary of how core_task_cookie and core_group_cookie are
truncated to the lower half of their bits and combined into the
overall core_cookie. Now that core_group_cookie is further losing 8
bits to color, that leaves (in the case of 32 bit unsigned long) only
8 bits to uniquely identify the group contribution to the cookie.

Also, I agree that 256 colors is likely adequate, but it would be nice
to avoid this restriction.

I'd like to propose the following alternative, which involves creating
a new struct to represent the core cookie:

struct core_cookie {
unsigned long task_cookie;
unsigned long group_cookie;
unsigned long color;
/* can be further extended with arbitrary fields */

struct rb_node node;
refcount_t;
};

struct rb_root core_cookies; /* (sorted), all active core_cookies */
seqlock_t core_cookies_lock; /* protects against removal/addition to
core_cookies */

struct task_struct {
...
unsigned long core_cookie; /* (struct core_cookie *) */
}

A given task stores the address of a core_cookie struct in its
core_cookie field. When we reconfigure a task's
color/task_cookie/group_cookie, we can first look for an existing
core_cookie that matches those settings, or create a new one.

2020-11-03 02:56:56

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 19/26] sched: Add a second-level tag for nested CGroup usecase

On Fri, Oct 30, 2020 at 05:42:12PM -0700, Josh Don wrote:
> On Mon, Oct 19, 2020 at 6:45 PM Joel Fernandes (Google)
> <[email protected]> wrote:
> >
> > +static unsigned long cpu_core_get_group_cookie(struct task_group *tg)
> > +{
> > + unsigned long color = 0;
> > +
> > + if (!tg)
> > + return 0;
> > +
> > + for (; tg; tg = tg->parent) {
> > + if (tg->core_tag_color) {
> > + WARN_ON_ONCE(color);
> > + color = tg->core_tag_color;
> > + }
> > +
> > + if (tg->core_tagged) {
> > + unsigned long cookie = ((unsigned long)tg << 8) | color;
> > + cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
> > + return cookie;
> > + }
> > + }
> > +
> > + return 0;
> > +}
>
> I'm a bit wary of how core_task_cookie and core_group_cookie are
> truncated to the lower half of their bits and combined into the
> overall core_cookie. Now that core_group_cookie is further losing 8
> bits to color, that leaves (in the case of 32 bit unsigned long) only
> 8 bits to uniquely identify the group contribution to the cookie.
>
> Also, I agree that 256 colors is likely adequate, but it would be nice
> to avoid this restriction.
>
> I'd like to propose the following alternative, which involves creating
> a new struct to represent the core cookie:
>
> struct core_cookie {
> unsigned long task_cookie;
> unsigned long group_cookie;
> unsigned long color;
> /* can be further extended with arbitrary fields */
>
> struct rb_node node;
> refcount_t;
> };
>
> struct rb_root core_cookies; /* (sorted), all active core_cookies */
> seqlock_t core_cookies_lock; /* protects against removal/addition to
> core_cookies */
>
> struct task_struct {
> ...
> unsigned long core_cookie; /* (struct core_cookie *) */
> }
>
> A given task stores the address of a core_cookie struct in its
> core_cookie field. When we reconfigure a task's
> color/task_cookie/group_cookie, we can first look for an existing
> core_cookie that matches those settings, or create a new one.

Josh,

This sounds good to me.

Just to mention one thing, for stuff like the following, you'll have to write
functions that can do greater-than, less-than operations, etc.

static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)
{
if (a->core_cookie < b->core_cookie)
return true;

if (a->core_cookie > b->core_cookie)
return false;

And pretty much everywhere you do null-checks on core_cookie, or access
core_cookie for any other reasons.

Also there's kselftests that need trivial modifications to pass with the new
changes you propose.

Looking forward to the patch to do the improvement and thanks.

thanks,

- Joel


> > More information about attacks:
> > For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
> > data to either host or guest attackers. For L1TF, it is possible to leak
> > to guest attackers. There is no possible mitigation involving flushing
> > of buffers to avoid this since the execution of attacker and victims
> > happen concurrently on 2 or more HTs.
> >
> > Cc: Julien Desfossez <[email protected]>
> > Cc: Tim Chen <[email protected]>
> > Cc: Aaron Lu <[email protected]>
> > Cc: Aubrey Li <[email protected]>
> > Cc: Tim Chen <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
> > Co-developed-by: Vineeth Pillai <[email protected]>
> > Tested-by: Julien Desfossez <[email protected]>
> > Signed-off-by: Vineeth Pillai <[email protected]>
> > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > ---
> > .../admin-guide/kernel-parameters.txt | 7 +
> > include/linux/entry-common.h | 2 +-
> > include/linux/sched.h | 12 +
> > kernel/entry/common.c | 25 +-
> > kernel/sched/core.c | 229 ++++++++++++++++++
> > kernel/sched/sched.h | 3 +
> > 6 files changed, 275 insertions(+), 3 deletions(-)
> >
The issue is with code like this:

2020-11-05 15:54:34

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 19/26] sched: Add a second-level tag for nested CGroup usecase

On Wed, Oct 28, 2020 at 02:23:02PM +0800, Li, Aubrey wrote:
> On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
> > Google has a usecase where the first level tag to tag a CGroup is not
> > sufficient. So, a patch is carried for years where a second tag is added which
> > is writeable by unprivileged users.
> >
> > Google uses DAC controls to make the 'tag' possible to set only by root while
> > the second-level 'color' can be changed by anyone. The actual names that
> > Google uses is different, but the concept is the same.
> >
> > The hierarchy looks like:
> >
> > Root group
> > / \
> > A B (These are created by the root daemon - borglet).
> > / \ \
> > C D E (These are created by AppEngine within the container).
> >
> > The reason why Google has two parts is that AppEngine wants to allow a subset of
> > subcgroups within a parent tagged cgroup sharing execution. Think of these
> > subcgroups belong to the same customer or project. Because these subcgroups are
> > created by AppEngine, they are not tracked by borglet (the root daemon),
> > therefore borglet won't have a chance to set a color for them. That's where
> > 'color' file comes from. Color could be set by AppEngine, and once set, the
> > normal tasks within the subcgroup would not be able to overwrite it. This is
> > enforced by promoting the permission of the color file in cgroupfs.
> >
> > The 'color' is a 8-bit value allowing for upto 256 unique colors. IMHO, having
> > more than these many CGroups sounds like a scalability issue so this suffices.
> > We steal the lower 8-bits of the cookie to set the color.
> >
>
> So when color = 0, tasks in group A C D can run together on the HTs in same core,
> And if I set the color of taskC in group C = 1, then taskC has a different cookie
> from taskA and taskD, so in terms of taskA, what's the difference between taskC
> and [taskB or taskE]? The color breaks the relationship that C belongs to A.

C does belong to A in the sense, A cannot share with B, this implies C can
never share with B. Setting C's color does not change that fact. So coloring
is irrelevant in your question.

Sure, A cannot share with C either after coloring, but that's irrelevant and
not the point of doing the coloring.

thanks,

- Joel

2020-11-05 18:54:38

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.

On Mon, Oct 26, 2020 at 10:31:31AM +0100, Peter Zijlstra wrote:
> On Fri, Oct 23, 2020 at 05:31:18PM -0400, Joel Fernandes wrote:
> > On Fri, Oct 23, 2020 at 09:26:54PM +0200, Peter Zijlstra wrote:
>
> > > How about this then?
> >
> > This does look better. It makes sense and I think it will work. I will look
> > more into it and also test it.
>
> Hummm... Looking at it again I wonder if I can make something like the
> below work.
>
> (depends on the next patch that pulls core_forceidle into core-wide
> state)
>
> That would retain the CFS-cgroup optimization as well, for as long as
> there's no cookies around.
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4691,8 +4691,6 @@ pick_next_task(struct rq *rq, struct tas
> return next;
> }
>
> - put_prev_task_balance(rq, prev, rf);
> -
> smt_mask = cpu_smt_mask(cpu);
>
> /*
> @@ -4707,14 +4705,25 @@ pick_next_task(struct rq *rq, struct tas
> */
> rq->core->core_task_seq++;
> need_sync = !!rq->core->core_cookie;
> -
> - /* reset state */
> -reset:
> - rq->core->core_cookie = 0UL;
> if (rq->core->core_forceidle) {
> need_sync = true;
> rq->core->core_forceidle = false;
> }
> +
> + if (!need_sync) {
> + next = __pick_next_task(rq, prev, rf);

This could end up triggering pick_next_task_fair's newidle balancing;

> + if (!next->core_cookie) {
> + rq->core_pick = NULL;
> + return next;
> + }

.. only to realize here that pick_next_task_fair() that we have to put_prev
the task back as it has a cookie, but the effect of newidle balancing cannot
be reverted.

Would that be a problem as the newly pulled task might be incompatible and
would have been better to leave it alone?

TBH, this is a drastic change and we've done a lot of testing with the
current code and its looking good. I'm a little scared of changing it right
now and introducing regression. Can we maybe do this after the existing
patches are upstream?

thanks,

- Joel


> + put_prev_task(next);
> + need_sync = true;
> + } else {
> + put_prev_task_balance(rq, prev, rf);
> + }
> +
> + /* reset state */
> + rq->core->core_cookie = 0UL;
> for_each_cpu(i, smt_mask) {
> struct rq *rq_i = cpu_rq(i);
>
> @@ -4744,35 +4752,8 @@ pick_next_task(struct rq *rq, struct tas
> * core.
> */
> p = pick_task(rq_i, class, max);
> - if (!p) {
> - /*
> - * If there weren't no cookies; we don't need to
> - * bother with the other siblings.
> - */
> - if (i == cpu && !need_sync)
> - goto next_class;
> -
> + if (!p)
> continue;
> - }
> -
> - /*
> - * Optimize the 'normal' case where there aren't any
> - * cookies and we don't need to sync up.
> - */
> - if (i == cpu && !need_sync) {
> - if (p->core_cookie) {
> - /*
> - * This optimization is only valid as
> - * long as there are no cookies
> - * involved.
> - */
> - need_sync = true;
> - goto reset;
> - }
> -
> - next = p;
> - goto done;
> - }
>
> rq_i->core_pick = p;
>

2020-11-05 22:09:30

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 06/26] sched: Add core wide task selection and scheduling.

On Thu, Nov 05, 2020 at 01:50:19PM -0500, Joel Fernandes wrote:
> On Mon, Oct 26, 2020 at 10:31:31AM +0100, Peter Zijlstra wrote:
> > On Fri, Oct 23, 2020 at 05:31:18PM -0400, Joel Fernandes wrote:
> > > On Fri, Oct 23, 2020 at 09:26:54PM +0200, Peter Zijlstra wrote:
> >
> > > > How about this then?
> > >
> > > This does look better. It makes sense and I think it will work. I will look
> > > more into it and also test it.
> >
> > Hummm... Looking at it again I wonder if I can make something like the
> > below work.
> >
> > (depends on the next patch that pulls core_forceidle into core-wide
> > state)
> >
> > That would retain the CFS-cgroup optimization as well, for as long as
> > there's no cookies around.
> >
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4691,8 +4691,6 @@ pick_next_task(struct rq *rq, struct tas
> > return next;
> > }
> >
> > - put_prev_task_balance(rq, prev, rf);
> > -
> > smt_mask = cpu_smt_mask(cpu);
> >
> > /*
> > @@ -4707,14 +4705,25 @@ pick_next_task(struct rq *rq, struct tas
> > */
> > rq->core->core_task_seq++;
> > need_sync = !!rq->core->core_cookie;
> > -
> > - /* reset state */
> > -reset:
> > - rq->core->core_cookie = 0UL;
> > if (rq->core->core_forceidle) {
> > need_sync = true;
> > rq->core->core_forceidle = false;
> > }
> > +
> > + if (!need_sync) {
> > + next = __pick_next_task(rq, prev, rf);
>
> This could end up triggering pick_next_task_fair's newidle balancing;
>
> > + if (!next->core_cookie) {
> > + rq->core_pick = NULL;
> > + return next;
> > + }
>
> .. only to realize here that pick_next_task_fair() that we have to put_prev
> the task back as it has a cookie, but the effect of newidle balancing cannot
> be reverted.
>
> Would that be a problem as the newly pulled task might be incompatible and
> would have been better to leave it alone?
>
> TBH, this is a drastic change and we've done a lot of testing with the
> current code and its looking good. I'm a little scared of changing it right
> now and introducing regression. Can we maybe do this after the existing
> patches are upstream?

After sleeping over it, I am trying something like the following. Thoughts?

Basically, I call pick_task() in advance. That's mostly all that's different
with your patch:

---8<-----------------------

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0ce17aa72694..366e5ed84a63 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5000,28 +5000,34 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
put_prev_task_balance(rq, prev, rf);

smt_mask = cpu_smt_mask(cpu);
-
- /*
- * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
- *
- * @task_seq guards the task state ({en,de}queues)
- * @pick_seq is the @task_seq we did a selection on
- * @sched_seq is the @pick_seq we scheduled
- *
- * However, preemptions can cause multiple picks on the same task set.
- * 'Fix' this by also increasing @task_seq for every pick.
- */
- rq->core->core_task_seq++;
need_sync = !!rq->core->core_cookie;

/* reset state */
-reset:
rq->core->core_cookie = 0UL;
if (rq->core->core_forceidle) {
need_sync = true;
fi_before = true;
rq->core->core_forceidle = false;
}
+
+ /*
+ * Optimize for common case where this CPU has no cookies
+ * and there are no cookied tasks running on siblings.
+ */
+ if (!need_sync) {
+ for_each_class(class) {
+ next = class->pick_task(rq);
+ if (next)
+ break;
+ }
+
+ if (!next->core_cookie) {
+ rq->core_pick = NULL;
+ goto done;
+ }
+ need_sync = true;
+ }
+
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);

@@ -5039,6 +5045,18 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
}
}

+ /*
+ * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+ *
+ * @task_seq guards the task state ({en,de}queues)
+ * @pick_seq is the @task_seq we did a selection on
+ * @sched_seq is the @pick_seq we scheduled
+ *
+ * However, preemptions can cause multiple picks on the same task set.
+ * 'Fix' this by also increasing @task_seq for every pick.
+ */
+ rq->core->core_task_seq++;
+
/*
* Try and select tasks for each sibling in decending sched_class
* order.
@@ -5059,40 +5077,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
* core.
*/
p = pick_task(rq_i, class, max);
- if (!p) {
- /*
- * If there weren't no cookies; we don't need to
- * bother with the other siblings.
- */
- if (i == cpu && !need_sync)
- goto next_class;
-
+ if (!p)
continue;
- }
-
- /*
- * Optimize the 'normal' case where there aren't any
- * cookies and we don't need to sync up.
- */
- if (i == cpu && !need_sync) {
- if (p->core_cookie) {
- /*
- * This optimization is only valid as
- * long as there are no cookies
- * involved. We may have skipped
- * non-empty higher priority classes on
- * siblings, which are empty on this
- * CPU, so start over.
- */
- need_sync = true;
- goto reset;
- }
-
- next = p;
- trace_printk("unconstrained pick: %s/%d %lx\n",
- next->comm, next->pid, next->core_cookie);
- goto done;
- }

if (!is_task_rq_idle(p))
occ++;

2020-11-06 03:00:51

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 00/26] Core scheduling

On 2020/10/30 21:26, Ning, Hongyu wrote:
> On 2020/10/20 9:43, Joel Fernandes (Google) wrote:
>> Eighth iteration of the Core-Scheduling feature.
>>
>> Core scheduling is a feature that allows only trusted tasks to run
>> concurrently on cpus sharing compute resources (eg: hyperthreads on a
>> core). The goal is to mitigate the core-level side-channel attacks
>> without requiring to disable SMT (which has a significant impact on
>> performance in some situations). Core scheduling (as of v7) mitigates
>> user-space to user-space attacks and user to kernel attack when one of
>> the siblings enters the kernel via interrupts or system call.
>>
>> By default, the feature doesn't change any of the current scheduler
>> behavior. The user decides which tasks can run simultaneously on the
>> same core (for now by having them in the same tagged cgroup). When a tag
>> is enabled in a cgroup and a task from that cgroup is running on a
>> hardware thread, the scheduler ensures that only idle or trusted tasks
>> run on the other sibling(s). Besides security concerns, this feature can
>> also be beneficial for RT and performance applications where we want to
>> control how tasks make use of SMT dynamically.
>>
>> This iteration focuses on the the following stuff:
>> - Redesigned API.
>> - Rework of Kernel Protection feature based on Thomas's entry work.
>> - Rework of hotplug fixes.
>> - Address review comments in v7
>>
>> Joel: Both a CGroup and Per-task interface via prctl(2) are provided for
>> configuring core sharing. More details are provided in documentation patch.
>> Kselftests are provided to verify the correctness/rules of the interface.
>>
>> Julien: TPCC tests showed improvements with core-scheduling. With kernel
>> protection enabled, it does not show any regression. Possibly ASI will improve
>> the performance for those who choose kernel protection (can be toggled through
>> sched_core_protect_kernel sysctl). Results:
>> v8 average stdev diff
>> baseline (SMT on) 1197.272 44.78312824
>> core sched ( kernel protect) 412.9895 45.42734343 -65.51%
>> core sched (no kernel protect) 686.6515 71.77756931 -42.65%
>> nosmt 408.667 39.39042872 -65.87%
>>
>> v8 is rebased on tip/master.
>>
>> Future work
>> ===========
>> - Load balancing/Migration fixes for core scheduling.
>> With v6, Load balancing is partially coresched aware, but has some
>> issues w.r.t process/taskgroup weights:
>> https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...
>> - Core scheduling test framework: kselftests, torture tests etc
>>
>> Changes in v8
>> =============
>> - New interface/API implementation
>> - Joel
>> - Revised kernel protection patch
>> - Joel
>> - Revised Hotplug fixes
>> - Joel
>> - Minor bug fixes and address review comments
>> - Vineeth
>>
>
>> create mode 100644 tools/testing/selftests/sched/config
>> create mode 100644 tools/testing/selftests/sched/test_coresched.c
>>
>
> Adding 4 workloads test results for Core Scheduling v8:
>
> - kernel under test: coresched community v8 from https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched-v5.9
> - workloads:
> -- A. sysbench cpu (192 threads) + sysbench cpu (192 threads)
> -- B. sysbench cpu (192 threads) + sysbench mysql (192 threads, mysqld forced into the same cgroup)
> -- C. uperf netperf.xml (192 threads over TCP or UDP protocol separately)
> -- D. will-it-scale context_switch via pipe (192 threads)
> - test machine setup:
> CPU(s): 192
> On-line CPU(s) list: 0-191
> Thread(s) per core: 2
> Core(s) per socket: 48
> Socket(s): 2
> NUMA node(s): 4
> - test results:
> -- workload A, no obvious performance drop in cs_on:
> +----------------------+------+----------------------+------------------------+
> | | ** | sysbench cpu * 192 | sysbench mysql * 192 |
> +======================+======+======================+========================+
> | cgroup | ** | cg_sysbench_cpu_0 | cg_sysbench_mysql_0 |
> +----------------------+------+----------------------+------------------------+
> | record_item | ** | Tput_avg (events/s) | Tput_avg (events/s) |
> +----------------------+------+----------------------+------------------------+
> | coresched_normalized | ** | 1.01 | 0.87 |
> +----------------------+------+----------------------+------------------------+
> | default_normalized | ** | 1 | 1 |
> +----------------------+------+----------------------+------------------------+
> | smtoff_normalized | ** | 0.59 | 0.82 |
> +----------------------+------+----------------------+------------------------+
>
> -- workload B, no obvious performance drop in cs_on:
> +----------------------+------+----------------------+------------------------+
> | | ** | sysbench cpu * 192 | sysbench cpu * 192 |
> +======================+======+======================+========================+
> | cgroup | ** | cg_sysbench_cpu_0 | cg_sysbench_cpu_1 |
> +----------------------+------+----------------------+------------------------+
> | record_item | ** | Tput_avg (events/s) | Tput_avg (events/s) |
> +----------------------+------+----------------------+------------------------+
> | coresched_normalized | ** | 1.01 | 0.98 |
> +----------------------+------+----------------------+------------------------+
> | default_normalized | ** | 1 | 1 |
> +----------------------+------+----------------------+------------------------+
> | smtoff_normalized | ** | 0.6 | 0.6 |
> +----------------------+------+----------------------+------------------------+
>
> -- workload C, known performance drop in cs_on since Core Scheduling v6:
> +----------------------+------+---------------------------+---------------------------+
> | | ** | uperf netperf TCP * 192 | uperf netperf UDP * 192 |
> +======================+======+===========================+===========================+
> | cgroup | ** | cg_uperf | cg_uperf |
> +----------------------+------+---------------------------+---------------------------+
> | record_item | ** | Tput_avg (Gb/s) | Tput_avg (Gb/s) |
> +----------------------+------+---------------------------+---------------------------+
> | coresched_normalized | ** | 0.46 | 0.48 |
> +----------------------+------+---------------------------+---------------------------+
> | default_normalized | ** | 1 | 1 |
> +----------------------+------+---------------------------+---------------------------+
> | smtoff_normalized | ** | 0.82 | 0.79 |
> +----------------------+------+---------------------------+---------------------------+

This is known that when coresched is on, uperf offloads softirq service to
ksoftirqd, and the cookie of ksoftirqd is different from the cookie of uperf.
As a result, ksoftirqd can run concurrently with uperf previous but not now.

>
> -- workload D, new added syscall workload, performance drop in cs_on:
> +----------------------+------+-------------------------------+
> | | ** | will-it-scale * 192 |
> | | | (pipe based context_switch) |
> +======================+======+===============================+
> | cgroup | ** | cg_will-it-scale |
> +----------------------+------+-------------------------------+
> | record_item | ** | threads_avg |
> +----------------------+------+-------------------------------+
> | coresched_normalized | ** | 0.2 |
> +----------------------+------+-------------------------------+
> | default_normalized | ** | 1 |
> +----------------------+------+-------------------------------+
> | smtoff_normalized | ** | 0.89 |
> +----------------------+------+-------------------------------+

will-it-scale may be a very extreme case. The story here is,
- On one sibling reader/writer gets blocked and tries to schedule another reader/writer in.
- The other sibling tries to wake up reader/writer.

Both CPUs are acquiring rq->__lock,

So when coresched off, they are two different locks, lock stat(1 second delta) below:

class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
&rq->__lock: 210 210 0.10 3.04 180.87 0.86 797 79165021 0.03 20.69 60650198.34 0.77

But when coresched on, they are actually one same lock, lock stat(1 second delta) below:

class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
&rq->__lock: 6479459 6484857 0.05 216.46 60829776.85 9.38 8346319 15399739 0.03 95.56 81119515.38 5.27

This nature of core scheduling may degrade the performance of similar workloads with frequent context switching.

Any thoughts?

Thanks,
-Aubrey

2020-11-06 17:57:27

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 00/26] Core scheduling

On Fri, Nov 06, 2020 at 10:58:58AM +0800, Li, Aubrey wrote:

> >
> > -- workload D, new added syscall workload, performance drop in cs_on:
> > +----------------------+------+-------------------------------+
> > | | ** | will-it-scale * 192 |
> > | | | (pipe based context_switch) |
> > +======================+======+===============================+
> > | cgroup | ** | cg_will-it-scale |
> > +----------------------+------+-------------------------------+
> > | record_item | ** | threads_avg |
> > +----------------------+------+-------------------------------+
> > | coresched_normalized | ** | 0.2 |
> > +----------------------+------+-------------------------------+
> > | default_normalized | ** | 1 |
> > +----------------------+------+-------------------------------+
> > | smtoff_normalized | ** | 0.89 |
> > +----------------------+------+-------------------------------+
>
> will-it-scale may be a very extreme case. The story here is,
> - On one sibling reader/writer gets blocked and tries to schedule another reader/writer in.
> - The other sibling tries to wake up reader/writer.
>
> Both CPUs are acquiring rq->__lock,
>
> So when coresched off, they are two different locks, lock stat(1 second delta) below:
>
> class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
> &rq->__lock: 210 210 0.10 3.04 180.87 0.86 797 79165021 0.03 20.69 60650198.34 0.77
>
> But when coresched on, they are actually one same lock, lock stat(1 second delta) below:
>
> class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
> &rq->__lock: 6479459 6484857 0.05 216.46 60829776.85 9.38 8346319 15399739 0.03 95.56 81119515.38 5.27
>
> This nature of core scheduling may degrade the performance of similar workloads with frequent context switching.

When core sched is off, is SMT off as well? From the above table, it seems to
be. So even for core sched off, there will be a single lock per physical CPU
core (assuming SMT is also off) right? Or did I miss something?

thanks,

- Joel

2020-11-06 20:57:40

by Joel Fernandes

[permalink] [raw]
Subject: [RFT for v9] (Was Re: [PATCH v8 -tip 00/26] Core scheduling)

All,

I am getting ready to send the next v9 series based on tip/master
branch. Could you please give the below tree a try and report any results in
your testing?
git tree:
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git (branch coresched)
git log:
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched

The major changes in this series are the improvements:
(1)
"sched: Make snapshotting of min_vruntime more CGroup-friendly"
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=9a20a6652b3c50fd51faa829f7947004239a04eb

(2)
"sched: Simplify the core pick loop for optimized case"
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=0370117b4fd418cdaaa6b1489bfc14f305691152

And a bug fix:
(1)
"sched: Enqueue task into core queue only after vruntime is updated"
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=401dad5536e7e05d1299d0864e6fc5072029f492

There are also 2 more bug fixes that I squashed-in related to kernel
protection and a crash seen on the tip/master branch.

Hoping to send the series next week out to the list.

Have a great weekend, and Thanks!

- Joel


On Mon, Oct 19, 2020 at 09:43:10PM -0400, Joel Fernandes (Google) wrote:
> Eighth iteration of the Core-Scheduling feature.
>
> Core scheduling is a feature that allows only trusted tasks to run
> concurrently on cpus sharing compute resources (eg: hyperthreads on a
> core). The goal is to mitigate the core-level side-channel attacks
> without requiring to disable SMT (which has a significant impact on
> performance in some situations). Core scheduling (as of v7) mitigates
> user-space to user-space attacks and user to kernel attack when one of
> the siblings enters the kernel via interrupts or system call.
>
> By default, the feature doesn't change any of the current scheduler
> behavior. The user decides which tasks can run simultaneously on the
> same core (for now by having them in the same tagged cgroup). When a tag
> is enabled in a cgroup and a task from that cgroup is running on a
> hardware thread, the scheduler ensures that only idle or trusted tasks
> run on the other sibling(s). Besides security concerns, this feature can
> also be beneficial for RT and performance applications where we want to
> control how tasks make use of SMT dynamically.
>
> This iteration focuses on the the following stuff:
> - Redesigned API.
> - Rework of Kernel Protection feature based on Thomas's entry work.
> - Rework of hotplug fixes.
> - Address review comments in v7
>
> Joel: Both a CGroup and Per-task interface via prctl(2) are provided for
> configuring core sharing. More details are provided in documentation patch.
> Kselftests are provided to verify the correctness/rules of the interface.
>
> Julien: TPCC tests showed improvements with core-scheduling. With kernel
> protection enabled, it does not show any regression. Possibly ASI will improve
> the performance for those who choose kernel protection (can be toggled through
> sched_core_protect_kernel sysctl). Results:
> v8 average stdev diff
> baseline (SMT on) 1197.272 44.78312824
> core sched ( kernel protect) 412.9895 45.42734343 -65.51%
> core sched (no kernel protect) 686.6515 71.77756931 -42.65%
> nosmt 408.667 39.39042872 -65.87%
>
> v8 is rebased on tip/master.
>
> Future work
> ===========
> - Load balancing/Migration fixes for core scheduling.
> With v6, Load balancing is partially coresched aware, but has some
> issues w.r.t process/taskgroup weights:
> https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...
> - Core scheduling test framework: kselftests, torture tests etc
>
> Changes in v8
> =============
> - New interface/API implementation
> - Joel
> - Revised kernel protection patch
> - Joel
> - Revised Hotplug fixes
> - Joel
> - Minor bug fixes and address review comments
> - Vineeth
>
> Changes in v7
> =============
> - Kernel protection from untrusted usermode tasks
> - Joel, Vineeth
> - Fix for hotplug crashes and hangs
> - Joel, Vineeth
>
> Changes in v6
> =============
> - Documentation
> - Joel
> - Pause siblings on entering nmi/irq/softirq
> - Joel, Vineeth
> - Fix for RCU crash
> - Joel
> - Fix for a crash in pick_next_task
> - Yu Chen, Vineeth
> - Minor re-write of core-wide vruntime comparison
> - Aaron Lu
> - Cleanup: Address Review comments
> - Cleanup: Remove hotplug support (for now)
> - Build fixes: 32 bit, SMT=n, AUTOGROUP=n etc
> - Joel, Vineeth
>
> Changes in v5
> =============
> - Fixes for cgroup/process tagging during corner cases like cgroup
> destroy, task moving across cgroups etc
> - Tim Chen
> - Coresched aware task migrations
> - Aubrey Li
> - Other minor stability fixes.
>
> Changes in v4
> =============
> - Implement a core wide min_vruntime for vruntime comparison of tasks
> across cpus in a core.
> - Aaron Lu
> - Fixes a typo bug in setting the forced_idle cpu.
> - Aaron Lu
>
> Changes in v3
> =============
> - Fixes the issue of sibling picking up an incompatible task
> - Aaron Lu
> - Vineeth Pillai
> - Julien Desfossez
> - Fixes the issue of starving threads due to forced idle
> - Peter Zijlstra
> - Fixes the refcounting issue when deleting a cgroup with tag
> - Julien Desfossez
> - Fixes a crash during cpu offline/online with coresched enabled
> - Vineeth Pillai
> - Fixes a comparison logic issue in sched_core_find
> - Aaron Lu
>
> Changes in v2
> =============
> - Fixes for couple of NULL pointer dereference crashes
> - Subhra Mazumdar
> - Tim Chen
> - Improves priority comparison logic for process in different cpus
> - Peter Zijlstra
> - Aaron Lu
> - Fixes a hard lockup in rq locking
> - Vineeth Pillai
> - Julien Desfossez
> - Fixes a performance issue seen on IO heavy workloads
> - Vineeth Pillai
> - Julien Desfossez
> - Fix for 32bit build
> - Aubrey Li
>
> Aubrey Li (1):
> sched: migration changes for core scheduling
>
> Joel Fernandes (Google) (13):
> sched/fair: Snapshot the min_vruntime of CPUs on force idle
> arch/x86: Add a new TIF flag for untrusted tasks
> kernel/entry: Add support for core-wide protection of kernel-mode
> entry/idle: Enter and exit kernel protection during idle entry and
> exit
> sched: Split the cookie and setup per-task cookie on fork
> sched: Add a per-thread core scheduling interface
> sched: Add a second-level tag for nested CGroup usecase
> sched: Release references to the per-task cookie on exit
> sched: Handle task addition to CGroup
> sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG
> kselftest: Add tests for core-sched interface
> sched: Move core-scheduler interfacing code to a new file
> Documentation: Add core scheduling documentation
>
> Peter Zijlstra (10):
> sched: Wrap rq::lock access
> sched: Introduce sched_class::pick_task()
> sched: Core-wide rq->lock
> sched/fair: Add a few assertions
> sched: Basic tracking of matching tasks
> sched: Add core wide task selection and scheduling.
> sched: Trivial forced-newidle balancer
> irq_work: Cleanup
> sched: cgroup tagging interface for core scheduling
> sched: Debug bits...
>
> Vineeth Pillai (2):
> sched/fair: Fix forced idle sibling starvation corner case
> entry/kvm: Protect the kernel when entering from guest
>
> .../admin-guide/hw-vuln/core-scheduling.rst | 312 +++++
> Documentation/admin-guide/hw-vuln/index.rst | 1 +
> .../admin-guide/kernel-parameters.txt | 7 +
> arch/x86/include/asm/thread_info.h | 2 +
> arch/x86/kvm/x86.c | 3 +
> drivers/gpu/drm/i915/i915_request.c | 4 +-
> include/linux/entry-common.h | 20 +-
> include/linux/entry-kvm.h | 12 +
> include/linux/irq_work.h | 33 +-
> include/linux/irqflags.h | 4 +-
> include/linux/sched.h | 27 +-
> include/uapi/linux/prctl.h | 3 +
> kernel/Kconfig.preempt | 6 +
> kernel/bpf/stackmap.c | 2 +-
> kernel/entry/common.c | 25 +-
> kernel/entry/kvm.c | 13 +
> kernel/fork.c | 1 +
> kernel/irq_work.c | 18 +-
> kernel/printk/printk.c | 6 +-
> kernel/rcu/tree.c | 3 +-
> kernel/sched/Makefile | 1 +
> kernel/sched/core.c | 1135 ++++++++++++++++-
> kernel/sched/coretag.c | 468 +++++++
> kernel/sched/cpuacct.c | 12 +-
> kernel/sched/deadline.c | 34 +-
> kernel/sched/debug.c | 8 +-
> kernel/sched/fair.c | 272 ++--
> kernel/sched/idle.c | 24 +-
> kernel/sched/pelt.h | 2 +-
> kernel/sched/rt.c | 22 +-
> kernel/sched/sched.h | 302 ++++-
> kernel/sched/stop_task.c | 13 +-
> kernel/sched/topology.c | 4 +-
> kernel/sys.c | 3 +
> kernel/time/tick-sched.c | 6 +-
> kernel/trace/bpf_trace.c | 2 +-
> tools/include/uapi/linux/prctl.h | 3 +
> tools/testing/selftests/sched/.gitignore | 1 +
> tools/testing/selftests/sched/Makefile | 14 +
> tools/testing/selftests/sched/config | 1 +
> .../testing/selftests/sched/test_coresched.c | 840 ++++++++++++
> 41 files changed, 3437 insertions(+), 232 deletions(-)
> create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
> create mode 100644 kernel/sched/coretag.c
> create mode 100644 tools/testing/selftests/sched/.gitignore
> create mode 100644 tools/testing/selftests/sched/Makefile
> create mode 100644 tools/testing/selftests/sched/config
> create mode 100644 tools/testing/selftests/sched/test_coresched.c
>
> --
> 2.29.0.rc1.297.gfa9743e501-goog
>

2020-11-09 06:06:24

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 00/26] Core scheduling

On 2020/11/7 1:54, Joel Fernandes wrote:
> On Fri, Nov 06, 2020 at 10:58:58AM +0800, Li, Aubrey wrote:
>
>>>
>>> -- workload D, new added syscall workload, performance drop in cs_on:
>>> +----------------------+------+-------------------------------+
>>> | | ** | will-it-scale * 192 |
>>> | | | (pipe based context_switch) |
>>> +======================+======+===============================+
>>> | cgroup | ** | cg_will-it-scale |
>>> +----------------------+------+-------------------------------+
>>> | record_item | ** | threads_avg |
>>> +----------------------+------+-------------------------------+
>>> | coresched_normalized | ** | 0.2 |
>>> +----------------------+------+-------------------------------+
>>> | default_normalized | ** | 1 |
>>> +----------------------+------+-------------------------------+
>>> | smtoff_normalized | ** | 0.89 |
>>> +----------------------+------+-------------------------------+
>>
>> will-it-scale may be a very extreme case. The story here is,
>> - On one sibling reader/writer gets blocked and tries to schedule another reader/writer in.
>> - The other sibling tries to wake up reader/writer.
>>
>> Both CPUs are acquiring rq->__lock,
>>
>> So when coresched off, they are two different locks, lock stat(1 second delta) below:
>>
>> class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
>> &rq->__lock: 210 210 0.10 3.04 180.87 0.86 797 79165021 0.03 20.69 60650198.34 0.77
>>
>> But when coresched on, they are actually one same lock, lock stat(1 second delta) below:
>>
>> class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
>> &rq->__lock: 6479459 6484857 0.05 216.46 60829776.85 9.38 8346319 15399739 0.03 95.56 81119515.38 5.27
>>
>> This nature of core scheduling may degrade the performance of similar workloads with frequent context switching.
>
> When core sched is off, is SMT off as well? From the above table, it seems to
> be. So even for core sched off, there will be a single lock per physical CPU
> core (assuming SMT is also off) right? Or did I miss something?
>

The table includes 3 cases:
- default: SMT on, coresched off
- coresched: SMT on, coresched on
- smtoff: SMT off, coresched off

I was comparing the default(coresched off & SMT on) case with (coresched
on & SMT on) case.

If SMT off, then reader and writer on the different cores have different rq->lock,
so the lock contention is not that serious.

class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
&rq->__lock: 60 60 0.11 1.92 41.33 0.69 127 67184172 0.03 22.95 33160428.37 0.49

Does this address your concern?

Thanks,
-Aubrey

2020-11-12 16:13:10

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v8 -tip 25/26] Documentation: Add core scheduling documentation

On Mon, Oct 19, 2020 at 08:36:25PM -0700, Randy Dunlap wrote:
> Hi Joel,
>
> On 10/19/20 6:43 PM, Joel Fernandes (Google) wrote:
> > Document the usecases, design and interfaces for core scheduling.
> >
> > Co-developed-by: Vineeth Pillai <[email protected]>
> > Tested-by: Julien Desfossez <[email protected]>
> > Signed-off-by: Joel Fernandes (Google) <[email protected]>

All fixed as below, updated and thanks!

---8<-----------------------

From: "Joel Fernandes (Google)" <[email protected]>
Subject: [PATCH] Documentation: Add core scheduling documentation

Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Vineeth Pillai <[email protected]>
Tested-by: Julien Desfossez <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
.../admin-guide/hw-vuln/core-scheduling.rst | 313 ++++++++++++++++++
Documentation/admin-guide/hw-vuln/index.rst | 1 +
2 files changed, 314 insertions(+)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index 000000000000..c7399809c74d
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,313 @@
+Core Scheduling
+***************
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+----------------
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practice, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``, where N is the total number of CPUs).
+
+Usage
+-----
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core, while doing its best
+to satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+######
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.tag``
+Writing ``1`` into this file results in all tasks in the group getting tagged.
+This results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inherited
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.tag, it is not possible to set this
+ for any descendant of the tagged group. For finer grained control, the
+ ``cpu.tag_color`` file described next may be used.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+ a core with kernel threads and untagged system threads. For this reason,
+ if a group has ``cpu.tag`` of 0, it is considered to be trusted.
+
+* ``cpu.tag_color``
+For finer grained control over core sharing, a color can also be set in
+addition to the tag. This allows to further control core sharing between child
+CGroups within an already tagged CGroup. The color and the tag are both used to
+generate a `cookie` which is used by the scheduler to identify the group.
+
+Up to 256 different colors can be set (0-255) by writing into this file.
+
+A sample real-world usage of this file follows:
+
+Google uses DAC controls to make ``cpu.tag`` writable only by root and the
+``cpu.tag_color`` can be changed by anyone.
+
+The hierarchy looks like this:
+::
+ Root group
+ / \
+ A B (These are created by the root daemon - borglet).
+ / \ \
+ C D E (These are created by AppEngine within the container).
+
+A and B are containers for 2 different jobs or apps that are created by a root
+daemon called borglet. borglet then tags each of these group with the ``cpu.tag``
+file. The job itself can create additional child CGroups which are colored by
+the container's AppEngine with the ``cpu.tag_color`` file.
+
+The reason why Google uses this 2-level tagging system is that AppEngine wants to
+allow a subset of child CGroups within a tagged parent CGroup to be co-scheduled on a
+core while not being co-scheduled with other child CGroups. Think of these
+child CGroups as belonging to the same customer or project. Because these
+child CGroups are created by AppEngine, they are not tracked by borglet (the
+root daemon), therefore borglet won't have a chance to set a color for them.
+That's where cpu.tag_color file comes in. A color could be set by AppEngine,
+and once set, the normal tasks within the subcgroup would not be able to
+overwrite it. This is enforced by promoting the permission of the
+``cpu.tag_color`` file in cgroupfs.
+
+The color is an 8-bit value allowing for up to 256 unique colors.
+
+.. note:: Once a CGroup is colored, none of its descendants can be re-colored. Also
+ coloring of a CGroup is possible only if either the group or one of its
+ ancestors was tagged via the ``cpu.tag`` file.
+
+prctl interface
+###############
+A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` is available to a process to request
+sharing a core with another process. For example, consider 2 processes ``P1``
+and ``P2`` with PIDs 100 and 200. If process ``P1`` calls
+``prctl(PR_SCHED_CORE_SHARE, 200)``, the kernel makes ``P1`` share a core with ``P2``.
+The kernel performs ptrace access mode checks before granting the request.
+
+.. note:: This operation is not commutative. P1 calling
+ ``prctl(PR_SCHED_CORE_SHARE, pidof(P2)`` is not the same as P2 calling the
+ same for P1. The former case is P1 joining P2's group of processes
+ (which P2 would have joined with ``prctl(2)`` prior to P1's ``prctl(2)``).
+
+.. note:: The core-sharing granted with prctl(2) will be subject to
+ core-sharing restrictions specified by the CGroup interface. For example
+ if P1 and P2 are a part of 2 different tagged CGroups, then they will
+ not share a core even if a prctl(2) call is made. This is analogous
+ to how affinities are set using the cpuset interface.
+
+It is important to note that, on a ``CLONE_THREAD`` ``clone(2)`` syscall, the child
+will be assigned the same tag as its parent and thus be allowed to share a core
+with them. This design choice is because, for the security usecase, a
+``CLONE_THREAD`` child can access its parent's address space anyway, so there's
+no point in not allowing them to share a core. If a different behavior is
+desired, the child thread can call ``prctl(2)`` as needed. This behavior is
+specific to the ``prctl(2)`` interface. For the CGroup interface, the child of a
+fork always shares a core with its parent. On the other hand, if a parent
+was previously tagged via ``prctl(2)`` and does a regular ``fork(2)`` syscall, the
+child will receive a unique tag.
+
+Design/Implementation
+---------------------
+Each task that is tagged is assigned a cookie internally in the kernel. As
+mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
+each other and share a core.
+
+The basic idea is that, every schedule event tries to select tasks for all the
+siblings of a core such that all the selected tasks running on a core are
+trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
+The idle task is considered special, as it trusts everything and everything
+trusts it.
+
+During a schedule() event on any sibling of a core, the highest priority task on
+the sibling's core is picked and assigned to the sibling calling schedule(), if
+the sibling has the task enqueued. For rest of the siblings in the core,
+highest priority task with the same cookie is selected if there is one runnable
+in their individual run queues. If a task with same cookie is not available,
+the idle task is selected. Idle task is globally trusted.
+
+Once a task has been selected for all the siblings in the core, an IPI is sent to
+siblings for whom a new task was selected. Siblings on receiving the IPI will
+switch to the new task immediately. If an idle task is selected for a sibling,
+then the sibling is considered to be in a `forced idle` state. I.e., it may
+have tasks on its on runqueue to run, however it will still have to run idle.
+More on this in the next section.
+
+Forced-idling of tasks
+----------------------
+The scheduler tries its best to find tasks that trust each other such that all
+tasks selected to be scheduled are of the highest priority in a core. However,
+it is possible that some runqueues had tasks that were incompatible with the
+highest priority ones in the core. Favoring security over fairness, one or more
+siblings could be forced to select a lower priority task if the highest
+priority task is not trusted with respect to the core wide highest priority
+task. If a sibling does not have a trusted task to run, it will be forced idle
+by the scheduler (idle thread is scheduled to run).
+
+When the highest priority task is selected to run, a reschedule-IPI is sent to
+the sibling to force it into idle. This results in 4 cases which need to be
+considered depending on whether a VM or a regular usermode process was running
+on either HT::
+
+ HT1 (attack) HT2 (victim)
+ A idle -> user space user space -> idle
+ B idle -> user space guest -> idle
+ C idle -> guest user space -> idle
+ D idle -> guest guest -> idle
+
+Note that for better performance, we do not wait for the destination CPU
+(victim) to enter idle mode. This is because the sending of the IPI would bring
+the destination CPU immediately into kernel mode from user space, or VMEXIT
+in the case of guests. At best, this would only leak some scheduler metadata
+which may not be worth protecting. It is also possible that the IPI is received
+too late on some architectures, but this has not been observed in the case of
+x86.
+
+Kernel protection from untrusted tasks
+--------------------------------------
+The scheduler on its own cannot protect the kernel executing concurrently with
+an untrusted task in a core. This is because the scheduler is unaware of
+interrupts/syscalls at scheduling time. To mitigate this, an IPI is sent to
+siblings on kernel entry (syscall and IRQ). This IPI forces the sibling to enter
+kernel mode and wait before returning to user until all siblings of the
+core have left kernel mode. This process is also known as stunning. For good
+performance, an IPI is sent only to a sibling only if it is running a tagged
+task. If a sibling is running a kernel thread or is idle, no IPI is sent.
+
+The kernel protection feature can be turned off on the kernel command line by
+passing ``sched_core_protect_kernel=0``.
+
+Other alternative ideas discussed for kernel protection are listed below just
+for completeness. They all have limitations:
+
+1. Changing interrupt affinities to a trusted core which does not execute untrusted tasks
+#########################################################################################
+By changing the interrupt affinities to a designated safe-CPU which runs
+only trusted tasks, IRQ data can be protected. One issue is this involves
+giving up a full CPU core of the system to run safe tasks. Another is that,
+per-cpu interrupts such as the local timer interrupt cannot have their
+affinity changed. Also, sensitive timer callbacks such as the random entropy timer
+can run in softirq on return from these interrupts and expose sensitive
+data. In the future, that could be mitigated by forcing softirqs into threaded
+mode by utilizing a mechanism similar to ``CONFIG_PREEMPT_RT``.
+
+Yet another issue with this is, for multiqueue devices with managed
+interrupts, the IRQ affinities cannot be changed; however it could be
+possible to force a reduced number of queues which would in turn allow to
+shield one or two CPUs from such interrupts and queue handling for the price
+of indirection.
+
+2. Running IRQs as threaded-IRQs
+################################
+This would result in forcing IRQs into the scheduler which would then provide
+the process-context mitigation. However, not all interrupts can be threaded.
+Also this does nothing about syscall entries.
+
+3. Kernel Address Space Isolation
+#################################
+System calls could run in a much restricted address space which is
+guaranteed not to leak any sensitive data. There are practical limitation in
+implementing this - the main concern being how to decide on an address space
+that is guaranteed to not have any sensitive data.
+
+4. Limited cookie-based protection
+##################################
+On a system call, change the cookie to the system trusted cookie and initiate a
+schedule event. This would be better than pausing all the siblings during the
+entire duration for the system call, but still would be a huge hit to the
+performance.
+
+Trust model
+-----------
+Core scheduling maintains trust relationships amongst groups of tasks by
+assigning the tag of them with the same cookie value.
+When a system with core scheduling boots, all tasks are considered to trust
+each other. This is because the core scheduler does not have information about
+trust relationships until userspace uses the above mentioned interfaces, to
+communicate them. In other words, all tasks have a default cookie value of 0.
+and are considered system-wide trusted. The stunning of siblings running
+cookie-0 tasks is also avoided.
+
+Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
+within such groups are considered to trust each other, but do not trust those
+outside. Tasks outside the group also don't trust tasks within.
+
+Limitations
+-----------
+Core scheduling tries to guarantee that only trusted tasks run concurrently on a
+core. But there could be small window of time during which untrusted tasks run
+concurrently or kernel could be running concurrently with a task not trusted by
+kernel.
+
+1. IPI processing delays
+########################
+Core scheduling selects only trusted tasks to run together. IPI is used to notify
+the siblings to switch to the new task. But there could be hardware delays in
+receiving of the IPI on some arch (on x86, this has not been observed). This may
+cause an attacker task to start running on a CPU before its siblings receive the
+IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
+may populate data in the cache and micro architectural buffers after the attacker
+starts to run and this is a possibility for data leak.
+
+Open cross-HT issues that core scheduling does not solve
+--------------------------------------------------------
+1. For MDS
+##########
+Core scheduling cannot protect against MDS attacks between an HT running in
+user mode and another running in kernel mode. Even though both HTs run tasks
+which trust each other, kernel memory is still considered untrusted. Such
+attacks are possible for any combination of sibling CPU modes (host or guest mode).
+
+2. For L1TF
+###########
+Core scheduling cannot protect against an L1TF guest attacker exploiting a
+guest or host victim. This is because the guest attacker can craft invalid
+PTEs which are not inverted due to a vulnerable guest kernel. The only
+solution is to disable EPT (Extended Page Tables).
+
+For both MDS and L1TF, if the guest vCPU is configured to not trust each
+other (by tagging separately), then the guest to guest attacks would go away.
+Or it could be a system admin policy which considers guest to guest attacks as
+a guest problem.
+
+Another approach to resolve these would be to make every untrusted task on the
+system to not trust every other untrusted task. While this could reduce
+parallelism of the untrusted tasks, it would still solve the above issues while
+allowing system processes (trusted tasks) to share a core.
+
+Use cases
+---------
+The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
+with SMT enabled. There are other use cases where this feature could be used:
+
+- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
+ that uses SIMD instructions etc.
+- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
+ together could also be realized using core scheduling. One example is vCPUs of
+ a VM.
+
+Future work
+-----------
+Skipping per-HT mitigations if task is trusted
+##############################################
+If core scheduling is enabled, by default all tasks trust each other as
+mentioned above. In such scenario, it may be desirable to skip the same-HT
+mitigations on return to the trusted user-mode to improve performance.
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index 21710f8609fe..361ccbbd9e54 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -16,3 +16,4 @@ are configurable at compile, boot or run time.
multihit.rst
special-register-buffer-data-sampling.rst
l1d_flush.rst
+ core-scheduling.rst
--
2.29.2.222.g5d2a92d10f8-goog

2020-11-13 09:25:51

by Ning, Hongyu

[permalink] [raw]
Subject: Re: [RFT for v9] (Was Re: [PATCH v8 -tip 00/26] Core scheduling)

On 2020/11/7 4:55, Joel Fernandes wrote:
> All,
>
> I am getting ready to send the next v9 series based on tip/master
> branch. Could you please give the below tree a try and report any results in
> your testing?
> git tree:
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git (branch coresched)
> git log:
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched
>
> The major changes in this series are the improvements:
> (1)
> "sched: Make snapshotting of min_vruntime more CGroup-friendly"
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=9a20a6652b3c50fd51faa829f7947004239a04eb
>
> (2)
> "sched: Simplify the core pick loop for optimized case"
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=0370117b4fd418cdaaa6b1489bfc14f305691152
>
> And a bug fix:
> (1)
> "sched: Enqueue task into core queue only after vruntime is updated"
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=401dad5536e7e05d1299d0864e6fc5072029f492
>
> There are also 2 more bug fixes that I squashed-in related to kernel
> protection and a crash seen on the tip/master branch.
>
> Hoping to send the series next week out to the list.
>
> Have a great weekend, and Thanks!
>
> - Joel
>
>
> On Mon, Oct 19, 2020 at 09:43:10PM -0400, Joel Fernandes (Google) wrote:

Adding 4 workloads test results for core scheduling v9 candidate:

- kernel under test:
-- coresched community v9 candidate from https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git (branch coresched)
-- latest commit: 2e8591a330ff (HEAD -> coresched, origin/coresched) NEW: sched: Add a coresched command line option
-- coresched=on kernel parameter applied
- workloads:
-- A. sysbench cpu (192 threads) + sysbench cpu (192 threads)
-- B. sysbench cpu (192 threads) + sysbench mysql (192 threads, mysqld forced into the same cgroup)
-- C. uperf netperf.xml (192 threads over TCP or UDP protocol separately)
-- D. will-it-scale context_switch via pipe (192 threads)
- test machine setup:
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 4
- test results, no obvious performance drop compared to community v8 build:
-- workload A:
+----------------------+------+----------------------+------------------------+
| | ** | sysbench cpu * 192 | sysbench cpu * 192 |
+======================+======+======================+========================+
| cgroup | ** | cg_sysbench_cpu_0 | cg_sysbench_cpu_1 |
+----------------------+------+----------------------+------------------------+
| record_item | ** | Tput_avg (events/s) | Tput_avg (events/s) |
+----------------------+------+----------------------+------------------------+
| coresched_normalized | ** | 0.98 | 1.01 |
+----------------------+------+----------------------+------------------------+
| default_normalized | ** | 1 | 1 |
+----------------------+------+----------------------+------------------------+
| smtoff_normalized | ** | 0.59 | 0.6 |
+----------------------+------+----------------------+------------------------+

-- workload B:
+----------------------+------+----------------------+------------------------+
| | ** | sysbench cpu * 192 | sysbench mysql * 192 |
+======================+======+======================+========================+
| cgroup | ** | cg_sysbench_cpu_0 | cg_sysbench_mysql_0 |
+----------------------+------+----------------------+------------------------+
| record_item | ** | Tput_avg (events/s) | Tput_avg (events/s) |
+----------------------+------+----------------------+------------------------+
| coresched_normalized | ** | 1.02 | 0.78 |
+----------------------+------+----------------------+------------------------+
| default_normalized | ** | 1 | 1 |
+----------------------+------+----------------------+------------------------+
| smtoff_normalized | ** | 0.59 | 0.75 |
+----------------------+------+----------------------+------------------------+

-- workload C:
+----------------------+------+---------------------------+---------------------------+
| | ** | uperf netperf TCP * 192 | uperf netperf UDP * 192 |
+======================+======+===========================+===========================+
| cgroup | ** | cg_uperf | cg_uperf |
+----------------------+------+---------------------------+---------------------------+
| record_item | ** | Tput_avg (Gb/s) | Tput_avg (Gb/s) |
+----------------------+------+---------------------------+---------------------------+
| coresched_normalized | ** | 0.65 | 0.67 |
+----------------------+------+---------------------------+---------------------------+
| default_normalized | ** | 1 | 1 |
+----------------------+------+---------------------------+---------------------------+
| smtoff_normalized | ** | 0.83 | 0.91 |
+----------------------+------+---------------------------+---------------------------+

-- workload D:
+----------------------+------+-------------------------------+
| | ** | will-it-scale * 192 |
| | | (pipe based context_switch) |
+======================+======+===============================+
| cgroup | ** | cg_will-it-scale |
+----------------------+------+-------------------------------+
| record_item | ** | threads_avg |
+----------------------+------+-------------------------------+
| coresched_normalized | ** | 0.29 |
+----------------------+------+-------------------------------+
| default_normalized | ** | 1.00 |
+----------------------+------+-------------------------------+
| smtoff_normalized | ** | 0.87 |
+----------------------+------+-------------------------------+

- notes on test results record_item:
* coresched_normalized: smton, cs enabled, test result normalized by default value
* default_normalized: smton, cs disabled, test result normalized by default value
* smtoff_normalized: smtoff, test result normalized by default value


Hongyu

2020-11-13 10:05:50

by Ning, Hongyu

[permalink] [raw]
Subject: Re: [RFT for v9] (Was Re: [PATCH v8 -tip 00/26] Core scheduling)


On 2020/11/13 17:22, Ning, Hongyu wrote:
> On 2020/11/7 4:55, Joel Fernandes wrote:
>> All,
>>
>> I am getting ready to send the next v9 series based on tip/master
>> branch. Could you please give the below tree a try and report any results in
>> your testing?
>> git tree:
>> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git (branch coresched)
>> git log:
>> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched
>>
>> The major changes in this series are the improvements:
>> (1)
>> "sched: Make snapshotting of min_vruntime more CGroup-friendly"
>> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=9a20a6652b3c50fd51faa829f7947004239a04eb
>>
>> (2)
>> "sched: Simplify the core pick loop for optimized case"
>> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=0370117b4fd418cdaaa6b1489bfc14f305691152
>>
>> And a bug fix:
>> (1)
>> "sched: Enqueue task into core queue only after vruntime is updated"
>> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=coresched-v9-for-test&id=401dad5536e7e05d1299d0864e6fc5072029f492
>>
>> There are also 2 more bug fixes that I squashed-in related to kernel
>> protection and a crash seen on the tip/master branch.
>>
>> Hoping to send the series next week out to the list.
>>
>> Have a great weekend, and Thanks!
>>
>> - Joel
>>
>>
>> On Mon, Oct 19, 2020 at 09:43:10PM -0400, Joel Fernandes (Google) wrote:
>
> Adding 4 workloads test results for core scheduling v9 candidate:
>
> - kernel under test:
> -- coresched community v9 candidate from https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git (branch coresched)
> -- latest commit: 2e8591a330ff (HEAD -> coresched, origin/coresched) NEW: sched: Add a coresched command line option
> -- coresched=on kernel parameter applied
> - workloads:
> -- A. sysbench cpu (192 threads) + sysbench cpu (192 threads)
> -- B. sysbench cpu (192 threads) + sysbench mysql (192 threads, mysqld forced into the same cgroup)
> -- C. uperf netperf.xml (192 threads over TCP or UDP protocol separately)
> -- D. will-it-scale context_switch via pipe (192 threads)
> - test machine setup:
> CPU(s): 192
> On-line CPU(s) list: 0-191
> Thread(s) per core: 2
> Core(s) per socket: 48
> Socket(s): 2
> NUMA node(s): 4
> - test results, no obvious performance drop compared to community v8 build:
> -- workload A:
> +----------------------+------+----------------------+------------------------+
> | | ** | sysbench cpu * 192 | sysbench cpu * 192 |
> +======================+======+======================+========================+
> | cgroup | ** | cg_sysbench_cpu_0 | cg_sysbench_cpu_1 |
> +----------------------+------+----------------------+------------------------+
> | record_item | ** | Tput_avg (events/s) | Tput_avg (events/s) |
> +----------------------+------+----------------------+------------------------+
> | coresched_normalized | ** | 0.98 | 1.01 |
> +----------------------+------+----------------------+------------------------+
> | default_normalized | ** | 1 | 1 |
> +----------------------+------+----------------------+------------------------+
> | smtoff_normalized | ** | 0.59 | 0.6 |
> +----------------------+------+----------------------+------------------------+
>
> -- workload B:
> +----------------------+------+----------------------+------------------------+
> | | ** | sysbench cpu * 192 | sysbench mysql * 192 |
> +======================+======+======================+========================+
> | cgroup | ** | cg_sysbench_cpu_0 | cg_sysbench_mysql_0 |
> +----------------------+------+----------------------+------------------------+
> | record_item | ** | Tput_avg (events/s) | Tput_avg (events/s) |
> +----------------------+------+----------------------+------------------------+
> | coresched_normalized | ** | 1.02 | 0.78 |
> +----------------------+------+----------------------+------------------------+
> | default_normalized | ** | 1 | 1 |
> +----------------------+------+----------------------+------------------------+
> | smtoff_normalized | ** | 0.59 | 0.75 |
> +----------------------+------+----------------------+------------------------+
>
> -- workload C:
> +----------------------+------+---------------------------+---------------------------+
> | | ** | uperf netperf TCP * 192 | uperf netperf UDP * 192 |
> +======================+======+===========================+===========================+
> | cgroup | ** | cg_uperf | cg_uperf |
> +----------------------+------+---------------------------+---------------------------+
> | record_item | ** | Tput_avg (Gb/s) | Tput_avg (Gb/s) |
> +----------------------+------+---------------------------+---------------------------+
> | coresched_normalized | ** | 0.65 | 0.67 |
> +----------------------+------+---------------------------+---------------------------+
> | default_normalized | ** | 1 | 1 |
> +----------------------+------+---------------------------+---------------------------+
> | smtoff_normalized | ** | 0.83 | 0.91 |
> +----------------------+------+---------------------------+---------------------------+
>
> -- workload D:
> +----------------------+------+-------------------------------+
> | | ** | will-it-scale * 192 |
> | | | (pipe based context_switch) |
> +======================+======+===============================+
> | cgroup | ** | cg_will-it-scale |
> +----------------------+------+-------------------------------+
> | record_item | ** | threads_avg |
> +----------------------+------+-------------------------------+
> | coresched_normalized | ** | 0.29 |
> +----------------------+------+-------------------------------+
> | default_normalized | ** | 1.00 |
> +----------------------+------+-------------------------------+
> | smtoff_normalized | ** | 0.87 |
> +----------------------+------+-------------------------------+
>
> - notes on test results record_item:
> * coresched_normalized: smton, cs enabled, test result normalized by default value
> * default_normalized: smton, cs disabled, test result normalized by default value
> * smtoff_normalized: smtoff, test result normalized by default value
>
>
> Hongyu
>

Add 2 more negative test case:

- continuously toggle cpu.core_tag, during workload running with cs_on
- continuously toggle smt setting via /sys/devices/system/cpu/smt/control, during workload running with cs_on

no kernel panic or platform hang observed.