2020-08-28 19:53:35

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 00/23] Core scheduling v7

Seventh iteration of the Core-Scheduling feature.

(Note that this iteration is a repost of the combined v6 and v6+ series,
and does not include the new interface discussed at LPC).

Core scheduling is a feature that allows only trusted tasks to run
concurrently on cpus sharing compute resources (eg: hyperthreads on a
core). The goal is to mitigate the core-level side-channel attacks
without requiring to disable SMT (which has a significant impact on
performance in some situations). Core scheduling (as of v7) mitigates
user-space to user-space attacks and user to kernel attack when one of
the siblings enters the kernel via interrupts or system call.

By default, the feature doesn't change any of the current scheduler
behavior. The user decides which tasks can run simultaneously on the
same core (for now by having them in the same tagged cgroup). When a tag
is enabled in a cgroup and a task from that cgroup is running on a
hardware thread, the scheduler ensures that only idle or trusted tasks
run on the other sibling(s). Besides security concerns, this feature can
also be beneficial for RT and performance applications where we want to
control how tasks make use of SMT dynamically.

This iteration focuses on the kernel protection when kernel runs along
with untrusted user mode tasks on a core. This was a main gap in the
security aspects provided by core scheduling feature in mitigating the
side-channel attacks. In v6, we had a minimal version where kernel was
protected from untrusted tasks when a sibling enters the kernel via
irq/softirq. System calls were not protected in the v6 patch and the
implementation had corner cases. With this iteration, user mode tasks
running on siblings in a core are paused until the last sibling exits
kernel and thus guarantees that kernel code does not run concurrently
with untrusted user mode tasks. This protects both syscalls and IRQs vs
usermode and guest. It is also based on Thomas’s x86/entry branch which
moves generic entry code out of arch/

This iteration also fixes the hotplug crash and hang issues present in
v6. In v6, previous hotplug fixes were removed as they were getting
complicated. This iteration addresses the root cause of hot plug issues
and fixes them.

In terms of performance, the active VM-based workloads are still less
impacted by core scheduling than by disabling SMT. There are corner
cases and at this stage, since we are getting closer to merger, we are
actively looking for feedback and reproducers of those so we can address
them.

Performance tests:
sysbench is used to test the performance of the patch series. Used a 8
cpu/4 Core VM and ran 2 sysbench tests in parallel. Each sysbench test
runs 4 tasks:
sysbench --test=cpu --cpu-max-prime=100000 --num-threads=4 run

Compared the performance results for various combinations as below.
The metric below is 'events per second':

1. Coresched disabled
sysbench-1/sysbench-2 => 175.7/175.6
2. Coreched enabled, both sysbench tagged
sysbench-1/sysbench-2 => 168.8/165.6
3. Coresched enabled, sysbench-1 tagged and sysbench-2 untagged
sysbench-1/sysbench-2 => 96.4/176.9
4. smt off
sysbench-1/sysbench-2 => 97.9/98.8

When both sysbench-es are tagged, there is a perf drop of ~4%. With a
tagged/untagged case, the tagged one suffers because it always gets
stalled when the sibiling enters kernel. But this is no worse than
smtoff.

Also a modified rcutorture was used to heavily stress the kernel to make
sure there is not crash or instability.

Larger scale performance tests were performed where the host is running
the core scheduling kernel, each VM is in its own tag.

With 3 12-vcpus VMs on a 36 hardware thread NUMA node running linpack
(CPU-intensive), we go from 1:1 to 2:1 vcpus to hardware threads when we
disable SMT, the performance impact of core scheduling is -9.52% whereas
it is -25.41% with nosmt.

With 2 12-vcpus VMs database servers and 192 1-vcpus semi-idle VMs
running on 2 36 hardware threads NUMA nodes (3:1), the performance
impact is -47% with core scheduling and -66% with nosmt.

More details on those tests at the end of this presentation:
https://www.linuxplumbersconf.org/event/7/contributions/648/attachments/555/981/CoreScheduling_LPC2020.pdf

v7 is rebased on 5.9-rc2(d3ac97072ec5)
https://github.com/digitalocean/linux-coresched/tree/coresched/v7-v5.9-rc

TODO
----
(Mostly same as in v6)
- MAJOR: Core wide vruntime comparison re-work
https://lwn.net/ml/linux-kernel/[email protected]/
https://lwn.net/ml/linux-kernel/[email protected]/
- MAJOR: Decide on the interfaces/API for exposing the feature to userland.
- LPC 2020 gained consensus on a simple task level interface as a starter
- Investigate ideas on overcoming unified hierarchy limitations with
cgroup v2 controller
- MAJOR: Load balancing/Migration fixes for core scheduling.
With v6, Load balancing is partially coresched aware, but has some
issues w.r.t process/taskgroup weights:
https://lwn.net/ml/linux-kernel/[email protected]/
- Investigate the source of the overhead even when no tasks are tagged:
https://lkml.org/lkml/2019/10/29/242
- Core scheduling test framework: kselftests, torture tests etc

Changes in v7
-------------
- Kernel protection from untrusted usermode tasks
- Joel, Vineeth
- Fix for hotplug crashes and hangs
- Joel, Vineeth

Changes in v6
-------------
- Documentation
- Joel
- Pause siblings on entering nmi/irq/softirq
- Joel, Vineeth
- Fix for RCU crash
- Joel
- Fix for a crash in pick_next_task
- Yu Chen, Vineeth
- Minor re-write of core-wide vruntime comparison
- Aaron Lu
- Cleanup: Address Review comments
- Cleanup: Remove hotplug support (for now)
- Build fixes: 32 bit, SMT=n, AUTOGROUP=n etc
- Joel, Vineeth

Changes in v5
-------------
- Fixes for cgroup/process tagging during corner cases like cgroup
destroy, task moving across cgroups etc
- Tim Chen
- Coresched aware task migrations
- Aubrey Li
- Other minor stability fixes.

Changes in v4
-------------
- Implement a core wide min_vruntime for vruntime comparison of tasks
across cpus in a core.
- Aaron Lu
- Fixes a typo bug in setting the forced_idle cpu.
- Aaron Lu

Changes in v3
-------------
- Fixes the issue of sibling picking up an incompatible task
- Aaron Lu
- Vineeth Pillai
- Julien Desfossez
- Fixes the issue of starving threads due to forced idle
- Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
- Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
- Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
- Aaron Lu

Changes in v2
-------------
- Fixes for couple of NULL pointer dereference crashes
- Subhra Mazumdar
- Tim Chen
- Improves priority comparison logic for process in different cpus
- Peter Zijlstra
- Aaron Lu
- Fixes a hard lockup in rq locking
- Vineeth Pillai
- Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
- Vineeth Pillai
- Julien Desfossez
- Fix for 32bit build
- Aubrey Li

--

Aaron Lu (2):
sched/fair: wrapper for cfs_rq->min_vruntime
sched/fair: core wide cfs task priority comparison

Aubrey Li (1):
sched: migration changes for core scheduling

Joel Fernandes (Google) (6):
irq_work: Add support to detect if work is pending
entry/idle: Add a common function for activites during idle entry/exit
arch/x86: Add a new TIF flag for untrusted tasks
kernel/entry: Add support for core-wide protection of kernel-mode
entry/idle: Enter and exit kernel protection during idle entry and
exit
Documentation: Add documentation on core scheduling

Peter Zijlstra (9):
sched: Wrap rq::lock access
sched: Introduce sched_class::pick_task()
sched: Core-wide rq->lock
sched/fair: Add a few assertions
sched: Basic tracking of matching tasks
sched: Add core wide task selection and scheduling.
sched: Trivial forced-newidle balancer
sched: cgroup tagging interface for core scheduling
sched: Debug bits...

Vineeth Pillai (5):
bitops: Introduce find_next_or_bit
cpumask: Introduce a new iterator for_each_cpu_wrap_or
sched/fair: Fix forced idle sibling starvation corner case
entry/kvm: Protect the kernel when entering from guest
sched/coresched: config option for kernel protection

.../admin-guide/hw-vuln/core-scheduling.rst | 253 ++++
Documentation/admin-guide/hw-vuln/index.rst | 1 +
.../admin-guide/kernel-parameters.txt | 9 +
arch/x86/include/asm/thread_info.h | 2 +
arch/x86/kvm/x86.c | 3 +
include/asm-generic/bitops/find.h | 16 +
include/linux/cpumask.h | 42 +
include/linux/entry-common.h | 22 +
include/linux/entry-kvm.h | 12 +
include/linux/irq_work.h | 1 +
include/linux/sched.h | 21 +-
kernel/Kconfig.preempt | 19 +
kernel/entry/common.c | 90 +-
kernel/entry/kvm.c | 12 +
kernel/irq_work.c | 11 +
kernel/sched/core.c | 1310 +++++++++++++++--
kernel/sched/cpuacct.c | 12 +-
kernel/sched/deadline.c | 34 +-
kernel/sched/debug.c | 4 +-
kernel/sched/fair.c | 400 +++--
kernel/sched/idle.c | 30 +-
kernel/sched/pelt.h | 2 +-
kernel/sched/rt.c | 22 +-
kernel/sched/sched.h | 253 +++-
kernel/sched/stop_task.c | 13 +-
kernel/sched/topology.c | 4 +-
lib/cpumask.c | 53 +
lib/find_bit.c | 58 +-
28 files changed, 2381 insertions(+), 328 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

--
2.17.1


2020-08-28 19:53:44

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 01/23] sched: Wrap rq::lock access

From: Peter Zijlstra <[email protected]>

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
---
kernel/sched/core.c | 46 +++++++++---------
kernel/sched/cpuacct.c | 12 ++---
kernel/sched/deadline.c | 18 +++----
kernel/sched/debug.c | 4 +-
kernel/sched/fair.c | 38 +++++++--------
kernel/sched/idle.c | 4 +-
kernel/sched/pelt.h | 2 +-
kernel/sched/rt.c | 8 +--
kernel/sched/sched.h | 105 +++++++++++++++++++++-------------------
kernel/sched/topology.c | 4 +-
10 files changed, 122 insertions(+), 119 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8471a0f7eb32..b85d5e56d5fe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -185,12 +185,12 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)

for (;;) {
rq = task_rq(p);
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
rq_pin_lock(rq, rf);
return rq;
}
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));

while (unlikely(task_on_rq_migrating(p)))
cpu_relax();
@@ -209,7 +209,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
for (;;) {
raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
rq = task_rq(p);
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
/*
* move_queued_task() task_rq_lock()
*
@@ -231,7 +231,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
rq_pin_lock(rq, rf);
return rq;
}
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);

while (unlikely(task_on_rq_migrating(p)))
@@ -301,7 +301,7 @@ void update_rq_clock(struct rq *rq)
{
s64 delta;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

if (rq->clock_update_flags & RQCF_ACT_SKIP)
return;
@@ -610,7 +610,7 @@ void resched_curr(struct rq *rq)
struct task_struct *curr = rq->curr;
int cpu;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

if (test_tsk_need_resched(curr))
return;
@@ -634,10 +634,10 @@ void resched_cpu(int cpu)
struct rq *rq = cpu_rq(cpu);
unsigned long flags;

- raw_spin_lock_irqsave(&rq->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), flags);
if (cpu_online(cpu) || cpu == smp_processor_id())
resched_curr(rq);
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
}

#ifdef CONFIG_SMP
@@ -1141,7 +1141,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
struct uclamp_se *uc_se = &p->uclamp[clamp_id];
struct uclamp_bucket *bucket;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

/* Update task effective clamp */
p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
@@ -1181,7 +1181,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
unsigned int bkt_clamp;
unsigned int rq_clamp;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

/*
* If sched_uclamp_used was enabled after task @p was enqueued,
@@ -1737,7 +1737,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
struct task_struct *p, int new_cpu)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

deactivate_task(rq, p, DEQUEUE_NOCLOCK);
set_task_cpu(p, new_cpu);
@@ -1849,7 +1849,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
* Because __kthread_bind() calls this on blocked tasks without
* holding rq->lock.
*/
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
}
if (running)
@@ -1986,7 +1986,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
* task_rq_lock().
*/
WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
- lockdep_is_held(&task_rq(p)->lock)));
+ lockdep_is_held(rq_lockp(task_rq(p)))));
#endif
/*
* Clearly, migrating tasks to offline CPUs is a fairly daft thing.
@@ -2497,7 +2497,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
{
int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

if (p->sched_contributes_to_load)
rq->nr_uninterruptible--;
@@ -3499,10 +3499,10 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
* do an early lockdep release here:
*/
rq_unpin_lock(rq, rf);
- spin_release(&rq->lock.dep_map, _THIS_IP_);
+ spin_release(&rq_lockp(rq)->dep_map, _THIS_IP_);
#ifdef CONFIG_DEBUG_SPINLOCK
/* this is a valid case when another task releases the spinlock */
- rq->lock.owner = next;
+ rq_lockp(rq)->owner = next;
#endif
}

@@ -3513,8 +3513,8 @@ static inline void finish_lock_switch(struct rq *rq)
* fix up the runqueue lock - which gets 'carried over' from
* prev into current:
*/
- spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
- raw_spin_unlock_irq(&rq->lock);
+ spin_acquire(&rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
+ raw_spin_unlock_irq(rq_lockp(rq));
}

/*
@@ -3664,7 +3664,7 @@ static void __balance_callback(struct rq *rq)
void (*func)(struct rq *rq);
unsigned long flags;

- raw_spin_lock_irqsave(&rq->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), flags);
head = rq->balance_callback;
rq->balance_callback = NULL;
while (head) {
@@ -3675,7 +3675,7 @@ static void __balance_callback(struct rq *rq)

func(rq);
}
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
}

static inline void balance_callback(struct rq *rq)
@@ -6522,7 +6522,7 @@ void init_idle(struct task_struct *idle, int cpu)
__sched_fork(0, idle);

raw_spin_lock_irqsave(&idle->pi_lock, flags);
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));

idle->state = TASK_RUNNING;
idle->se.exec_start = sched_clock();
@@ -6560,7 +6560,7 @@ void init_idle(struct task_struct *idle, int cpu)
#ifdef CONFIG_SMP
idle->on_cpu = 1;
#endif
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irqrestore(&idle->pi_lock, flags);

/* Set the preempt count _outside_ the spinlocks! */
@@ -7132,7 +7132,7 @@ void __init sched_init(void)
struct rq *rq;

rq = cpu_rq(i);
- raw_spin_lock_init(&rq->lock);
+ raw_spin_lock_init(&rq->__lock);
rq->nr_running = 0;
rq->calc_load_active = 0;
rq->calc_load_update = jiffies + LOAD_FREQ;
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 941c28cf9738..38c1a68e91f0 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -112,7 +112,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
/*
* Take rq->lock to make 64-bit read safe on 32-bit platforms.
*/
- raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
#endif

if (index == CPUACCT_STAT_NSTATS) {
@@ -126,7 +126,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
}

#ifndef CONFIG_64BIT
- raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
#endif

return data;
@@ -141,14 +141,14 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
/*
* Take rq->lock to make 64-bit write safe on 32-bit platforms.
*/
- raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
#endif

for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
cpuusage->usages[i] = val;

#ifndef CONFIG_64BIT
- raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
#endif
}

@@ -253,13 +253,13 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
* Take rq->lock to make 64-bit read safe on 32-bit
* platforms.
*/
- raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
#endif

seq_printf(m, " %llu", cpuusage->usages[index]);

#ifndef CONFIG_64BIT
- raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+ raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
#endif
}
seq_puts(m, "\n");
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3862a28cd05d..c588a06057dc 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -119,7 +119,7 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->running_bw;

- lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
dl_rq->running_bw += dl_bw;
SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
@@ -132,7 +132,7 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->running_bw;

- lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
dl_rq->running_bw -= dl_bw;
SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
if (dl_rq->running_bw > old)
@@ -146,7 +146,7 @@ void __add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->this_bw;

- lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
dl_rq->this_bw += dl_bw;
SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */
}
@@ -156,7 +156,7 @@ void __sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->this_bw;

- lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
dl_rq->this_bw -= dl_bw;
SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */
if (dl_rq->this_bw > old)
@@ -966,7 +966,7 @@ static int start_dl_timer(struct task_struct *p)
ktime_t now, act;
s64 delta;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

/*
* We want the timer to fire at the deadline, but considering
@@ -1076,9 +1076,9 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
* If the runqueue is no longer available, migrate the
* task elsewhere. This necessarily changes rq.
*/
- lockdep_unpin_lock(&rq->lock, rf.cookie);
+ lockdep_unpin_lock(rq_lockp(rq), rf.cookie);
rq = dl_task_offline_migration(rq, p);
- rf.cookie = lockdep_pin_lock(&rq->lock);
+ rf.cookie = lockdep_pin_lock(rq_lockp(rq));
update_rq_clock(rq);

/*
@@ -1703,7 +1703,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
* from try_to_wake_up(). Hence, p->pi_lock is locked, but
* rq->lock is not... So, lock it
*/
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
if (p->dl.dl_non_contending) {
sub_running_bw(&p->dl, &rq->dl);
p->dl.dl_non_contending = 0;
@@ -1718,7 +1718,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
put_task_struct(p);
}
sub_rq_bw(&p->dl, &rq->dl);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
}

static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 36c54265bb2b..6e4b801fb80e 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -497,7 +497,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "exec_clock",
SPLIT_NS(cfs_rq->exec_clock));

- raw_spin_lock_irqsave(&rq->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), flags);
if (rb_first_cached(&cfs_rq->tasks_timeline))
MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
last = __pick_last_entity(cfs_rq);
@@ -505,7 +505,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
max_vruntime = last->vruntime;
min_vruntime = cfs_rq->min_vruntime;
rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "MIN_vruntime",
SPLIT_NS(MIN_vruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "min_vruntime",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1a68a0536add..9be6cf92cc3d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1101,7 +1101,7 @@ struct numa_group {
static struct numa_group *deref_task_numa_group(struct task_struct *p)
{
return rcu_dereference_check(p->numa_group, p == current ||
- (lockdep_is_held(&task_rq(p)->lock) && !READ_ONCE(p->on_cpu)));
+ (lockdep_is_held(rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu)));
}

static struct numa_group *deref_curr_numa_group(struct task_struct *p)
@@ -5287,7 +5287,7 @@ static void __maybe_unused update_runtime_enabled(struct rq *rq)
{
struct task_group *tg;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

rcu_read_lock();
list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -5306,7 +5306,7 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
{
struct task_group *tg;

- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

rcu_read_lock();
list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -6766,7 +6766,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
* In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
* rq->lock and can modify state directly.
*/
- lockdep_assert_held(&task_rq(p)->lock);
+ lockdep_assert_held(rq_lockp(task_rq(p)));
detach_entity_cfs_rq(&p->se);

} else {
@@ -7394,7 +7394,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
{
s64 delta;

- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(rq_lockp(env->src_rq));

if (p->sched_class != &fair_sched_class)
return 0;
@@ -7488,7 +7488,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
{
int tsk_cache_hot;

- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(rq_lockp(env->src_rq));

/*
* We do not migrate tasks that are:
@@ -7566,7 +7566,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
*/
static void detach_task(struct task_struct *p, struct lb_env *env)
{
- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(rq_lockp(env->src_rq));

deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
set_task_cpu(p, env->dst_cpu);
@@ -7582,7 +7582,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
{
struct task_struct *p;

- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(rq_lockp(env->src_rq));

list_for_each_entry_reverse(p,
&env->src_rq->cfs_tasks, se.group_node) {
@@ -7618,7 +7618,7 @@ static int detach_tasks(struct lb_env *env)
struct task_struct *p;
int detached = 0;

- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(rq_lockp(env->src_rq));

if (env->imbalance <= 0)
return 0;
@@ -7740,7 +7740,7 @@ static int detach_tasks(struct lb_env *env)
*/
static void attach_task(struct rq *rq, struct task_struct *p)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

BUG_ON(task_rq(p) != rq);
activate_task(rq, p, ENQUEUE_NOCLOCK);
@@ -9672,7 +9672,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
if (need_active_balance(&env)) {
unsigned long flags;

- raw_spin_lock_irqsave(&busiest->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(busiest), flags);

/*
* Don't kick the active_load_balance_cpu_stop,
@@ -9680,7 +9680,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
* moved to this_cpu:
*/
if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
- raw_spin_unlock_irqrestore(&busiest->lock,
+ raw_spin_unlock_irqrestore(rq_lockp(busiest),
flags);
env.flags |= LBF_ALL_PINNED;
goto out_one_pinned;
@@ -9696,7 +9696,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
busiest->push_cpu = this_cpu;
active_balance = 1;
}
- raw_spin_unlock_irqrestore(&busiest->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(busiest), flags);

if (active_balance) {
stop_one_cpu_nowait(cpu_of(busiest),
@@ -10439,7 +10439,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
time_before(jiffies, READ_ONCE(nohz.next_blocked)))
return;

- raw_spin_unlock(&this_rq->lock);
+ raw_spin_unlock(rq_lockp(this_rq));
/*
* This CPU is going to be idle and blocked load of idle CPUs
* need to be updated. Run the ilb locally as it is a good
@@ -10448,7 +10448,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
*/
if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
kick_ilb(NOHZ_STATS_KICK);
- raw_spin_lock(&this_rq->lock);
+ raw_spin_lock(rq_lockp(this_rq));
}

#else /* !CONFIG_NO_HZ_COMMON */
@@ -10514,7 +10514,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
goto out;
}

- raw_spin_unlock(&this_rq->lock);
+ raw_spin_unlock(rq_lockp(this_rq));

update_blocked_averages(this_cpu);
rcu_read_lock();
@@ -10552,7 +10552,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
}
rcu_read_unlock();

- raw_spin_lock(&this_rq->lock);
+ raw_spin_lock(rq_lockp(this_rq));

if (curr_cost > this_rq->max_idle_balance_cost)
this_rq->max_idle_balance_cost = curr_cost;
@@ -11028,9 +11028,9 @@ void unregister_fair_sched_group(struct task_group *tg)

rq = cpu_rq(cpu);

- raw_spin_lock_irqsave(&rq->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), flags);
list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
}
}

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6bf34986f45c..5c33beb1a525 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -424,10 +424,10 @@ struct task_struct *pick_next_task_idle(struct rq *rq)
static void
dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
{
- raw_spin_unlock_irq(&rq->lock);
+ raw_spin_unlock_irq(rq_lockp(rq));
printk(KERN_ERR "bad: scheduling from the idle thread!\n");
dump_stack();
- raw_spin_lock_irq(&rq->lock);
+ raw_spin_lock_irq(rq_lockp(rq));
}

/*
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 795e43e02afc..e850bd71a8ce 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -141,7 +141,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)

static inline u64 rq_clock_pelt(struct rq *rq)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
assert_clock_updated(rq);

return rq->clock_pelt - rq->lost_idle_time;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f215eea6a966..e57fca05b660 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -887,7 +887,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
if (skip)
continue;

- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
update_rq_clock(rq);

if (rt_rq->rt_time) {
@@ -925,7 +925,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)

if (enqueue)
sched_rt_rq_enqueue(rt_rq);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
}

if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF))
@@ -2094,9 +2094,9 @@ void rto_push_irq_work_func(struct irq_work *work)
* When it gets updated, a check is made if a push is possible.
*/
if (has_pushable_tasks(rq)) {
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
push_rt_tasks(rq);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
}

raw_spin_lock(&rd->rto_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 28709f6b0975..9145be83d80b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -894,7 +894,7 @@ DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
*/
struct rq {
/* runqueue lock: */
- raw_spinlock_t lock;
+ raw_spinlock_t __lock;

/*
* nr_running and cpu_load should be in the same cacheline because
@@ -1075,6 +1075,10 @@ static inline int cpu_of(struct rq *rq)
#endif
}

+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+ return &rq->__lock;
+}

#ifdef CONFIG_SCHED_SMT
extern void __update_idle_core(struct rq *rq);
@@ -1142,7 +1146,7 @@ static inline void assert_clock_updated(struct rq *rq)

static inline u64 rq_clock(struct rq *rq)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
assert_clock_updated(rq);

return rq->clock;
@@ -1150,7 +1154,7 @@ static inline u64 rq_clock(struct rq *rq)

static inline u64 rq_clock_task(struct rq *rq)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
assert_clock_updated(rq);

return rq->clock_task;
@@ -1176,7 +1180,7 @@ static inline u64 rq_clock_thermal(struct rq *rq)

static inline void rq_clock_skip_update(struct rq *rq)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
rq->clock_update_flags |= RQCF_REQ_SKIP;
}

@@ -1186,7 +1190,7 @@ static inline void rq_clock_skip_update(struct rq *rq)
*/
static inline void rq_clock_cancel_skipupdate(struct rq *rq)
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));
rq->clock_update_flags &= ~RQCF_REQ_SKIP;
}

@@ -1215,7 +1219,7 @@ struct rq_flags {
*/
static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
{
- rf->cookie = lockdep_pin_lock(&rq->lock);
+ rf->cookie = lockdep_pin_lock(rq_lockp(rq));

#ifdef CONFIG_SCHED_DEBUG
rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
@@ -1230,12 +1234,12 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
rf->clock_update_flags = RQCF_UPDATED;
#endif

- lockdep_unpin_lock(&rq->lock, rf->cookie);
+ lockdep_unpin_lock(rq_lockp(rq), rf->cookie);
}

static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
{
- lockdep_repin_lock(&rq->lock, rf->cookie);
+ lockdep_repin_lock(rq_lockp(rq), rf->cookie);

#ifdef CONFIG_SCHED_DEBUG
/*
@@ -1256,7 +1260,7 @@ static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
__releases(rq->lock)
{
rq_unpin_lock(rq, rf);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
}

static inline void
@@ -1265,7 +1269,7 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
__releases(p->pi_lock)
{
rq_unpin_lock(rq, rf);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
}

@@ -1273,7 +1277,7 @@ static inline void
rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
- raw_spin_lock_irqsave(&rq->lock, rf->flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), rf->flags);
rq_pin_lock(rq, rf);
}

@@ -1281,7 +1285,7 @@ static inline void
rq_lock_irq(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
- raw_spin_lock_irq(&rq->lock);
+ raw_spin_lock_irq(rq_lockp(rq));
rq_pin_lock(rq, rf);
}

@@ -1289,7 +1293,7 @@ static inline void
rq_lock(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
rq_pin_lock(rq, rf);
}

@@ -1297,7 +1301,7 @@ static inline void
rq_relock(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
- raw_spin_lock(&rq->lock);
+ raw_spin_lock(rq_lockp(rq));
rq_repin_lock(rq, rf);
}

@@ -1306,7 +1310,7 @@ rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
__releases(rq->lock)
{
rq_unpin_lock(rq, rf);
- raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), rf->flags);
}

static inline void
@@ -1314,7 +1318,7 @@ rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
__releases(rq->lock)
{
rq_unpin_lock(rq, rf);
- raw_spin_unlock_irq(&rq->lock);
+ raw_spin_unlock_irq(rq_lockp(rq));
}

static inline void
@@ -1322,7 +1326,7 @@ rq_unlock(struct rq *rq, struct rq_flags *rf)
__releases(rq->lock)
{
rq_unpin_lock(rq, rf);
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock(rq_lockp(rq));
}

static inline struct rq *
@@ -1387,7 +1391,7 @@ queue_balance_callback(struct rq *rq,
struct callback_head *head,
void (*func)(struct rq *rq))
{
- lockdep_assert_held(&rq->lock);
+ lockdep_assert_held(rq_lockp(rq));

if (unlikely(head->next))
return;
@@ -2084,7 +2088,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
__acquires(busiest->lock)
__acquires(this_rq->lock)
{
- raw_spin_unlock(&this_rq->lock);
+ raw_spin_unlock(rq_lockp(this_rq));
double_rq_lock(this_rq, busiest);

return 1;
@@ -2103,20 +2107,22 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
__acquires(busiest->lock)
__acquires(this_rq->lock)
{
- int ret = 0;
-
- if (unlikely(!raw_spin_trylock(&busiest->lock))) {
- if (busiest < this_rq) {
- raw_spin_unlock(&this_rq->lock);
- raw_spin_lock(&busiest->lock);
- raw_spin_lock_nested(&this_rq->lock,
- SINGLE_DEPTH_NESTING);
- ret = 1;
- } else
- raw_spin_lock_nested(&busiest->lock,
- SINGLE_DEPTH_NESTING);
+ if (rq_lockp(this_rq) == rq_lockp(busiest))
+ return 0;
+
+ if (likely(raw_spin_trylock(rq_lockp(busiest))))
+ return 0;
+
+ if (rq_lockp(busiest) >= rq_lockp(this_rq)) {
+ raw_spin_lock_nested(rq_lockp(busiest), SINGLE_DEPTH_NESTING);
+ return 0;
}
- return ret;
+
+ raw_spin_unlock(rq_lockp(this_rq));
+ raw_spin_lock(rq_lockp(busiest));
+ raw_spin_lock_nested(rq_lockp(this_rq), SINGLE_DEPTH_NESTING);
+
+ return 1;
}

#endif /* CONFIG_PREEMPTION */
@@ -2126,11 +2132,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
*/
static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
{
- if (unlikely(!irqs_disabled())) {
- /* printk() doesn't work well under rq->lock */
- raw_spin_unlock(&this_rq->lock);
- BUG_ON(1);
- }
+ lockdep_assert_irqs_disabled();

return _double_lock_balance(this_rq, busiest);
}
@@ -2138,8 +2140,9 @@ static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
__releases(busiest->lock)
{
- raw_spin_unlock(&busiest->lock);
- lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
+ if (rq_lockp(this_rq) != rq_lockp(busiest))
+ raw_spin_unlock(rq_lockp(busiest));
+ lock_set_subclass(&rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
}

static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
@@ -2180,16 +2183,16 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
__acquires(rq2->lock)
{
BUG_ON(!irqs_disabled());
- if (rq1 == rq2) {
- raw_spin_lock(&rq1->lock);
+ if (rq_lockp(rq1) == rq_lockp(rq2)) {
+ raw_spin_lock(rq_lockp(rq1));
__acquire(rq2->lock); /* Fake it out ;) */
} else {
- if (rq1 < rq2) {
- raw_spin_lock(&rq1->lock);
- raw_spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
+ if (rq_lockp(rq1) < rq_lockp(rq2)) {
+ raw_spin_lock(rq_lockp(rq1));
+ raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING);
} else {
- raw_spin_lock(&rq2->lock);
- raw_spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
+ raw_spin_lock(rq_lockp(rq2));
+ raw_spin_lock_nested(rq_lockp(rq1), SINGLE_DEPTH_NESTING);
}
}
}
@@ -2204,9 +2207,9 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
__releases(rq1->lock)
__releases(rq2->lock)
{
- raw_spin_unlock(&rq1->lock);
- if (rq1 != rq2)
- raw_spin_unlock(&rq2->lock);
+ raw_spin_unlock(rq_lockp(rq1));
+ if (rq_lockp(rq1) != rq_lockp(rq2))
+ raw_spin_unlock(rq_lockp(rq2));
else
__release(rq2->lock);
}
@@ -2229,7 +2232,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
{
BUG_ON(!irqs_disabled());
BUG_ON(rq1 != rq2);
- raw_spin_lock(&rq1->lock);
+ raw_spin_lock(rq_lockp(rq1));
__acquire(rq2->lock); /* Fake it out ;) */
}

@@ -2244,7 +2247,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
__releases(rq2->lock)
{
BUG_ON(rq1 != rq2);
- raw_spin_unlock(&rq1->lock);
+ raw_spin_unlock(rq_lockp(rq1));
__release(rq2->lock);
}

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 007b0a6b0152..d6d727420952 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -440,7 +440,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
struct root_domain *old_rd = NULL;
unsigned long flags;

- raw_spin_lock_irqsave(&rq->lock, flags);
+ raw_spin_lock_irqsave(rq_lockp(rq), flags);

if (rq->rd) {
old_rd = rq->rd;
@@ -466,7 +466,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
set_rq_online(rq);

- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ raw_spin_unlock_irqrestore(rq_lockp(rq), flags);

if (old_rd)
call_rcu(&old_rd->rcu, free_rootdomain);
--
2.17.1

2020-08-28 19:54:01

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 04/23] sched/fair: Add a few assertions

From: Peter Zijlstra <[email protected]>

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/fair.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1a1bf726264a..af8c40191a19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6223,6 +6223,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
}

symmetric:
+ /*
+ * per-cpu select_idle_mask usage
+ */
+ lockdep_assert_irqs_disabled();
+
if (available_idle_cpu(target) || sched_idle_cpu(target))
return target;

@@ -6664,8 +6669,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
* certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
*
* Returns the target CPU number.
- *
- * preempt must be disabled.
*/
static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
@@ -6676,6 +6679,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
int want_affine = 0;
int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);

+ /*
+ * required for stable ->cpus_allowed
+ */
+ lockdep_assert_held(&p->pi_lock);
+
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);

--
2.17.1

2020-08-28 19:54:12

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 06/23] bitops: Introduce find_next_or_bit

From: Vineeth Pillai <[email protected]>

Hotplug fixes to core-scheduling require a new bitops API.

Introduce a new API find_next_or_bit() which returns the
bit number of the next set bit in OR-ed bit masks of the
given bit masks.

Signed-off-by: Vineeth Pillai <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/asm-generic/bitops/find.h | 16 +++++++++
lib/find_bit.c | 58 +++++++++++++++++++++++++------
2 files changed, 63 insertions(+), 11 deletions(-)

diff --git a/include/asm-generic/bitops/find.h b/include/asm-generic/bitops/find.h
index 9fdf21302fdf..0b476ca0d665 100644
--- a/include/asm-generic/bitops/find.h
+++ b/include/asm-generic/bitops/find.h
@@ -32,6 +32,22 @@ extern unsigned long find_next_and_bit(const unsigned long *addr1,
unsigned long offset);
#endif

+#ifndef find_next_or_bit
+/**
+ * find_next_or_bit - find the next set bit in any memory regions
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+extern unsigned long find_next_or_bit(const unsigned long *addr1,
+ const unsigned long *addr2, unsigned long size,
+ unsigned long offset);
+#endif
+
#ifndef find_next_zero_bit
/**
* find_next_zero_bit - find the next cleared bit in a memory region
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 49f875f1baf7..e36257bb0701 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -19,7 +19,16 @@

#if !defined(find_next_bit) || !defined(find_next_zero_bit) || \
!defined(find_next_bit_le) || !defined(find_next_zero_bit_le) || \
- !defined(find_next_and_bit)
+ !defined(find_next_and_bit) || !defined(find_next_or_bit)
+
+/*
+ * find_next_bit bitwise operation types
+ */
+enum fnb_bwops_type {
+ FNB_AND = 0,
+ FNB_OR = 1
+};
+
/*
* This is a common helper function for find_next_bit, find_next_zero_bit, and
* find_next_and_bit. The differences are:
@@ -29,7 +38,8 @@
*/
static unsigned long _find_next_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le)
+ unsigned long start, unsigned long invert,
+ enum fnb_bwops_type type, unsigned long le)
{
unsigned long tmp, mask;

@@ -37,8 +47,16 @@ static unsigned long _find_next_bit(const unsigned long *addr1,
return nbits;

tmp = addr1[start / BITS_PER_LONG];
- if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
+ if (addr2) {
+ switch (type) {
+ case FNB_AND:
+ tmp &= addr2[start / BITS_PER_LONG];
+ break;
+ case FNB_OR:
+ tmp |= addr2[start / BITS_PER_LONG];
+ break;
+ }
+ }
tmp ^= invert;

/* Handle 1st word. */
@@ -56,8 +74,16 @@ static unsigned long _find_next_bit(const unsigned long *addr1,
return nbits;

tmp = addr1[start / BITS_PER_LONG];
- if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
+ if (addr2) {
+ switch (type) {
+ case FNB_AND:
+ tmp &= addr2[start / BITS_PER_LONG];
+ break;
+ case FNB_OR:
+ tmp |= addr2[start / BITS_PER_LONG];
+ break;
+ }
+ }
tmp ^= invert;
}

@@ -75,7 +101,7 @@ static unsigned long _find_next_bit(const unsigned long *addr1,
unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
unsigned long offset)
{
- return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+ return _find_next_bit(addr, NULL, size, offset, 0UL, FNB_AND, 0);
}
EXPORT_SYMBOL(find_next_bit);
#endif
@@ -84,7 +110,7 @@ EXPORT_SYMBOL(find_next_bit);
unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
unsigned long offset)
{
- return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+ return _find_next_bit(addr, NULL, size, offset, ~0UL, FNB_AND, 0);
}
EXPORT_SYMBOL(find_next_zero_bit);
#endif
@@ -94,11 +120,21 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long size,
unsigned long offset)
{
- return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
+ return _find_next_bit(addr1, addr2, size, offset, 0UL, FNB_AND, 0);
}
EXPORT_SYMBOL(find_next_and_bit);
#endif

+#if !defined(find_next_or_bit)
+unsigned long find_next_or_bit(const unsigned long *addr1,
+ const unsigned long *addr2, unsigned long size,
+ unsigned long offset)
+{
+ return _find_next_bit(addr1, addr2, size, offset, 0UL, FNB_OR, 0);
+}
+EXPORT_SYMBOL(find_next_or_bit);
+#endif
+
#ifndef find_first_bit
/*
* Find the first set bit in a memory region.
@@ -161,7 +197,7 @@ EXPORT_SYMBOL(find_last_bit);
unsigned long find_next_zero_bit_le(const void *addr, unsigned
long size, unsigned long offset)
{
- return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
+ return _find_next_bit(addr, NULL, size, offset, ~0UL, FNB_AND, 1);
}
EXPORT_SYMBOL(find_next_zero_bit_le);
#endif
@@ -170,7 +206,7 @@ EXPORT_SYMBOL(find_next_zero_bit_le);
unsigned long find_next_bit_le(const void *addr, unsigned
long size, unsigned long offset)
{
- return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
+ return _find_next_bit(addr, NULL, size, offset, 0UL, FNB_AND, 1);
}
EXPORT_SYMBOL(find_next_bit_le);
#endif
--
2.17.1

2020-08-28 19:54:35

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 03/23] sched: Core-wide rq->lock

From: Peter Zijlstra <[email protected]>

Introduce the basic infrastructure to have a core wide rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
---
kernel/Kconfig.preempt | 6 +++
kernel/sched/core.c | 95 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 31 ++++++++++++++
3 files changed, 132 insertions(+)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index bf82259cff96..4488fbf4d3a8 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -80,3 +80,9 @@ config PREEMPT_COUNT
config PREEMPTION
bool
select PREEMPT_COUNT
+
+config SCHED_CORE
+ bool "Core Scheduling for SMT"
+ default y
+ depends on SCHED_SMT
+
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b85d5e56d5fe..e2642c5dbd61 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -73,6 +73,70 @@ unsigned int sysctl_sched_rt_period = 1000000;

__read_mostly int scheduler_running;

+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * The static-key + stop-machine variable are needed such that:
+ *
+ * spin_lock(rq_lockp(rq));
+ * ...
+ * spin_unlock(rq_lockp(rq));
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+static int __sched_core_stopper(void *data)
+{
+ bool enabled = !!(unsigned long)data;
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ cpu_rq(cpu)->core_enabled = enabled;
+
+ return 0;
+}
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+
+static void __sched_core_enable(void)
+{
+ // XXX verify there are no cookie tasks (yet)
+
+ static_branch_enable(&__sched_core_enabled);
+ stop_machine(__sched_core_stopper, (void *)true, NULL);
+}
+
+static void __sched_core_disable(void)
+{
+ // XXX verify there are no cookie tasks (left)
+
+ stop_machine(__sched_core_stopper, (void *)false, NULL);
+ static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+ mutex_lock(&sched_core_mutex);
+ if (!sched_core_count++)
+ __sched_core_enable();
+ mutex_unlock(&sched_core_mutex);
+}
+
+void sched_core_put(void)
+{
+ mutex_lock(&sched_core_mutex);
+ if (!--sched_core_count)
+ __sched_core_disable();
+ mutex_unlock(&sched_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
/*
* part of the period that we allow rt tasks to run in us.
* default: 0.95s
@@ -6964,6 +7028,32 @@ static void sched_rq_cpu_starting(unsigned int cpu)

int sched_cpu_starting(unsigned int cpu)
{
+#ifdef CONFIG_SCHED_CORE
+ const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+ struct rq *rq, *core_rq = NULL;
+ int i;
+
+ core_rq = cpu_rq(cpu)->core;
+
+ if (!core_rq) {
+ for_each_cpu(i, smt_mask) {
+ rq = cpu_rq(i);
+ if (rq->core && rq->core == rq)
+ core_rq = rq;
+ }
+
+ if (!core_rq)
+ core_rq = cpu_rq(cpu);
+
+ for_each_cpu(i, smt_mask) {
+ rq = cpu_rq(i);
+
+ WARN_ON_ONCE(rq->core && rq->core != core_rq);
+ rq->core = core_rq;
+ }
+ }
+#endif /* CONFIG_SCHED_CORE */
+
sched_rq_cpu_starting(cpu);
sched_tick_start(cpu);
return 0;
@@ -7194,6 +7284,11 @@ void __init sched_init(void)
#endif /* CONFIG_SMP */
hrtick_rq_init(rq);
atomic_set(&rq->nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+ rq->core = NULL;
+ rq->core_enabled = 0;
+#endif
}

set_load_weight(&init_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 293d031480d8..6ab8adff169b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1048,6 +1048,12 @@ struct rq {
/* Must be inspected within a rcu lock section */
struct cpuidle_state *idle_state;
#endif
+
+#ifdef CONFIG_SCHED_CORE
+ /* per rq */
+ struct rq *core;
+ unsigned int core_enabled;
+#endif
};

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1075,11 +1081,36 @@ static inline int cpu_of(struct rq *rq)
#endif
}

+#ifdef CONFIG_SCHED_CORE
+DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+ return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
+}
+
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+ if (sched_core_enabled(rq))
+ return &rq->core->__lock;
+
+ return &rq->__lock;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+ return false;
+}
+
static inline raw_spinlock_t *rq_lockp(struct rq *rq)
{
return &rq->__lock;
}

+#endif /* CONFIG_SCHED_CORE */
+
#ifdef CONFIG_SCHED_SMT
extern void __update_idle_core(struct rq *rq);

--
2.17.1

2020-08-28 19:54:50

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 07/23] cpumask: Introduce a new iterator for_each_cpu_wrap_or

From: Vineeth Pillai <[email protected]>

Hotplug fixes to core-scheduling require a new cpumask iterator
which iterates through all online cpus in both the given cpumasks.
This patch introduces it.

Signed-off-by: Vineeth Pillai <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/cpumask.h | 42 ++++++++++++++++++++++++++++++++
lib/cpumask.c | 53 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 95 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index f0d895d6ac39..27e7e019237b 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -207,6 +207,10 @@ static inline int cpumask_any_and_distribute(const struct cpumask *src1p,
for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask, (void)(start))
#define for_each_cpu_and(cpu, mask1, mask2) \
for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask1, (void)mask2)
+#define for_each_cpu_or(cpu, mask1, mask2) \
+ for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask1, (void)mask2)
+#define for_each_cpu_wrap_or(cpu, mask1, mask2, start) \
+ for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask1, (void)mask2, (void)(start))
#else
/**
* cpumask_first - get the first cpu in a cpumask
@@ -248,6 +252,7 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
}

int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
+int cpumask_next_or(int n, const struct cpumask *, const struct cpumask *);
int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
unsigned int cpumask_local_spread(unsigned int i, int node);
int cpumask_any_and_distribute(const struct cpumask *src1p,
@@ -278,6 +283,8 @@ int cpumask_any_and_distribute(const struct cpumask *src1p,
(cpu) < nr_cpu_ids;)

extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap);
+extern int cpumask_next_wrap_or(int n, const struct cpumask *mask1,
+ const struct cpumask *mask2, int start, bool wrap);

/**
* for_each_cpu_wrap - iterate over every cpu in a mask, starting at a specified location
@@ -294,6 +301,22 @@ extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool
(cpu) < nr_cpumask_bits; \
(cpu) = cpumask_next_wrap((cpu), (mask), (start), true))

+/**
+ * for_each_cpu_wrap_or - iterate over every cpu in both masks, starting at a specified location
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask1: the first cpumask pointer
+ * @mask2: the second cpumask pointer
+ * @start: the start location
+ *
+ * The implementation does not assume any bit both masks are set (including @start).
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_wrap_or(cpu, mask1, mask2, start) \
+ for ((cpu) = cpumask_next_wrap_or((start)-1, (mask1), (mask2), (start), false); \
+ (cpu) < nr_cpumask_bits; \
+ (cpu) = cpumask_next_wrap_or((cpu), (mask1), (mask2), (start), true))
+
/**
* for_each_cpu_and - iterate over every cpu in both masks
* @cpu: the (optionally unsigned) integer iterator
@@ -312,6 +335,25 @@ extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool
for ((cpu) = -1; \
(cpu) = cpumask_next_and((cpu), (mask1), (mask2)), \
(cpu) < nr_cpu_ids;)
+
+/**
+ * for_each_cpu_or - iterate over every cpu in (mask1 | mask2)
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask1: the first cpumask pointer
+ * @mask2: the second cpumask pointer
+ *
+ * This saves a temporary CPU mask in many places. It is equivalent to:
+ * struct cpumask tmp;
+ * cpumask_or(&tmp, &mask1, &mask2);
+ * for_each_cpu(cpu, &tmp)
+ * ...
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_or(cpu, mask1, mask2) \
+ for ((cpu) = -1; \
+ (cpu) = cpumask_next_or((cpu), (mask1), (mask2)), \
+ (cpu) < nr_cpu_ids;)
#endif /* SMP */

#define CPU_BITS_NONE \
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 85da6ab4fbb5..351c56de4632 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -43,6 +43,25 @@ int cpumask_next_and(int n, const struct cpumask *src1p,
}
EXPORT_SYMBOL(cpumask_next_and);

+/**
+ * cpumask_next_or - get the next cpu in *src1p | *src2p
+ * @n: the cpu prior to the place to search (ie. return will be > @n)
+ * @src1p: the first cpumask pointer
+ * @src2p: the second cpumask pointer
+ *
+ * Returns >= nr_cpu_ids if no further cpus set in both.
+ */
+int cpumask_next_or(int n, const struct cpumask *src1p,
+ const struct cpumask *src2p)
+{
+ /* -1 is a legal arg here. */
+ if (n != -1)
+ cpumask_check(n);
+ return find_next_or_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+ nr_cpumask_bits, n + 1);
+}
+EXPORT_SYMBOL(cpumask_next_or);
+
/**
* cpumask_any_but - return a "random" in a cpumask, but not this one.
* @mask: the cpumask to search
@@ -95,6 +114,40 @@ int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap)
}
EXPORT_SYMBOL(cpumask_next_wrap);

+/**
+ * cpumask_next_wrap_or - helper to implement for_each_cpu_wrap_or
+ * @n: the cpu prior to the place to search
+ * @mask1: first cpumask pointer
+ * @mask2: second cpumask pointer
+ * @start: the start point of the iteration
+ * @wrap: assume @n crossing @start terminates the iteration
+ *
+ * Returns >= nr_cpu_ids on completion
+ *
+ * Note: the @wrap argument is required for the start condition when
+ * we cannot assume @start is set in @mask.
+ */
+int cpumask_next_wrap_or(int n, const struct cpumask *mask1, const struct cpumask *mask2,
+ int start, bool wrap)
+{
+ int next;
+
+again:
+ next = cpumask_next_or(n, mask1, mask2);
+
+ if (wrap && n < start && next >= start) {
+ return nr_cpumask_bits;
+
+ } else if (next >= nr_cpumask_bits) {
+ wrap = true;
+ n = -1;
+ goto again;
+ }
+
+ return next;
+}
+EXPORT_SYMBOL(cpumask_next_wrap_or);
+
/* These are not inline because of header tangles. */
#ifdef CONFIG_CPUMASK_OFFSTACK
/**
--
2.17.1

2020-08-28 19:56:00

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 17/23] kernel/entry: Add support for core-wide protection of kernel-mode

From: "Joel Fernandes (Google)" <[email protected]>

Core-scheduling prevents hyperthreads in usermode from attacking each
other, but it does not do anything about one of the hyperthreads
entering the kernel for any reason. This leaves the door open for MDS
and L1TF attacks with concurrent execution sequences between
hyperthreads.

This patch therefore adds support for protecting all syscall and IRQ
kernel mode entries. Care is taken to track the outermost usermode exit
and entry using per-cpu counters. In cases where one of the hyperthreads
enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
when not needed - example: idle and non-cookie HTs do not need to be
forced into kernel mode.

More information about attacks:
For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
data to either host or guest attackers. For L1TF, it is possible to leak
to guest attackers. There is no possible mitigation involving flushing
of buffers to avoid this since the execution of attacker and victims
happen concurrently on 2 or more HTs.

Cc: Julien Desfossez <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Aubrey Li <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Co-developed-by: Vineeth Pillai <[email protected]>
Signed-off-by: Vineeth Pillai <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/sched.h | 12 +++
kernel/entry/common.c | 90 +++++++++++-------
kernel/sched/core.c | 212 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 3 +
4 files changed, 285 insertions(+), 32 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eee53fa7b8d4..1e04ffe689cb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2055,4 +2055,16 @@ int sched_trace_rq_nr_running(struct rq *rq);

const struct cpumask *sched_trace_rd_span(struct root_domain *rd);

+#ifdef CONFIG_SCHED_CORE
+void sched_core_unsafe_enter(void);
+void sched_core_unsafe_exit(void);
+void sched_core_unsafe_exit_wait(unsigned long ti_check);
+void sched_core_wait_till_safe(unsigned long ti_check);
+#else
+#define sched_core_unsafe_enter(ignore) do { } while (0)
+#define sched_core_unsafe_exit(ignore) do { } while (0)
+#define sched_core_wait_till_safe(ignore) do { } while (0)
+#define sched_core_unsafe_exit_wait(ignore) do { } while (0)
+#endif
+
#endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index fcae019158ca..86ab8b532399 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -28,6 +28,7 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)

instrumentation_begin();
trace_hardirqs_off_finish();
+ sched_core_unsafe_enter();
instrumentation_end();
}

@@ -112,59 +113,84 @@ static __always_inline void exit_to_user_mode(void)
/* Workaround to allow gradual conversion of architecture code */
void __weak arch_do_signal(struct pt_regs *regs) { }

-static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
- unsigned long ti_work)
+static inline bool exit_to_user_mode_work_pending(unsigned long ti_work)
{
- /*
- * Before returning to user space ensure that all pending work
- * items have been completed.
- */
- while (ti_work & EXIT_TO_USER_MODE_WORK) {
+ return (ti_work & EXIT_TO_USER_MODE_WORK);
+}

- local_irq_enable_exit_to_user(ti_work);
+static inline void exit_to_user_mode_work(struct pt_regs *regs,
+ unsigned long ti_work)
+{

- if (ti_work & _TIF_NEED_RESCHED)
- schedule();
+ local_irq_enable_exit_to_user(ti_work);

- if (ti_work & _TIF_UPROBE)
- uprobe_notify_resume(regs);
+ if (ti_work & _TIF_NEED_RESCHED)
+ schedule();

- if (ti_work & _TIF_PATCH_PENDING)
- klp_update_patch_state(current);
+ if (ti_work & _TIF_UPROBE)
+ uprobe_notify_resume(regs);

- if (ti_work & _TIF_SIGPENDING)
- arch_do_signal(regs);
+ if (ti_work & _TIF_PATCH_PENDING)
+ klp_update_patch_state(current);

- if (ti_work & _TIF_NOTIFY_RESUME) {
- clear_thread_flag(TIF_NOTIFY_RESUME);
- tracehook_notify_resume(regs);
- rseq_handle_notify_resume(NULL, regs);
- }
+ if (ti_work & _TIF_SIGPENDING)
+ arch_do_signal(regs);
+
+ if (ti_work & _TIF_NOTIFY_RESUME) {
+ clear_thread_flag(TIF_NOTIFY_RESUME);
+ tracehook_notify_resume(regs);
+ rseq_handle_notify_resume(NULL, regs);
+ }
+
+ /* Architecture specific TIF work */
+ arch_exit_to_user_mode_work(regs, ti_work);
+
+ local_irq_disable_exit_to_user();
+}

- /* Architecture specific TIF work */
- arch_exit_to_user_mode_work(regs, ti_work);
+static unsigned long exit_to_user_mode_loop(struct pt_regs *regs)
+{
+ unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+
+ /*
+ * Before returning to user space ensure that all pending work
+ * items have been completed.
+ */
+ while (1) {
+ /* Both interrupts and preemption could be enabled here. */
+ if (exit_to_user_mode_work_pending(ti_work))
+ exit_to_user_mode_work(regs, ti_work);
+
+ /* Interrupts may be reenabled with preemption disabled. */
+ sched_core_unsafe_exit_wait(EXIT_TO_USER_MODE_WORK);

/*
- * Disable interrupts and reevaluate the work flags as they
- * might have changed while interrupts and preemption was
- * enabled above.
+ * Reevaluate the work flags as they might have changed
+ * while interrupts and preemption were enabled.
*/
- local_irq_disable_exit_to_user();
ti_work = READ_ONCE(current_thread_info()->flags);
+
+ /*
+ * We may be switching out to another task in kernel mode. That
+ * process is responsible for exiting the "unsafe" kernel mode
+ * when it returns to user or guest.
+ */
+ if (exit_to_user_mode_work_pending(ti_work))
+ sched_core_unsafe_enter();
+ else
+ break;
}

- /* Return the latest work state for arch_exit_to_user_mode() */
- return ti_work;
+ return ti_work;
}

static void exit_to_user_mode_prepare(struct pt_regs *regs)
{
- unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+ unsigned long ti_work;

lockdep_assert_irqs_disabled();

- if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
- ti_work = exit_to_user_mode_loop(regs, ti_work);
+ ti_work = exit_to_user_mode_loop(regs);

arch_exit_to_user_mode_prepare(regs, ti_work);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7e2f310ec0c0..0dc9172be04d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4600,6 +4600,217 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
return a->core_cookie == b->core_cookie;
}

+/*
+ * Handler to attempt to enter kernel. It does nothing because the exit to
+ * usermode or guest mode will do the actual work (of waiting if needed).
+ */
+static void sched_core_irq_work(struct irq_work *work)
+{
+ return;
+}
+
+/*
+ * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
+ * exits the core-wide unsafe state. Obviously the CPU calling this function
+ * should not be responsible for the core being in the core-wide unsafe state
+ * otherwise it will deadlock.
+ *
+ * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
+ * the loop if TIF flags are set and notify caller about it.
+ *
+ * IRQs should be disabled.
+ */
+void sched_core_wait_till_safe(unsigned long ti_check)
+{
+ bool restart = false;
+ struct rq *rq;
+ int cpu;
+
+ /*
+ * During exit to user mode, it is possible that preemption was briefly
+ * enabled before this function is called. During this interval, an interface
+ * could have removed the task's tag through stop_machine(). Thus, it is normal
+ * for this function to be called when a task that appears tagged during
+ * unsafe_enter() but not during the corresponding unsafe_exit(). unsafe_exit()
+ * has to be called for cleanups.
+ */
+ if (!test_tsk_thread_flag(current, TIF_UNSAFE_RET) || !current->core_cookie)
+ goto ret;
+
+ cpu = smp_processor_id();
+ rq = cpu_rq(cpu);
+
+ if (!sched_core_enabled(rq))
+ goto ret;
+
+ /* Down grade to allow interrupts. */
+ preempt_disable();
+ local_irq_enable();
+
+ /*
+ * Wait till the core of this HT is not in an unsafe state.
+ *
+ * Pair with smp_store_release() in sched_core_unsafe_exit().
+ */
+ while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
+ cpu_relax();
+ if (READ_ONCE(current_thread_info()->flags) & ti_check) {
+ restart = true;
+ break;
+ }
+ }
+
+ /* Upgrade it back to the expectations of entry code. */
+ local_irq_disable();
+ preempt_enable();
+
+ret:
+ if (!restart)
+ clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
+
+ return;
+}
+
+/*
+ * Enter the core-wide IRQ state. Sibling will be paused if it is running
+ * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
+ * avoid sending useless IPIs is made. Must be called only from hard IRQ
+ * context.
+ */
+void sched_core_unsafe_enter(void)
+{
+ const struct cpumask *smt_mask;
+ unsigned long flags;
+ struct rq *rq;
+ int i, cpu;
+
+ /* Ensure that on return to user/guest, we check whether to wait. */
+ if (current->core_cookie)
+ set_tsk_thread_flag(current, TIF_UNSAFE_RET);
+
+ local_irq_save(flags);
+ cpu = smp_processor_id();
+ rq = cpu_rq(cpu);
+ if (!sched_core_enabled(rq))
+ goto ret;
+
+ /* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
+ rq->core_this_unsafe_nest++;
+
+ /* Should not nest: enter() should only pair with exit(). */
+ if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
+ goto ret;
+
+ raw_spin_lock(rq_lockp(rq));
+ smt_mask = cpu_smt_mask(cpu);
+
+ /* Contribute this CPU's unsafe_enter() to core-wide unsafe_enter() count. */
+ WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
+
+ if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
+ goto unlock;
+
+ if (irq_work_pending(&rq->core_irq_work)) {
+ /*
+ * Do nothing more since we are in an IPI sent from another
+ * sibling to enforce safety. That sibling would have sent IPIs
+ * to all of the HTs.
+ */
+ goto unlock;
+ }
+
+ /*
+ * If we are not the first ones on the core to enter core-wide unsafe
+ * state, do nothing.
+ */
+ if (rq->core->core_unsafe_nest > 1)
+ goto unlock;
+
+ /* Do nothing more if the core is not tagged. */
+ if (!rq->core->core_cookie)
+ goto unlock;
+
+ for_each_cpu(i, smt_mask) {
+ struct rq *srq = cpu_rq(i);
+
+ if (i == cpu || cpu_is_offline(i))
+ continue;
+
+ if (!srq->curr->mm || is_idle_task(srq->curr))
+ continue;
+
+ /* Skip if HT is not running a tagged task. */
+ if (!srq->curr->core_cookie && !srq->core_pick)
+ continue;
+
+ /*
+ * Force sibling into the kernel by IPI. If work was already
+ * pending, no new IPIs are sent. This is Ok since the receiver
+ * would already be in the kernel, or on its way to it.
+ */
+ irq_work_queue_on(&srq->core_irq_work, i);
+ }
+unlock:
+ raw_spin_unlock(rq_lockp(rq));
+ret:
+ local_irq_restore(flags);
+}
+
+/*
+ * Process any work need for either exiting the core-wide unsafe state, or for
+ * waiting on this hyperthread if the core is still in this state.
+ *
+ * @idle: Are we called from the idle loop?
+ */
+void sched_core_unsafe_exit(void)
+{
+ unsigned long flags;
+ unsigned int nest;
+ struct rq *rq;
+ int cpu;
+
+ local_irq_save(flags);
+ cpu = smp_processor_id();
+ rq = cpu_rq(cpu);
+
+ /* Do nothing if core-sched disabled. */
+ if (!sched_core_enabled(rq))
+ goto ret;
+
+ /*
+ * Can happen when a process is forked and the first return to user
+ * mode is a syscall exit. Either way, there's nothing to do.
+ */
+ if (rq->core_this_unsafe_nest == 0)
+ goto ret;
+
+ rq->core_this_unsafe_nest--;
+
+ /* enter() should be paired with exit() only. */
+ if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
+ goto ret;
+
+ raw_spin_lock(rq_lockp(rq));
+ /*
+ * Core-wide nesting counter can never be 0 because we are
+ * still in it on this CPU.
+ */
+ nest = rq->core->core_unsafe_nest;
+ WARN_ON_ONCE(!nest);
+
+ /* Pair with smp_load_acquire() in sched_core_wait_till_safe(). */
+ smp_store_release(&rq->core->core_unsafe_nest, nest - 1);
+ raw_spin_unlock(rq_lockp(rq));
+ret:
+ local_irq_restore(flags);
+}
+
+void sched_core_unsafe_exit_wait(unsigned long ti_check)
+{
+ sched_core_unsafe_exit();
+ sched_core_wait_till_safe(ti_check);
+}
+
// XXX fairness/fwd progress conditions
/*
* Returns
@@ -7584,6 +7795,7 @@ int sched_cpu_starting(unsigned int cpu)
rq = cpu_rq(i);
if (rq->core && rq->core == rq)
core_rq = rq;
+ init_irq_work(&rq->core_irq_work, sched_core_irq_work);
}

if (!core_rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 59827ae58019..dbd8416ddaba 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1058,10 +1058,13 @@ struct rq {
unsigned int core_sched_seq;
struct rb_root core_tree;
unsigned char core_forceidle;
+ struct irq_work core_irq_work; /* To force HT into kernel */
+ unsigned int core_this_unsafe_nest;

/* shared state */
unsigned int core_task_seq;
unsigned long core_cookie;
+ unsigned int core_unsafe_nest;
#endif
};

--
2.17.1

2020-08-28 19:56:27

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 02/23] sched: Introduce sched_class::pick_task()

From: Peter Zijlstra <[email protected]>

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
---
kernel/sched/deadline.c | 16 ++++++++++++++--
kernel/sched/fair.c | 36 +++++++++++++++++++++++++++++++++---
kernel/sched/idle.c | 8 ++++++++
kernel/sched/rt.c | 14 ++++++++++++--
kernel/sched/sched.h | 3 +++
kernel/sched/stop_task.c | 13 +++++++++++--
6 files changed, 81 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c588a06057dc..6389aff32558 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1824,7 +1824,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
return rb_entry(left, struct sched_dl_entity, rb_node);
}

-static struct task_struct *pick_next_task_dl(struct rq *rq)
+static struct task_struct *pick_task_dl(struct rq *rq)
{
struct sched_dl_entity *dl_se;
struct dl_rq *dl_rq = &rq->dl;
@@ -1836,7 +1836,18 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
dl_se = pick_next_dl_entity(rq, dl_rq);
BUG_ON(!dl_se);
p = dl_task_of(dl_se);
- set_next_task_dl(rq, p, true);
+
+ return p;
+}
+
+static struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+ struct task_struct *p;
+
+ p = pick_task_dl(rq);
+ if (p)
+ set_next_task_dl(rq, p, true);
+
return p;
}

@@ -2493,6 +2504,7 @@ const struct sched_class dl_sched_class

#ifdef CONFIG_SMP
.balance = balance_dl,
+ .pick_task = pick_task_dl,
.select_task_rq = select_task_rq_dl,
.migrate_task_rq = migrate_task_rq_dl,
.set_cpus_allowed = set_cpus_allowed_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9be6cf92cc3d..1a1bf726264a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4446,7 +4446,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
* Avoid running the skip buddy, if running something else can
* be done without getting too unfair.
*/
- if (cfs_rq->skip == se) {
+ if (cfs_rq->skip && cfs_rq->skip == se) {
struct sched_entity *second;

if (se == curr) {
@@ -4464,13 +4464,13 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
/*
* Prefer last buddy, try to return the CPU to a preempted task.
*/
- if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+ if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
se = cfs_rq->last;

/*
* Someone really wants this to run. If it's not unfair, run it.
*/
- if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+ if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
se = cfs_rq->next;

clear_buddies(cfs_rq, se);
@@ -6970,6 +6970,35 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
set_last_buddy(se);
}

+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_fair(struct rq *rq)
+{
+ struct cfs_rq *cfs_rq = &rq->cfs;
+ struct sched_entity *se;
+
+ if (!cfs_rq->nr_running)
+ return NULL;
+
+ do {
+ struct sched_entity *curr = cfs_rq->curr;
+
+ se = pick_next_entity(cfs_rq, NULL);
+
+ if (curr) {
+ if (se && curr->on_rq)
+ update_curr(cfs_rq);
+
+ if (!se || entity_before(curr, se))
+ se = curr;
+ }
+
+ cfs_rq = group_cfs_rq(se);
+ } while (cfs_rq);
+
+ return task_of(se);
+}
+#endif
+
struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
@@ -11152,6 +11181,7 @@ const struct sched_class fair_sched_class

#ifdef CONFIG_SMP
.balance = balance_fair,
+ .pick_task = pick_task_fair,
.select_task_rq = select_task_rq_fair,
.migrate_task_rq = migrate_task_rq_fair,

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 5c33beb1a525..8c41b023dd14 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -408,6 +408,13 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
schedstat_inc(rq->sched_goidle);
}

+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_idle(struct rq *rq)
+{
+ return rq->idle;
+}
+#endif
+
struct task_struct *pick_next_task_idle(struct rq *rq)
{
struct task_struct *next = rq->idle;
@@ -475,6 +482,7 @@ const struct sched_class idle_sched_class

#ifdef CONFIG_SMP
.balance = balance_idle,
+ .pick_task = pick_task_idle,
.select_task_rq = select_task_rq_idle,
.set_cpus_allowed = set_cpus_allowed_common,
#endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e57fca05b660..a5851c775270 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1624,7 +1624,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
return rt_task_of(rt_se);
}

-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct task_struct *pick_task_rt(struct rq *rq)
{
struct task_struct *p;

@@ -1632,7 +1632,16 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
return NULL;

p = _pick_next_task_rt(rq);
- set_next_task_rt(rq, p, true);
+
+ return p;
+}
+
+static struct task_struct *pick_next_task_rt(struct rq *rq)
+{
+ struct task_struct *p = pick_task_rt(rq);
+ if (p)
+ set_next_task_rt(rq, p, true);
+
return p;
}

@@ -2443,6 +2452,7 @@ const struct sched_class rt_sched_class

#ifdef CONFIG_SMP
.balance = balance_rt,
+ .pick_task = pick_task_rt,
.select_task_rq = select_task_rq_rt,
.set_cpus_allowed = set_cpus_allowed_common,
.rq_online = rq_online_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9145be83d80b..293d031480d8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1793,6 +1793,9 @@ struct sched_class {

#ifdef CONFIG_SMP
int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+
+ struct task_struct * (*pick_task)(struct rq *rq);
+
int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
void (*migrate_task_rq)(struct task_struct *p, int new_cpu);

diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 394bc8126a1e..8f92915dd95e 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -34,15 +34,23 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop, bool fir
stop->se.exec_start = rq_clock_task(rq);
}

-static struct task_struct *pick_next_task_stop(struct rq *rq)
+static struct task_struct *pick_task_stop(struct rq *rq)
{
if (!sched_stop_runnable(rq))
return NULL;

- set_next_task_stop(rq, rq->stop, true);
return rq->stop;
}

+static struct task_struct *pick_next_task_stop(struct rq *rq)
+{
+ struct task_struct *p = pick_task_stop(rq);
+ if (p)
+ set_next_task_stop(rq, p, true);
+
+ return p;
+}
+
static void
enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
{
@@ -124,6 +132,7 @@ const struct sched_class stop_sched_class

#ifdef CONFIG_SMP
.balance = balance_stop,
+ .pick_task = pick_task_stop,
.select_task_rq = select_task_rq_stop,
.set_cpus_allowed = set_cpus_allowed_common,
#endif
--
2.17.1

2020-08-28 19:56:32

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 14/23] irq_work: Add support to detect if work is pending

From: "Joel Fernandes (Google)" <[email protected]>

When an unsafe region is entered on an HT, an IPI needs to be sent to
siblings to ensure they enter the kernel.

Following are the reasons why we would like to use irq_work to implement
forcing of sibling into kernel mode:

1. Existing smp_call infrastructure cannot be used easily since we could
end up waiting on CSD lock if previously an smp_call was not yet
serviced.

2. I'd like to use generic code, such that there is no need to add an
arch-specific IPI.

3. IRQ work already has support to detect that previous work was not yet
executed through the IRQ_WORK_PENDING bit.

4. We need to know if the destination of the IPI is not sending more
IPIs due to that IPI itself causing an entry into unsafe region.

Support for 4. requires us to be able to detect that irq_work is
pending.

This commit therefore adds a way for irq_work users to know if a
previous per-HT irq_work is pending. If it is, we need not send new
IPIs.

Memory ordering:

I was trying to handle the MP-pattern below. Consider the flag to be the
pending bit. P0() is the IRQ work handler. P1() is the code calling
irq_work_pending(). P0() already implicitly adds a memory barrier as a
part of the atomic_fetch_andnot() before calling work->func(). For P1(),
this patch adds the memory barrier as the atomic_read() in this patch's
irq_work_pending() is not sufficient.

P0()
{
WRITE_ONCE(buf, 1);
WRITE_ONCE(flag, 1);
}

P1()
{
int r1;
int r2 = 0;

r1 = READ_ONCE(flag);
if (r1)
r2 = READ_ONCE(buf);
}

Note: This patch is included in the following:
https://lore.kernel.org/lkml/[email protected]/

This could be removed when the above patch gets merged.

Cc: [email protected]
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/irq_work.h | 1 +
kernel/irq_work.c | 11 +++++++++++
2 files changed, 12 insertions(+)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 30823780c192..b26466f95d04 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -42,6 +42,7 @@ bool irq_work_queue_on(struct irq_work *work, int cpu);

void irq_work_tick(void);
void irq_work_sync(struct irq_work *work);
+bool irq_work_pending(struct irq_work *work);

#ifdef CONFIG_IRQ_WORK
#include <asm/irq_work.h>
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index eca83965b631..2d206d511aa0 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -24,6 +24,17 @@
static DEFINE_PER_CPU(struct llist_head, raised_list);
static DEFINE_PER_CPU(struct llist_head, lazy_list);

+bool irq_work_pending(struct irq_work *work)
+{
+ /*
+ * Provide ordering to callers who may read other stuff
+ * after the atomic read (MP-pattern).
+ */
+ bool ret = atomic_read_acquire(&work->flags) & IRQ_WORK_PENDING;
+
+ return ret;
+}
+
/*
* Claim the entry so that no one else will poke at it.
*/
--
2.17.1

2020-08-28 19:56:46

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 12/23] sched: Trivial forced-newidle balancer

From: Peter Zijlstra <[email protected]>

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

rcu_read_unlock() can incur an infrequent deadlock in
sched_core_balance(). Fix this by using the RCU-sched flavor instead.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Acked-by: Paul E. McKenney <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 130 +++++++++++++++++++++++++++++++++++++++++-
kernel/sched/idle.c | 1 +
kernel/sched/sched.h | 6 ++
4 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5fe9878502cb..eee53fa7b8d4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -687,6 +687,7 @@ struct task_struct {
#ifdef CONFIG_SCHED_CORE
struct rb_node core_node;
unsigned long core_cookie;
+ unsigned int core_occupation;
#endif

#ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 707a56ef0fa3..7e2f310ec0c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -201,6 +201,21 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
return match;
}

+static struct task_struct *sched_core_next(struct task_struct *p, unsigned long cookie)
+{
+ struct rb_node *node = &p->core_node;
+
+ node = rb_next(node);
+ if (!node)
+ return NULL;
+
+ p = container_of(node, struct task_struct, core_node);
+ if (p->core_cookie != cookie)
+ return NULL;
+
+ return p;
+}
+
/*
* The static-key + stop-machine variable are needed such that:
*
@@ -4641,7 +4656,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
struct task_struct *next, *max = NULL;
const struct sched_class *class;
const struct cpumask *smt_mask;
- int i, j, cpu;
+ int i, j, cpu, occ = 0;
int smt_weight;
bool need_sync;

@@ -4750,6 +4765,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
goto done;
}

+ if (!is_idle_task(p))
+ occ++;
+
rq_i->core_pick = p;

/*
@@ -4775,6 +4793,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

cpu_rq(j)->core_pick = NULL;
}
+ occ = 1;
goto again;
} else {
/*
@@ -4820,6 +4839,8 @@ next_class:;
if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
rq_i->core_forceidle = true;

+ rq_i->core_pick->core_occupation = occ;
+
if (i == cpu)
continue;

@@ -4837,6 +4858,113 @@ next_class:;
return next;
}

+static bool try_steal_cookie(int this, int that)
+{
+ struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+ struct task_struct *p;
+ unsigned long cookie;
+ bool success = false;
+
+ local_irq_disable();
+ double_rq_lock(dst, src);
+
+ cookie = dst->core->core_cookie;
+ if (!cookie)
+ goto unlock;
+
+ if (dst->curr != dst->idle)
+ goto unlock;
+
+ p = sched_core_find(src, cookie);
+ if (p == src->idle)
+ goto unlock;
+
+ do {
+ if (p == src->core_pick || p == src->curr)
+ goto next;
+
+ if (!cpumask_test_cpu(this, &p->cpus_mask))
+ goto next;
+
+ if (p->core_occupation > dst->idle->core_occupation)
+ goto next;
+
+ p->on_rq = TASK_ON_RQ_MIGRATING;
+ deactivate_task(src, p, 0);
+ set_task_cpu(p, this);
+ activate_task(dst, p, 0);
+ p->on_rq = TASK_ON_RQ_QUEUED;
+
+ resched_curr(dst);
+
+ success = true;
+ break;
+
+next:
+ p = sched_core_next(p, cookie);
+ } while (p);
+
+unlock:
+ double_rq_unlock(dst, src);
+ local_irq_enable();
+
+ return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+ int i;
+
+ for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+ if (i == cpu)
+ continue;
+
+ if (need_resched())
+ break;
+
+ if (try_steal_cookie(cpu, i))
+ return true;
+ }
+
+ return false;
+}
+
+static void sched_core_balance(struct rq *rq)
+{
+ struct sched_domain *sd;
+ int cpu = cpu_of(rq);
+
+ preempt_disable();
+ rcu_read_lock();
+ raw_spin_unlock_irq(rq_lockp(rq));
+ for_each_domain(cpu, sd) {
+ if (need_resched())
+ break;
+
+ if (steal_cookie_task(cpu, sd))
+ break;
+ }
+ raw_spin_lock_irq(rq_lockp(rq));
+ rcu_read_unlock();
+ preempt_enable();
+}
+
+static DEFINE_PER_CPU(struct callback_head, core_balance_head);
+
+void queue_core_balance(struct rq *rq)
+{
+ if (!sched_core_enabled(rq))
+ return;
+
+ if (!rq->core->core_cookie)
+ return;
+
+ if (!rq->nr_running) /* not forced idle */
+ return;
+
+ queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
+}
+
#else /* !CONFIG_SCHED_CORE */

static struct task_struct *
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8c41b023dd14..9c5637d866fd 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -406,6 +406,7 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
{
update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
+ queue_core_balance(rq);
}

#ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7e9346de3f59..be6db45276e7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1109,6 +1109,8 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
void sched_core_adjust_sibling_vruntime(int cpu, bool coresched_enabled);

+extern void queue_core_balance(struct rq *rq);
+
#else /* !CONFIG_SCHED_CORE */

static inline bool sched_core_enabled(struct rq *rq)
@@ -1121,6 +1123,10 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
return &rq->__lock;
}

+static inline void queue_core_balance(struct rq *rq)
+{
+}
+
#endif /* CONFIG_SCHED_CORE */

#ifdef CONFIG_SCHED_SMT
--
2.17.1

2020-08-28 19:56:52

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 22/23] Documentation: Add documentation on core scheduling

From: "Joel Fernandes (Google)" <[email protected]>

Co-developed-by: Vineeth Pillai <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Vineeth Pillai <[email protected]>
---
.../admin-guide/hw-vuln/core-scheduling.rst | 253 ++++++++++++++++++
Documentation/admin-guide/hw-vuln/index.rst | 1 +
2 files changed, 254 insertions(+)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index 000000000000..7ebece93f1d3
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,253 @@
+Core Scheduling
+================
+MDS and L1TF mitigations do not protect from cross-HT attacks (attacker running
+on one HT with victim running on another). For proper mitigation of this,
+core scheduling support is available via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core while still meeting
+the scheduler's requirements.
+
+Usage
+-----
+The current interface implementation is just for testing and uses CPU
+controller CGroups which will change soon. A ``cpu.tag`` file has been added to
+the CPU controller CGroup. If the content of this file is 1, then all the
+CGroup's tasks trust each other and are allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+As mention, the interface is for testing purposes and has drawbacks. Trusted
+tasks have to be grouped into CPU CGroup which is not always possible
+depending on the system's existing CGroup configuration, where trusted tasks
+could already be in different CPU CGroups. Also, this feature will have a hard
+dependency on CGroups and systems with CGroups disabled would not be able to
+use core scheduling so another API is needed in conjunction with
+CGroups. See `Future work`_ for other API proposals.
+
+Design/Implementation
+---------------------
+Each task that is tagged is assigned a cookie internally in the kernel. As
+mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
+each other and share a core.
+
+The basic idea is that, every schedule event tries to select tasks for all the
+siblings of a core such that all the selected tasks running on a core are
+trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
+The idle task is considered special, in that it trusts every thing.
+
+During a schedule() event on any sibling of a core, the highest priority task for
+that core is picked and assigned to the sibling calling schedule() if it has it
+enqueued. For rest of the siblings in the core, highest priority task with the
+same cookie is selected if there is one runnable in their individual run
+queues. If a task with same cookie is not available, the idle task is selected.
+Idle task is globally trusted.
+
+Once a task has been selected for all the siblings in the core, an IPI is sent to
+siblings for whom a new task was selected. Siblings on receiving the IPI, will
+switch to the new task immediately. If an idle task is selected for a sibling,
+then the sibling is considered to be in a "forced idle" state. i.e., it may
+have tasks on its on runqueue to run, however it will still have to run idle.
+More on this in the next section.
+
+Forced-idling of tasks
+---------------------
+The scheduler tries its best to find tasks that trust each other such that all
+tasks selected to be scheduled are of the highest priority in a core. However,
+it is possible that some runqueues had tasks that were incompatibile with the
+highest priority ones in the core. Favoring security over fairness, one or more
+siblings could be forced to select a lower priority task if the highest
+priority task is not trusted with respect to the core wide highest priority
+task. If a sibling does not have a trusted task to run, it will be forced idle
+by the scheduler(idle thread is scheduled to run).
+
+When the highest priorty task is selected to run, a reschedule-IPI is sent to
+the sibling to force it into idle. This results in 4 cases which need to be
+considered depending on whether a VM or a regular usermode process was running
+on either HT:
+
+::
+ HT1 (attack) HT2 (victim)
+
+ A idle -> user space user space -> idle
+
+ B idle -> user space guest -> idle
+
+ C idle -> guest user space -> idle
+
+ D idle -> guest guest -> idle
+
+Note that for better performance, we do not wait for the destination CPU
+(victim) to enter idle mode. This is because the sending of the IPI would bring
+the destination CPU immediately into kernel mode from user space, or VMEXIT
+in the case of guests. At best, this would only leak some scheduler metadata
+which may not be worth protecting. It is also possible that the IPI is received
+too late on some architectures, but this has not been observed in the case of
+x86.
+
+Kernel protection from untrusted tasks
+--------------------------------------
+The scheduler on its own cannot protect the kernel executing concurrently with
+an untrusted task in a core. This is because the scheduler is unaware of
+interrupts/syscalls at scheduling time. To mitigate this, we send an IPI to
+siblings on kernel entry. This forces the sibling to enter kernel mode and it
+waits before returning to user until all siblings of the core has left kernel
+mode. For good performance, we send an IPI only if it is detected that the
+core is running tasks that have been marked for core scheduling. If a sibling
+is running kernel threads or is idle, no IPI is sent.
+
+For easier testing, a temporary (not intended for mainline) patch is included
+in this series to make kernel protection configurable via a
+``CONFIG_SCHED_CORE_KERNEL_PROTECTION`` config option or a
+``sched_core_kernel_protection`` boot parameter.
+
+Other ideas for kernel protection which are
+
+1. Changing interrupt affinities to a trusted core which does not execute untrusted tasks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+By changing the interrupt affinities to a designated safe-CPU which runs
+only trusted tasks, IRQ data can be protected. One issue is this involves
+giving up a full CPU core of the system to run safe tasks. Another is that,
+per-cpu interrupts such as the local timer interrupt cannot have their
+affinity changed. also, sensitive timer callbacks such as the random entropy timer
+can run in softirq on return from these interrupts and expose sensitive
+data. In the future, that could be mitigated by forcing softirqs into threaded
+mode by utilizing a mechanism similar to ``PREEMPT_RT``.
+
+Yet another issue with this is, for multiqueue devices with managed
+interrupts, the IRQ affinities cannot be changed however it could be
+possible to force a reduced number of queues which would in turn allow to
+shield one or two CPUs from such interrupts and queue handling for the price
+of indirection.
+
+2. Running IRQs as threaded-IRQs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This would result in forcing IRQs into the scheduler which would then provide
+the process-context mitigation. However, not all interrupts can be threaded.
+Also this does nothing about syscall entries.
+
+3. Kernel Address Space Isolation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+System calls could run in a much restricted address space which is
+guarenteed not to leak any sensitive data. There are practical limitation in
+implementing this - the main concern being how to decide on an address space
+that is guarenteed to not have any sensitive data.
+
+4. Limited cookie-based protection
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+On a system call, change the cookie to the system trusted cookie and initiate a
+schedule event. This would be better than pausing all the siblings during the
+entire duration for the system call, but still would be a huge hit to the
+performance.
+
+Trust model
+-----------
+Core scheduling understands trust relationships by assignment of a cookie to
+related tasks using the above mentioned interface. When a system with core
+scheduling boots, all tasks are considered to trust each other. This is because
+the scheduler does not have information about trust relationships. That is, all
+tasks have a default cookie value of 0. This cookie value is also considered
+the system-wide cookie value and the IRQ-pausing mitigation is avoided if
+siblings are running these cookie-0 tasks.
+
+By default, all system processes on boot are considered trusted and userspace
+has to explicitly use the interfaces mentioned above to group sets of tasks.
+Tasks within the group trust each other, but not those outside. Tasks outside
+the group don't trust the task inside.
+
+Limitations
+-----------
+Core scheduling tries to guarentee that only trusted tasks run concurrently on a
+core. But there could be small window of time during which untrusted tasks run
+concurrently or kernel could be running concurrently with a task not trusted by
+kernel.
+
+1. IPI processing delays
+^^^^^^^^^^^^^^^^^^^^^^^^
+Core scheduling selects only trusted tasks to run together. IPI is used to notify
+the siblings to switch to the new task. But there could be hardware delays in
+receiving of the IPI on some arch (on x86, this has not been observed). This may
+cause an attacker task to start running on a cpu before its siblings receive the
+IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
+may populate data in the cache and micro acrhitectural buffers after the attacker
+starts to run and this is a possibility for data leak.
+
+Open cross-HT issues that core scheduling does not solve
+--------------------------------------------------------
+1. For MDS
+^^^^^^^^^^
+Core scheduling cannot protect against MDS attacks between an HT running in
+user mode and another running in kernel mode. Even though both HTs run tasks
+which trust each other, kernel memory is still considered untrusted. Such
+attacks are possible for any combination of sibling CPU modes (host or guest mode).
+
+2. For L1TF
+^^^^^^^^^^^
+Core scheduling cannot protect against a L1TF guest attackers exploiting a
+guest or host victim. This is because the guest attacker can craft invalid
+PTEs which are not inverted due to a vulnerable guest kernel. The only
+solution is to disable EPT.
+
+For both MDS and L1TF, if the guest vCPU is configured to not trust each
+other (by tagging separately), then the guest to guest attacks would go away.
+Or it could be a system admin policy which considers guest to guest attacks as
+a guest problem.
+
+Another approach to resolve these would be to make every untrusted task on the
+system to not trust every other untrusted task. While this could reduce
+parallelism of the untrusted tasks, it would still solve the above issues while
+allowing system processes (trusted tasks) to share a core.
+
+Use cases
+---------
+The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
+with SMT enabled. There are other use cases where this feature could be used:
+
+- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
+ that uses SIMD instructions etc.
+- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
+ together could also be realized using core scheduling. One example is vcpus of
+ a VM.
+
+Future work
+-----------
+1. API Proposals
+^^^^^^^^^^^^^^^^
+
+As mentioned in `Usage`_ section, various API proposals are listed here:
+
+- ``prctl`` : We can pass in a tag and all tasks with same tag set by prctl forms
+ a trusted group.
+
+- ``sched_setattr`` : Similar to prctl, but has the advantage that tasks could be
+ tagged by other tasks with appropriate permissions.
+
+- ``Auto Tagging`` : Related tasks are tagged automatically. Relation could be,
+ threads of the same process, tasks by a user, group or session etc.
+
+- Dedicated CGroup or procfs/sysfs interface for grouping trusted tasks. This could
+ be combined with prctl/sched_setattr as well.
+
+2. Auto-tagging of KVM vCPU threads
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+To make configuration easier, it would be great if KVM auto-tags vCPU threads
+such that a given VM only trusts other vCPUs of the same VM. Or something more
+aggressive like assiging a vCPU thread a unique tag.
+
+3. Auto-tagging of processes by default
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Currently core scheduling does not prevent 'unconfigured' tasks from being
+co-scheduled on the same core. In other words, everything trusts everything
+else by default. If a user wants everything default untrusted, a CONFIG option
+could be added to assign every task with a unique tag by default.
+
+4. Auto-tagging on fork
+^^^^^^^^^^^^^^^^^^^^^^^
+Currently, on fork a thread is added to the same trust-domain as the parent. For
+systems which want all tasks to have a unique tag, it could be desirable to assign
+a unique tag to a task so that the parent does not trust the child by default.
+
+5. Skipping per-HT mitigations if task is trusted
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+If core scheduling is enabled, by default all tasks trust each other as
+mentioned above. In such scenario, it may be desirable to skip the same-HT
+mitigations on return to the trusted user-mode to improve performance.
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index ca4dbdd9016d..f12cda55538b 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -15,3 +15,4 @@ are configurable at compile, boot or run time.
tsx_async_abort
multihit.rst
special-register-buffer-data-sampling.rst
+ core-scheduling.rst
--
2.17.1

2020-08-28 19:57:09

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 10/23] sched/fair: wrapper for cfs_rq->min_vruntime

From: Aaron Lu <[email protected]>

Add a wrapper function cfs_rq_min_vruntime(cfs_rq) to
return cfs_rq->min_vruntime.

It will be used in the following patch, no functionality
change.

Signed-off-by: Aaron Lu <[email protected]>
---
kernel/sched/fair.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 409edc736297..298d2c521c1e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -460,6 +460,11 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)

#endif /* CONFIG_FAIR_GROUP_SCHED */

+static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->min_vruntime;
+}
+
static __always_inline
void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);

@@ -496,7 +501,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
struct sched_entity *curr = cfs_rq->curr;
struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);

- u64 vruntime = cfs_rq->min_vruntime;
+ u64 vruntime = cfs_rq_min_vruntime(cfs_rq);

if (curr) {
if (curr->on_rq)
@@ -516,7 +521,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
}

/* ensure we never gain time by being placed backwards. */
- cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+ cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
#ifndef CONFIG_64BIT
smp_wmb();
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
@@ -4044,7 +4049,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
#ifdef CONFIG_SCHED_DEBUG
- s64 d = se->vruntime - cfs_rq->min_vruntime;
+ s64 d = se->vruntime - cfs_rq_min_vruntime(cfs_rq);

if (d < 0)
d = -d;
@@ -4057,7 +4062,7 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
- u64 vruntime = cfs_rq->min_vruntime;
+ u64 vruntime = cfs_rq_min_vruntime(cfs_rq);

/*
* The 'current' period is already promised to the current tasks,
@@ -4151,7 +4156,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* update_curr().
*/
if (renorm && curr)
- se->vruntime += cfs_rq->min_vruntime;
+ se->vruntime += cfs_rq_min_vruntime(cfs_rq);

update_curr(cfs_rq);

@@ -4162,7 +4167,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* fairness detriment of existing tasks.
*/
if (renorm && !curr)
- se->vruntime += cfs_rq->min_vruntime;
+ se->vruntime += cfs_rq_min_vruntime(cfs_rq);

/*
* When enqueuing a sched_entity, we must:
@@ -4281,7 +4286,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* can move min_vruntime forward still more.
*/
if (!(flags & DEQUEUE_SLEEP))
- se->vruntime -= cfs_rq->min_vruntime;
+ se->vruntime -= cfs_rq_min_vruntime(cfs_rq);

/* return excess runtime on last dequeue */
return_cfs_rq_runtime(cfs_rq);
@@ -6717,7 +6722,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
min_vruntime = cfs_rq->min_vruntime;
} while (min_vruntime != min_vruntime_copy);
#else
- min_vruntime = cfs_rq->min_vruntime;
+ min_vruntime = cfs_rq_min_vruntime(cfs_rq);
#endif

se->vruntime -= min_vruntime;
@@ -10727,7 +10732,7 @@ static void task_fork_fair(struct task_struct *p)
resched_curr(rq);
}

- se->vruntime -= cfs_rq->min_vruntime;
+ se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
rq_unlock(rq, &rf);
}

@@ -10850,7 +10855,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
* cause 'unlimited' sleep bonus.
*/
place_entity(cfs_rq, se, 0);
- se->vruntime -= cfs_rq->min_vruntime;
+ se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
}

detach_entity_cfs_rq(se);
@@ -10864,7 +10869,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
attach_entity_cfs_rq(se);

if (!vruntime_normalized(p))
- se->vruntime += cfs_rq->min_vruntime;
+ se->vruntime += cfs_rq_min_vruntime(cfs_rq);
}

static void switched_from_fair(struct rq *rq, struct task_struct *p)
--
2.17.1

2020-08-28 19:57:09

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 11/23] sched/fair: core wide cfs task priority comparison

From: Aaron Lu <[email protected]>

This patch provides a vruntime based way to compare two cfs task's
priority, be it on the same cpu or different threads of the same core.

When the two tasks are on the same CPU, we just need to find a common
cfs_rq both sched_entities are on and then do the comparison.

When the two tasks are on differen threads of the same core, each thread
will choose the next task to run the usual way and then the root level
sched entities which the two tasks belong to will be used to decide
which task runs next core wide.

An illustration for the cross CPU case:

cpu0 cpu1
/ | \ / | \
se1 se2 se3 se4 se5 se6
/ \ / \
se21 se22 se61 se62
(A) /
se621
(B)

Assume CPU0 and CPU1 are smt siblings and cpu0 has decided task A to
run next and cpu1 has decided task B to run next. To compare priority
of task A and B, we compare priority of se2 and se6. Whose vruntime is
smaller, who wins.

To make this work, the root level sched entities' vruntime of the two
threads must be directly comparable. So one of the hyperthread's root
cfs_rq's min_vruntime is chosen as the core wide one and all root level
sched entities' vruntime is normalized against it.

All sub cfs_rqs and sched entities are not interesting in cross cpu
priority comparison as they will only participate in the usual cpu local
schedule decision so no need to normalize their vruntimes.

Signed-off-by: Aaron Lu <[email protected]>
---
kernel/sched/core.c | 23 +++----
kernel/sched/fair.c | 142 ++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 3 +
3 files changed, 150 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1f480b6025cd..707a56ef0fa3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -114,19 +114,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
return !dl_time_before(a->dl.deadline, b->dl.deadline);

- if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */
- u64 vruntime = b->se.vruntime;
-
- /*
- * Normalize the vruntime if tasks are in different cpus.
- */
- if (task_cpu(a) != task_cpu(b)) {
- vruntime -= task_cfs_rq(b)->min_vruntime;
- vruntime += task_cfs_rq(a)->min_vruntime;
- }
-
- return !((s64)(a->se.vruntime - vruntime) <= 0);
- }
+ if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+ return cfs_prio_less(a, b);

return false;
}
@@ -229,8 +218,12 @@ static int __sched_core_stopper(void *data)
bool enabled = !!(unsigned long)data;
int cpu;

- for_each_possible_cpu(cpu)
- cpu_rq(cpu)->core_enabled = enabled;
+ for_each_possible_cpu(cpu) {
+ struct rq *rq = cpu_rq(cpu);
+ rq->core_enabled = enabled;
+ if (cpu_online(cpu) && rq->core != rq)
+ sched_core_adjust_sibling_vruntime(cpu, enabled);
+ }

return 0;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 298d2c521c1e..e54a35612efa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -460,11 +460,142 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)

#endif /* CONFIG_FAIR_GROUP_SCHED */

+#ifdef CONFIG_SCHED_CORE
+static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+ return &rq_of(cfs_rq)->cfs;
+}
+
+static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq == root_cfs_rq(cfs_rq);
+}
+
+static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq)
+{
+ return &rq_of(cfs_rq)->core->cfs;
+}
+
+static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
+{
+ if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq))
+ return cfs_rq->min_vruntime;
+
+ return core_cfs_rq(cfs_rq)->min_vruntime;
+}
+
+#ifndef CONFIG_64BIT
+static inline u64 cfs_rq_min_vruntime_copy(struct cfs_rq *cfs_rq)
+{
+ if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq))
+ return cfs_rq->min_vruntime_copy;
+
+ return core_cfs_rq(cfs_rq)->min_vruntime_copy;
+}
+#endif /* CONFIG_64BIT */
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+ struct sched_entity *sea = &a->se;
+ struct sched_entity *seb = &b->se;
+ bool samecpu = task_cpu(a) == task_cpu(b);
+ s64 delta;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ if (samecpu) {
+ /* vruntime is per cfs_rq */
+ while (!is_same_group(sea, seb)) {
+ int sea_depth = sea->depth;
+ int seb_depth = seb->depth;
+
+ if (sea_depth >= seb_depth)
+ sea = parent_entity(sea);
+ if (sea_depth <= seb_depth)
+ seb = parent_entity(seb);
+ }
+
+ delta = (s64)(sea->vruntime - seb->vruntime);
+ goto out;
+ }
+
+ /* crosscpu: compare root level se's vruntime to decide priority */
+ while (sea->parent)
+ sea = sea->parent;
+ while (seb->parent)
+ seb = seb->parent;
+#else
+ /*
+ * Use the min_vruntime of root cfs_rq if the tasks are
+ * enqueued in different cpus.
+ * */
+ if (!samecpu) {
+ delta = (s64)(task_rq(a)->cfs.min_vruntime -
+ task_rq(b)->cfs.min_vruntime);
+ goto out;
+ }
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+ delta = (s64)(sea->vruntime - seb->vruntime);
+
+out:
+ return delta > 0;
+}
+
+/*
+ * This function takes care of adjusting the min_vruntime of siblings of
+ * a core during coresched enable/disable.
+ * This is called in stop machine context so no need to take the rq lock.
+ *
+ * Coresched enable case
+ * Once Core scheduling is enabled, the root level sched entities
+ * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq
+ * min_vruntime, so it's necessary to normalize vruntime of existing root
+ * level sched entities in sibling_cfs_rq.
+ *
+ * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be
+ * only using cfs_rq->min_vruntime during the entire run of core scheduling.
+ *
+ * Coresched disable case
+ * During the entire run of core scheduling, sibling_cfs_rq's min_vruntime
+ * is left unused and could lag far behind its still queued sched entities.
+ * Sync it to the up2date core wide one to avoid problems.
+ */
+void sched_core_adjust_sibling_vruntime(int cpu, bool coresched_enabled)
+{
+ struct cfs_rq *cfs = &cpu_rq(cpu)->cfs;
+ struct cfs_rq *core_cfs = &cpu_rq(cpu)->core->cfs;
+ if (coresched_enabled) {
+ struct sched_entity *se, *next;
+ s64 delta = core_cfs->min_vruntime - cfs->min_vruntime;
+ rbtree_postorder_for_each_entry_safe(se, next,
+ &cfs->tasks_timeline.rb_root,
+ run_node) {
+ se->vruntime += delta;
+ }
+ } else {
+ cfs->min_vruntime = core_cfs->min_vruntime;
+#ifndef CONFIG_64BIT
+ smp_wmb();
+ cfs->min_vruntime_copy = core_cfs->min_vruntime;
+#endif
+ }
+}
+
+#else
static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
{
return cfs_rq->min_vruntime;
}

+#ifndef CONFIG_64BIT
+static inline u64 cfs_rq_min_vruntime_copy(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->min_vruntime_copy;
+}
+#endif /* CONFIG_64BIT */
+
+#endif /* CONFIG_SCHED_CORE */
+
static __always_inline
void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);

@@ -520,8 +651,13 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
vruntime = min_vruntime(vruntime, se->vruntime);
}

+#ifdef CONFIG_SCHED_CORE
+ if (sched_core_enabled(rq_of(cfs_rq)) && is_root_cfs_rq(cfs_rq))
+ cfs_rq = core_cfs_rq(cfs_rq);
+#endif
+
/* ensure we never gain time by being placed backwards. */
- cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
+ cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
#ifndef CONFIG_64BIT
smp_wmb();
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
@@ -6717,9 +6853,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
u64 min_vruntime_copy;

do {
- min_vruntime_copy = cfs_rq->min_vruntime_copy;
+ min_vruntime_copy = cfs_rq_min_vruntime_copy(cfs_rq);
smp_rmb();
- min_vruntime = cfs_rq->min_vruntime;
+ min_vruntime = cfs_rq_min_vruntime(cfs_rq);
} while (min_vruntime != min_vruntime_copy);
#else
min_vruntime = cfs_rq_min_vruntime(cfs_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index def442f2c690..7e9346de3f59 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1106,6 +1106,9 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
return &rq->__lock;
}

+bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+void sched_core_adjust_sibling_vruntime(int cpu, bool coresched_enabled);
+
#else /* !CONFIG_SCHED_CORE */

static inline bool sched_core_enabled(struct rq *rq)
--
2.17.1

2020-08-28 19:57:15

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 16/23] arch/x86: Add a new TIF flag for untrusted tasks

From: "Joel Fernandes (Google)" <[email protected]>

Add a new TIF flag to indicate whether the kernel needs to be careful
and take additional steps to mitigate micro-architectural issues during
entry into user or guest mode.

This new flag will be used by the series to determine if waiting is
needed or not, during exit to user or guest mode.

Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
arch/x86/include/asm/thread_info.h | 2 ++
kernel/sched/sched.h | 6 ++++++
2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 267701ae3d86..42e63969acb3 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -98,6 +98,7 @@ struct thread_info {
#define TIF_IO_BITMAP 22 /* uses I/O bitmap */
#define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
#define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
+#define TIF_UNSAFE_RET 26 /* On return to process/guest, perform safety checks. */
#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
#define TIF_SYSCALL_TRACEPOINT 28 /* syscall tracepoint instrumentation */
#define TIF_ADDR32 29 /* 32-bit address space on 64 bits */
@@ -127,6 +128,7 @@ struct thread_info {
#define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
#define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
#define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
+#define _TIF_UNSAFE_RET (1 << TIF_UNSAFE_RET)
#define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
#define _TIF_SYSCALL_TRACEPOINT (1 << TIF_SYSCALL_TRACEPOINT)
#define _TIF_ADDR32 (1 << TIF_ADDR32)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1c94b2862069..59827ae58019 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2760,3 +2760,9 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)

void swake_up_all_locked(struct swait_queue_head *q);
void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
+
+#ifdef CONFIG_SCHED_CORE
+#ifndef TIF_UNSAFE_RET
+#define TIF_UNSAFE_RET (0)
+#endif
+#endif
--
2.17.1

2020-08-28 19:57:32

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 19/23] entry/kvm: Protect the kernel when entering from guest

From: Vineeth Pillai <[email protected]>

Similar to how user to kernel mode transitions are protected in earlier
patches, protect the entry into kernel from guest mode as well.

Signed-off-by: Vineeth Pillai <[email protected]>
---
arch/x86/kvm/x86.c | 3 +++
include/linux/entry-kvm.h | 12 ++++++++++++
kernel/entry/kvm.c | 12 ++++++++++++
3 files changed, 27 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 33945283fe07..2c2f211a556a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8526,6 +8526,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
*/
smp_mb__after_srcu_read_unlock();

+ kvm_exit_to_guest_mode(vcpu);
+
/*
* This handles the case where a posted interrupt was
* notified with kvm_vcpu_kick.
@@ -8619,6 +8621,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
}
}

+ kvm_enter_from_guest_mode(vcpu);
local_irq_enable();
preempt_enable();

diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 0cef17afb41a..32aabb7f3e6d 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -77,4 +77,16 @@ static inline bool xfer_to_guest_mode_work_pending(void)
}
#endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */

+/**
+ * kvm_enter_from_guest_mode - Hook called just after entering kernel from guest.
+ * @vcpu: Pointer to the current VCPU data
+ */
+void kvm_enter_from_guest_mode(struct kvm_vcpu *vcpu);
+
+/**
+ * kvm_exit_to_guest_mode - Hook called just before entering guest from kernel.
+ * @vcpu: Pointer to the current VCPU data
+ */
+void kvm_exit_to_guest_mode(struct kvm_vcpu *vcpu);
+
#endif
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index eb1a8a4c867c..994af4241646 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -49,3 +49,15 @@ int xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu)
return xfer_to_guest_mode_work(vcpu, ti_work);
}
EXPORT_SYMBOL_GPL(xfer_to_guest_mode_handle_work);
+
+void kvm_enter_from_guest_mode(struct kvm_vcpu *vcpu)
+{
+ sched_core_unsafe_enter();
+}
+EXPORT_SYMBOL_GPL(kvm_enter_from_guest_mode);
+
+void kvm_exit_to_guest_mode(struct kvm_vcpu *vcpu)
+{
+ sched_core_unsafe_exit_wait(XFER_TO_GUEST_MODE_WORK);
+}
+EXPORT_SYMBOL_GPL(kvm_exit_to_guest_mode);
--
2.17.1

2020-08-28 19:57:32

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 09/23] sched/fair: Fix forced idle sibling starvation corner case

From: Vineeth Pillai <[email protected]>

If there is only one long running local task and the sibling is
forced idle, it might not get a chance to run until a schedule
event happens on any cpu in the core.

So we check for this condition during a tick to see if a sibling
is starved and then give it a chance to schedule.

Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
---
kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 285002a2f641..409edc736297 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10631,6 +10631,40 @@ static void rq_offline_fair(struct rq *rq)

#endif /* CONFIG_SMP */

+#ifdef CONFIG_SCHED_CORE
+static inline bool
+__entity_slice_used(struct sched_entity *se)
+{
+ return (se->sum_exec_runtime - se->prev_sum_exec_runtime) >
+ sched_slice(cfs_rq_of(se), se);
+}
+
+/*
+ * If runqueue has only one task which used up its slice and if the sibling
+ * is forced idle, then trigger schedule to give forced idle task a chance.
+ */
+static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se)
+{
+ int cpu = cpu_of(rq), sibling_cpu;
+
+ if (rq->cfs.nr_running > 1 || !__entity_slice_used(se))
+ return;
+
+ for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) {
+ struct rq *sibling_rq;
+ if (sibling_cpu == cpu)
+ continue;
+ if (cpu_is_offline(sibling_cpu))
+ continue;
+
+ sibling_rq = cpu_rq(sibling_cpu);
+ if (sibling_rq->core_forceidle) {
+ resched_curr(sibling_rq);
+ }
+ }
+}
+#endif
+
/*
* scheduler tick hitting a task of our scheduling class.
*
@@ -10654,6 +10688,11 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

update_misfit_status(curr, rq);
update_overutilized_status(task_rq(curr));
+
+#ifdef CONFIG_SCHED_CORE
+ if (sched_core_enabled(rq))
+ resched_forceidle_sibling(rq, &curr->se);
+#endif
}

/*
--
2.17.1

2020-08-28 19:57:51

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 21/23] sched: cgroup tagging interface for core scheduling

From: Peter Zijlstra <[email protected]>

Marks all tasks in a cgroup as matching for core-scheduling.

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling. Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup. A more complicated
scheme could probably avoid the stop machine. Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Core scheduler has extra overhead. Enable it only for core with
more than one SMT hardware threads.

Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
---
kernel/sched/core.c | 183 +++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 4 +
2 files changed, 180 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 34238fd67f31..5f77e575bbac 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -153,6 +153,37 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct *
return false;
}

+static bool sched_core_empty(struct rq *rq)
+{
+ return RB_EMPTY_ROOT(&rq->core_tree);
+}
+
+static bool sched_core_enqueued(struct task_struct *task)
+{
+ return !RB_EMPTY_NODE(&task->core_node);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+ struct task_struct *task;
+
+ task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node);
+ return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ struct task_struct *task;
+
+ while (!sched_core_empty(rq)) {
+ task = sched_core_first(rq);
+ rb_erase(&task->core_node, &rq->core_tree);
+ RB_CLEAR_NODE(&task->core_node);
+ }
+ rq->core->core_task_seq++;
+}
+
static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
{
struct rb_node *parent, **node;
@@ -184,10 +215,11 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
{
rq->core->core_task_seq++;

- if (!p->core_cookie)
+ if (!sched_core_enqueued(p))
return;

rb_erase(&p->core_node, &rq->core_tree);
+ RB_CLEAR_NODE(&p->core_node);
}

/*
@@ -253,9 +285,23 @@ static int __sched_core_stopper(void *data)

for_each_possible_cpu(cpu) {
struct rq *rq = cpu_rq(cpu);
- rq->core_enabled = enabled;
- if (cpu_online(cpu) && rq->core != rq)
- sched_core_adjust_sibling_vruntime(cpu, enabled);
+
+ WARN_ON_ONCE(enabled == rq->core_enabled);
+
+ if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
+ /*
+ * All active and migrating tasks will have already
+ * been removed from core queue when we clear the
+ * cgroup tags. However, dying tasks could still be
+ * left in core queue. Flush them here.
+ */
+ if (!enabled)
+ sched_core_flush(cpu);
+
+ rq->core_enabled = enabled;
+ if (cpu_online(cpu) && rq->core != rq)
+ sched_core_adjust_sibling_vruntime(cpu, enabled);
+ }
}

return 0;
@@ -266,7 +312,11 @@ static int sched_core_count;

static void __sched_core_enable(void)
{
- // XXX verify there are no cookie tasks (yet)
+ int cpu;
+
+ /* verify there are no cookie tasks (yet) */
+ for_each_online_cpu(cpu)
+ BUG_ON(!sched_core_empty(cpu_rq(cpu)));

static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -274,8 +324,6 @@ static void __sched_core_enable(void)

static void __sched_core_disable(void)
{
- // XXX verify there are no cookie tasks (left)
-
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
}
@@ -300,6 +348,7 @@ void sched_core_put(void)

static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static bool sched_core_enqueued(struct task_struct *task) { return false; }

#endif /* CONFIG_SCHED_CORE */

@@ -3534,6 +3583,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_SMP
plist_node_init(&p->pushable_tasks, MAX_PRIO);
RB_CLEAR_NODE(&p->pushable_dl_tasks);
+#endif
+#ifdef CONFIG_SCHED_CORE
+ RB_CLEAR_NODE(&p->core_node);
#endif
return 0;
}
@@ -7431,6 +7483,9 @@ void init_idle(struct task_struct *idle, int cpu)
#ifdef CONFIG_SMP
sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
#endif
+#ifdef CONFIG_SCHED_CORE
+ RB_CLEAR_NODE(&idle->core_node);
+#endif
}

#ifdef CONFIG_SMP
@@ -8416,6 +8471,15 @@ static void sched_change_group(struct task_struct *tsk, int type)
tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
struct task_group, css);
tg = autogroup_task_group(tsk, tg);
+
+#ifdef CONFIG_SCHED_CORE
+ if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
+ tsk->core_cookie = 0UL;
+
+ if (tg->tagged /* && !tsk->core_cookie ? */)
+ tsk->core_cookie = (unsigned long)tg;
+#endif
+
tsk->sched_task_group = tg;

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -8508,6 +8572,18 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
return 0;
}

+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+#ifdef CONFIG_SCHED_CORE
+ struct task_group *tg = css_tg(css);
+
+ if (tg->tagged) {
+ sched_core_put();
+ tg->tagged = 0;
+ }
+#endif
+}
+
static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
{
struct task_group *tg = css_tg(css);
@@ -9073,6 +9149,82 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
}
#endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_SCHED_CORE
+static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+ struct task_group *tg = css_tg(css);
+
+ return !!tg->tagged;
+}
+
+struct write_core_tag {
+ struct cgroup_subsys_state *css;
+ int val;
+};
+
+static int __sched_write_tag(void *data)
+{
+ struct write_core_tag *tag = (struct write_core_tag *) data;
+ struct cgroup_subsys_state *css = tag->css;
+ int val = tag->val;
+ struct task_group *tg = css_tg(tag->css);
+ struct css_task_iter it;
+ struct task_struct *p;
+
+ tg->tagged = !!val;
+
+ css_task_iter_start(css, 0, &it);
+ /*
+ * Note: css_task_iter_next will skip dying tasks.
+ * There could still be dying tasks left in the core queue
+ * when we set cgroup tag to 0 when the loop is done below.
+ */
+ while ((p = css_task_iter_next(&it))) {
+ p->core_cookie = !!val ? (unsigned long)tg : 0UL;
+
+ if (sched_core_enqueued(p)) {
+ sched_core_dequeue(task_rq(p), p);
+ if (!p->core_cookie)
+ continue;
+ }
+
+ if (sched_core_enabled(task_rq(p)) &&
+ p->core_cookie && task_on_rq_queued(p))
+ sched_core_enqueue(task_rq(p), p);
+
+ }
+ css_task_iter_end(&it);
+
+ return 0;
+}
+
+static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+ struct task_group *tg = css_tg(css);
+ struct write_core_tag wtag;
+
+ if (val > 1)
+ return -ERANGE;
+
+ if (!static_branch_likely(&sched_smt_present))
+ return -EINVAL;
+
+ if (tg->tagged == !!val)
+ return 0;
+
+ if (!!val)
+ sched_core_get();
+
+ wtag.css = css;
+ wtag.val = val;
+ stop_machine(__sched_write_tag, (void *) &wtag, NULL);
+ if (!val)
+ sched_core_put();
+
+ return 0;
+}
+#endif
+
static struct cftype cpu_legacy_files[] = {
#ifdef CONFIG_FAIR_GROUP_SCHED
{
@@ -9109,6 +9261,14 @@ static struct cftype cpu_legacy_files[] = {
.write_u64 = cpu_rt_period_write_uint,
},
#endif
+#ifdef CONFIG_SCHED_CORE
+ {
+ .name = "tag",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_core_tag_read_u64,
+ .write_u64 = cpu_core_tag_write_u64,
+ },
+#endif
#ifdef CONFIG_UCLAMP_TASK_GROUP
{
.name = "uclamp.min",
@@ -9282,6 +9442,14 @@ static struct cftype cpu_files[] = {
.write_s64 = cpu_weight_nice_write_s64,
},
#endif
+#ifdef CONFIG_SCHED_CORE
+ {
+ .name = "tag",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_core_tag_read_u64,
+ .write_u64 = cpu_core_tag_write_u64,
+ },
+#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
.name = "max",
@@ -9310,6 +9478,7 @@ static struct cftype cpu_files[] = {
struct cgroup_subsys cpu_cgrp_subsys = {
.css_alloc = cpu_cgroup_css_alloc,
.css_online = cpu_cgroup_css_online,
+ .css_offline = cpu_cgroup_css_offline,
.css_released = cpu_cgroup_css_released,
.css_free = cpu_cgroup_css_free,
.css_extra_stat_show = cpu_extra_stat_show,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 676818bdb9df..8ef7ab3061ed 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -384,6 +384,10 @@ struct cfs_bandwidth {
struct task_group {
struct cgroup_subsys_state css;

+#ifdef CONFIG_SCHED_CORE
+ int tagged;
+#endif
+
#ifdef CONFIG_FAIR_GROUP_SCHED
/* schedulable entities of this group on each CPU */
struct sched_entity **se;
--
2.17.1

2020-08-28 19:57:52

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 20/23] sched/coresched: config option for kernel protection

From: Vineeth Pillai <[email protected]>

There are use cases where the kernel protection is not needed. One
example could be about using core scheduling for non-security related
use cases - isolate core for a particular process dynamically. Also,
to test/benchmark the overhead of kernel protection.

Have a compile time and boot time option to disable the feature.
CONFIG_SCHED_CORE_KERNEL_PROTECTION will enable this feature at
compile time and is enabled by default is CONFIG_SCHED_CORE=y.
sched_core_kernel_protection= boot time option to control this. Value
0 will disable the feature.

Signed-off-by: Vineeth Pillai <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 9 +++++
include/linux/sched.h | 2 +-
kernel/Kconfig.preempt | 13 +++++++
kernel/sched/core.c | 39 ++++++++++++++++++-
kernel/sched/sched.h | 2 +
5 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a1068742a6df..01e442388e4a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4611,6 +4611,15 @@

sbni= [NET] Granch SBNI12 leased line adapter

+ sched_core_kernel_protection=
+ [SCHED_CORE, SCHED_CORE_IRQ_PAUSE] Pause SMT siblings
+ of a core runninig in user mode if atleast one of the
+ siblings of the core is running in kernel. This is to
+ guarantee that kernel data is not leaked to tasks which
+ are not trusted by the kernel.
+ This feature is valid only when Core scheduling is
+ enabled(CONFIG_SCHED_CORE).
+
sched_debug [KNL] Enables verbose scheduler debug messages.

schedstats= [KNL,X86] Enable or disable scheduled statistics.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1e04ffe689cb..4d9ae6b4dcc9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2055,7 +2055,7 @@ int sched_trace_rq_nr_running(struct rq *rq);

const struct cpumask *sched_trace_rd_span(struct root_domain *rd);

-#ifdef CONFIG_SCHED_CORE
+#ifdef CONFIG_SCHED_CORE_KERNEL_PROTECTION
void sched_core_unsafe_enter(void);
void sched_core_unsafe_exit(void);
void sched_core_unsafe_exit_wait(unsigned long ti_check);
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 4488fbf4d3a8..52f86739f910 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -86,3 +86,16 @@ config SCHED_CORE
default y
depends on SCHED_SMT

+config SCHED_CORE_KERNEL_PROTECTION
+ bool "Core Scheduling for SMT"
+ default y
+ depends on SCHED_CORE
+ help
+ This option enables pausing all SMT siblings of a core running in
+ user mode when atleast one of the siblings in the core is in kernel.
+ This is to enforce security such that information from kernel is not
+ leaked to non-trusted tasks running on siblings. This option is valid
+ only if Core Scheduling(CONFIG_SCHED_CORE) is enabled.
+
+ If in doubt, select 'Y' when CONFIG_SCHED_CORE=y
+
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0dc9172be04d..34238fd67f31 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -75,6 +75,24 @@ __read_mostly int scheduler_running;

#ifdef CONFIG_SCHED_CORE

+#ifdef CONFIG_SCHED_CORE_KERNEL_PROTECTION
+
+DEFINE_STATIC_KEY_TRUE(sched_core_kernel_protection);
+static int __init set_sched_core_kernel_protection(char *str)
+{
+ unsigned long val = 0;
+
+ if (!str)
+ return 0;
+
+ if (!kstrtoul(str, 0, &val) && !val)
+ static_branch_disable(&sched_core_kernel_protection);
+
+ return 1;
+}
+__setup("sched_core_kernel_protection=", set_sched_core_kernel_protection);
+#endif
+
DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);

/* kernel prio, less is more */
@@ -4600,6 +4618,8 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
return a->core_cookie == b->core_cookie;
}

+#ifdef CONFIG_SCHED_CORE_KERNEL_PROTECTION
+
/*
* Handler to attempt to enter kernel. It does nothing because the exit to
* usermode or guest mode will do the actual work (of waiting if needed).
@@ -4609,6 +4629,11 @@ static void sched_core_irq_work(struct irq_work *work)
return;
}

+static inline void init_sched_core_irq_work(struct rq *rq)
+{
+ init_irq_work(&rq->core_irq_work, sched_core_irq_work);
+}
+
/*
* sched_core_wait_till_safe - Pause the caller's hyperthread until the core
* exits the core-wide unsafe state. Obviously the CPU calling this function
@@ -4684,6 +4709,9 @@ void sched_core_unsafe_enter(void)
struct rq *rq;
int i, cpu;

+ if (!static_branch_likely(&sched_core_kernel_protection))
+ return;
+
/* Ensure that on return to user/guest, we check whether to wait. */
if (current->core_cookie)
set_tsk_thread_flag(current, TIF_UNSAFE_RET);
@@ -4769,6 +4797,9 @@ void sched_core_unsafe_exit(void)
struct rq *rq;
int cpu;

+ if (!static_branch_likely(&sched_core_kernel_protection))
+ return;
+
local_irq_save(flags);
cpu = smp_processor_id();
rq = cpu_rq(cpu);
@@ -4807,9 +4838,15 @@ void sched_core_unsafe_exit(void)

void sched_core_unsafe_exit_wait(unsigned long ti_check)
{
+ if (!static_branch_likely(&sched_core_kernel_protection))
+ return;
+
sched_core_unsafe_exit();
sched_core_wait_till_safe(ti_check);
}
+#else
+static inline void init_sched_core_irq_work(struct rq *rq) {}
+#endif /* CONFIG_SCHED_CORE_KERNEL_PROTECTION */

// XXX fairness/fwd progress conditions
/*
@@ -7795,7 +7832,7 @@ int sched_cpu_starting(unsigned int cpu)
rq = cpu_rq(i);
if (rq->core && rq->core == rq)
core_rq = rq;
- init_irq_work(&rq->core_irq_work, sched_core_irq_work);
+ init_sched_core_irq_work(rq);
}

if (!core_rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dbd8416ddaba..676818bdb9df 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1058,8 +1058,10 @@ struct rq {
unsigned int core_sched_seq;
struct rb_root core_tree;
unsigned char core_forceidle;
+#ifdef CONFIG_SCHED_CORE_KERNEL_PROTECTION
struct irq_work core_irq_work; /* To force HT into kernel */
unsigned int core_this_unsafe_nest;
+#endif

/* shared state */
unsigned int core_task_seq;
--
2.17.1

2020-08-28 19:58:03

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 18/23] entry/idle: Enter and exit kernel protection during idle entry and exit

From: "Joel Fernandes (Google)" <[email protected]>

Make use of the generic_idle_{enter,exit} helper function added in
earlier patches to enter and exit kernel protection.

On exiting idle, protection will be reenabled.

Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/entry-common.h | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 2ea0e09b00d5..c833f2fda542 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -374,6 +374,9 @@ void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);
*/
static inline void generic_idle_enter(void)
{
+ /* Entering idle ends the protected kernel region. */
+ sched_core_unsafe_exit();
+
rcu_idle_enter();
}

@@ -383,6 +386,9 @@ static inline void generic_idle_enter(void)
static inline void generic_idle_exit(void)
{
rcu_idle_exit();
+
+ /* Exiting idle (re)starts the protected kernel region. */
+ sched_core_unsafe_enter();
}

#endif
--
2.17.1

2020-08-28 19:58:08

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 23/23] sched: Debug bits...

From: Peter Zijlstra <[email protected]>

Not-Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/core.c | 40 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5f77e575bbac..def25fe5e0d4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -123,6 +123,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)

int pa = __task_prio(a), pb = __task_prio(b);

+ trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+ a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+ b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;

@@ -320,12 +324,16 @@ static void __sched_core_enable(void)

static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+ printk("core sched enabled\n");
}

static void __sched_core_disable(void)
{
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
+
+ printk("core sched disabled\n");
}

void sched_core_get(void)
@@ -4977,6 +4985,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
put_prev_task(rq, prev);
set_next_task(rq, next);
}
+
+ trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+ rq->core->core_task_seq,
+ rq->core->core_pick_seq,
+ rq->core_sched_seq,
+ next->comm, next->pid,
+ next->core_cookie);
+
return next;
}

@@ -5062,6 +5078,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
*/
if (i == cpu && !need_sync && !p->core_cookie) {
next = p;
+ trace_printk("unconstrained pick: %s/%d %lx\n",
+ next->comm, next->pid, next->core_cookie);
goto done;
}

@@ -5070,6 +5088,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

rq_i->core_pick = p;

+ trace_printk("cpu(%d): selected: %s/%d %lx\n",
+ i, p->comm, p->pid, p->core_cookie);
+
/*
* If this new candidate is of higher priority than the
* previous; and they're incompatible; we need to wipe
@@ -5086,6 +5107,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
rq->core->core_cookie = p->core_cookie;
max = p;

+ trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);
+
if (old_max) {
for_each_cpu(j, smt_mask) {
if (j == i)
@@ -5114,6 +5137,7 @@ next_class:;

/* Something should have been selected for current CPU */
WARN_ON_ONCE(!next);
+ trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);

/*
* Reschedule siblings
@@ -5145,12 +5169,20 @@ next_class:;
continue;

if (rq_i->curr != rq_i->core_pick) {
+ trace_printk("IPI(%d)\n", i);
WRITE_ONCE(rq_i->core_pick_seq, rq->core->core_task_seq);
resched_curr(rq_i);
}

/* Did we break L1TF mitigation requirements? */
- WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+ if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+ trace_printk("[%d]: cookie mismatch. %s/%d/0x%lx/0x%lx\n",
+ rq_i->cpu, rq_i->core_pick->comm,
+ rq_i->core_pick->pid,
+ rq_i->core_pick->core_cookie,
+ rq_i->core->core_cookie);
+ WARN_ON_ONCE(1);
+ }
}

done:
@@ -5189,6 +5221,10 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;

+ trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+ p->comm, p->pid, that, this,
+ p->core_occupation, dst->idle->core_occupation, cookie);
+
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src, p, 0);
set_task_cpu(p, this);
@@ -7900,6 +7936,8 @@ int sched_cpu_starting(unsigned int cpu)
rq->core = core_rq;
}
}
+
+ printk("core: %d -> %d\n", cpu, cpu_of(core_rq));
#endif /* CONFIG_SCHED_CORE */

sched_rq_cpu_starting(cpu);
--
2.17.1

2020-08-28 19:58:14

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 15/23] entry/idle: Add a common function for activites during idle entry/exit

From: "Joel Fernandes (Google)" <[email protected]>

Currently only RCU hooks for idle entry/exit are called. In later
patches, kernel-entry protection functionality will be added.

Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/entry-common.h | 16 ++++++++++++++++
kernel/sched/idle.c | 17 +++++++++--------
2 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index efebbffcd5cc..2ea0e09b00d5 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -369,4 +369,20 @@ void irqentry_exit_cond_resched(void);
*/
void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);

+/**
+ * generic_idle_enter - Called during entry into idle for housekeeping.
+ */
+static inline void generic_idle_enter(void)
+{
+ rcu_idle_enter();
+}
+
+/**
+ * generic_idle_enter - Called when exiting idle for housekeeping.
+ */
+static inline void generic_idle_exit(void)
+{
+ rcu_idle_exit();
+}
+
#endif
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 9c5637d866fd..269de55086c1 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -8,6 +8,7 @@
*/
#include "sched.h"

+#include <linux/entry-common.h>
#include <trace/events/power.h>

/* Linker adds these: start and end of __cpuidle functions */
@@ -54,7 +55,7 @@ __setup("hlt", cpu_idle_nopoll_setup);

static noinline int __cpuidle cpu_idle_poll(void)
{
- rcu_idle_enter();
+ generic_idle_enter();
trace_cpu_idle_rcuidle(0, smp_processor_id());
local_irq_enable();
stop_critical_timings();
@@ -64,7 +65,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
cpu_relax();
start_critical_timings();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
- rcu_idle_exit();
+ generic_idle_exit();

return 1;
}
@@ -158,7 +159,7 @@ static void cpuidle_idle_call(void)

if (cpuidle_not_available(drv, dev)) {
tick_nohz_idle_stop_tick();
- rcu_idle_enter();
+ generic_idle_enter();

default_idle_call();
goto exit_idle;
@@ -178,13 +179,13 @@ static void cpuidle_idle_call(void)
u64 max_latency_ns;

if (idle_should_enter_s2idle()) {
- rcu_idle_enter();
+ generic_idle_enter();

entered_state = call_cpuidle_s2idle(drv, dev);
if (entered_state > 0)
goto exit_idle;

- rcu_idle_exit();
+ generic_idle_exit();

max_latency_ns = U64_MAX;
} else {
@@ -192,7 +193,7 @@ static void cpuidle_idle_call(void)
}

tick_nohz_idle_stop_tick();
- rcu_idle_enter();
+ generic_idle_enter();

next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
call_cpuidle(drv, dev, next_state);
@@ -209,7 +210,7 @@ static void cpuidle_idle_call(void)
else
tick_nohz_idle_retain_tick();

- rcu_idle_enter();
+ generic_idle_enter();

entered_state = call_cpuidle(drv, dev, next_state);
/*
@@ -227,7 +228,7 @@ static void cpuidle_idle_call(void)
if (WARN_ON_ONCE(irqs_disabled()))
local_irq_enable();

- rcu_idle_exit();
+ generic_idle_exit();
}

/*
--
2.17.1

2020-08-28 19:58:52

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 13/23] sched: migration changes for core scheduling

From: Aubrey Li <[email protected]>

- Don't migrate if there is a cookie mismatch
Load balance tries to move task from busiest CPU to the
destination CPU. When core scheduling is enabled, if the
task's cookie does not match with the destination CPU's
core cookie, this task will be skipped by this CPU. This
mitigates the forced idle time on the destination CPU.

- Select cookie matched idle CPU
In the fast path of task wakeup, select the first cookie matched
idle CPU instead of the first idle CPU.

- Find cookie matched idlest CPU
In the slow path of task wakeup, find the idlest CPU whose core
cookie matches with task's cookie

- Don't migrate task if cookie not match
For the NUMA load balance, don't migrate task to the CPU whose
core cookie does not match with task's cookie

Signed-off-by: Aubrey Li <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
---
kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 29 ++++++++++++++++++++
2 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e54a35612efa..ae11a18644b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2048,6 +2048,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;

+#ifdef CONFIG_SCHED_CORE
+ /*
+ * Skip this cpu if source task's cookie does not match
+ * with CPU's core cookie.
+ */
+ if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+ continue;
+#endif
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5983,11 +5992,17 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this

/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+ struct rq *rq = cpu_rq(i);
+
+#ifdef CONFIG_SCHED_CORE
+ if (!sched_core_cookie_match(rq, p))
+ continue;
+#endif
+
if (sched_idle_cpu(i))
return i;

if (available_idle_cpu(i)) {
- struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6244,8 +6259,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
- if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
- break;
+
+ if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
+#ifdef CONFIG_SCHED_CORE
+ /*
+ * If Core Scheduling is enabled, select this cpu
+ * only if the process cookie matches core cookie.
+ */
+ if (sched_core_enabled(cpu_rq(cpu)) &&
+ p->core_cookie == cpu_rq(cpu)->core->core_cookie)
+#endif
+ break;
+ }
}

time = cpu_clock(this) - time;
@@ -7626,8 +7651,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* We do not migrate tasks that are:
* 1) throttled_lb_pair, or
* 2) cannot be migrated to this CPU due to cpus_ptr, or
- * 3) running (obviously), or
- * 4) are cache-hot on their current CPU.
+ * 3) task's cookie does not match with this CPU's core cookie
+ * 4) running (obviously), or
+ * 5) are cache-hot on their current CPU.
*/
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
@@ -7662,6 +7688,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
return 0;
}

+#ifdef CONFIG_SCHED_CORE
+ /*
+ * Don't migrate task if the task's cookie does not match
+ * with the destination CPU's core cookie.
+ */
+ if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+ return 0;
+#endif
+
/* Record that we found atleast one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;

@@ -8886,6 +8921,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
p->cpus_ptr))
continue;

+#ifdef CONFIG_SCHED_CORE
+ if (sched_core_enabled(cpu_rq(this_cpu))) {
+ int i = 0;
+ bool cookie_match = false;
+
+ for_each_cpu(i, sched_group_span(group)) {
+ struct rq *rq = cpu_rq(i);
+
+ if (sched_core_cookie_match(rq, p)) {
+ cookie_match = true;
+ break;
+ }
+ }
+ /* Skip over this group if no cookie matched */
+ if (!cookie_match)
+ continue;
+ }
+#endif
+
local_group = cpumask_test_cpu(this_cpu,
sched_group_span(group));

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be6db45276e7..1c94b2862069 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1109,6 +1109,35 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
void sched_core_adjust_sibling_vruntime(int cpu, bool coresched_enabled);

+/*
+ * Helper to check if the CPU's core cookie matches with the task's cookie
+ * when core scheduling is enabled.
+ * A special case is that the task's cookie always matches with CPU's core
+ * cookie if the CPU is in an idle core.
+ */
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+ bool idle_core = true;
+ int cpu;
+
+ /* Ignore cookie match if core scheduler is not enabled on the CPU. */
+ if (!sched_core_enabled(rq))
+ return true;
+
+ for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+ if (!available_idle_cpu(cpu)) {
+ idle_core = false;
+ break;
+ }
+ }
+
+ /*
+ * A CPU in an idle core is always the best choice for tasks with
+ * cookies.
+ */
+ return idle_core || rq->core->core_cookie == p->core_cookie;
+}
+
extern void queue_core_balance(struct rq *rq);

#else /* !CONFIG_SCHED_CORE */
--
2.17.1

2020-08-28 19:58:59

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 05/23] sched: Basic tracking of matching tasks

From: Peter Zijlstra <[email protected]>

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
---
include/linux/sched.h | 8 ++-
kernel/sched/core.c | 146 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 46 -------------
kernel/sched/sched.h | 55 ++++++++++++++++
4 files changed, 208 insertions(+), 47 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 93ecd930efd3..5fe9878502cb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -682,10 +682,16 @@ struct task_struct {
const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;
+ struct sched_dl_entity dl;
+
+#ifdef CONFIG_SCHED_CORE
+ struct rb_node core_node;
+ unsigned long core_cookie;
+#endif
+
#ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
#endif
- struct sched_dl_entity dl;

#ifdef CONFIG_UCLAMP_TASK
/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e2642c5dbd61..eea18956a9ef 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -77,6 +77,141 @@ __read_mostly int scheduler_running;

DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);

+/* kernel prio, less is more */
+static inline int __task_prio(struct task_struct *p)
+{
+ if (p->sched_class == &stop_sched_class) /* trumps deadline */
+ return -2;
+
+ if (rt_prio(p->prio)) /* includes deadline */
+ return p->prio; /* [-1, 99] */
+
+ if (p->sched_class == &idle_sched_class)
+ return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
+
+ return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+}
+
+/*
+ * l(a,b)
+ * le(a,b) := !l(b,a)
+ * g(a,b) := l(b,a)
+ * ge(a,b) := !l(a,b)
+ */
+
+/* real prio, less is less */
+static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+{
+
+ int pa = __task_prio(a), pb = __task_prio(b);
+
+ if (-pa < -pb)
+ return true;
+
+ if (-pb < -pa)
+ return false;
+
+ if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
+ return !dl_time_before(a->dl.deadline, b->dl.deadline);
+
+ if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */
+ u64 vruntime = b->se.vruntime;
+
+ /*
+ * Normalize the vruntime if tasks are in different cpus.
+ */
+ if (task_cpu(a) != task_cpu(b)) {
+ vruntime -= task_cfs_rq(b)->min_vruntime;
+ vruntime += task_cfs_rq(a)->min_vruntime;
+ }
+
+ return !((s64)(a->se.vruntime - vruntime) <= 0);
+ }
+
+ return false;
+}
+
+static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)
+{
+ if (a->core_cookie < b->core_cookie)
+ return true;
+
+ if (a->core_cookie > b->core_cookie)
+ return false;
+
+ /* flip prio, so high prio is leftmost */
+ if (prio_less(b, a))
+ return true;
+
+ return false;
+}
+
+static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+{
+ struct rb_node *parent, **node;
+ struct task_struct *node_task;
+
+ rq->core->core_task_seq++;
+
+ if (!p->core_cookie)
+ return;
+
+ node = &rq->core_tree.rb_node;
+ parent = *node;
+
+ while (*node) {
+ node_task = container_of(*node, struct task_struct, core_node);
+ parent = *node;
+
+ if (__sched_core_less(p, node_task))
+ node = &parent->rb_left;
+ else
+ node = &parent->rb_right;
+ }
+
+ rb_link_node(&p->core_node, parent, node);
+ rb_insert_color(&p->core_node, &rq->core_tree);
+}
+
+static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+{
+ rq->core->core_task_seq++;
+
+ if (!p->core_cookie)
+ return;
+
+ rb_erase(&p->core_node, &rq->core_tree);
+}
+
+/*
+ * Find left-most (aka, highest priority) task matching @cookie.
+ */
+static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
+{
+ struct rb_node *node = rq->core_tree.rb_node;
+ struct task_struct *node_task, *match;
+
+ /*
+ * The idle task always matches any cookie!
+ */
+ match = idle_sched_class.pick_task(rq);
+
+ while (node) {
+ node_task = container_of(node, struct task_struct, core_node);
+
+ if (cookie < node_task->core_cookie) {
+ node = node->rb_left;
+ } else if (cookie > node_task->core_cookie) {
+ node = node->rb_right;
+ } else {
+ match = node_task;
+ node = node->rb_left;
+ }
+ }
+
+ return match;
+}
+
/*
* The static-key + stop-machine variable are needed such that:
*
@@ -135,6 +270,11 @@ void sched_core_put(void)
mutex_unlock(&sched_core_mutex);
}

+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
+static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+
#endif /* CONFIG_SCHED_CORE */

/*
@@ -1628,6 +1768,9 @@ static inline void init_uclamp(void) { }

static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
+ if (sched_core_enabled(rq))
+ sched_core_enqueue(rq, p);
+
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);

@@ -1642,6 +1785,9 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)

static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
{
+ if (sched_core_enabled(rq))
+ sched_core_dequeue(rq, p);
+
if (!(flags & DEQUEUE_NOCLOCK))
update_rq_clock(rq);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index af8c40191a19..285002a2f641 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -258,33 +258,11 @@ const struct sched_class fair_sched_class;
*/

#ifdef CONFIG_FAIR_GROUP_SCHED
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
- SCHED_WARN_ON(!entity_is_task(se));
- return container_of(se, struct task_struct, se);
-}

/* Walk up scheduling entities hierarchy */
#define for_each_sched_entity(se) \
for (; se; se = se->parent)

-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
- return p->se.cfs_rq;
-}
-
-/* runqueue on which this entity is (to be) queued */
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
- return se->cfs_rq;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
- return grp->my_q;
-}
-
static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
{
if (!path)
@@ -445,33 +423,9 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)

#else /* !CONFIG_FAIR_GROUP_SCHED */

-static inline struct task_struct *task_of(struct sched_entity *se)
-{
- return container_of(se, struct task_struct, se);
-}
-
#define for_each_sched_entity(se) \
for (; se; se = NULL)

-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
- return &task_rq(p)->cfs;
-}
-
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
- struct task_struct *p = task_of(se);
- struct rq *rq = task_rq(p);
-
- return &rq->cfs;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
- return NULL;
-}
-
static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
{
if (path)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6ab8adff169b..92e0b8679eef 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1053,6 +1053,10 @@ struct rq {
/* per rq */
struct rq *core;
unsigned int core_enabled;
+ struct rb_root core_tree;
+
+ /* shared state */
+ unsigned int core_task_seq;
#endif
};

@@ -1132,6 +1136,57 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() raw_cpu_ptr(&runqueues)

+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+ SCHED_WARN_ON(!entity_is_task(se));
+ return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+ return p->se.cfs_rq;
+}
+
+/* runqueue on which this entity is (to be) queued */
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+ return se->cfs_rq;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+ return grp->my_q;
+}
+
+#else
+
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+ return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+ return &task_rq(p)->cfs;
+}
+
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+ struct task_struct *p = task_of(se);
+ struct rq *rq = task_rq(p);
+
+ return &rq->cfs;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+ return NULL;
+}
+#endif
+
extern void update_rq_clock(struct rq *rq);

static inline u64 __rq_clock_broken(struct rq *rq)
--
2.17.1

2020-08-28 19:59:28

by Julien Desfossez

[permalink] [raw]
Subject: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

From: Peter Zijlstra <[email protected]>

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

During a CPU hotplug event, schedule would be called with the hotplugged
CPU not in the cpumask. So use for_each_cpu(_wrap)_or to include the
current cpu in the task pick loop.

There are multiple loops in pick_next_task that iterate over CPUs in
smt_mask. During a hotplug event, sibling could be removed from the
smt_mask while pick_next_task is running. So we cannot trust the mask
across the different loops. This can confuse the logic. Add a retry logic
if smt_mask changes between the loops.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
kernel/sched/core.c | 284 ++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 6 +-
2 files changed, 288 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eea18956a9ef..1f480b6025cd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4537,7 +4537,7 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
* Pick up the highest-prio task:
*/
static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
const struct sched_class *class;
struct task_struct *p;
@@ -4577,6 +4577,283 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
BUG();
}

+#ifdef CONFIG_SCHED_CORE
+
+static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
+{
+ return is_idle_task(a) || (a->core_cookie == cookie);
+}
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+ if (is_idle_task(a) || is_idle_task(b))
+ return true;
+
+ return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ * rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
+{
+ struct task_struct *class_pick, *cookie_pick;
+ unsigned long cookie = rq->core->core_cookie;
+
+ class_pick = class->pick_task(rq);
+ if (!class_pick)
+ return NULL;
+
+ if (!cookie) {
+ /*
+ * If class_pick is tagged, return it only if it has
+ * higher priority than max.
+ */
+ if (max && class_pick->core_cookie &&
+ prio_less(class_pick, max))
+ return idle_sched_class.pick_task(rq);
+
+ return class_pick;
+ }
+
+ /*
+ * If class_pick is idle or matches cookie, return early.
+ */
+ if (cookie_equals(class_pick, cookie))
+ return class_pick;
+
+ cookie_pick = sched_core_find(rq, cookie);
+
+ /*
+ * If class > max && class > cookie, it is the highest priority task on
+ * the core (so far) and it must be selected, otherwise we must go with
+ * the cookie pick in order to satisfy the constraint.
+ */
+ if (prio_less(cookie_pick, class_pick) &&
+ (!max || prio_less(max, class_pick)))
+ return class_pick;
+
+ return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+ struct task_struct *next, *max = NULL;
+ const struct sched_class *class;
+ const struct cpumask *smt_mask;
+ int i, j, cpu;
+ int smt_weight;
+ bool need_sync;
+
+ if (!sched_core_enabled(rq))
+ return __pick_next_task(rq, prev, rf);
+
+ /*
+ * If there were no {en,de}queues since we picked (IOW, the task
+ * pointers are all still valid), and we haven't scheduled the last
+ * pick yet, do so now.
+ */
+ if (rq->core_pick_seq == rq->core->core_task_seq &&
+ rq->core_pick_seq != rq->core_sched_seq) {
+ WRITE_ONCE(rq->core_sched_seq, rq->core_pick_seq);
+
+ next = rq->core_pick;
+ if (next != prev) {
+ put_prev_task(rq, prev);
+ set_next_task(rq, next);
+ }
+ return next;
+ }
+
+ put_prev_task_balance(rq, prev, rf);
+
+ cpu = cpu_of(rq);
+ smt_mask = cpu_smt_mask(cpu);
+
+ /*
+ * core->core_task_seq, rq->core_pick_seq, rq->core_sched_seq
+ *
+ * @task_seq guards the task state ({en,de}queues)
+ * @pick_seq is the @task_seq we did a selection on
+ * @sched_seq is the @pick_seq we scheduled
+ *
+ * However, preemptions can cause multiple picks on the same task set.
+ * 'Fix' this by also increasing @task_seq for every pick.
+ */
+ rq->core->core_task_seq++;
+ need_sync = !!rq->core->core_cookie;
+
+retry_select:
+ smt_weight = cpumask_weight(smt_mask);
+
+ /* reset state */
+ rq->core->core_cookie = 0UL;
+ for_each_cpu_or(i, smt_mask, cpumask_of(cpu)) {
+ struct rq *rq_i = cpu_rq(i);
+
+ rq_i->core_pick = NULL;
+
+ if (rq_i->core_forceidle) {
+ need_sync = true;
+ rq_i->core_forceidle = false;
+ }
+
+ if (i != cpu)
+ update_rq_clock(rq_i);
+ }
+
+ /*
+ * Try and select tasks for each sibling in decending sched_class
+ * order.
+ */
+ for_each_class(class) {
+again:
+ for_each_cpu_wrap_or(i, smt_mask, cpumask_of(cpu), cpu) {
+ struct rq *rq_i = cpu_rq(i);
+ struct task_struct *p;
+
+ /*
+ * During hotplug online a sibling can be added in
+ * the smt_mask * while we are here. If so, we would
+ * need to restart selection by resetting all over.
+ */
+ if (unlikely(smt_weight != cpumask_weight(smt_mask)))
+ goto retry_select;
+
+ if (rq_i->core_pick)
+ continue;
+
+ /*
+ * If this sibling doesn't yet have a suitable task to
+ * run; ask for the most elegible task, given the
+ * highest priority task already selected for this
+ * core.
+ */
+ p = pick_task(rq_i, class, max);
+ if (!p) {
+ /*
+ * If there weren't no cookies; we don't need
+ * to bother with the other siblings.
+ */
+ if (i == cpu && !need_sync)
+ goto next_class;
+
+ continue;
+ }
+
+ /*
+ * Optimize the 'normal' case where there aren't any
+ * cookies and we don't need to sync up.
+ */
+ if (i == cpu && !need_sync && !p->core_cookie) {
+ next = p;
+ goto done;
+ }
+
+ rq_i->core_pick = p;
+
+ /*
+ * If this new candidate is of higher priority than the
+ * previous; and they're incompatible; we need to wipe
+ * the slate and start over. pick_task makes sure that
+ * p's priority is more than max if it doesn't match
+ * max's cookie.
+ *
+ * NOTE: this is a linear max-filter and is thus bounded
+ * in execution time.
+ */
+ if (!max || !cookie_match(max, p)) {
+ struct task_struct *old_max = max;
+
+ rq->core->core_cookie = p->core_cookie;
+ max = p;
+
+ if (old_max) {
+ for_each_cpu(j, smt_mask) {
+ if (j == i)
+ continue;
+
+ cpu_rq(j)->core_pick = NULL;
+ }
+ goto again;
+ } else {
+ /*
+ * Once we select a task for a cpu, we
+ * should not be doing an unconstrained
+ * pick because it might starve a task
+ * on a forced idle cpu.
+ */
+ need_sync = true;
+ }
+
+ }
+ }
+next_class:;
+ }
+
+ next = rq->core_pick;
+
+ /* Something should have been selected for current CPU */
+ WARN_ON_ONCE(!next);
+
+ /*
+ * Reschedule siblings
+ *
+ * NOTE: L1TF -- at this point we're no longer running the old task and
+ * sending an IPI (below) ensures the sibling will no longer be running
+ * their task. This ensures there is no inter-sibling overlap between
+ * non-matching user state.
+ */
+ for_each_cpu_or(i, smt_mask, cpumask_of(cpu)) {
+ struct rq *rq_i = cpu_rq(i);
+
+ WARN_ON_ONCE(smt_weight == cpumask_weight(smt_mask) && !rq->core_pick);
+
+ /*
+ * During hotplug online a sibling can be added in the smt_mask
+ * while we are here. We might have missed picking a task for it.
+ * Ignore it now as a schedule on that sibling will correct itself.
+ */
+ if (!rq_i->core_pick)
+ continue;
+
+ if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
+ rq_i->core_forceidle = true;
+
+ if (i == cpu)
+ continue;
+
+ if (rq_i->curr != rq_i->core_pick) {
+ WRITE_ONCE(rq_i->core_pick_seq, rq->core->core_task_seq);
+ resched_curr(rq_i);
+ }
+
+ /* Did we break L1TF mitigation requirements? */
+ WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+ }
+
+done:
+ set_next_task(rq, next);
+ return next;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+ return __pick_next_task(rq, prev, rf);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
/*
* __schedule() is the main scheduler function.
*
@@ -7433,7 +7710,12 @@ void __init sched_init(void)

#ifdef CONFIG_SCHED_CORE
rq->core = NULL;
+ rq->core_pick = NULL;
rq->core_enabled = 0;
+ rq->core_tree = RB_ROOT;
+ rq->core_forceidle = false;
+
+ rq->core_cookie = 0UL;
#endif
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 92e0b8679eef..def442f2c690 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1052,11 +1052,16 @@ struct rq {
#ifdef CONFIG_SCHED_CORE
/* per rq */
struct rq *core;
+ struct task_struct *core_pick;
+ unsigned int core_pick_seq;
unsigned int core_enabled;
+ unsigned int core_sched_seq;
struct rb_root core_tree;
+ unsigned char core_forceidle;

/* shared state */
unsigned int core_task_seq;
+ unsigned long core_cookie;
#endif
};

@@ -1929,7 +1934,6 @@ static inline void put_prev_task(struct rq *rq, struct task_struct *prev)

static inline void set_next_task(struct rq *rq, struct task_struct *next)
{
- WARN_ON_ONCE(rq->curr != next);
next->sched_class->set_next_task(rq, next, false);
}

--
2.17.1

2020-08-28 20:54:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

On Fri, Aug 28, 2020 at 03:51:09PM -0400, Julien Desfossez wrote:
> + smt_weight = cpumask_weight(smt_mask);

> + for_each_cpu_wrap_or(i, smt_mask, cpumask_of(cpu), cpu) {
> + struct rq *rq_i = cpu_rq(i);
> + struct task_struct *p;
> +
> + /*
> + * During hotplug online a sibling can be added in
> + * the smt_mask * while we are here. If so, we would
> + * need to restart selection by resetting all over.
> + */
> + if (unlikely(smt_weight != cpumask_weight(smt_mask)))
> + goto retry_select;

cpumask_weigt() is fairly expensive, esp. for something that should
'never' happen.

What exactly is the race here?

We'll update the cpu_smt_mask() fairly early in secondary bringup, but
where does it become a problem?

The moment the new thread starts scheduling it'll block on the common
rq->lock and then it'll cycle task_seq and do a new pick.

So where do things go side-ways?

Can we please split out this hotplug 'fix' into a separate patch with a
coherent changelog.

2020-08-28 20:57:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

On Fri, Aug 28, 2020 at 03:51:09PM -0400, Julien Desfossez wrote:
> + if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> + rq_i->core_forceidle = true;

Did you mean: rq_i->core_pick == rq_i->idle ?

is_idle_task() will also match idle-injection threads, which I'm not
sure we want here.

2020-08-28 21:29:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v7 09/23] sched/fair: Fix forced idle sibling starvation corner case

On Fri, Aug 28, 2020 at 03:51:10PM -0400, Julien Desfossez wrote:
> +/*
> + * If runqueue has only one task which used up its slice and if the sibling
> + * is forced idle, then trigger schedule to give forced idle task a chance.
> + */
> +static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se)
> +{
> + int cpu = cpu_of(rq), sibling_cpu;
> +
> + if (rq->cfs.nr_running > 1 || !__entity_slice_used(se))
> + return;
> +
> + for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) {
> + struct rq *sibling_rq;
> + if (sibling_cpu == cpu)
> + continue;
> + if (cpu_is_offline(sibling_cpu))
> + continue;
> +
> + sibling_rq = cpu_rq(sibling_cpu);
> + if (sibling_rq->core_forceidle) {
> + resched_curr(sibling_rq);
> + }
> + }

The only pupose of this loop seem to be to find if we have a forceidle;
surely we can avoid that by storing this during the pick.

> +}

static void task_tick_core(struct rq *rq)
{
if (sched_core_enabled(rq))
resched_forceidle_sibling(rq, &rq->curr->se);
}

#else

static void task_tick_core(struct rq *rq) { }

> +#endif
> +
> /*
> * scheduler tick hitting a task of our scheduling class.
> *
> @@ -10654,6 +10688,11 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>
> update_misfit_status(curr, rq);
> update_overutilized_status(task_rq(curr));
> +
> +#ifdef CONFIG_SCHED_CORE
> + if (sched_core_enabled(rq))
> + resched_forceidle_sibling(rq, &curr->se);
> +#endif

Then you can ditch the #ifdef here

> }
>
> /*
> --
> 2.17.1
>

2020-08-28 21:31:58

by Peter Zijlstra

[permalink] [raw]

2020-08-28 22:04:11

by Vineeth Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.



On 8/28/20 4:51 PM, Peter Zijlstra wrote:
> cpumask_weigt() is fairly expensive, esp. for something that should
> 'never' happen.
>
> What exactly is the race here?
>
> We'll update the cpu_smt_mask() fairly early in secondary bringup, but
> where does it become a problem?
>
> The moment the new thread starts scheduling it'll block on the common
> rq->lock and then it'll cycle task_seq and do a new pick.
>
> So where do things go side-ways?
During hotplug stress test, we have noticed that while a sibling is in
pick_next_task, another sibling can go offline or come online. What
we have observed is smt_mask get updated underneath us even if
we hold the lock. From reading the code, looks like we don't hold the
rq lock when the mask is updated. This extra logic was to take care of that.

> Can we please split out this hotplug 'fix' into a separate patch with a
> coherent changelog.
Sorry about this. I had posted this as separate patches in v6 list,
but merged it for v7. Will split it and have details about the fix in
next iteration.

Thanks,
Vineeth

2020-08-28 22:17:51

by Vineeth Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.



On 8/28/20 4:55 PM, Peter Zijlstra wrote:
> On Fri, Aug 28, 2020 at 03:51:09PM -0400, Julien Desfossez wrote:
>> + if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
>> + rq_i->core_forceidle = true;
> Did you mean: rq_i->core_pick == rq_i->idle ?
>
> is_idle_task() will also match idle-injection threads, which I'm not
> sure we want here.
Thanks for catching this. You are right, we should be checking for
rq_i->idle. There are couple other places where we use is_idle_task.
Will go through them and verify if we need to fix there as well.

Thanks,
Vineeth

2020-08-28 22:27:26

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

On Fri, Aug 28, 2020 at 06:02:25PM -0400, Vineeth Pillai wrote:
[...]
> > Can we please split out this hotplug 'fix' into a separate patch with a
> > coherent changelog.
> Sorry about this. I had posted this as separate patches in v6 list,
> but merged it for v7. Will split it and have details about the fix in
> next iteration.

Thanks Vineeth. Peter, also the "v6+" series (which were some addons on v6)
detail the individual hotplug changes squashed into this patch:
https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/

Agreed we can split the patches for the next series, however for final
upstream merge, I suggest we fix hotplug issues in this patch itself so that
we don't break bisectability.

thanks,

- Joel

2020-08-28 23:26:43

by Vineeth Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v7 09/23] sched/fair: Fix forced idle sibling starvation corner case



On 8/28/20 5:25 PM, Peter Zijlstra wrote:
> The only pupose of this loop seem to be to find if we have a forceidle;
> surely we can avoid that by storing this during the pick.
The idea was to kick each cpu that was force idle. But now, thinking
about it, we just need to kick one as it will pick for all the siblings.
Will optimize this as you suggested.

>
> static void task_tick_core(struct rq *rq)
> {
> if (sched_core_enabled(rq))
> resched_forceidle_sibling(rq, &rq->curr->se);
> }
>
> #else
>
> static void task_tick_core(struct rq *rq) { }
>
>> +#endif
>> +
>> /*
>> * scheduler tick hitting a task of our scheduling class.
>> *
>> @@ -10654,6 +10688,11 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>>
>> update_misfit_status(curr, rq);
>> update_overutilized_status(task_rq(curr));
>> +
>> +#ifdef CONFIG_SCHED_CORE
>> + if (sched_core_enabled(rq))
>> + resched_forceidle_sibling(rq, &curr->se);
>> +#endif
> Then you can ditch the #ifdef here
Makes sense, will do.

Thanks,
Vineeth

2020-08-29 07:51:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

On Fri, Aug 28, 2020 at 06:02:25PM -0400, Vineeth Pillai wrote:
> On 8/28/20 4:51 PM, Peter Zijlstra wrote:

> > So where do things go side-ways?

> During hotplug stress test, we have noticed that while a sibling is in
> pick_next_task, another sibling can go offline or come online. What
> we have observed is smt_mask get updated underneath us even if
> we hold the lock. From reading the code, looks like we don't hold the
> rq lock when the mask is updated. This extra logic was to take care of that.

Sure, the mask is updated async, but _where_ is the actual problem with
that?

On Fri, Aug 28, 2020 at 06:23:55PM -0400, Joel Fernandes wrote:
> Thanks Vineeth. Peter, also the "v6+" series (which were some addons on v6)
> detail the individual hotplug changes squashed into this patch:
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/[email protected]/

That one looks fishy, the pick is core wide, making that pick_seq per rq
just doesn't make sense.

> https://lore.kernel.org/lkml/[email protected]/

This one reads like tinkering, there is no description of the actual
problem just some code that makes a symptom go away.

Sure, on hotplug the smt mask can change, but only for a CPU that isn't
actually scheduling, so who cares.

/me re-reads the hotplug code...

..ooOO is the problem that we clear the cpumasks on take_cpu_down()
instead of play_dead() ?! That should be fixable.

> https://lore.kernel.org/lkml/[email protected]/

This is the only one that makes some sense, it makes rq->core consistent
over hotplug.

> Agreed we can split the patches for the next series, however for final
> upstream merge, I suggest we fix hotplug issues in this patch itself so that
> we don't break bisectability.

Meh, who sodding cares about hotplug :-). Also you can 'fix' such things
by making sure you can't actually enable core-sched until after
everything is in place.


2020-08-31 13:05:01

by Vineeth Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.



On 8/29/20 3:47 AM, [email protected] wrote:
>> During hotplug stress test, we have noticed that while a sibling is in
>> pick_next_task, another sibling can go offline or come online. What
>> we have observed is smt_mask get updated underneath us even if
>> we hold the lock. From reading the code, looks like we don't hold the
>> rq lock when the mask is updated. This extra logic was to take care of that.
> Sure, the mask is updated async, but _where_ is the actual problem with
> that?
One issue that we observed was with the inconsistent view of smt_mask
in the three loops in pick_next_task can lead to corner cases. The first
loop
resets all coresched fields, second loop picks the task for each
siblings and
then if the sibling came online in the third loop, we IPI it and it
crashes for a
NULL core_pick. Similarly I think we might have issues if the sibling goes
offline in the last loop or sibling is offline/online in the pick loop.

It might be possible to do specific checks for core_pick in the loops to
fix the
corner cases above. But I was not sure if there were more and hence took
this
approach. I understand that cpumask_weight is expensive and will try to
fix it
using core_pick to fix the corner cases.

>
> On Fri, Aug 28, 2020 at 06:23:55PM -0400, Joel Fernandes wrote:
>> Thanks Vineeth. Peter, also the "v6+" series (which were some addons on v6)
>> detail the individual hotplug changes squashed into this patch:
>> https://lore.kernel.org/lkml/[email protected]/
>> https://lore.kernel.org/lkml/[email protected]/
> That one looks fishy, the pick is core wide, making that pick_seq per rq
> just doesn't make sense.
I think there are couple of scenarios where pick_seq per sibling will be
useful. One is this hotplug case, where you need to pick only for the online
siblings and the offline siblings can come online async. Second case that
we have seen is, we don't need to mark pick for siblings when we pick a task
which is currently running. Marking the pick core wide will make the sibling
take the fast path and re-select the same task during a schedule event
due to preemption or its time slice expiry.

>> https://lore.kernel.org/lkml/[email protected]/
> This one reads like tinkering, there is no description of the actual
> problem just some code that makes a symptom go away.
>
> Sure, on hotplug the smt mask can change, but only for a CPU that isn't
> actually scheduling, so who cares.
>
> /me re-reads the hotplug code...
>
> ..ooOO is the problem that we clear the cpumasks on take_cpu_down()
> instead of play_dead() ?! That should be fixable.
Yes, we get called into schedule(for going idle) for an offline cpu and
it gets
confused. Also I think there could be problems while it comes online as
well,
like I mentioned above. We might IPI a sibling which just came online
with a
NULL core_pick. But I think we can fix it with specific checks for
core_pick.

I shall look into the smt mask update side of the hotplug again and see
if corner cases could be better handled there.

Thanks,
Vineeth

2020-08-31 14:25:55

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

Hi Peter,

On Sat, Aug 29, 2020 at 09:47:19AM +0200, [email protected] wrote:
> On Fri, Aug 28, 2020 at 06:02:25PM -0400, Vineeth Pillai wrote:
> > On 8/28/20 4:51 PM, Peter Zijlstra wrote:
>
> > > So where do things go side-ways?
>
> > During hotplug stress test, we have noticed that while a sibling is in
> > pick_next_task, another sibling can go offline or come online. What
> > we have observed is smt_mask get updated underneath us even if
> > we hold the lock. From reading the code, looks like we don't hold the
> > rq lock when the mask is updated. This extra logic was to take care of that.
>
> Sure, the mask is updated async, but _where_ is the actual problem with
> that?
>
> On Fri, Aug 28, 2020 at 06:23:55PM -0400, Joel Fernandes wrote:
> > Thanks Vineeth. Peter, also the "v6+" series (which were some addons on v6)
> > detail the individual hotplug changes squashed into this patch:
> > https://lore.kernel.org/lkml/[email protected]/
> > https://lore.kernel.org/lkml/[email protected]/
>
> That one looks fishy, the pick is core wide, making that pick_seq per rq
> just doesn't make sense.

I think Vineeth was trying to handle the case where rq->core_pick happened to
be NULL for an offline CPU, and then schedule() is called when it came online
but its sched_seq != core-wide pick_seq. The reason for this situation is
because a sibling did selection for the offline CPU and ended up leaving its
rq->core_pick as NULL as the then-offline CPU was missing from the
cpu_smt_mask, but it incremented the core-wide pick_seq anyway.

Due to this, the pick_next_task() can crash after entering this if() block:
+ if (rq->core_pick_seq == rq->core->core_task_seq &&
+ rq->core_pick_seq != rq->core_sched_seq) {

How would you suggest to fix it? Maybe we can just assign rq->core_sched_seq
== rq->core_pick_seq for an offline CPU (or any CPU where rq->core_pick ==
NULL), so it does not end up using rq->core_pick and does a full core-wide
selcetion again when it comes online?

Or easier, check for rq->core_pick == NULL and skip this fast-path if() block
completely.

> > https://lore.kernel.org/lkml/[email protected]/
>
> This one reads like tinkering, there is no description of the actual
> problem just some code that makes a symptom go away.
>
> Sure, on hotplug the smt mask can change, but only for a CPU that isn't
> actually scheduling, so who cares.
>
> /me re-reads the hotplug code...
>
> ..ooOO is the problem that we clear the cpumasks on take_cpu_down()
> instead of play_dead() ?! That should be fixable.

I think Vineeth explained this in his email, there is logic across the loops
in the pick_next_task() that depend on the cpu_smt_mask not change. I am not
sure if play_dead() will fix it, the issue is seen in the code doing the
selection and the cpu_smt_mask changing under it due to possibly other CPUs
going offline.

Example, you have a splat and null pointer dereference possibilities in the
below loop if rq_i ->core_pick == NULL, because a sibling CPU came online but
a task was not selected for it in the for loops prior to this for loop:

/*
* Reschedule siblings
*
* NOTE: L1TF -- at this point we're no longer running the old task and
* sending an IPI (below) ensures the sibling will no longer be running
* their task. This ensures there is no inter-sibling overlap between
* non-matching user state.
*/
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);

WARN_ON_ONCE(!rq_i->core_pick);

if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
rq_i->core_forceidle = true;

rq_i->core_pick->core_occupation = occ;

Probably the code can be rearchitected to not depend on cpu_smt_mask
changing. What I did in my old tree is I made a copy of the cpu_smt_mask in
the beginning of this function, and that makes all the problems go away. But
I was afraid of overhead of that copying.

(btw, I would not complain one bit if this function was nuked and rewritten
to be simpler).

> > https://lore.kernel.org/lkml/[email protected]/
>
> This is the only one that makes some sense, it makes rq->core consistent
> over hotplug.

Cool at least we got one thing right ;)

> > Agreed we can split the patches for the next series, however for final
> > upstream merge, I suggest we fix hotplug issues in this patch itself so that
> > we don't break bisectability.
>
> Meh, who sodding cares about hotplug :-). Also you can 'fix' such things
> by making sure you can't actually enable core-sched until after
> everything is in place.

Fair enough :)

thanks,

- Joel

2020-09-01 03:41:27

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

On Sat, Aug 29, 2020 at 09:47:19AM +0200, [email protected] wrote:
> On Fri, Aug 28, 2020 at 06:02:25PM -0400, Vineeth Pillai wrote:
> > On 8/28/20 4:51 PM, Peter Zijlstra wrote:
>
> > > So where do things go side-ways?
>
> > During hotplug stress test, we have noticed that while a sibling is in
> > pick_next_task, another sibling can go offline or come online. What
> > we have observed is smt_mask get updated underneath us even if
> > we hold the lock. From reading the code, looks like we don't hold the
> > rq lock when the mask is updated. This extra logic was to take care of that.
>
> Sure, the mask is updated async, but _where_ is the actual problem with
> that?
>
> On Fri, Aug 28, 2020 at 06:23:55PM -0400, Joel Fernandes wrote:
> > Thanks Vineeth. Peter, also the "v6+" series (which were some addons on v6)
> > detail the individual hotplug changes squashed into this patch:
> > https://lore.kernel.org/lkml/[email protected]/
> > https://lore.kernel.org/lkml/[email protected]/
>
> That one looks fishy, the pick is core wide, making that pick_seq per rq
> just doesn't make sense.
>
> > https://lore.kernel.org/lkml/[email protected]/
>
> This one reads like tinkering, there is no description of the actual
> problem just some code that makes a symptom go away.
>
> Sure, on hotplug the smt mask can change, but only for a CPU that isn't
> actually scheduling, so who cares.
>
> /me re-reads the hotplug code...
>
> ..ooOO is the problem that we clear the cpumasks on take_cpu_down()
> instead of play_dead() ?! That should be fixable.

That is indeed the problem.

I was wondering, is there any harm in just selecting idle task if the CPU
calling schedule() is missing from cpu_smt_mask? Does it need to do a
core-wide selection?

That would be best, and avoid any unnecessary surgery of the already
complicated function. This is sort of what Tim was doing in v4 and v5.

Also what do we do if cpu_smt_mask changing while this function is running? I
tried something like the following and it solves the issues but the overhead
probably really sucks. I was thinking of doing a variation of the below that
just stored the cpu_smt_mask's rq pointers in an array of size SMTS_PER_CORE
on the stack, instead of a cpumask but I am not sure if that will keep the
code clean while still having similar storage overhead.

---8<-----------------------

From 5e905e7e620177075a9bcf78fb0dc89a136434bb Mon Sep 17 00:00:00 2001
From: Joel Fernandes <[email protected]>
Date: Tue, 30 Jun 2020 19:39:45 -0400
Subject: [PATCH] Fix CPU hotplug causing crashes in task selection logic

Signed-off-by: Joel Fernandes <[email protected]>
---
kernel/sched/core.c | 34 ++++++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0362102fa3d2..47a21013ba0d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4464,7 +4464,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
struct task_struct *next, *max = NULL;
const struct sched_class *class;
- const struct cpumask *smt_mask;
+ struct cpumask select_mask;
int i, j, cpu, occ = 0;
bool need_sync;

@@ -4499,7 +4499,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
finish_prev_task(rq, prev, rf);

cpu = cpu_of(rq);
- smt_mask = cpu_smt_mask(cpu);
+ cpumask_copy(&select_mask, cpu_smt_mask(cpu));
+
+ /*
+ * Always make sure current CPU is added to smt_mask so that below
+ * selection logic runs on it.
+ */
+ cpumask_set_cpu(cpu, &select_mask);

/*
* core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
@@ -4516,7 +4522,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

/* reset state */
rq->core->core_cookie = 0UL;
- for_each_cpu(i, smt_mask) {
+ for_each_cpu(i, &select_mask) {
struct rq *rq_i = cpu_rq(i);

rq_i->core_pick = NULL;
@@ -4536,7 +4542,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
*/
for_each_class(class) {
again:
- for_each_cpu_wrap(i, smt_mask, cpu) {
+ for_each_cpu_wrap(i, &select_mask, cpu) {
struct rq *rq_i = cpu_rq(i);
struct task_struct *p;

@@ -4600,7 +4608,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);

if (old_max) {
- for_each_cpu(j, smt_mask) {
+ for_each_cpu(j, &select_mask) {
if (j == i)
continue;

@@ -4625,6 +4633,10 @@ next_class:;

rq->core->core_pick_seq = rq->core->core_task_seq;
next = rq->core_pick;
+
+ /* Something should have been selected for current CPU*/
+ WARN_ON_ONCE(!next);
+
rq->core_sched_seq = rq->core->core_pick_seq;
trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);

@@ -4636,7 +4648,7 @@ next_class:;
* their task. This ensures there is no inter-sibling overlap between
* non-matching user state.
*/
- for_each_cpu(i, smt_mask) {
+ for_each_cpu(i, &select_mask) {
struct rq *rq_i = cpu_rq(i);

WARN_ON_ONCE(!rq_i->core_pick);
--
2.28.0.402.g5ffc5be6b7-goog

2020-09-01 05:11:29

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

On Sat, Aug 29, 2020 at 09:47:19AM +0200, [email protected] wrote:
> On Fri, Aug 28, 2020 at 06:02:25PM -0400, Vineeth Pillai wrote:
> > On 8/28/20 4:51 PM, Peter Zijlstra wrote:
>
> > > So where do things go side-ways?
>
> > During hotplug stress test, we have noticed that while a sibling is in
> > pick_next_task, another sibling can go offline or come online. What
> > we have observed is smt_mask get updated underneath us even if
> > we hold the lock. From reading the code, looks like we don't hold the
> > rq lock when the mask is updated. This extra logic was to take care of that.
>
> Sure, the mask is updated async, but _where_ is the actual problem with
> that?

Hi Peter,

I tried again and came up with the simple patch below which handles all
issues and does not cause any more crashes. I added elaborate commit messages
and code comments enlisting all the issues. Hope it makes sense now. IMHO any
other solutions seems unclear or overhead. The simple solution below Just
Works (Tm) and does not add overhead.

Let me know what you think, thanks.

---8<-----------------------

From 546c5b48f372111589117f51fd79ac1e9493c7e7 Mon Sep 17 00:00:00 2001
From: "Joel Fernandes (Google)" <[email protected]>
Date: Tue, 1 Sep 2020 00:56:36 -0400
Subject: [PATCH] sched/core: Hotplug fixes to pick_next_task()

The follow 3 cases need to be handled to avoid crashes in pick_next_task() when
CPUs in a core are going offline or coming online.

1. The stopper task is switching into idle when it is brought down by CPU
hotplug. It is not in the cpu_smt_mask so nothing need be selected for it.
Further, the current code ends up not selecting anything for it, not even idle.
This ends up causing crashes in set_next_task(). Just do the __pick_next_task()
selection which will select the idle task. No need to do core-wide selection as
other siblings will handle it for themselves when they call schedule.

2. The rq->core_pick for a sibling in a core can be NULL if no selection was
made for it because it was either offline or went offline during a sibling's
core-wide selection. In this case, do a core-wide selection. In this case, we
have to completely ignore the checks:
if (rq->core->core_pick_seq == rq->core->core_task_seq &&
rq->core->core_pick_seq != rq->core_sched_seq)

Otherwise, it would again end up crashing like #1.

3. The 'Rescheduling siblings' loop of pick_next_task() is quite fragile. It
calls various functions on rq->core_pick which could very well be NULL because:
An online sibling might have gone offline before a task could be picked for it,
or it might be offline but later happen to come online, but its too late and
nothing was picked for it. Just ignore the siblings for which nothing could be
picked. This avoids any crashes that may occur in this loop that assume
rq->core_pick is not NULL.

Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/core.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 717122a3dca1..4966e9f14f39 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4610,13 +4610,24 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
if (!sched_core_enabled(rq))
return __pick_next_task(rq, prev, rf);

+ cpu = cpu_of(rq);
+
+ /* Stopper task is switching into idle, no need core-wide selection. */
+ if (cpu_is_offline(cpu))
+ return __pick_next_task(rq, prev, rf);
+
/*
* If there were no {en,de}queues since we picked (IOW, the task
* pointers are all still valid), and we haven't scheduled the last
* pick yet, do so now.
+ *
+ * rq->core_pick can be NULL if no selection was made for a CPU because
+ * it was either offline or went offline during a sibling's core-wide
+ * selection. In this case, do a core-wide selection.
*/
if (rq->core->core_pick_seq == rq->core->core_task_seq &&
- rq->core->core_pick_seq != rq->core_sched_seq) {
+ rq->core->core_pick_seq != rq->core_sched_seq &&
+ !rq->core_pick) {
WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);

next = rq->core_pick;
@@ -4629,7 +4640,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

put_prev_task_balance(rq, prev, rf);

- cpu = cpu_of(rq);
smt_mask = cpu_smt_mask(cpu);

/*
@@ -4761,7 +4771,15 @@ next_class:;
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);

- WARN_ON_ONCE(!rq_i->core_pick);
+ /*
+ * An online sibling might have gone offline before a task
+ * could be picked for it, or it might be offline but later
+ * happen to come online, but its too late and nothing was
+ * picked for it. That's Ok - it will pick tasks for itself,
+ * so ignore it.
+ */
+ if (!rq_i->core_pick)
+ continue;

if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
rq_i->core_forceidle = true;
--
2.28.0.402.g5ffc5be6b7-goog

2020-09-01 12:37:42

by Vineeth Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

Hi Joel,

On 9/1/20 1:10 AM, Joel Fernandes wrote:
> 3. The 'Rescheduling siblings' loop of pick_next_task() is quite fragile. It
> calls various functions on rq->core_pick which could very well be NULL because:
> An online sibling might have gone offline before a task could be picked for it,
> or it might be offline but later happen to come online, but its too late and
> nothing was picked for it. Just ignore the siblings for which nothing could be
> picked. This avoids any crashes that may occur in this loop that assume
> rq->core_pick is not NULL.
>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
I like this idea, its much simpler :-)

> ---
> kernel/sched/core.c | 24 +++++++++++++++++++++---
> 1 file changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 717122a3dca1..4966e9f14f39 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4610,13 +4610,24 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> if (!sched_core_enabled(rq))
> return __pick_next_task(rq, prev, rf);
>
> + cpu = cpu_of(rq);
> +
> + /* Stopper task is switching into idle, no need core-wide selection. */
I think we can come here when hotplug thread is scheduled during online, but
mask is not yet updated. Probably can add it with this comment as well.

> + if (cpu_is_offline(cpu))
> + return __pick_next_task(rq, prev, rf);
> +
We would need reset core_pick here I think. Something like
    if (cpu_is_offline(cpu)) {
        rq->core_pick = NULL;
        return __pick_next_task(rq, prev, rf);
    }

Without this we can end up in a crash like this:
1. Sibling of this cpu picks a task (rq_i->core_pick) and this cpu goes
    offline soon after.
2. Before this cpu comes online, sibling goes through another pick loop
    and before its IPI loop, this cpu comes online and we get an IPI.
3. So when this cpu gets into schedule, we have core_pick set and
    core_pick_seq != core_sched_seq. So we enter the fast path. But
    core_pick might no longer in this runqueue.

So, to protect this, we should reset core_pick I think. I have seen this
crash
occasionally.

> /*
> * If there were no {en,de}queues since we picked (IOW, the task
> * pointers are all still valid), and we haven't scheduled the last
> * pick yet, do so now.
> + *
> + * rq->core_pick can be NULL if no selection was made for a CPU because
> + * it was either offline or went offline during a sibling's core-wide
> + * selection. In this case, do a core-wide selection.
> */
> if (rq->core->core_pick_seq == rq->core->core_task_seq &&
> - rq->core->core_pick_seq != rq->core_sched_seq) {
> + rq->core->core_pick_seq != rq->core_sched_seq &&
> + !rq->core_pick) {
Should this check be reversed? I mean, we should enter the fastpath if
we have rq->core_pick is set right?


Another unrelated, but related note :-)
Besides this, I think we need to retain on more change from the previous
patch. We would need to make core_pick_seq per sibling instead of per
core. Having it per core might lead to unfairness. For eg: When a cpu
sees that its sibling's core_pick is the one which is already running, it
will not send IPI. but core_pick remains set and core->core_pick_seq is
incremented. Now if the sibling is preempted due to a high priority task
or its time slice expired, it enters schedule. But it goes to fast path and
selects the running task there by starving the high priority task. Having
the core_pick_seq per sibling will avoid this. It might also help in some
hotplug corner cases as well.

Thanks,
Vineeth

2020-09-01 15:57:03

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH v7 17/23] kernel/entry: Add support for core-wide protection of kernel-mode

On Fri, Aug 28 2020 at 15:51, Julien Desfossez wrote:
> @@ -112,59 +113,84 @@ static __always_inline void exit_to_user_mode(void)
> /* Workaround to allow gradual conversion of architecture code */
> void __weak arch_do_signal(struct pt_regs *regs) { }
>
> -static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> - unsigned long ti_work)

Can the rework of that code please be split out into a seperate patch
and then adding that unsafe muck on top?

> +static inline bool exit_to_user_mode_work_pending(unsigned long ti_work)
> {
> - /*
> - * Before returning to user space ensure that all pending work
> - * items have been completed.
> - */
> - while (ti_work & EXIT_TO_USER_MODE_WORK) {
> + return (ti_work & EXIT_TO_USER_MODE_WORK);
> +}
>
> - local_irq_enable_exit_to_user(ti_work);
> +static inline void exit_to_user_mode_work(struct pt_regs *regs,
> + unsigned long ti_work)
> +{
>
> - if (ti_work & _TIF_NEED_RESCHED)
> - schedule();
> + local_irq_enable_exit_to_user(ti_work);
>
> - if (ti_work & _TIF_UPROBE)
> - uprobe_notify_resume(regs);
> + if (ti_work & _TIF_NEED_RESCHED)
> + schedule();
>
> - if (ti_work & _TIF_PATCH_PENDING)
> - klp_update_patch_state(current);
> + if (ti_work & _TIF_UPROBE)
> + uprobe_notify_resume(regs);
>
> - if (ti_work & _TIF_SIGPENDING)
> - arch_do_signal(regs);
> + if (ti_work & _TIF_PATCH_PENDING)
> + klp_update_patch_state(current);
>
> - if (ti_work & _TIF_NOTIFY_RESUME) {
> - clear_thread_flag(TIF_NOTIFY_RESUME);
> - tracehook_notify_resume(regs);
> - rseq_handle_notify_resume(NULL, regs);
> - }
> + if (ti_work & _TIF_SIGPENDING)
> + arch_do_signal(regs);
> +
> + if (ti_work & _TIF_NOTIFY_RESUME) {
> + clear_thread_flag(TIF_NOTIFY_RESUME);
> + tracehook_notify_resume(regs);
> + rseq_handle_notify_resume(NULL, regs);
> + }
> +
> + /* Architecture specific TIF work */
> + arch_exit_to_user_mode_work(regs, ti_work);
> +
> + local_irq_disable_exit_to_user();
> +}
>
> - /* Architecture specific TIF work */
> - arch_exit_to_user_mode_work(regs, ti_work);
> +static unsigned long exit_to_user_mode_loop(struct pt_regs *regs)
> +{
> + unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> +
> + /*
> + * Before returning to user space ensure that all pending work
> + * items have been completed.
> + */
> + while (1) {
> + /* Both interrupts and preemption could be enabled here. */

What? Preemption is enabled here, but interrupts are disabled.

> + if (exit_to_user_mode_work_pending(ti_work))
> + exit_to_user_mode_work(regs, ti_work);
> +
> + /* Interrupts may be reenabled with preemption disabled. */
> + sched_core_unsafe_exit_wait(EXIT_TO_USER_MODE_WORK);
> +
> /*
> - * Disable interrupts and reevaluate the work flags as they
> - * might have changed while interrupts and preemption was
> - * enabled above.
> + * Reevaluate the work flags as they might have changed
> + * while interrupts and preemption were enabled.

What enables preemption and interrupts? Can you pretty please write
comments which explain what's going on.

> */
> - local_irq_disable_exit_to_user();
> ti_work = READ_ONCE(current_thread_info()->flags);
> +
> + /*
> + * We may be switching out to another task in kernel mode. That
> + * process is responsible for exiting the "unsafe" kernel mode
> + * when it returns to user or guest.
> + */
> + if (exit_to_user_mode_work_pending(ti_work))
> + sched_core_unsafe_enter();
> + else
> + break;
> }
>
> - /* Return the latest work state for arch_exit_to_user_mode() */
> - return ti_work;
> + return ti_work;
> }
>
> static void exit_to_user_mode_prepare(struct pt_regs *regs)
> {
> - unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> + unsigned long ti_work;
>
> lockdep_assert_irqs_disabled();
>
> - if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> - ti_work = exit_to_user_mode_loop(regs, ti_work);
> + ti_work = exit_to_user_mode_loop(regs);

Why has the above loop to be invoked unconditionally even when that core
scheduling muck is disabled? Just to make all return to user paths more
expensive unconditionally, right?

Thanks,

tglx


2020-09-01 16:51:59

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 17/23] kernel/entry: Add support for core-wide protection of kernel-mode

On Tue, Sep 01, 2020 at 05:54:53PM +0200, Thomas Gleixner wrote:
> On Fri, Aug 28 2020 at 15:51, Julien Desfossez wrote:
> > @@ -112,59 +113,84 @@ static __always_inline void exit_to_user_mode(void)
> > /* Workaround to allow gradual conversion of architecture code */
> > void __weak arch_do_signal(struct pt_regs *regs) { }
> >
> > -static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> > - unsigned long ti_work)
>
> Can the rework of that code please be split out into a seperate patch
> and then adding that unsafe muck on top?

Yeah, good idea. Will do.

> > +static inline bool exit_to_user_mode_work_pending(unsigned long ti_work)
> > {
> > - /*
> > - * Before returning to user space ensure that all pending work
> > - * items have been completed.
> > - */
> > - while (ti_work & EXIT_TO_USER_MODE_WORK) {
> > + return (ti_work & EXIT_TO_USER_MODE_WORK);
> > +}
> >
> > - local_irq_enable_exit_to_user(ti_work);
> > +static inline void exit_to_user_mode_work(struct pt_regs *regs,
> > + unsigned long ti_work)
> > +{
> >
> > - if (ti_work & _TIF_NEED_RESCHED)
> > - schedule();
> > + local_irq_enable_exit_to_user(ti_work);
> >
> > - if (ti_work & _TIF_UPROBE)
> > - uprobe_notify_resume(regs);
> > + if (ti_work & _TIF_NEED_RESCHED)
> > + schedule();
> >
> > - if (ti_work & _TIF_PATCH_PENDING)
> > - klp_update_patch_state(current);
> > + if (ti_work & _TIF_UPROBE)
> > + uprobe_notify_resume(regs);
> >
> > - if (ti_work & _TIF_SIGPENDING)
> > - arch_do_signal(regs);
> > + if (ti_work & _TIF_PATCH_PENDING)
> > + klp_update_patch_state(current);
> >
> > - if (ti_work & _TIF_NOTIFY_RESUME) {
> > - clear_thread_flag(TIF_NOTIFY_RESUME);
> > - tracehook_notify_resume(regs);
> > - rseq_handle_notify_resume(NULL, regs);
> > - }
> > + if (ti_work & _TIF_SIGPENDING)
> > + arch_do_signal(regs);
> > +
> > + if (ti_work & _TIF_NOTIFY_RESUME) {
> > + clear_thread_flag(TIF_NOTIFY_RESUME);
> > + tracehook_notify_resume(regs);
> > + rseq_handle_notify_resume(NULL, regs);
> > + }
> > +
> > + /* Architecture specific TIF work */
> > + arch_exit_to_user_mode_work(regs, ti_work);
> > +
> > + local_irq_disable_exit_to_user();
> > +}
> >
> > - /* Architecture specific TIF work */
> > - arch_exit_to_user_mode_work(regs, ti_work);
> > +static unsigned long exit_to_user_mode_loop(struct pt_regs *regs)
> > +{
> > + unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> > +
> > + /*
> > + * Before returning to user space ensure that all pending work
> > + * items have been completed.
> > + */
> > + while (1) {
> > + /* Both interrupts and preemption could be enabled here. */
>
> What? Preemption is enabled here, but interrupts are disabled.

Sorry, I meant about what happens within exit_to_user_mode_work(). That's
what the comment was for. I agree I will do better with the comment next
time.

> > + if (exit_to_user_mode_work_pending(ti_work))
> > + exit_to_user_mode_work(regs, ti_work);
> > +
> > + /* Interrupts may be reenabled with preemption disabled. */
> > + sched_core_unsafe_exit_wait(EXIT_TO_USER_MODE_WORK);
> > +
> > /*
> > - * Disable interrupts and reevaluate the work flags as they
> > - * might have changed while interrupts and preemption was
> > - * enabled above.
> > + * Reevaluate the work flags as they might have changed
> > + * while interrupts and preemption were enabled.
>
> What enables preemption and interrupts? Can you pretty please write
> comments which explain what's going on.

Yes, sorry. So, sched_core_unsafe_exit_wait() will spin with IRQs enabled and
preemption disabled. I did it this way to get past stop_machine() issues
where you were unhappy with us spinning in IRQ disabled region.

> > */
> > - local_irq_disable_exit_to_user();
> > ti_work = READ_ONCE(current_thread_info()->flags);
> > +
> > + /*
> > + * We may be switching out to another task in kernel mode. That
> > + * process is responsible for exiting the "unsafe" kernel mode
> > + * when it returns to user or guest.
> > + */
> > + if (exit_to_user_mode_work_pending(ti_work))
> > + sched_core_unsafe_enter();
> > + else
> > + break;
> > }
> >
> > - /* Return the latest work state for arch_exit_to_user_mode() */
> > - return ti_work;
> > + return ti_work;
> > }
> >
> > static void exit_to_user_mode_prepare(struct pt_regs *regs)
> > {
> > - unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> > + unsigned long ti_work;
> >
> > lockdep_assert_irqs_disabled();
> >
> > - if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> > - ti_work = exit_to_user_mode_loop(regs, ti_work);
> > + ti_work = exit_to_user_mode_loop(regs);
>
> Why has the above loop to be invoked unconditionally even when that core
> scheduling muck is disabled? Just to make all return to user paths more
> expensive unconditionally, right?

If you see the above loop, we are calling exit_to_user_mode_work()
conditionally by calling exit_to_user_mode_work_pending() which does the same
thing.

So we are still conditionally doing the usual exit_to_user_mode work, its
just that now we have to unconditionally invoke the 'loop' anyway. The reason
for that is, the loop can switch into another thread, so we have to do
unsafe_exit() for the old thread, and unsafe_enter() for the new one while
handling the tif work properly. We could get migrated to another CPU in this
loop itself, AFAICS. So the unsafe_enter() / unsafe_exit() calls could also
happen on different CPUs.

I will rework it by splitting, and adding more elaborate comments, etc.
Thanks,

- Joel


2020-09-01 17:32:40

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

Hi Vineeth,

On Tue, Sep 01, 2020 at 08:34:23AM -0400, Vineeth Pillai wrote:
> Hi Joel,
>
> On 9/1/20 1:10 AM, Joel Fernandes wrote:
> > 3. The 'Rescheduling siblings' loop of pick_next_task() is quite fragile. It
> > calls various functions on rq->core_pick which could very well be NULL because:
> > An online sibling might have gone offline before a task could be picked for it,
> > or it might be offline but later happen to come online, but its too late and
> > nothing was picked for it. Just ignore the siblings for which nothing could be
> > picked. This avoids any crashes that may occur in this loop that assume
> > rq->core_pick is not NULL.
> >
> > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> I like this idea, its much simpler :-)

Thanks.

> > ---
> > kernel/sched/core.c | 24 +++++++++++++++++++++---
> > 1 file changed, 21 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 717122a3dca1..4966e9f14f39 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4610,13 +4610,24 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > if (!sched_core_enabled(rq))
> > return __pick_next_task(rq, prev, rf);
> > + cpu = cpu_of(rq);
> > +
> > + /* Stopper task is switching into idle, no need core-wide selection. */
>
> I think we can come here when hotplug thread is scheduled during online, but
> mask is not yet updated. Probably can add it with this comment as well.
>

I don't see how that is possible. Because the cpuhp threads run during the
CPU onlining process, the boot thread for the CPU coming online would have
already updated the mask.

> > + if (cpu_is_offline(cpu))
> > + return __pick_next_task(rq, prev, rf);
> > +
> We would need reset core_pick here I think. Something like
> ??? if (cpu_is_offline(cpu)) {
> ??????? rq->core_pick = NULL;
> ??????? return __pick_next_task(rq, prev, rf);
> ??? }
>
> Without this we can end up in a crash like this:
> 1. Sibling of this cpu picks a task (rq_i->core_pick) and this cpu goes
> ??? offline soon after.
> 2. Before this cpu comes online, sibling goes through another pick loop
> ??? and before its IPI loop, this cpu comes online and we get an IPI.
> 3. So when this cpu gets into schedule, we have core_pick set and
> ??? core_pick_seq != core_sched_seq. So we enter the fast path. But
> ??? core_pick might no longer in this runqueue.
>
> So, to protect this, we should reset core_pick I think. I have seen this
> crash
> occasionally.

Ok, done.

> > /*
> > * If there were no {en,de}queues since we picked (IOW, the task
> > * pointers are all still valid), and we haven't scheduled the last
> > * pick yet, do so now.
> > + *
> > + * rq->core_pick can be NULL if no selection was made for a CPU because
> > + * it was either offline or went offline during a sibling's core-wide
> > + * selection. In this case, do a core-wide selection.
> > */
> > if (rq->core->core_pick_seq == rq->core->core_task_seq &&
> > - rq->core->core_pick_seq != rq->core_sched_seq) {
> > + rq->core->core_pick_seq != rq->core_sched_seq &&
> > + !rq->core_pick) {
> Should this check be reversed? I mean, we should enter the fastpath if
> we have rq->core_pick is set right?

Done. Sorry my testing did not catch it, but it eventually caused a problem
after several hours of the stress test so I'd have eventually caught it.

> Another unrelated, but related note :-)
> Besides this, I think we need to retain on more change from the previous
> patch. We would need to make core_pick_seq per sibling instead of per
> core. Having it per core might lead to unfairness. For eg: When a cpu
> sees that its sibling's core_pick is the one which is already running, it
> will not send IPI. but core_pick remains set and core->core_pick_seq is
> incremented. Now if the sibling is preempted due to a high priority task

Then don't keep the core_pick set then. If you don't send it IPI and if
core_pick is already running, then NULL it already. I don't know why we add
to more corner cases by making assumptions. We have enough open issues that
are not hotplug related. Here's my suggestion :

1. Keep the ideas consistent, forget about the exact code currently written
and just understand the pick_seq is for siblings knowing that something was
picked for the whole core. So if their pick_seq != sched_seq, then they have
to pick what was selected.

2. If core_pick should be NULL, then NULL it in some path. If you keep some
core_pick and you increment pick_seq, then you are automatically asking the
sibling to pick that task up then next time it enters schedule(). See if [1]
will work?

Note that, we have added logic in this patch that does a full selection if
rq->core_pick == NULL.

> or its time slice expired, it enters schedule. But it goes to fast path and
> selects the running task there by starving the high priority task. Having
> the core_pick_seq per sibling will avoid this. It might also help in some
> hotplug corner cases as well.

That can be a separate patch IMHO. It has nothing to do with
stability/crashing of concurrent and rather infrequent CPU hotplug
operations.

Also, Peter said pick_seq is for core-wide picking. If you want to add
another semantic, then maybe add another counter which has a separate
meaning and justify why you are adding it.

thanks,

- Joel

[1]
---8<-----------------------

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7728ca7f6bb2..7a03b609e3b7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4793,6 +4793,8 @@ next_class:;

if (rq_i->curr != rq_i->core_pick)
resched_curr(rq_i);
+ else
+ rq_i->core_pick = NULL;

/* Did we break L1TF mitigation requirements? */
WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));

2020-09-01 20:04:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH v7 17/23] kernel/entry: Add support for core-wide protection of kernel-mode

Joel,

On Tue, Sep 01 2020 at 12:50, Joel Fernandes wrote:
> On Tue, Sep 01, 2020 at 05:54:53PM +0200, Thomas Gleixner wrote:
>> On Fri, Aug 28 2020 at 15:51, Julien Desfossez wrote:
>> > /*
>> > - * Disable interrupts and reevaluate the work flags as they
>> > - * might have changed while interrupts and preemption was
>> > - * enabled above.
>> > + * Reevaluate the work flags as they might have changed
>> > + * while interrupts and preemption were enabled.
>>
>> What enables preemption and interrupts? Can you pretty please write
>> comments which explain what's going on.
>
> Yes, sorry. So, sched_core_unsafe_exit_wait() will spin with IRQs enabled and
> preemption disabled. I did it this way to get past stop_machine() issues
> where you were unhappy with us spinning in IRQ disabled region.

So the comment is even more nonsensical :)

>> > - if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>> > - ti_work = exit_to_user_mode_loop(regs, ti_work);
>> > + ti_work = exit_to_user_mode_loop(regs);
>>
>> Why has the above loop to be invoked unconditionally even when that core
>> scheduling muck is disabled? Just to make all return to user paths more
>> expensive unconditionally, right?
>
> If you see the above loop, we are calling exit_to_user_mode_work()
> conditionally by calling exit_to_user_mode_work_pending() which does the same
> thing.

It does the same thing technically, but the fastpath, i.e. no work to
do, is not longer straight forward. Just look at the resulting ASM
before and after. It's awful.

> So we are still conditionally doing the usual exit_to_user_mode work, its
> just that now we have to unconditionally invoke the 'loop' anyway.

No.

> The reason for that is, the loop can switch into another thread, so we
> have to do unsafe_exit() for the old thread, and unsafe_enter() for
> the new one while handling the tif work properly. We could get
> migrated to another CPU in this loop itself, AFAICS. So the
> unsafe_enter() / unsafe_exit() calls could also happen on different
> CPUs.

That's fine. It still does not justify to make everything slower even if
that 'pretend that HT is secure' thing is disabled.

Something like the below should be sufficient to do what you want
while restricting the wreckage to the 'pretend ht is secure' case.

The generated code for the CONFIG_PRETENT_HT_SECURE=n case is the same
as without the patch. With CONFIG_PRETENT_HT_SECURE=y the impact is
exactly two NOP-ed out jumps if the muck is not enabled on the command
line which should be the default behaviour.

Thanks,

tglx

---
--- /dev/null
+++ b/include/linux/pretend_ht_secure.h
@@ -0,0 +1,21 @@
+#ifndef _LINUX_PRETEND_HT_SECURE_H
+#define _LINUX_PRETEND_HT_SECURE_H
+
+#ifdef CONFIG_PRETEND_HT_SECURE
+static inline void enter_from_user_ht_sucks(void)
+{
+ if (static_branch_unlikely(&pretend_ht_secure_key))
+ enter_from_user_pretend_ht_is_secure();
+}
+
+static inline void exit_to_user_ht_sucks(void)
+{
+ if (static_branch_unlikely(&pretend_ht_secure_key))
+ exit_to_user_pretend_ht_is_secure();
+}
+#else
+static inline void enter_from_user_ht_sucks(void) { }
+static inline void exit_to_user_ht_sucks(void) { }
+#endif
+
+#endif
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -17,6 +17,7 @@
* 1) Tell lockdep that interrupts are disabled
* 2) Invoke context tracking if enabled to reactivate RCU
* 3) Trace interrupts off state
+ * 4) Pretend that HT is secure
*/
static __always_inline void enter_from_user_mode(struct pt_regs *regs)
{
@@ -28,6 +29,7 @@ static __always_inline void enter_from_u

instrumentation_begin();
trace_hardirqs_off_finish();
+ enter_from_user_ht_sucks();
instrumentation_end();
}

@@ -111,6 +113,12 @@ static __always_inline void exit_to_user
/* Workaround to allow gradual conversion of architecture code */
void __weak arch_do_signal(struct pt_regs *regs) { }

+static inline unsigned long exit_to_user_get_work(void)
+{
+ exit_to_user_ht_sucks();
+ return READ_ONCE(current_thread_info()->flags);
+}
+
static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
{
@@ -149,7 +157,7 @@ static unsigned long exit_to_user_mode_l
* enabled above.
*/
local_irq_disable_exit_to_user();
- ti_work = READ_ONCE(current_thread_info()->flags);
+ ti_work = exit_to_user_get_work();
}

/* Return the latest work state for arch_exit_to_user_mode() */
@@ -158,7 +166,7 @@ static unsigned long exit_to_user_mode_l

static void exit_to_user_mode_prepare(struct pt_regs *regs)
{
- unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+ unsigned long ti_work = exit_to_user_get_work();

lockdep_assert_irqs_disabled();

2020-09-01 21:26:21

by Vineeth Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

Hi Joel,

On 9/1/20 1:30 PM, Joel Fernandes wrote:
>> I think we can come here when hotplug thread is scheduled during online, but
>> mask is not yet updated. Probably can add it with this comment as well.
>>
> I don't see how that is possible. Because the cpuhp threads run during the
> CPU onlining process, the boot thread for the CPU coming online would have
> already updated the mask.
Sorry my mistake. I got confused with the online state ordering.

>> Another unrelated, but related note :-)
>> Besides this, I think we need to retain on more change from the previous
>> patch. We would need to make core_pick_seq per sibling instead of per
>> core. Having it per core might lead to unfairness. For eg: When a cpu
>> sees that its sibling's core_pick is the one which is already running, it
>> will not send IPI. but core_pick remains set and core->core_pick_seq is
>> incremented. Now if the sibling is preempted due to a high priority task
> Then don't keep the core_pick set then. If you don't send it IPI and if
> core_pick is already running, then NULL it already. I don't know why we add
> to more corner cases by making assumptions. We have enough open issues that
> are not hotplug related. Here's my suggestion :
>
> 1. Keep the ideas consistent, forget about the exact code currently written
> and just understand the pick_seq is for siblings knowing that something was
> picked for the whole core. So if their pick_seq != sched_seq, then they have
> to pick what was selected.
I was trying to keep the ideas consistent. The requirement of core_pick
was to let the scheduled cpu know that a pick has been made. And
initial idea was to have the counter core wide. But I found this gap
that pick is not always core wide and assuming it to be core wide can
cause fairness issues. So I was proposing the idea of changing it from
core wide to per sibling. In other words, I was trying to make sure
core_pick,
along with task_seq and sched_seq is trying to serve its purpose of letting
a sibling know that a new task pick has been made for it. I cannot think of
a reason, why core_pick should be core wide. I might be missing something.

> 2. If core_pick should be NULL, then NULL it in some path. If you keep some
> core_pick and you increment pick_seq, then you are automatically asking the
> sibling to pick that task up then next time it enters schedule(). See if [1]
> will work?
>
> Note that, we have added logic in this patch that does a full selection if
> rq->core_pick == NULL.

I agree, setting rq->core_pick = NULL is another way to solve this
issue, but
still I feel its semantically incorrect to think that a pick is core
wide when it
could actually be to only a subset of siblings in the core. If there is
a valid
reasoning for having core_pick to be core wide, I completely agree with the
fix of resetting core_pick.

>> or its time slice expired, it enters schedule. But it goes to fast path and
>> selects the running task there by starving the high priority task. Having
>> the core_pick_seq per sibling will avoid this. It might also help in some
>> hotplug corner cases as well.
> That can be a separate patch IMHO. It has nothing to do with
> stability/crashing of concurrent and rather infrequent CPU hotplug
> operations.
Agree. Sorry for the confusion, my intention was to not have the logic in
this patch.

> Also, Peter said pick_seq is for core-wide picking. If you want to add
> another semantic, then maybe add another counter which has a separate
> meaning and justify why you are adding it.
I think just one counter is enough. Unless, there is a need to keep the
counter
to track core wide pick, I feel it is worth to change the design and
make the
counter serve its purpose. Will think through this and send it as a separate
patch if needed.

Thanks,
Vineeth

2020-09-02 01:12:44

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

On Tue, Sep 1, 2020 at 5:23 PM Vineeth Pillai
<[email protected]> wrote:
> > Also, Peter said pick_seq is for core-wide picking. If you want to add
> > another semantic, then maybe add another counter which has a separate
> > meaning and justify why you are adding it.
> I think just one counter is enough. Unless, there is a need to keep the
> counter
> to track core wide pick, I feel it is worth to change the design and
> make the
> counter serve its purpose. Will think through this and send it as a separate
> patch if needed.

Since you agree that it suffices to set core_pick to NULL if you don't
IPI a sibling, I don't see the need for your split pick_seq thing.
But let me know if I missed something.

thanks,

- Joel

2020-09-02 01:32:30

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 17/23] kernel/entry: Add support for core-wide protection of kernel-mode

Hi Thomas,

On Tue, Sep 01, 2020 at 10:02:10PM +0200, Thomas Gleixner wrote:
[..]
> > The reason for that is, the loop can switch into another thread, so we
> > have to do unsafe_exit() for the old thread, and unsafe_enter() for
> > the new one while handling the tif work properly. We could get
> > migrated to another CPU in this loop itself, AFAICS. So the
> > unsafe_enter() / unsafe_exit() calls could also happen on different
> > CPUs.
>
> That's fine. It still does not justify to make everything slower even if
> that 'pretend that HT is secure' thing is disabled.
>
> Something like the below should be sufficient to do what you want
> while restricting the wreckage to the 'pretend ht is secure' case.
>
> The generated code for the CONFIG_PRETENT_HT_SECURE=n case is the same

When you say 'pretend', did you mean 'make' ? The point of this patch is to
protect the kernel from the other hyperthread thus making HT secure for the
kernel contexts and not merely pretending.

> as without the patch. With CONFIG_PRETENT_HT_SECURE=y the impact is
> exactly two NOP-ed out jumps if the muck is not enabled on the command
> line which should be the default behaviour.

I see where you're coming from, I'll try to rework it to be less intrusive
when core-scheduling is disabled. Some more comments below:

> Thanks,
>
> tglx
>
> ---
> --- /dev/null
> +++ b/include/linux/pretend_ht_secure.h
> @@ -0,0 +1,21 @@
> +#ifndef _LINUX_PRETEND_HT_SECURE_H
> +#define _LINUX_PRETEND_HT_SECURE_H
> +
> +#ifdef CONFIG_PRETEND_HT_SECURE
> +static inline void enter_from_user_ht_sucks(void)
> +{
> + if (static_branch_unlikely(&pretend_ht_secure_key))
> + enter_from_user_pretend_ht_is_secure();
> +}
> +
> +static inline void exit_to_user_ht_sucks(void)
> +{
> + if (static_branch_unlikely(&pretend_ht_secure_key))
> + exit_to_user_pretend_ht_is_secure();

We already have similar config and static keys for the core-scheduling
feature itself. Can we just make it depend on that?

Or, are you saying users may want 'core scheduling' enabled but may want to
leave out the kernel protection?

> +}
> +#else
> +static inline void enter_from_user_ht_sucks(void) { }
> +static inline void exit_to_user_ht_sucks(void) { }
> +#endif
> +
> +#endif
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -17,6 +17,7 @@
> * 1) Tell lockdep that interrupts are disabled
> * 2) Invoke context tracking if enabled to reactivate RCU
> * 3) Trace interrupts off state
> + * 4) Pretend that HT is secure
> */
> static __always_inline void enter_from_user_mode(struct pt_regs *regs)
> {
> @@ -28,6 +29,7 @@ static __always_inline void enter_from_u
>
> instrumentation_begin();
> trace_hardirqs_off_finish();
> + enter_from_user_ht_sucks();
> instrumentation_end();
> }
>
> @@ -111,6 +113,12 @@ static __always_inline void exit_to_user
> /* Workaround to allow gradual conversion of architecture code */
> void __weak arch_do_signal(struct pt_regs *regs) { }
>
> +static inline unsigned long exit_to_user_get_work(void)
> +{
> + exit_to_user_ht_sucks();

Ok, one issue with your patch is it does not take care of the waiting logic.
sched_core_unsafe_exit_wait() needs to be called *after* all of the
exit_to_user_mode_work is processed. This is because
sched_core_unsafe_exit_wait() also checks for any new exit-to-usermode-work
that popped up while it is spinning and breaks out of its spin-till-safe loop
early. This is key to solving the stop-machine issue. If the stopper needs to
run, then the need-resched flag will be set and we break out of the spin and
redo the whole exit_to_user_mode_loop() as it should.

I agree with the need to make the ASM suck less if the feature is turned off
though, and I can try to cook something along those lines. Thanks for the idea!

thanks,

- Joel


> + return READ_ONCE(current_thread_info()->flags);
> +}
> +
> static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> unsigned long ti_work)
> {
> @@ -149,7 +157,7 @@ static unsigned long exit_to_user_mode_l
> * enabled above.
> */
> local_irq_disable_exit_to_user();
> - ti_work = READ_ONCE(current_thread_info()->flags);
> + ti_work = exit_to_user_get_work();
> }
>
> /* Return the latest work state for arch_exit_to_user_mode() */
> @@ -158,7 +166,7 @@ static unsigned long exit_to_user_mode_l
>
> static void exit_to_user_mode_prepare(struct pt_regs *regs)
> {
> - unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> + unsigned long ti_work = exit_to_user_get_work();
>
> lockdep_assert_irqs_disabled();
>

2020-09-02 07:10:19

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [RFC PATCH v7 12/23] sched: Trivial forced-newidle balancer

On Fri, Aug 28, 2020 at 03:51:13PM -0400, Julien Desfossez wrote:
> /*
> * The static-key + stop-machine variable are needed such that:
> *
> @@ -4641,7 +4656,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> struct task_struct *next, *max = NULL;
> const struct sched_class *class;
> const struct cpumask *smt_mask;
> - int i, j, cpu;
> + int i, j, cpu, occ = 0;
> int smt_weight;
> bool need_sync;
>
> @@ -4750,6 +4765,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> goto done;
> }
>
> + if (!is_idle_task(p))
> + occ++;
> +
> rq_i->core_pick = p;
>
> /*
> @@ -4775,6 +4793,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>
> cpu_rq(j)->core_pick = NULL;
> }
> + occ = 1;
> goto again;
> } else {
> /*
> @@ -4820,6 +4839,8 @@ next_class:;
> if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> rq_i->core_forceidle = true;
>
> + rq_i->core_pick->core_occupation = occ;
> +
> if (i == cpu)
> continue;
>
> @@ -4837,6 +4858,113 @@ next_class:;
> return next;
> }
>
> +static bool try_steal_cookie(int this, int that)
> +{
> + struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
> + struct task_struct *p;
> + unsigned long cookie;
> + bool success = false;
> +
> + local_irq_disable();
> + double_rq_lock(dst, src);
> +
> + cookie = dst->core->core_cookie;
> + if (!cookie)
> + goto unlock;
> +
> + if (dst->curr != dst->idle)
> + goto unlock;
> +
> + p = sched_core_find(src, cookie);
> + if (p == src->idle)
> + goto unlock;
> +
> + do {
> + if (p == src->core_pick || p == src->curr)
> + goto next;
> +
> + if (!cpumask_test_cpu(this, &p->cpus_mask))
> + goto next;
> +
> + if (p->core_occupation > dst->idle->core_occupation)
> + goto next;
> +
Can you please explain the rationale behind this check? If I understand
correctly, p->core_occupation is set in pick_next_task() to indicate
the number of matching cookie (except idle) tasks picked on this core.
It is not reset anywhere.

> + p->on_rq = TASK_ON_RQ_MIGRATING;
> + deactivate_task(src, p, 0);
> + set_task_cpu(p, this);
> + activate_task(dst, p, 0);
> + p->on_rq = TASK_ON_RQ_QUEUED;
> +
> + resched_curr(dst);
> +
> + success = true;
> + break;
> +
> +next:
> + p = sched_core_next(p, cookie);
> + } while (p);
> +
> +unlock:
> + double_rq_unlock(dst, src);
> + local_irq_enable();
> +
> + return success;
> +}

Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

2020-09-02 07:57:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH v7 17/23] kernel/entry: Add support for core-wide protection of kernel-mode

Joel,

On Tue, Sep 01 2020 at 21:29, Joel Fernandes wrote:
> On Tue, Sep 01, 2020 at 10:02:10PM +0200, Thomas Gleixner wrote:
>> The generated code for the CONFIG_PRETENT_HT_SECURE=n case is the same
>
> When you say 'pretend', did you mean 'make' ? The point of this patch is to
> protect the kernel from the other hyperthread thus making HT secure for the
> kernel contexts and not merely pretending.

I'm paranoid and don't trust HT at all. There is too much shared state.

>> --- /dev/null
>> +++ b/include/linux/pretend_ht_secure.h
>> @@ -0,0 +1,21 @@
>> +#ifndef _LINUX_PRETEND_HT_SECURE_H
>> +#define _LINUX_PRETEND_HT_SECURE_H
>> +
>> +#ifdef CONFIG_PRETEND_HT_SECURE
>> +static inline void enter_from_user_ht_sucks(void)
>> +{
>> + if (static_branch_unlikely(&pretend_ht_secure_key))
>> + enter_from_user_pretend_ht_is_secure();
>> +}
>> +
>> +static inline void exit_to_user_ht_sucks(void)
>> +{
>> + if (static_branch_unlikely(&pretend_ht_secure_key))
>> + exit_to_user_pretend_ht_is_secure();
>
> We already have similar config and static keys for the core-scheduling
> feature itself. Can we just make it depend on that?

Of course. This was just for illustration. :)

> Or, are you saying users may want 'core scheduling' enabled but may want to
> leave out the kernel protection?

Core scheduling per se without all the protection muck, i.e. a relaxed
version which tries to gang schedule threads of a process on a core if
feasible has advantages to some workloads.

>> @@ -111,6 +113,12 @@ static __always_inline void exit_to_user
>> /* Workaround to allow gradual conversion of architecture code */
>> void __weak arch_do_signal(struct pt_regs *regs) { }
>>
>> +static inline unsigned long exit_to_user_get_work(void)
>> +{
>> + exit_to_user_ht_sucks();
>
> Ok, one issue with your patch is it does not take care of the waiting logic.
> sched_core_unsafe_exit_wait() needs to be called *after* all of the
> exit_to_user_mode_work is processed. This is because
> sched_core_unsafe_exit_wait() also checks for any new exit-to-usermode-work
> that popped up while it is spinning and breaks out of its spin-till-safe loop
> early. This is key to solving the stop-machine issue. If the stopper needs to
> run, then the need-resched flag will be set and we break out of the spin and
> redo the whole exit_to_user_mode_loop() as it should.

And where is the problem?

syscall_entry()
...
sys_foo()
....
return 0;

local_irq_disable();
exit_to_user_mode_prepare()
ti_work = exit_to_user_get_work()
{
if (ht_muck)
syscall_exit_ht_muck() {
....
while (wait) {
local_irq_enable();
while (wait) cpu_relax();
local_irq_disable();
}
}
return READ_ONCE(current_thread_info()->flags);
}

if (unlikely(ti_work & WORK))
ti_work = exit_loop(ti_work)

while (ti_work & MASK) {
local_irq_enable();
.....
local_irq_disable();
ti_work = exit_to_user_get_work()
{
See above
}
}

It covers both the 'no work' and the 'do work' exit path. If that's not
sufficient, then something is fundamentally wrong with your design.

Thanks,

tglx

2020-09-02 15:17:30

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 17/23] kernel/entry: Add support for core-wide protection of kernel-mode

Hi Thomas,

On Wed, Sep 02, 2020 at 09:53:29AM +0200, Thomas Gleixner wrote:
[...]
> >> --- /dev/null
> >> +++ b/include/linux/pretend_ht_secure.h
> >> @@ -0,0 +1,21 @@
> >> +#ifndef _LINUX_PRETEND_HT_SECURE_H
> >> +#define _LINUX_PRETEND_HT_SECURE_H
> >> +
> >> +#ifdef CONFIG_PRETEND_HT_SECURE
> >> +static inline void enter_from_user_ht_sucks(void)
> >> +{
> >> + if (static_branch_unlikely(&pretend_ht_secure_key))
> >> + enter_from_user_pretend_ht_is_secure();
> >> +}
> >> +
> >> +static inline void exit_to_user_ht_sucks(void)
> >> +{
> >> + if (static_branch_unlikely(&pretend_ht_secure_key))
> >> + exit_to_user_pretend_ht_is_secure();
> >
> > We already have similar config and static keys for the core-scheduling
> > feature itself. Can we just make it depend on that?
>
> Of course. This was just for illustration. :)

Got it. :)

> > Or, are you saying users may want 'core scheduling' enabled but may want to
> > leave out the kernel protection?
>
> Core scheduling per se without all the protection muck, i.e. a relaxed
> version which tries to gang schedule threads of a process on a core if
> feasible has advantages to some workloads.

Sure. So I will make it depending on the existing core-scheduling
config/static-key so the kernel protection is there when core scheduling is
enabled (so both userspace and with this patch the kernel is protected).

>
> >> @@ -111,6 +113,12 @@ static __always_inline void exit_to_user
> >> /* Workaround to allow gradual conversion of architecture code */
> >> void __weak arch_do_signal(struct pt_regs *regs) { }
> >>
> >> +static inline unsigned long exit_to_user_get_work(void)
> >> +{
> >> + exit_to_user_ht_sucks();
> >
> > Ok, one issue with your patch is it does not take care of the waiting logic.
> > sched_core_unsafe_exit_wait() needs to be called *after* all of the
> > exit_to_user_mode_work is processed. This is because
> > sched_core_unsafe_exit_wait() also checks for any new exit-to-usermode-work
> > that popped up while it is spinning and breaks out of its spin-till-safe loop
> > early. This is key to solving the stop-machine issue. If the stopper needs to
> > run, then the need-resched flag will be set and we break out of the spin and
> > redo the whole exit_to_user_mode_loop() as it should.
>
> And where is the problem?
>
> syscall_entry()
> ...
> sys_foo()
> ....
> return 0;
>
> local_irq_disable();
> exit_to_user_mode_prepare()
> ti_work = exit_to_user_get_work()
> {
> if (ht_muck)
> syscall_exit_ht_muck() {
> ....
> while (wait) {
> local_irq_enable();
> while (wait) cpu_relax();
> local_irq_disable();
> }
> }
> return READ_ONCE(current_thread_info()->flags);
> }
>
> if (unlikely(ti_work & WORK))
> ti_work = exit_loop(ti_work)
>
> while (ti_work & MASK) {
> local_irq_enable();
> .....
> local_irq_disable();
> ti_work = exit_to_user_get_work()
> {
> See above
> }
> }
>
> It covers both the 'no work' and the 'do work' exit path. If that's not
> sufficient, then something is fundamentally wrong with your design.

Yes, you are right, I got confused from your previous patch. This works too
and is exactly as my design. I will do it this way then. Thank you, Thomas!

- Joel

2020-09-02 16:59:08

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v7 17/23] kernel/entry: Add support for core-wide protection of kernel-mode

On Wed, 2020-09-02 at 09:53 +0200, Thomas Gleixner wrote:
> On Tue, Sep 01 2020 at 21:29, Joel Fernandes wrote:
> > On Tue, Sep 01, 2020 at 10:02:10PM +0200, Thomas Gleixner wrote:
> > >
> > Or, are you saying users may want 'core scheduling' enabled but may
> > want to
> > leave out the kernel protection?
>
> Core scheduling per se without all the protection muck, i.e. a
> relaxed
> version which tries to gang schedule threads of a process on a core
> if
> feasible has advantages to some workloads.
>
Indeed! For at least two reasons, IMO:

1) what Thomas is saying already. I.e., even on a CPU which has HT but
is not affected by any of the (known!) speculation issues, one may want
to use Core Scheduling _as_a_feature_. For instance, for avoiding
threads from different processes, or vCPUs from different VMs, sharing
cores (e.g., for better managing their behavior/performance, or for
improved fairness of billing/accounting). And in this case, this
mechanism for protecting the kernel from the userspace on the other
thread may not be necessary or interesting;

2) protection of the kernel from the other thread running in userspace
may be achieved in different ways. This is one, sure. ASI will probably
be another. Hence if/when we'll have both, this and ASI, it would be
cool to be able to configure the system in such a way that there is
only one active, to avoid paying the price of both! :-)

Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-09-03 04:36:56

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 17/23] kernel/entry: Add support for core-wide protection of kernel-mode

On Wed, Sep 2, 2020 at 12:57 PM Dario Faggioli <[email protected]> wrote:
>
> On Wed, 2020-09-02 at 09:53 +0200, Thomas Gleixner wrote:
> > On Tue, Sep 01 2020 at 21:29, Joel Fernandes wrote:
> > > On Tue, Sep 01, 2020 at 10:02:10PM +0200, Thomas Gleixner wrote:
> > > >
> > > Or, are you saying users may want 'core scheduling' enabled but may
> > > want to
> > > leave out the kernel protection?
> >
> > Core scheduling per se without all the protection muck, i.e. a
> > relaxed
> > version which tries to gang schedule threads of a process on a core
> > if
> > feasible has advantages to some workloads.
> >
> Indeed! For at least two reasons, IMO:
>
> 1) what Thomas is saying already. I.e., even on a CPU which has HT but
> is not affected by any of the (known!) speculation issues, one may want
> to use Core Scheduling _as_a_feature_. For instance, for avoiding
> threads from different processes, or vCPUs from different VMs, sharing
> cores (e.g., for better managing their behavior/performance, or for
> improved fairness of billing/accounting). And in this case, this
> mechanism for protecting the kernel from the userspace on the other
> thread may not be necessary or interesting;

Agreed. So then I should really make this configurable and behind a
sysctl then. I'll do that.

> 2) protection of the kernel from the other thread running in userspace
> may be achieved in different ways. This is one, sure. ASI will probably
> be another. Hence if/when we'll have both, this and ASI, it would be
> cool to be able to configure the system in such a way that there is
> only one active, to avoid paying the price of both! :-)

Actually, no. Part of ASI will involve exactly what this patch does -
IPI-pausing siblings but ASI does so when they have no choice but to
switch away from the "limited kernel" mapping, into the full host
kernel mapping. I am not sure if they have yet implemented that part
but they do talk of it in [1] and in their pretty LPC slides. It is
just that ASI tries to avoid that scenario of kicking all siblings out
of guest mode. So, maybe this patch can be a stepping stone to ASI.
At least I got the entry hooks right, and the algorithm is efficient
IMO (useless IPIs are avoided). ASI can then come in and avoid
sending IPIs even more by doing their limited-kernel-mapping things if
needed. So, it does not need to be this vs ASI, both may be needed.

Why do you feel that ASI on its own offers some magical protection
that eliminates the need for this patch?

thanks,

- Joel

[1] The link https://lkml.org/lkml/2019/5/13/515 mentions "note that
kicking all sibling hyperthreads is not implemented in this serie"

2020-09-03 05:17:06

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC PATCH v7 06/23] bitops: Introduce find_next_or_bit

On 8/28/20 12:51 PM, Julien Desfossez wrote:
> +#ifndef find_next_or_bit
> +/**
> + * find_next_or_bit - find the next set bit in any memory regions
> + * @addr1: The first address to base the search on
> + * @addr2: The second address to base the search on
> + * @offset: The bitnumber to start searching at

preferably bit number

> + * @size: The bitmap size in bits
> + *
> + * Returns the bit number for the next set bit

* Return: the bit number for the next set bit

for kernel-doc syntax.


> + * If no bits are set, returns @size.
> + */
> +extern unsigned long find_next_or_bit(const unsigned long *addr1,
> + const unsigned long *addr2, unsigned long size,
> + unsigned long offset);
> +#endif


thanks.
--
~Randy

2020-09-03 14:37:50

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC PATCH v7 17/23] kernel/entry: Add support for core-wide protection of kernel-mode

On Thu, 2020-09-03 at 00:34 -0400, Joel Fernandes wrote:
> On Wed, Sep 2, 2020 at 12:57 PM Dario Faggioli <[email protected]>
> wrote:
> > 2) protection of the kernel from the other thread running in
> > userspace
> > may be achieved in different ways. This is one, sure. ASI will
> > probably
> > be another. Hence if/when we'll have both, this and ASI, it would
> > be
> > cool to be able to configure the system in such a way that there is
> > only one active, to avoid paying the price of both! :-)
>
> Actually, no. Part of ASI will involve exactly what this patch does -
> IPI-pausing siblings but ASI does so when they have no choice but to
> switch away from the "limited kernel" mapping, into the full host
> kernel mapping. I am not sure if they have yet implemented that part
> but they do talk of it in [1] and in their pretty LPC slides. It is
> just that ASI tries to avoid that scenario of kicking all siblings
> out
> of guest mode. So, maybe this patch can be a stepping stone to ASI.
>
Ah, sure! I mean, it of course depends on how things get mixed
together, which we still don't know. :-)

I know that ASI wants to do this very same thing some times. I just
wanted to say that we need/want only one mechanism for doing that, in
place at any given time.

If that mechanism will be this one, and ASI then uses it (e.g., in a
scenario where for whatever reason one wants ASI with the kicking but
not Core Scheduling), then great, I agree, and everything does makes
sense.

> Why do you feel that ASI on its own offers some magical protection
> that eliminates the need for this patch?
>
I don't. I was just trying to think at different use cases, how they
mix and match, etc. But as you're suggesting (and as Thomas has shown
very clearly in another reply to this email), the thing that makes the
most sense is to have one mechanism for the kicking, used by anyone
that needs it. :-)

Thanks and Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-09-03 14:51:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH v7 17/23] kernel/entry: Add support for core-wide protection of kernel-mode

On Thu, Sep 03 2020 at 00:34, Joel Fernandes wrote:
> On Wed, Sep 2, 2020 at 12:57 PM Dario Faggioli <[email protected]> wrote:
>> 2) protection of the kernel from the other thread running in userspace
>> may be achieved in different ways. This is one, sure. ASI will probably
>> be another. Hence if/when we'll have both, this and ASI, it would be
>> cool to be able to configure the system in such a way that there is
>> only one active, to avoid paying the price of both! :-)
>
> Actually, no. Part of ASI will involve exactly what this patch does -
> IPI-pausing siblings but ASI does so when they have no choice but to
> switch away from the "limited kernel" mapping, into the full host
> kernel mapping. I am not sure if they have yet implemented that part
> but they do talk of it in [1] and in their pretty LPC slides. It is
> just that ASI tries to avoid that scenario of kicking all siblings out
> of guest mode. So, maybe this patch can be a stepping stone to ASI.
> At least I got the entry hooks right, and the algorithm is efficient
> IMO (useless IPIs are avoided). ASI can then come in and avoid
> sending IPIs even more by doing their limited-kernel-mapping things if
> needed. So, it does not need to be this vs ASI, both may be needed.

Right. There are different parts which are seperate:

1) Core scheduling as a best effort feature (performance for certain use
cases)

2) Enforced core scheduling (utilizes #1 basics)

3) ASI

4) Kick sibling out of guest/host and wait mechanics

#1, #2, #3 can be used stand alone. #4 is a utility

Then you get combos:

A) #2 + #4:

core wide protection. i.e. what this series tries to achieve. #3
triggers the kick at the low level VMEXIT or entry from user mode
boundary. The wait happens at the same level

B) #3 + #4:

ASI plus kicking the sibling/wait mechanics independent of what's
scheduled. #3 triggers the kick at the ASI switch to full host
mapping boundary and the wait is probably the same as in #A

C) #2 + #3 + #4:

The full concert, but trigger/wait wise the same as #B

So we really want to make at least #4 an independent utility.

Thanks,

tglx





2020-09-03 15:26:23

by Vineeth Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v7 17/23] kernel/entry: Add support for core-wide protection of kernel-mode



On 9/3/20 12:34 AM, Joel Fernandes wrote:
>>
>> Indeed! For at least two reasons, IMO:
>>
>> 1) what Thomas is saying already. I.e., even on a CPU which has HT but
>> is not affected by any of the (known!) speculation issues, one may want
>> to use Core Scheduling _as_a_feature_. For instance, for avoiding
>> threads from different processes, or vCPUs from different VMs, sharing
>> cores (e.g., for better managing their behavior/performance, or for
>> improved fairness of billing/accounting). And in this case, this
>> mechanism for protecting the kernel from the userspace on the other
>> thread may not be necessary or interesting;
> Agreed. So then I should really make this configurable and behind a
> sysctl then. I'll do that.
We already have the patch to wrap this feature in a build time and
boot time option:
https://lwn.net/ml/linux-kernel/9cd9abad06ad8c3f35228afd07c74c7d9533c412.1598643276.git.jdesfossez@digitalocean.com/

I could not get to a safe way to make it a runtime tunable at the time
of posting this series, but I think it should also be possible.

Thanks,
Vineeth

2020-09-03 20:26:20

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 17/23] kernel/entry: Add support for core-wide protection of kernel-mode

On Thu, Sep 3, 2020 at 9:43 AM Dario Faggioli <[email protected]> wrote:
>
> On Thu, 2020-09-03 at 00:34 -0400, Joel Fernandes wrote:
> > On Wed, Sep 2, 2020 at 12:57 PM Dario Faggioli <[email protected]>
> > wrote:
> > > 2) protection of the kernel from the other thread running in
> > > userspace
> > > may be achieved in different ways. This is one, sure. ASI will
> > > probably
> > > be another. Hence if/when we'll have both, this and ASI, it would
> > > be
> > > cool to be able to configure the system in such a way that there is
> > > only one active, to avoid paying the price of both! :-)
> >
> > Actually, no. Part of ASI will involve exactly what this patch does -
> > IPI-pausing siblings but ASI does so when they have no choice but to
> > switch away from the "limited kernel" mapping, into the full host
> > kernel mapping. I am not sure if they have yet implemented that part
> > but they do talk of it in [1] and in their pretty LPC slides. It is
> > just that ASI tries to avoid that scenario of kicking all siblings
> > out
> > of guest mode. So, maybe this patch can be a stepping stone to ASI.
> >
> Ah, sure! I mean, it of course depends on how things get mixed
> together, which we still don't know. :-)
>
> I know that ASI wants to do this very same thing some times. I just
> wanted to say that we need/want only one mechanism for doing that, in
> place at any given time.
>
> If that mechanism will be this one, and ASI then uses it (e.g., in a
> scenario where for whatever reason one wants ASI with the kicking but
> not Core Scheduling), then great, I agree, and everything does makes
> sense.

Agreed, with some minor modifications, it should be possible.

> > Why do you feel that ASI on its own offers some magical protection
> > that eliminates the need for this patch?
> >
> I don't. I was just trying to think of different use cases, how they
> mix and match, etc. But as you're suggesting (and as Thomas has shown
> very clearly in another reply to this email), the thing that makes the
> most sense is to have one mechanism for the kicking, used by anyone
> that needs it. :-)

Makes sense. Thanks,

- Joel

2020-09-03 20:31:52

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 17/23] kernel/entry: Add support for core-wide protection of kernel-mode

On Thu, Sep 3, 2020 at 9:20 AM Thomas Gleixner <[email protected]> wrote:
>
> On Thu, Sep 03 2020 at 00:34, Joel Fernandes wrote:
> > On Wed, Sep 2, 2020 at 12:57 PM Dario Faggioli <[email protected]> wrote:
> >> 2) protection of the kernel from the other thread running in userspace
> >> may be achieved in different ways. This is one, sure. ASI will probably
> >> be another. Hence if/when we'll have both, this and ASI, it would be
> >> cool to be able to configure the system in such a way that there is
> >> only one active, to avoid paying the price of both! :-)
> >
> > Actually, no. Part of ASI will involve exactly what this patch does -
> > IPI-pausing siblings but ASI does so when they have no choice but to
> > switch away from the "limited kernel" mapping, into the full host
> > kernel mapping. I am not sure if they have yet implemented that part
> > but they do talk of it in [1] and in their pretty LPC slides. It is
> > just that ASI tries to avoid that scenario of kicking all siblings out
> > of guest mode. So, maybe this patch can be a stepping stone to ASI.
> > At least I got the entry hooks right, and the algorithm is efficient
> > IMO (useless IPIs are avoided). ASI can then come in and avoid
> > sending IPIs even more by doing their limited-kernel-mapping things if
> > needed. So, it does not need to be this vs ASI, both may be needed.
>
> Right. There are different parts which are seperate:
>
> 1) Core scheduling as a best effort feature (performance for certain use
> cases)
>
> 2) Enforced core scheduling (utilizes #1 basics)
>
> 3) ASI
>
> 4) Kick sibling out of guest/host and wait mechanics
>
> #1, #2, #3 can be used stand alone. #4 is a utility
>
> Then you get combos:
>
> A) #2 + #4:
>
> core wide protection. i.e. what this series tries to achieve. #3
> triggers the kick at the low level VMEXIT or entry from user mode
> boundary. The wait happens at the same level
>
> B) #3 + #4:
>
> ASI plus kicking the sibling/wait mechanics independent of what's
> scheduled. #3 triggers the kick at the ASI switch to full host
> mapping boundary and the wait is probably the same as in #A
>
> C) #2 + #3 + #4:
>
> The full concert, but trigger/wait wise the same as #B
>
> So we really want to make at least #4 an independent utility.

Agreed! Thanks for enlisting all the cases so well. I believe this
could be achieved by moving the calls to unsafe_enter() and
unsafe_exit() to when ASI decides it is time to enter the unsafe
kernel context. I will keep it in mind when sending the next revision
as well.

thanks,

- Joel

2020-09-15 20:12:35

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

On Fri, Aug 28, 2020 at 03:51:09PM -0400, Julien Desfossez wrote:
> From: Peter Zijlstra <[email protected]>
>
> Instead of only selecting a local task, select a task for all SMT
> siblings for every reschedule on the core (irrespective which logical
> CPU does the reschedule).
>
> During a CPU hotplug event, schedule would be called with the hotplugged
> CPU not in the cpumask. So use for_each_cpu(_wrap)_or to include the
> current cpu in the task pick loop.
>
> There are multiple loops in pick_next_task that iterate over CPUs in
> smt_mask. During a hotplug event, sibling could be removed from the
> smt_mask while pick_next_task is running. So we cannot trust the mask
> across the different loops. This can confuse the logic. Add a retry logic
> if smt_mask changes between the loops.
[...]
> +static struct task_struct *
> +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +{
[...]
> + /* reset state */
> + rq->core->core_cookie = 0UL;
> + for_each_cpu_or(i, smt_mask, cpumask_of(cpu)) {
> + struct rq *rq_i = cpu_rq(i);
> +
> + rq_i->core_pick = NULL;
> +
> + if (rq_i->core_forceidle) {
> + need_sync = true;
> + rq_i->core_forceidle = false;
> + }
> +
> + if (i != cpu)
> + update_rq_clock(rq_i);
> + }
> +
> + /*
> + * Try and select tasks for each sibling in decending sched_class
> + * order.
> + */
> + for_each_class(class) {
> +again:
> + for_each_cpu_wrap_or(i, smt_mask, cpumask_of(cpu), cpu) {
> + struct rq *rq_i = cpu_rq(i);
> + struct task_struct *p;
> +
> + /*
> + * During hotplug online a sibling can be added in
> + * the smt_mask * while we are here. If so, we would
> + * need to restart selection by resetting all over.
> + */
> + if (unlikely(smt_weight != cpumask_weight(smt_mask)))
> + goto retry_select;
> +
> + if (rq_i->core_pick)
> + continue;
> +
> + /*
> + * If this sibling doesn't yet have a suitable task to
> + * run; ask for the most elegible task, given the
> + * highest priority task already selected for this
> + * core.
> + */
> + p = pick_task(rq_i, class, max);
> + if (!p) {
> + /*
> + * If there weren't no cookies; we don't need
> + * to bother with the other siblings.
> + */
> + if (i == cpu && !need_sync)
> + goto next_class;

We find that there is a problem with this code, it force idles RT tasks
whenever one sibling is running tagged CFS task, with the other one running
untagged RT task.

Below diff should fix it, still testing it:

---8<-----------------------

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cc90eb50d481..26d05043b640 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4880,10 +4880,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
p = pick_task(rq_i, class, max);
if (!p) {
/*
- * If there weren't no cookies; we don't need
- * to bother with the other siblings.
+ * If the rest of the core is not running a
+ * tagged task, i.e. need_sync == 0, and the
+ * current CPU which called into the schedule()
+ * loop does not have any tasks for this class,
+ * skip selecting for other siblings since
+ * there's no point. We don't skip for RT/DL
+ * because that could make CFS force-idle RT.
*/
- if (i == cpu && !need_sync)
+ if (i == cpu && !need_sync && class == &fair_sched_class)
goto next_class;

continue;

2020-09-15 21:53:37

by Chris Hyser

[permalink] [raw]
Subject: Re: [RFC PATCH v7 11/23] sched/fair: core wide cfs task priority comparison

On 8/28/20 3:51 PM, Julien Desfossez wrote:
> From: Aaron Lu <[email protected]>
>
> This patch provides a vruntime based way to compare two cfs task's
> priority, be it on the same cpu or different threads of the same core.
>
> When the two tasks are on the same CPU, we just need to find a common
> cfs_rq both sched_entities are on and then do the comparison.
>
> When the two tasks are on differen threads of the same core, each thread
> will choose the next task to run the usual way and then the root level
> sched entities which the two tasks belong to will be used to decide
> which task runs next core wide.
>
> An illustration for the cross CPU case:
>
> cpu0 cpu1
> / | \ / | \
> se1 se2 se3 se4 se5 se6
> / \ / \
> se21 se22 se61 se62
> (A) /
> se621
> (B)
>
> Assume CPU0 and CPU1 are smt siblings and cpu0 has decided task A to
> run next and cpu1 has decided task B to run next. To compare priority
> of task A and B, we compare priority of se2 and se6. Whose vruntime is
> smaller, who wins.
>
> To make this work, the root level sched entities' vruntime of the two
> threads must be directly comparable. So one of the hyperthread's root
> cfs_rq's min_vruntime is chosen as the core wide one and all root level
> sched entities' vruntime is normalized against it.
>
> All sub cfs_rqs and sched entities are not interesting in cross cpu
> priority comparison as they will only participate in the usual cpu local
> schedule decision so no need to normalize their vruntimes.
>
> Signed-off-by: Aaron Lu <[email protected]>
> ---
> kernel/sched/core.c | 23 +++----
> kernel/sched/fair.c | 142 ++++++++++++++++++++++++++++++++++++++++++-
> kernel/sched/sched.h | 3 +
> 3 files changed, 150 insertions(+), 18 deletions(-)


While investigating reported 'uperf' performance regressions between core sched v5 and core sched v6/v7, this patch
seems to be the first indicator of about a 40% perf loss in moving between v5 and v6 (and the accounting here is carried
forward into this patch). Unfortunately, it is not the easiest thing to trace back as the patchsets are not directly
comparable in this case and moving into v7, the base kernel revision has changed from 5.6 to 5.9.

The regressions were duplicated with the following setup: on a 24 core VM, create a cgroup and in it, fire off the uperf
server and the client running 2 mins worth of 100 threads doing short TCP reads and writes. Do this for both the cgroup
core sched tagged and not tagged (technically tearing everything down and rebuilding it in between). Short and easy to
do dozens of runs for statistical averaging.

What ever the above version of this test might map to in real life, it presumably exacerbates the competition between
softirq threads and the core sched tagged threads which was observed in the reports.

Here are the uperf results of the various patchsets. Note, that disabling smt is better for these tests and that that
presumably reflects the overall overhead of core scheduling which went from bad to really bad. The primary focus in this
email is to start to understand what happened within core sched itself.

patchset smt=on/cs=off smt=off smt=on/cs=on
--------------------------------------------------------
v5-v5.6.y : 1.78Gb/s 1.57Gb/s 1.07Gb/s
pre-v6-v5.6.y : 1.75Gb/s 1.55Gb/s 822.16Mb/s
v6-5.7 : 1.87Gs/s 1.56Gb/s 561.6Mb/s
v6-5.7-hotplug : 1.75Gb/s 1.58Gb/s 438.21Mb/s
v7 : 1.80Gb/s 1.61Gb/s 440.44Mb/s

If you start by contrasting v5 and v6 on the same base 5.6 kernel to try to rule out kernel to kernel version
differences, bisecting v6 pointed to the v6 version of (ie this patch):

"[RFC PATCH v7 11/23] sched/fair: core wide cfs task priority comparison"

although all that really seems to be saying is that the new means of vruntime accounting (still present in v7) has
caused performance in the patchset to drop which is plausible; different numbers, different scheduler behavior. A rough
attempt to verify this by backporting parts of the new accounting onto the v5 patchset show where that initial switching
from old to new accounting dropped perf to about 791Mb/s and the rest of the changes (as shown in the v6 numbers though
not backported), only bring the v6 patchset back to 822.16Mb/s. That is not 100% proof, but seems very suspicious.

This change in vruntime accounting seems to account for about 40% of the total v5-to-v7 perf loss though clearly lots of
other changes have occurred in between. Certainly not saying there is a bug here, just time to bring in the original
authors and start a general discussion.

-chrish

2020-09-16 20:59:03

by Chris Hyser

[permalink] [raw]
Subject: Re: [RFC PATCH v7 11/23] sched/fair: core wide cfs task priority comparison

On 9/16/20 10:24 AM, chris hyser wrote:
> On 9/16/20 8:57 AM, Li, Aubrey wrote:
>>> Here are the uperf results of the various patchsets. Note, that disabling smt is better for these tests and that that
>>> presumably reflects the overall overhead of core scheduling which went from bad to really bad. The primary focus in
>>> this email is to start to understand what happened within core sched itself.
>>>
>>> patchset          smt=on/cs=off  smt=off    smt=on/cs=on
>>> --------------------------------------------------------
>>> v5-v5.6.y      :    1.78Gb/s     1.57Gb/s     1.07Gb/s
>>> pre-v6-v5.6.y  :    1.75Gb/s     1.55Gb/s    822.16Mb/s
>>> v6-5.7         :    1.87Gs/s     1.56Gb/s    561.6Mb/s
>>> v6-5.7-hotplug :    1.75Gb/s     1.58Gb/s    438.21Mb/s
>>> v7             :    1.80Gb/s     1.61Gb/s    440.44Mb/s
>>
>> I haven't had a chance to play with v7, but I got something different.
>>
>>    branch        smt=on/cs=on
>> coresched/v5-v5.6.y    1.09Gb/s
>> coresched/v6-v5.7.y    1.05Gb/s
>>
>> I attached my kernel config in case you want to make a comparison, or you
>> can send yours, I'll try to see I can replicate your result.
>
> I will give this config a try. One of the reports forwarded to me about the drop in uperf perf was an email from you I
> believe mentioning a 50% perf drop between v5 and v6?? I was actually setting out to duplicate your results. :-)

The first thing I did was to verify I built and tested the right bits. Presumably as I get same numbers. I'm still
trying to tweak your config to get a root disk in my setup. Oh, one thing I missed in reading your first response, I had
24 cores/48 cpus. I think you had half that, though my guess is that that should have actually made the numbers even
worse. :-)

The following was forwarded to me originally sent on Aug 3, by you I believe:

> We found uperf(in cgroup) throughput drops by ~50% with corescheduling.
>
> The problem is, uperf triggered a lot of softirq and offloaded softirq
> service to *ksoftirqd* thread.
>
> - default, ksoftirqd thread can run with uperf on the same core, we saw
> 100% CPU utilization.
> - coresched enabled, ksoftirqd's core cookie is different from uperf, so
> they can't run concurrently on the same core, we saw ~15% forced idle.
>
> I guess this kind of performance drop can be replicated by other similar
> (a lot of softirq activities) workloads.
>
> Currently core scheduler picks cookie-match tasks for all SMT siblings, does
> it make sense we add a policy to allow cookie-compatible task running together?
> For example, if a task is trusted(set by admin), it can work with kernel thread.
> The difference from corescheduling disabled is that we still have user to user
> isolation.
>
> Thanks,
> -Aubrey

Would you please elaborate on what this test was? In trying to duplicate this, I just kept adding uperf threads to my
setup until I started to see performance losses similar to what is reported above (and a second report about v7). Also,
I wasn't looking for absolute numbers per-se, just significant enough differences to try to track where the performance
went.

-chrish

2020-09-16 22:22:33

by Chris Hyser

[permalink] [raw]
Subject: Re: [RFC PATCH v7 11/23] sched/fair: core wide cfs task priority comparison

On 9/16/20 8:57 AM, Li, Aubrey wrote:
>> Here are the uperf results of the various patchsets. Note, that disabling smt is better for these tests and that that presumably reflects the overall overhead of core scheduling which went from bad to really bad. The primary focus in this email is to start to understand what happened within core sched itself.
>>
>> patchset          smt=on/cs=off  smt=off    smt=on/cs=on
>> --------------------------------------------------------
>> v5-v5.6.y      :    1.78Gb/s     1.57Gb/s     1.07Gb/s
>> pre-v6-v5.6.y  :    1.75Gb/s     1.55Gb/s    822.16Mb/s
>> v6-5.7         :    1.87Gs/s     1.56Gb/s    561.6Mb/s
>> v6-5.7-hotplug :    1.75Gb/s     1.58Gb/s    438.21Mb/s
>> v7             :    1.80Gb/s     1.61Gb/s    440.44Mb/s
>
> I haven't had a chance to play with v7, but I got something different.
>
> branch smt=on/cs=on
> coresched/v5-v5.6.y 1.09Gb/s
> coresched/v6-v5.7.y 1.05Gb/s
>
> I attached my kernel config in case you want to make a comparison, or you
> can send yours, I'll try to see I can replicate your result.

I will give this config a try. One of the reports forwarded to me about the drop in uperf perf was an email from you I
believe mentioning a 50% perf drop between v5 and v6?? I was actually setting out to duplicate your results. :-)

-chrish


2020-09-17 01:18:41

by Li, Aubrey

[permalink] [raw]
Subject: Re: [RFC PATCH v7 11/23] sched/fair: core wide cfs task priority comparison

On 2020/9/17 4:53, chris hyser wrote:
> On 9/16/20 10:24 AM, chris hyser wrote:
>> On 9/16/20 8:57 AM, Li, Aubrey wrote:
>>>> Here are the uperf results of the various patchsets. Note, that disabling smt is better for these tests and that that presumably reflects the overall overhead of core scheduling which went from bad to really bad. The primary focus in this email is to start to understand what happened within core sched itself.
>>>>
>>>> patchset          smt=on/cs=off  smt=off    smt=on/cs=on
>>>> --------------------------------------------------------
>>>> v5-v5.6.y      :    1.78Gb/s     1.57Gb/s     1.07Gb/s
>>>> pre-v6-v5.6.y  :    1.75Gb/s     1.55Gb/s    822.16Mb/s
>>>> v6-5.7         :    1.87Gs/s     1.56Gb/s    561.6Mb/s
>>>> v6-5.7-hotplug :    1.75Gb/s     1.58Gb/s    438.21Mb/s
>>>> v7             :    1.80Gb/s     1.61Gb/s    440.44Mb/s
>>>
>>> I haven't had a chance to play with v7, but I got something different.
>>>
>>>    branch        smt=on/cs=on
>>> coresched/v5-v5.6.y    1.09Gb/s
>>> coresched/v6-v5.7.y    1.05Gb/s
>>>
>>> I attached my kernel config in case you want to make a comparison, or you
>>> can send yours, I'll try to see I can replicate your result.
>>
>> I will give this config a try. One of the reports forwarded to me about the drop in uperf perf was an email from you I believe mentioning a 50% perf drop between v5 and v6?? I was actually setting out to duplicate your results. :-)
>
> The first thing I did was to verify I built and tested the right bits. Presumably as I get same numbers. I'm still trying to tweak your config to get a root disk in my setup. Oh, one thing I missed in reading your first response, I had 24 cores/48 cpus. I think you had half that, though my guess is that that should have actually made the numbers even worse. :-)
>
> The following was forwarded to me originally sent on Aug 3, by you I believe:
>
>> We found uperf(in cgroup) throughput drops by ~50% with corescheduling.
>>
>> The problem is, uperf triggered a lot of softirq and offloaded softirq
>> service to *ksoftirqd* thread.
>>
>> - default, ksoftirqd thread can run with uperf on the same core, we saw
>>   100% CPU utilization.
>> - coresched enabled, ksoftirqd's core cookie is different from uperf, so
>>   they can't run concurrently on the same core, we saw ~15% forced idle.
>>
>> I guess this kind of performance drop can be replicated by other similar
>> (a lot of softirq activities) workloads.
>>
>> Currently core scheduler picks cookie-match tasks for all SMT siblings, does
>> it make sense we add a policy to allow cookie-compatible task running together?
>> For example, if a task is trusted(set by admin), it can work with kernel thread.
>> The difference from corescheduling disabled is that we still have user to user
>> isolation.
>>
>> Thanks,
>> -Aubrey
>
> Would you please elaborate on what this test was? In trying to duplicate this, I just kept adding uperf threads to my setup until I started to see performance losses similar to what is reported above (and a second report about v7). Also, I wasn't looking for absolute numbers per-se, just significant enough differences to try to track where the performance went.
>

This test was smt-on/cs-on against smt-on/cs-off on the same corescheduling version,
we didn't find such big regression between different versions as you encountered.

Thanks,
-Aubrey

2020-09-17 14:31:25

by Vineeth Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v7 11/23] sched/fair: core wide cfs task priority comparison

Hi Peter,

On 8/28/20 5:29 PM, Peter Zijlstra wrote:
>
> This is still a horrible patch..

I was working on your idea of using the lag as a tool to compare vruntime
across cpus:
https://lwn.net/ml/linux-kernel/[email protected]/
https://lwn.net/ml/linux-kernel/[email protected]/

Basically, I record the clock time when a sibling is force idled. And before
comparison, if the sibling is still force idled, take the weighted delta
of time
the sibling was force idled and then decrement it from the vruntime. The
lag is  computed in pick_task_fair as we do cross-cpu comparison only after
a task_pick.

I think this can solve the SMTn issues since we are doing a one-sided
comaprison.

I have done some initial testing and looks good. Need to do some deep
analysis
and check the fairness cases. Thought of sharing the prototype and get
feedback
as I am doing further testing. Please have a look and let me know your
thoughts.

The git tree is here:
https://github.com/vineethrp/linux/tree/coresched/pre-v8-v5.9-rc-lagfix
It has  the review comments addressed since v8 was posted.

Thanks,
Vineeth

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cea9a63c2e7a..52b3e7b356e9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -136,19 +136,8 @@ static inline bool prio_less(struct task_struct *a,
struct task_struct *b)
        if (pa == -1) /* dl_prio() doesn't work because of stop_class
above */
                return !dl_time_before(a->dl.deadline, b->dl.deadline);

-       if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-               u64 vruntime = b->se.vruntime;
-
-               /*
-                * Normalize the vruntime if tasks are in different cpus.
-                */
-               if (task_cpu(a) != task_cpu(b)) {
-                       vruntime -= task_cfs_rq(b)->min_vruntime;
-                       vruntime += task_cfs_rq(a)->min_vruntime;
-               }
-
-               return !((s64)(a->se.vruntime - vruntime) <= 0);
-       }
+       if (pa == MAX_RT_PRIO + MAX_NICE)       /* fair */
+               return cfs_prio_less(a, b);

        return false;
 }
@@ -235,6 +224,13 @@ static void sched_core_dequeue(struct rq *rq,
struct task_struct *p)

        rb_erase(&p->core_node, &rq->core_tree);
        RB_CLEAR_NODE(&p->core_node);
+
+    /*
+     * Reset last force idle time if this rq will be
+     * empty after this dequeue.
+     */
+    if (rq->nr_running == 1)
+        rq->core_fi_time = 0;
 }

 /*
@@ -310,8 +306,10 @@ static int __sched_core_stopper(void *data)
                         * cgroup tags. However, dying tasks could still be
                         * left in core queue. Flush them here.
                         */
-                       if (!enabled)
+                       if (!enabled) {
                                sched_core_flush(cpu);
+                               rq->core_fi_time = 0;
+            }

                        rq->core_enabled = enabled;
                }
@@ -5181,9 +5179,17 @@ next_class:;
                if (!rq_i->core_pick)
                        continue;

-               if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
-                   !rq_i->core->core_forceidle) {
-                       rq_i->core->core_forceidle = true;
+               if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running) {
+
+                       if (!rq_i->core->core_forceidle)
+                               rq_i->core->core_forceidle = true;
+
+                       /*
+                        * XXX: Should we record force idled time a bit
later on the
+                        * pick_next_task of this sibling when it forces
itself idle?
+                        */
+                       rq_i->core_fi_time = rq_clock_task(rq_i);
+                       trace_printk("force idle(cpu=%d) TS=%llu\n", i,
rq_i->core_fi_time);
                }

                rq_i->core_pick->core_occupation = occ;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eadc62930336..7f62a50e23e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -389,6 +389,12 @@ static inline struct sched_entity
*parent_entity(struct sched_entity *se)
        return se->parent;
 }

+static inline bool
+is_same_tg(struct sched_entity *se, struct sched_entity *pse)
+{
+       return se->cfs_rq->tg == pse->cfs_rq->tg;
+}
+
 static void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
@@ -453,6 +459,12 @@ static inline struct sched_entity
*parent_entity(struct sched_entity *se)
        return NULL;
 }

+static inline bool
+is_same_tg(struct sched_entity *se, struct sched_entity *pse)
+{
+       return true;
+}
+
 static inline void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
@@ -6962,13 +6974,26 @@ static struct task_struct *pick_task_fair(struct
rq *rq)
 {
        struct cfs_rq *cfs_rq = &rq->cfs;
        struct sched_entity *se;
+       u64 lag = 0;

        if (!cfs_rq->nr_running)
                return NULL;

+       /*
+        * Calculate the lag caused due to force idle on this
+        * sibling. Since we do cross-cpu comparison of vruntime
+        * only after a task pick, it should be safe to compute
+        * the lag here.
+        */
+       if (rq->core_fi_time) {
+               lag = rq_clock_task(rq) - rq->core_fi_time;
+               if ((s64)lag < 0)
+                       lag = 0;
+       }
        do {
                struct sched_entity *curr = cfs_rq->curr;

+
                se = pick_next_entity(cfs_rq, NULL);

                if (curr) {
@@ -6977,11 +7002,15 @@ static struct task_struct *pick_task_fair(struct
rq *rq)

                        if (!se || entity_before(curr, se))
                                se = curr;
+
+                       cfs_rq->core_lag += lag;
                }

                cfs_rq = group_cfs_rq(se);
        } while (cfs_rq);

+       rq->core_fi_time = 0;
+
        return task_of(se);
 }
 #endif
@@ -10707,6 +10736,41 @@ static inline void task_tick_core(struct rq
*rq, struct task_struct *curr)
            __entity_slice_used(&curr->se))
                resched_curr(rq);
 }
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+       struct sched_entity *se_a = &a->se, *se_b = &b->se;
+       struct cfs_rq *cfs_rq_a, *cfs_rq_b;
+       u64 vruntime_a, vruntime_b;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+       while (!is_same_tg(se_a, se_b)) {
+               int se_a_depth = se_a->depth;
+               int se_b_depth = se_b->depth;
+
+               if (se_a_depth <= se_b_depth)
+                       se_b = parent_entity(se_b);
+               if (se_a_depth >= se_b_depth)
+                       se_a = parent_entity(se_a);
+       }
+#endif
+
+       cfs_rq_a = cfs_rq_of(se_a);
+       cfs_rq_b = cfs_rq_of(se_b);
+
+       vruntime_a = se_a->vruntime - cfs_rq_a->min_vruntime;
+       vruntime_b = se_b->vruntime - cfs_rq_b->min_vruntime;
+
+       trace_printk("(%s/%d;%Ld,%Lu) ?< (%s/%d;%Ld,%Lu)\n",
+                    a->comm, a->pid, vruntime_a, cfs_rq_a->core_lag,
+                    b->comm, b->pid, vruntime_b, cfs_rq_b->core_lag);
+       if (cfs_rq_a != cfs_rq_b) {
+               vruntime_a -= calc_delta_fair(cfs_rq_a->core_lag, &a->se);
+               vruntime_b -= calc_delta_fair(cfs_rq_b->core_lag, &b->se);
+       }
+
+       return !((s64)(vruntime_a - vruntime_b) <= 0);
+}
 #else
 static inline void task_tick_core(struct rq *rq, struct task_struct
*curr) {}
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5374ace195b9..a1c15edf8dd5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -607,6 +607,10 @@ struct cfs_rq {
        struct list_head        throttled_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
+
+#ifdef CONFIG_SCHED_CORE
+       u64                     core_lag;
+#endif
 };

 static inline int rt_bandwidth_enabled(void)
@@ -1056,6 +1060,7 @@ struct rq {
 #ifdef CONFIG_SCHED_CORE
        /* per rq */
        struct rq               *core;
+       u64                     core_fi_time; /* Last forced idle TS */
        struct task_struct      *core_pick;
        unsigned int            core_enabled;
        unsigned int            core_sched_seq;
vineeth@vinstation340u:~/WS/kernel_test/linux-qemu/out$ git show > log
vineeth@vinstation340u:~/WS/kernel_test/linux-qemu/out$ cat log
commit e97ec4d317ac01cabbcdcb1ac8a315a02b3e39f6
Author: Vineeth Pillai <[email protected]>
Date:   Wed Sep 16 11:35:37 2020 -0400

    sched/coresched: lag based cross cpu vruntime comparison

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cea9a63c2e7a..52b3e7b356e9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -136,19 +136,8 @@ static inline bool prio_less(struct task_struct *a,
struct task_struct *b)
     if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
         return !dl_time_before(a->dl.deadline, b->dl.deadline);

-    if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-        u64 vruntime = b->se.vruntime;
-
-        /*
-         * Normalize the vruntime if tasks are in different cpus.
-         */
-        if (task_cpu(a) != task_cpu(b)) {
-            vruntime -= task_cfs_rq(b)->min_vruntime;
-            vruntime += task_cfs_rq(a)->min_vruntime;
-        }
-
-        return !((s64)(a->se.vruntime - vruntime) <= 0);
-    }
+    if (pa == MAX_RT_PRIO + MAX_NICE)    /* fair */
+        return cfs_prio_less(a, b);

     return false;
 }
@@ -235,6 +224,13 @@ static void sched_core_dequeue(struct rq *rq,
struct task_struct *p)

     rb_erase(&p->core_node, &rq->core_tree);
     RB_CLEAR_NODE(&p->core_node);
+
+    /*
+     * Reset last force idle time if this rq will be
+     * empty after this dequeue.
+     */
+    if (rq->nr_running == 1)
+        rq->core_fi_time = 0;
 }

 /*
@@ -310,8 +306,10 @@ static int __sched_core_stopper(void *data)
              * cgroup tags. However, dying tasks could still be
              * left in core queue. Flush them here.
              */
-            if (!enabled)
+            if (!enabled) {
                 sched_core_flush(cpu);
+                rq->core_fi_time = 0;
+                /* XXX Reset cfs_rq->core_lag? */
+            }

             rq->core_enabled = enabled;
         }
@@ -5181,9 +5179,17 @@ next_class:;
         if (!rq_i->core_pick)
             continue;

-        if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
-            !rq_i->core->core_forceidle) {
-            rq_i->core->core_forceidle = true;
+        if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running) {
+
+            if (!rq_i->core->core_forceidle)
+                rq_i->core->core_forceidle = true;
+
+            /*
+             * XXX: Should we record force idled time a bit later on the
+             * pick_next_task of this sibling when it forces itself idle?
+             */
+            rq_i->core_fi_time = rq_clock_task(rq_i);
+            trace_printk("force idle(cpu=%d) TS=%llu\n", i,
rq_i->core_fi_time);
         }

         rq_i->core_pick->core_occupation = occ;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eadc62930336..7f62a50e23e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -389,6 +389,12 @@ static inline struct sched_entity
*parent_entity(struct sched_entity *se)
     return se->parent;
 }

+static inline bool
+is_same_tg(struct sched_entity *se, struct sched_entity *pse)
+{
+    return se->cfs_rq->tg == pse->cfs_rq->tg;
+}
+
 static void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
@@ -453,6 +459,12 @@ static inline struct sched_entity
*parent_entity(struct sched_entity *se)
     return NULL;
 }

+static inline bool
+is_same_tg(struct sched_entity *se, struct sched_entity *pse)
+{
+    return true;
+}
+
 static inline void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
@@ -6962,13 +6974,26 @@ static struct task_struct *pick_task_fair(struct
rq *rq)
 {
     struct cfs_rq *cfs_rq = &rq->cfs;
     struct sched_entity *se;
+  u64 lag = 0;

     if (!cfs_rq->nr_running)
         return NULL;

+    /*
+     * Calculate the lag caused due to force idle on this
+     * sibling. Since we do cross-cpu comparison of vruntime
+     * only after a task pick, it should be safe to compute
+     * the lag here.
+     */
+    if (rq->core_fi_time) {
+        lag = rq_clock_task(rq) - rq->core_fi_time;
+        if ((s64)lag < 0)
+            lag = 0;
+    }
     do {
         struct sched_entity *curr = cfs_rq->curr;

+
         se = pick_next_entity(cfs_rq, NULL);

         if (curr) {
@@ -6977,11 +7002,15 @@ static struct task_struct *pick_task_fair(struct
rq *rq)

             if (!se || entity_before(curr, se))
                 se = curr;
+
+            cfs_rq->core_lag += lag;
         }

         cfs_rq = group_cfs_rq(se);
     } while (cfs_rq);

+    rq->core_fi_time = 0;
+
     return task_of(se);
 }
 #endif
@@ -10707,6 +10736,41 @@ static inline void task_tick_core(struct rq
*rq, struct task_struct *curr)
         __entity_slice_used(&curr->se))
         resched_curr(rq);
 }
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+    struct sched_entity *se_a = &a->se, *se_b = &b->se;
+    struct cfs_rq *cfs_rq_a, *cfs_rq_b;
+    u64 vruntime_a, vruntime_b;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+    while (!is_same_tg(se_a, se_b)) {
+        int se_a_depth = se_a->depth;
+        int se_b_depth = se_b->depth;
+
+        if (se_a_depth <= se_b_depth)
+            se_b = parent_entity(se_b);
+        if (se_a_depth >= se_b_depth)
+            se_a = parent_entity(se_a);
+    }
+#endif
+
+    cfs_rq_a = cfs_rq_of(se_a);
+    cfs_rq_b = cfs_rq_of(se_b);
+
+    vruntime_a = se_a->vruntime - cfs_rq_a->min_vruntime;
+    vruntime_b = se_b->vruntime - cfs_rq_b->min_vruntime;
+
+    trace_printk("(%s/%d;%Ld,%Lu) ?< (%s/%d;%Ld,%Lu)\n",
+             a->comm, a->pid, vruntime_a, cfs_rq_a->core_lag,
+             b->comm, b->pid, vruntime_b, cfs_rq_b->core_lag);
+    if (cfs_rq_a != cfs_rq_b) {
+        vruntime_a -= calc_delta_fair(cfs_rq_a->core_lag, &a->se);
+        vruntime_b -= calc_delta_fair(cfs_rq_b->core_lag, &b->se);
+    }
+
+    return !((s64)(vruntime_a - vruntime_b) <= 0);
+}
 #else
 static inline void task_tick_core(struct rq *rq, struct task_struct
*curr) {}
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5374ace195b9..a1c15edf8dd5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -607,6 +607,10 @@ struct cfs_rq {
     struct list_head    throttled_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
+
+#ifdef CONFIG_SCHED_CORE
+    u64            core_lag;
+#endif
 };

 static inline int rt_bandwidth_enabled(void)
@@ -1056,6 +1060,7 @@ struct rq {
 #ifdef CONFIG_SCHED_CORE
     /* per rq */
     struct rq        *core;
+   u64            core_fi_time; /* Last forced idle TS */
     struct task_struct    *core_pick;
     unsigned int        core_enabled;
     unsigned int        core_sched_seq;

2020-09-17 20:42:00

by Vineeth Pillai

[permalink] [raw]
Subject: Re: [RFC PATCH v7 11/23] sched/fair: core wide cfs task priority comparison


> +
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +       struct sched_entity *se_a = &a->se, *se_b = &b->se;
> +       struct cfs_rq *cfs_rq_a, *cfs_rq_b;
> +       u64 vruntime_a, vruntime_b;
> +
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +       while (!is_same_tg(se_a, se_b)) {
> +               int se_a_depth = se_a->depth;
> +               int se_b_depth = se_b->depth;
> +
> +               if (se_a_depth <= se_b_depth)
> +                       se_b = parent_entity(se_b);
> +               if (se_a_depth >= se_b_depth)
> +                       se_a = parent_entity(se_a);
> +       }
> +#endif
> +
> +       cfs_rq_a = cfs_rq_of(se_a);
> +       cfs_rq_b = cfs_rq_of(se_b);
> +
> +       vruntime_a = se_a->vruntime - cfs_rq_a->min_vruntime;
> +       vruntime_b = se_b->vruntime - cfs_rq_b->min_vruntime;
> +
> +       trace_printk("(%s/%d;%Ld,%Lu) ?< (%s/%d;%Ld,%Lu)\n",
> +                    a->comm, a->pid, vruntime_a, cfs_rq_a->core_lag,
> +                    b->comm, b->pid, vruntime_b, cfs_rq_b->core_lag);
> +       if (cfs_rq_a != cfs_rq_b) {
> +               vruntime_a -= calc_delta_fair(cfs_rq_a->core_lag,
> &a->se);
> +               vruntime_b -= calc_delta_fair(cfs_rq_b->core_lag,
> &b->se);

This should be:
+               vruntime_a -= calc_delta_fair(cfs_rq_a->core_lag, se_a);
+               vruntime_b -= calc_delta_fair(cfs_rq_b->core_lag, se_b);


Thanks,
Vineeth


2020-09-23 02:45:46

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 11/23] sched/fair: core wide cfs task priority comparison

On Fri, Aug 28, 2020 at 11:29:27PM +0200, Peter Zijlstra wrote:
>
>
> This is still a horrible patch..

Hi Peter,
I wrote a new patch similar to this one and it fares much better in my tests,
it is based on Aaron's idea but I do the sync only during force-idle, and not
during enqueue. Also I yanked the whole 'core wide min_vruntime' crap. There
is a regressing test which improves quite a bit with my patch (results below):

Aaron, Vineeth, Chris any other thoughts? This patch is based on Google's
4.19 device kernel so will require some massaging to apply to mainline/v7
series. I will provide an updated patch later based on v7 series.

(Works only for SMT2, maybe we can generalize it more..)
--------8<-----------

From: "Joel Fernandes (Google)" <[email protected]>
Subject: [PATCH] sched: Sync the min_vruntime of cores when the system enters
force-idle

This patch provides a vruntime based way to compare two cfs task's priority, be
it on the same cpu or different threads of the same core.

It is based on Aaron Lu's patch with some important differences. Namely,
the vruntime is sync'ed only when the CPU goes into force-idle. Also I removed
the notion of core-wide min_vruntime.

Also I don't care how long a cpu in a core is force idled, I do my sync
whenever the force idle starts essentially bringing both SMTs to a common time
base. After that point, selection can happen as usual.

When running an Android audio test, with patch the perf sched latency output:

-----------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at |
-----------------------------------------------------------------------------------------------------------------
FinalizerDaemon:(2) | 23.969 ms | 969 | avg: 0.504 ms | max: 162.020 ms | max at: 1294.327339 s
HeapTaskDaemon:(3) | 2421.287 ms | 4733 | avg: 0.131 ms | max: 96.229 ms | max at: 1302.343366 s
adbd:(3) | 6.101 ms | 79 | avg: 1.105 ms | max: 84.923 ms | max at: 1294.431284 s

Without this patch and with Aubrey's initial patch (in v5 series), the max delay looks much better:

-----------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at |
-----------------------------------------------------------------------------------------------------------------
HeapTaskDaemon:(2) | 2602.109 ms | 4025 | avg: 0.231 ms | max: 19.152 ms | max at: 522.903934 s
surfaceflinger:7478 | 18.994 ms | 1206 | avg: 0.189 ms | max: 17.375 ms | max at: 520.523061 s
ksoftirqd/3:30 | 0.093 ms | 5 | avg: 3.328 ms | max: 16.567 ms | max at: 522.903871 s

The comparison bits are the same as with Aaron's patch. When the two tasks
are on the same CPU, we just need to find a common cfs_rq both sched_entities
are on and then do the comparison.

When the two tasks are on different threads of the same core, the root
level sched_entities to which the two tasks belong will be used to do
the comparison.

An ugly illustration for the cross CPU case:

cpu0 cpu1
/ | \ / | \
se1 se2 se3 se4 se5 se6
/ \ / \
se21 se22 se61 se62

Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while
task B's se is se61. To compare priority of task A and B, we compare
priority of se2 and se6. Whose vruntime is smaller, who wins.

To make this work, the root level se should have a common cfs_rq min
vuntime, which I call it the core cfs_rq min vruntime.

When we adjust the min_vruntime of rq->core, we need to propgate
that down the tree so as to not cause starvation of existing tasks
based on previous vruntime.

Inspired-by: Aaron Lu <[email protected]>
Signed-off-by: Joel Fernandes <[email protected]>
Change-Id: Ida0083a2382a37f768030112ddf43bdf024a84b3
---
kernel/sched/core.c | 17 +++++++++++++-
kernel/sched/fair.c | 54 +++++++++++++++++++++++---------------------
kernel/sched/sched.h | 2 ++
3 files changed, 46 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aeaf72acf063..d6f25fdd78ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4068,6 +4068,7 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
static struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
+ struct cfs_rq *cfs_rq[2];
struct task_struct *next, *max = NULL;
const struct sched_class *class;
const struct cpumask *smt_mask;
@@ -4247,6 +4248,7 @@ next_class:;
* their task. This ensures there is no inter-sibling overlap between
* non-matching user state.
*/
+ need_sync = false;
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);

@@ -4255,8 +4257,10 @@ next_class:;

WARN_ON_ONCE(!rq_i->core_pick);

- if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
+ if (is_idle_task(rq_i->core_pick) && rq_i->nr_running) {
rq_i->core_forceidle = true;
+ need_sync = true;
+ }

rq_i->core_pick->core_occupation = occ;

@@ -4270,6 +4274,17 @@ next_class:;
WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
}

+ /* Something got force idled, sync the vruntimes. Works only for SMT2. */
+ if (need_sync) {
+ j = 0;
+ for_each_cpu(i, smt_mask) {
+ struct rq *rq_i = cpu_rq(i);
+ cfs_rq[j++] = &rq_i->cfs;
+ }
+
+ if (j == 2)
+ update_core_cfs_rq_min_vruntime(cfs_rq[0], cfs_rq[1]);
+ }
done:
set_next_task(rq, next);
return next;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2ca3f6173c52..5bd220751d7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -479,13 +479,6 @@ static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq)

static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
{
- if (!sched_core_enabled(rq_of(cfs_rq)))
- return cfs_rq->min_vruntime;
-
-#ifdef CONFIG_SCHED_CORE
- if (is_root_cfs_rq(cfs_rq))
- return core_cfs_rq(cfs_rq)->min_vruntime;
-#endif
return cfs_rq->min_vruntime;
}

@@ -497,40 +490,47 @@ static void coresched_adjust_vruntime(struct cfs_rq *cfs_rq, u64 delta)
if (!cfs_rq)
return;

- cfs_rq->min_vruntime -= delta;
+ /*
+ * vruntime only goes forward. This can cause sleepers to be boosted
+ * but that's Ok for now.
+ */
+ cfs_rq->min_vruntime += delta;
rbtree_postorder_for_each_entry_safe(se, next,
&cfs_rq->tasks_timeline.rb_root, run_node) {
if (se->vruntime > delta)
- se->vruntime -= delta;
+ se->vruntime += delta;
if (se->my_q)
coresched_adjust_vruntime(se->my_q, delta);
}
}

-static void update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
+void update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rqa,
+ struct cfs_rq *cfs_rqb)
{
- struct cfs_rq *cfs_rq_core;
-
- if (!sched_core_enabled(rq_of(cfs_rq)))
- return;
+ u64 delta;

- if (!is_root_cfs_rq(cfs_rq))
+ if (!is_root_cfs_rq(cfs_rqa) || !is_root_cfs_rq(cfs_rqb))
return;

- cfs_rq_core = core_cfs_rq(cfs_rq);
- if (cfs_rq_core != cfs_rq &&
- cfs_rq->min_vruntime < cfs_rq_core->min_vruntime) {
- u64 delta = cfs_rq_core->min_vruntime - cfs_rq->min_vruntime;
- coresched_adjust_vruntime(cfs_rq_core, delta);
+ if (cfs_rqa->min_vruntime < cfs_rqb->min_vruntime) {
+ delta = cfs_rqb->min_vruntime - cfs_rqa->min_vruntime;
+ /* Bring a upto speed with b. */
+ coresched_adjust_vruntime(cfs_rqa, delta);
+ } else if (cfs_rqb->min_vruntime < cfs_rqa->min_vruntime) {
+ u64 delta = cfs_rqa->min_vruntime - cfs_rqb->min_vruntime;
+ /* Bring b upto speed with a. */
+ coresched_adjust_vruntime(cfs_rqb, delta);
}
}
#endif

bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
{
+ struct cfs_rq *cfs_rqa = &cpu_rq(task_cpu(a))->cfs;
+ struct cfs_rq *cfs_rqb = &cpu_rq(task_cpu(b))->cfs;
+ bool samecpu = task_cpu(a) == task_cpu(b);
struct sched_entity *sea = &a->se;
struct sched_entity *seb = &b->se;
- bool samecpu = task_cpu(a) == task_cpu(b);
struct task_struct *p;
s64 delta;

@@ -555,7 +555,13 @@ bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
sea = sea->parent;
while (seb->parent)
seb = seb->parent;
- delta = (s64)(sea->vruntime - seb->vruntime);
+
+ WARN_ON_ONCE(sea->vruntime < cfs_rqa->min_vruntime);
+ WARN_ON_ONCE(seb->vruntime < cfs_rqb->min_vruntime);
+
+ /* normalize vruntime WRT their rq's base */
+ delta = (s64)(sea->vruntime - seb->vruntime) +
+ (s64)(cfs_rqb->min_vruntime - cfs_rqa->min_vruntime);

out:
p = delta > 0 ? b : a;
@@ -624,10 +630,6 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
/* ensure we never gain time by being placed backwards. */
cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);

-#ifdef CONFIG_SCHED_CORE
- update_core_cfs_rq_min_vruntime(cfs_rq);
-#endif
-
#ifndef CONFIG_64BIT
smp_wmb();
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d09cfbd746e5..8559d0ee157f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2549,6 +2549,8 @@ unsigned long scale_irq_capacity(unsigned long util, unsigned long irq, unsigned
#endif

bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+void update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rqa,
+ struct cfs_rq *cfs_rqb);

#ifdef CONFIG_SMP
extern struct static_key_false sched_energy_present;
--
2.28.0.681.g6f77f65b4e-goog

2020-09-23 02:46:01

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 11/23] sched/fair: core wide cfs task priority comparison

On Tue, Sep 22, 2020 at 09:46:22PM -0400, Joel Fernandes wrote:
> On Fri, Aug 28, 2020 at 11:29:27PM +0200, Peter Zijlstra wrote:
> >
> >
> > This is still a horrible patch..
>
> Hi Peter,
> I wrote a new patch similar to this one and it fares much better in my tests,
> it is based on Aaron's idea but I do the sync only during force-idle, and not
> during enqueue. Also I yanked the whole 'core wide min_vruntime' crap. There
> is a regressing test which improves quite a bit with my patch (results below):
>
> Aaron, Vineeth, Chris any other thoughts? This patch is based on Google's
> 4.19 device kernel so will require some massaging to apply to mainline/v7
> series. I will provide an updated patch later based on v7 series.
>
> (Works only for SMT2, maybe we can generalize it more..)
> --------8<-----------
>
> From: "Joel Fernandes (Google)" <[email protected]>
> Subject: [PATCH] sched: Sync the min_vruntime of cores when the system enters
> force-idle
>
> This patch provides a vruntime based way to compare two cfs task's priority, be
> it on the same cpu or different threads of the same core.
>
> It is based on Aaron Lu's patch with some important differences. Namely,
> the vruntime is sync'ed only when the CPU goes into force-idle. Also I removed
> the notion of core-wide min_vruntime.
>
> Also I don't care how long a cpu in a core is force idled, I do my sync
> whenever the force idle starts essentially bringing both SMTs to a common time
> base. After that point, selection can happen as usual.
>
> When running an Android audio test, with patch the perf sched latency output:
>
> -----------------------------------------------------------------------------------------------------------------
> Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at |
> -----------------------------------------------------------------------------------------------------------------
> FinalizerDaemon:(2) | 23.969 ms | 969 | avg: 0.504 ms | max: 162.020 ms | max at: 1294.327339 s
> HeapTaskDaemon:(3) | 2421.287 ms | 4733 | avg: 0.131 ms | max: 96.229 ms | max at: 1302.343366 s
> adbd:(3) | 6.101 ms | 79 | avg: 1.105 ms | max: 84.923 ms | max at: 1294.431284 s
>
> Without this patch and with Aubrey's initial patch (in v5 series), the max delay looks much better:
>
> -----------------------------------------------------------------------------------------------------------------
> Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at |
> -----------------------------------------------------------------------------------------------------------------
> HeapTaskDaemon:(2) | 2602.109 ms | 4025 | avg: 0.231 ms | max: 19.152 ms | max at: 522.903934 s
> surfaceflinger:7478 | 18.994 ms | 1206 | avg: 0.189 ms | max: 17.375 ms | max at: 520.523061 s
> ksoftirqd/3:30 | 0.093 ms | 5 | avg: 3.328 ms | max: 16.567 ms | max at: 522.903871 s

I messed up the change log, just to clarify - the first result is without
patch (bad) and the second result is with patch (good).

thanks,

- Joel

2020-09-25 15:06:13

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v7 11/23] sched/fair: core wide cfs task priority comparison

On Tue, Sep 22, 2020 at 09:52:43PM -0400, Joel Fernandes wrote:
> On Tue, Sep 22, 2020 at 09:46:22PM -0400, Joel Fernandes wrote:
> > On Fri, Aug 28, 2020 at 11:29:27PM +0200, Peter Zijlstra wrote:
> > >
> > >
> > > This is still a horrible patch..
> >
> > Hi Peter,
> > I wrote a new patch similar to this one and it fares much better in my tests,
> > it is based on Aaron's idea but I do the sync only during force-idle, and not
> > during enqueue. Also I yanked the whole 'core wide min_vruntime' crap. There
> > is a regressing test which improves quite a bit with my patch (results below):
> >
> > Aaron, Vineeth, Chris any other thoughts? This patch is based on Google's
> > 4.19 device kernel so will require some massaging to apply to mainline/v7
> > series. I will provide an updated patch later based on v7 series.
> >
> > (Works only for SMT2, maybe we can generalize it more..)
> > --------8<-----------
> >
> > From: "Joel Fernandes (Google)" <[email protected]>
> > Subject: [PATCH] sched: Sync the min_vruntime of cores when the system enters
> > force-idle
> >
> > This patch provides a vruntime based way to compare two cfs task's priority, be
> > it on the same cpu or different threads of the same core.
> >
> > It is based on Aaron Lu's patch with some important differences. Namely,
> > the vruntime is sync'ed only when the CPU goes into force-idle. Also I removed
> > the notion of core-wide min_vruntime.
> >
> > Also I don't care how long a cpu in a core is force idled, I do my sync
> > whenever the force idle starts essentially bringing both SMTs to a common time
> > base. After that point, selection can happen as usual.
> >
> > When running an Android audio test, with patch the perf sched latency output:
> >
> > -----------------------------------------------------------------------------------------------------------------
> > Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at |
> > -----------------------------------------------------------------------------------------------------------------
> > FinalizerDaemon:(2) | 23.969 ms | 969 | avg: 0.504 ms | max: 162.020 ms | max at: 1294.327339 s
> > HeapTaskDaemon:(3) | 2421.287 ms | 4733 | avg: 0.131 ms | max: 96.229 ms | max at: 1302.343366 s
> > adbd:(3) | 6.101 ms | 79 | avg: 1.105 ms | max: 84.923 ms | max at: 1294.431284 s
> >
> > Without this patch and with Aubrey's initial patch (in v5 series), the max delay looks much better:
> >
> > -----------------------------------------------------------------------------------------------------------------
> > Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at |
> > -----------------------------------------------------------------------------------------------------------------
> > HeapTaskDaemon:(2) | 2602.109 ms | 4025 | avg: 0.231 ms | max: 19.152 ms | max at: 522.903934 s
> > surfaceflinger:7478 | 18.994 ms | 1206 | avg: 0.189 ms | max: 17.375 ms | max at: 520.523061 s
> > ksoftirqd/3:30 | 0.093 ms | 5 | avg: 3.328 ms | max: 16.567 ms | max at: 522.903871 s
>
> I messed up the change log, just to clarify - the first result is without
> patch (bad) and the second result is with patch (good).

Here's another approach that might be worth considering, I was discussing
with Vineeth. Freeze the min_vruntime of CPUs when the core enters into
force-idle. I think this is similar to Peter's suggestion.

It is doing quite well in my real-world audio tests. This applies on top of
our ChromeOS 4.19 kernel tree [1] (which has the v5 series).

Any thoughts or review are most welcome especially from Peter :)

[1] https://chromium.googlesource.com/chromiumos/third_party/kernel/+/refs/heads/chromeos-4.19

---8<-----------------------

From: Joel Fernandes <[email protected]>
Subject: [PATCH] Sync the min_vruntime of cores when the system enters force-idle

---
kernel/sched/core.c | 24 +++++++++++++++++-
kernel/sched/fair.c | 59 +++++++-------------------------------------
kernel/sched/sched.h | 1 +
3 files changed, 33 insertions(+), 51 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 715391c418d8..4ab680319a6b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4073,6 +4073,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
const struct cpumask *smt_mask;
int i, j, cpu, occ = 0;
bool need_sync = false;
+ bool fi_before = false;

cpu = cpu_of(rq);
if (cpu_is_offline(cpu))
@@ -4138,6 +4139,16 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
update_rq_clock(rq_i);
}

+ fi_before = need_sync;
+ if (!need_sync) {
+ for_each_cpu(i, smt_mask) {
+ struct rq *rq_i = cpu_rq(i);
+
+ /* Reset the snapshot if core is no longer in force-idle. */
+ rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
+ }
+ }
+
/*
* Try and select tasks for each sibling in decending sched_class
* order.
@@ -4247,6 +4258,7 @@ next_class:;
* their task. This ensures there is no inter-sibling overlap between
* non-matching user state.
*/
+ need_sync = false;
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);

@@ -4255,8 +4267,10 @@ next_class:;

WARN_ON_ONCE(!rq_i->core_pick);

- if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
+ if (is_idle_task(rq_i->core_pick) && rq_i->nr_running) {
rq_i->core_forceidle = true;
+ need_sync = true;
+ }

rq_i->core_pick->core_occupation = occ;

@@ -4270,6 +4284,14 @@ next_class:;
WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
}

+ if (!fi_before && need_sync) {
+ for_each_cpu(i, smt_mask) {
+ struct rq *rq_i = cpu_rq(i);
+
+ /* Snapshot if core is in force-idle. */
+ rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
+ }
+ }
done:
set_next_task(rq, next);
return next;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23d032ab62d8..3d7c822bb5fb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -479,59 +479,17 @@ static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq)

static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
{
- if (!sched_core_enabled(rq_of(cfs_rq)))
- return cfs_rq->min_vruntime;
-
-#ifdef CONFIG_SCHED_CORE
- if (is_root_cfs_rq(cfs_rq))
- return core_cfs_rq(cfs_rq)->min_vruntime;
-#endif
return cfs_rq->min_vruntime;
}

-#ifdef CONFIG_SCHED_CORE
-static void coresched_adjust_vruntime(struct cfs_rq *cfs_rq, u64 delta)
-{
- struct sched_entity *se, *next;
-
- if (!cfs_rq)
- return;
-
- cfs_rq->min_vruntime -= delta;
- rbtree_postorder_for_each_entry_safe(se, next,
- &cfs_rq->tasks_timeline.rb_root, run_node) {
- if (se->vruntime > delta)
- se->vruntime -= delta;
- if (se->my_q)
- coresched_adjust_vruntime(se->my_q, delta);
- }
-}
-
-static void update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
-{
- struct cfs_rq *cfs_rq_core;
-
- if (!sched_core_enabled(rq_of(cfs_rq)))
- return;
-
- if (!is_root_cfs_rq(cfs_rq))
- return;
-
- cfs_rq_core = core_cfs_rq(cfs_rq);
- if (cfs_rq_core != cfs_rq &&
- cfs_rq->min_vruntime < cfs_rq_core->min_vruntime) {
- u64 delta = cfs_rq_core->min_vruntime - cfs_rq->min_vruntime;
- coresched_adjust_vruntime(cfs_rq_core, delta);
- }
-}
-#endif
-
#ifdef CONFIG_FAIR_GROUP_SCHED
bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
{
+ bool samecpu = task_cpu(a) == task_cpu(b);
struct sched_entity *sea = &a->se;
struct sched_entity *seb = &b->se;
- bool samecpu = task_cpu(a) == task_cpu(b);
+ struct cfs_rq *cfs_rqa;
+ struct cfs_rq *cfs_rqb;
s64 delta;

if (samecpu) {
@@ -555,8 +513,13 @@ bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
sea = sea->parent;
while (seb->parent)
seb = seb->parent;
- delta = (s64)(sea->vruntime - seb->vruntime);

+ cfs_rqa = sea->cfs_rq;
+ cfs_rqb = seb->cfs_rq;
+
+ /* normalize vruntime WRT their rq's base */
+ delta = (s64)(sea->vruntime - seb->vruntime) +
+ (s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
out:
return delta > 0;
}
@@ -620,10 +583,6 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
/* ensure we never gain time by being placed backwards. */
cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);

-#ifdef CONFIG_SCHED_CORE
- update_core_cfs_rq_min_vruntime(cfs_rq);
-#endif
-
#ifndef CONFIG_64BIT
smp_wmb();
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d09cfbd746e5..45c8ce5c2333 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -499,6 +499,7 @@ struct cfs_rq {

u64 exec_clock;
u64 min_vruntime;
+ u64 min_vruntime_fi;
#ifndef CONFIG_64BIT
u64 min_vruntime_copy;
#endif
--
2.28.0.709.gb0816b6eb0-goog