2020-08-15 21:59:34

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC 00/12] Core-sched v6+: kernel protection and hotplug fixes

Hello!

This series is continuation of main core-sched v6 series [1] and adds support
for syscall and IRQ isolation from usermode processes and guests. It is key to
safely entering kernel mode in an HT while the other HT is in use by a user or
guest. The series also fixes CPU hotplug issues arising because of the
cpu_smt_mask changing while the next task is being picked. These hotplug fixes
are needed also for kernel protection to work correctly.

The series is based on Thomas's x86/entry tree.

[1] https://lwn.net/Articles/824918/

Background:

Core-scheduling prevents hyperthreads in usermode from attacking each
other, but it does not do anything about one of the hyperthreads
entering the kernel for any reason. This leaves the door open for MDS
and L1TF attacks with concurrent execution sequences between
hyperthreads.

This series adds support for protecting all syscall and IRQ kernel mode entries
by cleverly tracking when any sibling in a core enter the kernel, and when all
the siblings exit the kernel. IPIs are sent to force siblings into the kernel.

Care is taken to avoid waiting in IRQ-disabled sections as Thomas suggested
thus avoiding stop_machine deadlocks. Every attempt is made to avoid
unnecessary IPIs.

Performance tests:
sysbench is used to test the performance of the patch series. Used a 8 cpu/4
Core VM and ran 2 sysbench tests in parallel. Each sysbench test runs 4 tasks:
sysbench --test=cpu --cpu-max-prime=100000 --num-threads=4 run

Compared the performance results for various combinations as below.
The metric below is 'events per second':

1. Coresched disabled
sysbench-1/sysbench-2 => 175.7/175.6

2. Coreched enabled, both sysbench tagged
sysbench-1/sysbench-2 => 168.8/165.6

3. Coresched enabled, sysbench-1 tagged and sysbench-2 untagged
sysbench-1/sysbench-2 => 96.4/176.9

4. smt off
sysbench-1/sysbench-2 => 97.9/98.8

When both sysbench-es are tagged, there is a perf drop of ~4%. With a
tagged/untagged case, the tagged one suffers because it always gets
stalled when the sibiling enters kernel. But this is no worse than smtoff.

Also a modified rcutorture was used to heavily stress the kernel to make sure
there is not crash or instability.

Joel Fernandes (Google) (5):
irq_work: Add support to detect if work is pending
entry/idle: Add a common function for activites during idle entry/exit
arch/x86: Add a new TIF flag for untrusted tasks
kernel/entry: Add support for core-wide protection of kernel-mode
entry/idle: Enter and exit kernel protection during idle entry and
exit

Vineeth Pillai (7):
entry/kvm: Protect the kernel when entering from guest
bitops: Introduce find_next_or_bit
cpumask: Introduce a new iterator for_each_cpu_wrap_or
sched/coresched: Use for_each_cpu(_wrap)_or for pick_next_task
sched/coresched: Make core_pick_seq per run-queue
sched/coresched: Check for dynamic changes in smt_mask
sched/coresched: rq->core should be set only if not previously set

arch/x86/include/asm/thread_info.h | 2 +
arch/x86/kvm/x86.c | 3 +
include/asm-generic/bitops/find.h | 16 ++
include/linux/cpumask.h | 42 +++++
include/linux/entry-common.h | 22 +++
include/linux/entry-kvm.h | 12 ++
include/linux/irq_work.h | 1 +
include/linux/sched.h | 12 ++
kernel/entry/common.c | 88 +++++----
kernel/entry/kvm.c | 12 ++
kernel/irq_work.c | 11 ++
kernel/sched/core.c | 281 ++++++++++++++++++++++++++---
kernel/sched/idle.c | 17 +-
kernel/sched/sched.h | 11 +-
lib/cpumask.c | 53 ++++++
lib/find_bit.c | 56 ++++--
16 files changed, 564 insertions(+), 75 deletions(-)

--
2.28.0.220.ged08abb693-goog


2020-08-15 21:59:56

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC 04/12] kernel/entry: Add support for core-wide protection of kernel-mode

Core-scheduling prevents hyperthreads in usermode from attacking each
other, but it does not do anything about one of the hyperthreads
entering the kernel for any reason. This leaves the door open for MDS
and L1TF attacks with concurrent execution sequences between
hyperthreads.

This patch therefore adds support for protecting all syscall and IRQ
kernel mode entries. Care is taken to track the outermost usermode exit
and entry using per-cpu counters. In cases where one of the hyperthreads
enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
when not needed - example: idle and non-cookie HTs do not need to be
forced into kernel mode.

More information about attacks:
For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
data to either host or guest attackers. For L1TF, it is possible to leak
to guest attackers. There is no possible mitigation involving flushing
of buffers to avoid this since the execution of attacker and victims
happen concurrently on 2 or more HTs.

Cc: Julien Desfossez <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Aubrey Li <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Co-developed-by: Vineeth Pillai <[email protected]>
Signed-off-by: Vineeth Pillai <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/sched.h | 12 +++
kernel/entry/common.c | 88 +++++++++++-------
kernel/sched/core.c | 205 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 3 +
4 files changed, 277 insertions(+), 31 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 295e3258c9bf..910274141ab2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2054,4 +2054,16 @@ int sched_trace_rq_cpu(struct rq *rq);

const struct cpumask *sched_trace_rd_span(struct root_domain *rd);

+#ifdef CONFIG_SCHED_CORE
+void sched_core_unsafe_enter(void);
+void sched_core_unsafe_exit(void);
+void sched_core_unsafe_exit_wait(unsigned long ti_check);
+void sched_core_wait_till_safe(unsigned long ti_check);
+#else
+#define sched_core_unsafe_enter(ignore) do { } while (0)
+#define sched_core_unsafe_exit(ignore) do { } while (0)
+#define sched_core_wait_till_safe(ignore) do { } while (0)
+#define sched_core_unsafe_exit_wait(ignore) do { } while (0)
+#endif
+
#endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 495f5c051b03..3027ec474567 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -28,6 +28,7 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)

instrumentation_begin();
trace_hardirqs_off_finish();
+ sched_core_unsafe_enter();
instrumentation_end();
}

@@ -111,59 +112,84 @@ static __always_inline void exit_to_user_mode(void)
/* Workaround to allow gradual conversion of architecture code */
void __weak arch_do_signal(struct pt_regs *regs) { }

-static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
- unsigned long ti_work)
+static inline bool exit_to_user_mode_work_pending(unsigned long ti_work)
{
- /*
- * Before returning to user space ensure that all pending work
- * items have been completed.
- */
- while (ti_work & EXIT_TO_USER_MODE_WORK) {
+ return (ti_work & EXIT_TO_USER_MODE_WORK);
+}

- local_irq_enable_exit_to_user(ti_work);
+static inline void exit_to_user_mode_work(struct pt_regs *regs,
+ unsigned long ti_work)
+{

- if (ti_work & _TIF_NEED_RESCHED)
- schedule();
+ local_irq_enable_exit_to_user(ti_work);

- if (ti_work & _TIF_UPROBE)
- uprobe_notify_resume(regs);
+ if (ti_work & _TIF_NEED_RESCHED)
+ schedule();

- if (ti_work & _TIF_PATCH_PENDING)
- klp_update_patch_state(current);
+ if (ti_work & _TIF_UPROBE)
+ uprobe_notify_resume(regs);

- if (ti_work & _TIF_SIGPENDING)
- arch_do_signal(regs);
+ if (ti_work & _TIF_PATCH_PENDING)
+ klp_update_patch_state(current);

- if (ti_work & _TIF_NOTIFY_RESUME) {
- clear_thread_flag(TIF_NOTIFY_RESUME);
- tracehook_notify_resume(regs);
- rseq_handle_notify_resume(NULL, regs);
- }
+ if (ti_work & _TIF_SIGPENDING)
+ arch_do_signal(regs);
+
+ if (ti_work & _TIF_NOTIFY_RESUME) {
+ clear_thread_flag(TIF_NOTIFY_RESUME);
+ tracehook_notify_resume(regs);
+ rseq_handle_notify_resume(NULL, regs);
+ }
+
+ /* Architecture specific TIF work */
+ arch_exit_to_user_mode_work(regs, ti_work);
+
+ local_irq_disable_exit_to_user();
+}

- /* Architecture specific TIF work */
- arch_exit_to_user_mode_work(regs, ti_work);
+static unsigned long exit_to_user_mode_loop(struct pt_regs *regs)
+{
+ unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+
+ /*
+ * Before returning to user space ensure that all pending work
+ * items have been completed.
+ */
+ while (1) {
+ /* Both interrupts and preemption could be enabled here. */
+ if (exit_to_user_mode_work_pending(ti_work))
+ exit_to_user_mode_work(regs, ti_work);
+
+ /* Interrupts may be reenabled with preemption disabled. */
+ sched_core_unsafe_exit_wait(EXIT_TO_USER_MODE_WORK);

/*
- * Disable interrupts and reevaluate the work flags as they
- * might have changed while interrupts and preemption was
- * enabled above.
+ * Reevaluate the work flags as they might have changed
+ * while interrupts and preemption were enabled.
*/
- local_irq_disable_exit_to_user();
ti_work = READ_ONCE(current_thread_info()->flags);
+
+ /*
+ * We may be switching out to another task in kernel mode. That
+ * process is responsible for exiting the "unsafe" kernel mode
+ * when it returns to user or guest.
+ */
+ if (exit_to_user_mode_work_pending(ti_work))
+ sched_core_unsafe_enter();
+ else
+ break;
}

- /* Return the latest work state for arch_exit_to_user_mode() */
return ti_work;
}

static void exit_to_user_mode_prepare(struct pt_regs *regs)
{
- unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+ unsigned long ti_work;

lockdep_assert_irqs_disabled();

- if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
- ti_work = exit_to_user_mode_loop(regs, ti_work);
+ ti_work = exit_to_user_mode_loop(regs);

arch_exit_to_user_mode_prepare(regs, ti_work);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ab6357223b32..ff13254ed317 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4352,6 +4352,210 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
return a->core_cookie == b->core_cookie;
}

+/*
+ * Handler to attempt to enter kernel. It does nothing because the exit to
+ * usermode or guest mode will do the actual work (of waiting if needed).
+ */
+static void sched_core_irq_work(struct irq_work *work)
+{
+ return;
+}
+
+/*
+ * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
+ * exits the core-wide unsafe state. Obviously the CPU calling this function
+ * should not be responsible for the core being in the core-wide unsafe state
+ * otherwise it will deadlock.
+ *
+ * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
+ * the loop if TIF flags are set and notify caller about it.
+ *
+ * IRQs should be disabled.
+ */
+void sched_core_wait_till_safe(unsigned long ti_check)
+{
+ bool restart = false;
+ struct rq *rq;
+ int cpu;
+
+ /* Only untrusted tasks need to do any waiting. */
+ if (!test_tsk_thread_flag(current, TIF_UNSAFE_RET) || WARN_ON_ONCE(!current->core_cookie))
+ goto ret;
+
+ cpu = smp_processor_id();
+ rq = cpu_rq(cpu);
+
+ if (!sched_core_enabled(rq))
+ goto ret;
+
+ /* Down grade to allow interrupts. */
+ preempt_disable();
+ local_irq_enable();
+
+ /*
+ * Wait till the core of this HT is not in an unsafe state.
+ *
+ * Pair with smp_store_release() in sched_core_unsafe_exit().
+ */
+ while (smp_load_acquire(&rq->core->core_unsafe_nest) > 0) {
+ cpu_relax();
+ if (READ_ONCE(current_thread_info()->flags) & ti_check) {
+ restart = true;
+ break;
+ }
+ }
+
+ /* Upgrade it back to the expectations of entry code. */
+ local_irq_disable();
+ preempt_enable();
+
+ret:
+ if (!restart)
+ clear_tsk_thread_flag(current, TIF_UNSAFE_RET);
+
+ return;
+}
+
+/*
+ * Enter the core-wide IRQ state. Sibling will be paused if it is running
+ * 'untrusted' code, until sched_core_unsafe_exit() is called. Every attempt to
+ * avoid sending useless IPIs is made. Must be called only from hard IRQ
+ * context.
+ */
+void sched_core_unsafe_enter(void)
+{
+ const struct cpumask *smt_mask;
+ unsigned long flags;
+ struct rq *rq;
+ int i, cpu;
+
+ /* Ensure that on return to user/guest, we check whether to wait. */
+ if (current->core_cookie)
+ set_tsk_thread_flag(current, TIF_UNSAFE_RET);
+
+ local_irq_save(flags);
+ cpu = smp_processor_id();
+ rq = cpu_rq(cpu);
+ if (!sched_core_enabled(rq))
+ goto ret;
+
+ /* Count unsafe_enter() calls received without unsafe_exit() on this CPU. */
+ rq->core_this_unsafe_nest++;
+
+ /* Should not nest: enter() should only pair with exit(). */
+ if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 1))
+ goto ret;
+
+ raw_spin_lock(rq_lockp(rq));
+ smt_mask = cpu_smt_mask(cpu);
+
+ /* Contribute this CPU's unsafe_enter() to core-wide unsafe_enter() count. */
+ WRITE_ONCE(rq->core->core_unsafe_nest, rq->core->core_unsafe_nest + 1);
+
+ if (WARN_ON_ONCE(rq->core->core_unsafe_nest == UINT_MAX))
+ goto unlock;
+
+ if (irq_work_pending(&rq->core_irq_work)) {
+ /*
+ * Do nothing more since we are in an IPI sent from another
+ * sibling to enforce safety. That sibling would have sent IPIs
+ * to all of the HTs.
+ */
+ goto unlock;
+ }
+
+ /*
+ * If we are not the first ones on the core to enter core-wide unsafe
+ * state, do nothing.
+ */
+ if (rq->core->core_unsafe_nest > 1)
+ goto unlock;
+
+ /* Do nothing more if the core is not tagged. */
+ if (!rq->core->core_cookie)
+ goto unlock;
+
+ for_each_cpu(i, smt_mask) {
+ struct rq *srq = cpu_rq(i);
+
+ if (i == cpu || cpu_is_offline(i))
+ continue;
+
+ if (!srq->curr->mm || is_idle_task(srq->curr))
+ continue;
+
+ /* Skip if HT is not running a tagged task. */
+ if (!srq->curr->core_cookie && !srq->core_pick)
+ continue;
+
+ /*
+ * Force sibling into the kernel by IPI. If work was already
+ * pending, no new IPIs are sent. This is Ok since the receiver
+ * would already be in the kernel, or on its way to it.
+ */
+ irq_work_queue_on(&srq->core_irq_work, i);
+ }
+unlock:
+ raw_spin_unlock(rq_lockp(rq));
+ret:
+ local_irq_restore(flags);
+}
+
+/*
+ * Process any work need for either exiting the core-wide unsafe state, or for
+ * waiting on this hyperthread if the core is still in this state.
+ *
+ * @idle: Are we called from the idle loop?
+ */
+void sched_core_unsafe_exit(void)
+{
+ unsigned long flags;
+ unsigned int nest;
+ struct rq *rq;
+ int cpu;
+
+ local_irq_save(flags);
+ cpu = smp_processor_id();
+ rq = cpu_rq(cpu);
+
+ /* Do nothing if core-sched disabled. */
+ if (!sched_core_enabled(rq))
+ goto ret;
+
+ /*
+ * Can happen when a process is forked and the first return to user
+ * mode is a syscall exit. Either way, there's nothing to do.
+ */
+ if (rq->core_this_unsafe_nest == 0)
+ goto ret;
+
+ rq->core_this_unsafe_nest--;
+
+ /* enter() should be paired with exit() only. */
+ if (WARN_ON_ONCE(rq->core_this_unsafe_nest != 0))
+ goto ret;
+
+ raw_spin_lock(rq_lockp(rq));
+ /*
+ * Core-wide nesting counter can never be 0 because we are
+ * still in it on this CPU.
+ */
+ nest = rq->core->core_unsafe_nest;
+ WARN_ON_ONCE(!nest);
+
+ /* Pair with smp_load_acquire() in sched_core_wait_till_safe(). */
+ smp_store_release(&rq->core->core_unsafe_nest, nest - 1);
+ raw_spin_unlock(rq_lockp(rq));
+ret:
+ local_irq_restore(flags);
+}
+
+void sched_core_unsafe_exit_wait(unsigned long ti_check)
+{
+ sched_core_unsafe_exit();
+ sched_core_wait_till_safe(ti_check);
+}
+
// XXX fairness/fwd progress conditions
/*
* Returns
@@ -7295,6 +7499,7 @@ int sched_cpu_starting(unsigned int cpu)
rq = cpu_rq(i);
if (rq->core && rq->core == rq)
core_rq = rq;
+ init_irq_work(&rq->core_irq_work, sched_core_irq_work);
}

if (!core_rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1901d11a6f41..2922e171a1f0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1040,11 +1040,14 @@ struct rq {
unsigned int core_sched_seq;
struct rb_root core_tree;
unsigned char core_forceidle;
+ struct irq_work core_irq_work; /* To force HT into kernel */
+ unsigned int core_this_unsafe_nest;

/* shared state */
unsigned int core_task_seq;
unsigned int core_pick_seq;
unsigned long core_cookie;
+ unsigned int core_unsafe_nest;
#endif
};

--
2.28.0.220.ged08abb693-goog

2020-08-15 22:00:51

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC 06/12] entry/kvm: Protect the kernel when entering from guest

From: Vineeth Pillai <[email protected]>

Similar to how user to kernel mode transitions are protected in earlier
patches, protect the entry into kernel from guest mode as well.

Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
arch/x86/kvm/x86.c | 3 +++
include/linux/entry-kvm.h | 12 ++++++++++++
kernel/entry/kvm.c | 12 ++++++++++++
3 files changed, 27 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 82d4a9e88908..b8a6faf78dc6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8484,6 +8484,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
*/
smp_mb__after_srcu_read_unlock();

+ kvm_exit_to_guest_mode(vcpu);
+
/*
* This handles the case where a posted interrupt was
* notified with kvm_vcpu_kick.
@@ -8578,6 +8580,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
}
}

+ kvm_enter_from_guest_mode(vcpu);
local_irq_enable();
preempt_enable();

diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 0cef17afb41a..32aabb7f3e6d 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -77,4 +77,16 @@ static inline bool xfer_to_guest_mode_work_pending(void)
}
#endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */

+/**
+ * kvm_enter_from_guest_mode - Hook called just after entering kernel from guest.
+ * @vcpu: Pointer to the current VCPU data
+ */
+void kvm_enter_from_guest_mode(struct kvm_vcpu *vcpu);
+
+/**
+ * kvm_exit_to_guest_mode - Hook called just before entering guest from kernel.
+ * @vcpu: Pointer to the current VCPU data
+ */
+void kvm_exit_to_guest_mode(struct kvm_vcpu *vcpu);
+
#endif
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index eb1a8a4c867c..994af4241646 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -49,3 +49,15 @@ int xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu)
return xfer_to_guest_mode_work(vcpu, ti_work);
}
EXPORT_SYMBOL_GPL(xfer_to_guest_mode_handle_work);
+
+void kvm_enter_from_guest_mode(struct kvm_vcpu *vcpu)
+{
+ sched_core_unsafe_enter();
+}
+EXPORT_SYMBOL_GPL(kvm_enter_from_guest_mode);
+
+void kvm_exit_to_guest_mode(struct kvm_vcpu *vcpu)
+{
+ sched_core_unsafe_exit_wait(XFER_TO_GUEST_MODE_WORK);
+}
+EXPORT_SYMBOL_GPL(kvm_exit_to_guest_mode);
--
2.28.0.220.ged08abb693-goog

2020-08-15 22:01:07

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC 01/12] irq_work: Add support to detect if work is pending

When an unsafe region is entered on an HT, an IPI needs to be sent to
siblings to ensure they enter the kernel.

Following are the reasons why we would like to use irq_work to implement
forcing of sibling into kernel mode:

1. Existing smp_call infrastructure cannot be used easily since we could
end up waiting on CSD lock if previously an smp_call was not yet
serviced.

2. I'd like to use generic code, such that there is no need to add an
arch-specific IPI.

3. IRQ work already has support to detect that previous work was not yet
executed through the IRQ_WORK_PENDING bit.

4. We need to know if the destination of the IPI is not sending more
IPIs due to that IPI itself causing an entry into unsafe region.

Support for 4. requires us to be able to detect that irq_work is
pending.

This commit therefore adds a way for irq_work users to know if a
previous per-HT irq_work is pending. If it is, we need not send new
IPIs.

Memory ordering:

I was trying to handle the MP-pattern below. Consider the flag to be the
pending bit. P0() is the IRQ work handler. P1() is the code calling
irq_work_pending(). P0() already implicitly adds a memory barrier as a
part of the atomic_fetch_andnot() before calling work->func(). For P1(),
this patch adds the memory barrier as the atomic_read() in this patch's
irq_work_pending() is not sufficient.

P0()
{
WRITE_ONCE(buf, 1);
WRITE_ONCE(flag, 1);
}

P1()
{
int r1;
int r2 = 0;

r1 = READ_ONCE(flag);
if (r1)
r2 = READ_ONCE(buf);
}

Cc: [email protected]
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/irq_work.h | 1 +
kernel/irq_work.c | 11 +++++++++++
2 files changed, 12 insertions(+)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 30823780c192..b26466f95d04 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -42,6 +42,7 @@ bool irq_work_queue_on(struct irq_work *work, int cpu);

void irq_work_tick(void);
void irq_work_sync(struct irq_work *work);
+bool irq_work_pending(struct irq_work *work);

#ifdef CONFIG_IRQ_WORK
#include <asm/irq_work.h>
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index eca83965b631..2d206d511aa0 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -24,6 +24,17 @@
static DEFINE_PER_CPU(struct llist_head, raised_list);
static DEFINE_PER_CPU(struct llist_head, lazy_list);

+bool irq_work_pending(struct irq_work *work)
+{
+ /*
+ * Provide ordering to callers who may read other stuff
+ * after the atomic read (MP-pattern).
+ */
+ bool ret = atomic_read_acquire(&work->flags) & IRQ_WORK_PENDING;
+
+ return ret;
+}
+
/*
* Claim the entry so that no one else will poke at it.
*/
--
2.28.0.220.ged08abb693-goog

2020-08-15 22:03:02

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC 10/12] sched/coresched: Make core_pick_seq per run-queue

From: Vineeth Pillai <[email protected]>

core_pick_seq is a core wide counter to identify if a task pick has
been made for a CPU by its sibling. But during hotplug scenarios,
pick cannot be made for CPUs that are offline and when they come up,
they get tricked because the counter is incremented.

So, make core_pick_seq per run-queue.

Signed-off-by: Vineeth Pillai <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/core.c | 19 ++++++++++---------
kernel/sched/sched.h | 2 +-
2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e9df8221c62..48a49168e57f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4623,9 +4623,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
* pointers are all still valid), and we haven't scheduled the last
* pick yet, do so now.
*/
- if (rq->core->core_pick_seq == rq->core->core_task_seq &&
- rq->core->core_pick_seq != rq->core_sched_seq) {
- WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+ if (rq->core_pick_seq == rq->core->core_task_seq &&
+ rq->core_pick_seq != rq->core_sched_seq) {
+ WRITE_ONCE(rq->core_sched_seq, rq->core_pick_seq);

next = rq->core_pick;
if (next != prev) {
@@ -4635,7 +4635,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
rq->core->core_task_seq,
- rq->core->core_pick_seq,
+ rq->core_pick_seq,
rq->core_sched_seq,
next->comm, next->pid,
next->core_cookie);
@@ -4649,7 +4649,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
smt_mask = cpu_smt_mask(cpu);

/*
- * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+ * core->core_task_seq, rq->core_pick_seq, rq->core_sched_seq
*
* @task_seq guards the task state ({en,de}queues)
* @pick_seq is the @task_seq we did a selection on
@@ -4667,8 +4667,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
struct rq *rq_i = cpu_rq(i);

trace_printk("CPU %d is in smt_mask, resetting\n", i);
-
- rq_i->core_pick = NULL;
+ if (rq_i->core_pick) {
+ WRITE_ONCE(rq_i->core_sched_seq, rq_i->core_pick_seq);
+ rq_i->core_pick = NULL;
+ }

if (rq_i->core_forceidle) {
need_sync = true;
@@ -4771,9 +4773,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
next_class:;
}

- rq->core->core_pick_seq = rq->core->core_task_seq;
next = rq->core_pick;
- rq->core_sched_seq = rq->core->core_pick_seq;

/* Something should have been selected for current CPU */
WARN_ON_ONCE(!next);
@@ -4801,6 +4801,7 @@ next_class:;
continue;

if (rq_i->curr != rq_i->core_pick) {
+ WRITE_ONCE(rq_i->core_pick_seq, rq->core->core_task_seq);
trace_printk("IPI(%d)\n", i);
resched_curr(rq_i);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2922e171a1f0..c7caece2df6e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1036,6 +1036,7 @@ struct rq {
/* per rq */
struct rq *core;
struct task_struct *core_pick;
+ unsigned int core_pick_seq;
unsigned int core_enabled;
unsigned int core_sched_seq;
struct rb_root core_tree;
@@ -1045,7 +1046,6 @@ struct rq {

/* shared state */
unsigned int core_task_seq;
- unsigned int core_pick_seq;
unsigned long core_cookie;
unsigned int core_unsafe_nest;
#endif
--
2.28.0.220.ged08abb693-goog

2020-08-15 22:04:54

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC 07/12] bitops: Introduce find_next_or_bit

From: Vineeth Pillai <[email protected]>

Hotplug fixes to core-scheduling require a new bitops API.

Introduce a new API find_next_or_bit() which returns the bit number of
the next set bit in OR-ed bit masks of the given bit masks.

Signed-off-by: Vineeth Pillai <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/asm-generic/bitops/find.h | 16 +++++++++
lib/find_bit.c | 56 +++++++++++++++++++++++++------
2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/include/asm-generic/bitops/find.h b/include/asm-generic/bitops/find.h
index 9fdf21302fdf..0b476ca0d665 100644
--- a/include/asm-generic/bitops/find.h
+++ b/include/asm-generic/bitops/find.h
@@ -32,6 +32,22 @@ extern unsigned long find_next_and_bit(const unsigned long *addr1,
unsigned long offset);
#endif

+#ifndef find_next_or_bit
+/**
+ * find_next_or_bit - find the next set bit in any memory regions
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+extern unsigned long find_next_or_bit(const unsigned long *addr1,
+ const unsigned long *addr2, unsigned long size,
+ unsigned long offset);
+#endif
+
#ifndef find_next_zero_bit
/**
* find_next_zero_bit - find the next cleared bit in a memory region
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 49f875f1baf7..2eca8e2b16b1 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -19,7 +19,14 @@

#if !defined(find_next_bit) || !defined(find_next_zero_bit) || \
!defined(find_next_bit_le) || !defined(find_next_zero_bit_le) || \
- !defined(find_next_and_bit)
+ !defined(find_next_and_bit) || !defined(find_next_or_bit)
+
+typedef enum {
+ FNB_AND = 0,
+ FNB_OR = 1,
+ FNB_MAX = 2
+} fnb_bwops_t;
+
/*
* This is a common helper function for find_next_bit, find_next_zero_bit, and
* find_next_and_bit. The differences are:
@@ -29,7 +36,8 @@
*/
static unsigned long _find_next_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le)
+ unsigned long start, unsigned long invert,
+ fnb_bwops_t type, unsigned long le)
{
unsigned long tmp, mask;

@@ -37,8 +45,16 @@ static unsigned long _find_next_bit(const unsigned long *addr1,
return nbits;

tmp = addr1[start / BITS_PER_LONG];
- if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
+ if (addr2) {
+ switch (type) {
+ case FNB_AND:
+ tmp &= addr2[start / BITS_PER_LONG];
+ break;
+ case FNB_OR:
+ tmp |= addr2[start / BITS_PER_LONG];
+ break;
+ }
+ }
tmp ^= invert;

/* Handle 1st word. */
@@ -56,8 +72,16 @@ static unsigned long _find_next_bit(const unsigned long *addr1,
return nbits;

tmp = addr1[start / BITS_PER_LONG];
- if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
+ if (addr2) {
+ switch (type) {
+ case FNB_AND:
+ tmp &= addr2[start / BITS_PER_LONG];
+ break;
+ case FNB_OR:
+ tmp |= addr2[start / BITS_PER_LONG];
+ break;
+ }
+ }
tmp ^= invert;
}

@@ -75,7 +99,7 @@ static unsigned long _find_next_bit(const unsigned long *addr1,
unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
unsigned long offset)
{
- return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+ return _find_next_bit(addr, NULL, size, offset, 0UL, FNB_AND, 0);
}
EXPORT_SYMBOL(find_next_bit);
#endif
@@ -84,7 +108,7 @@ EXPORT_SYMBOL(find_next_bit);
unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
unsigned long offset)
{
- return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+ return _find_next_bit(addr, NULL, size, offset, ~0UL, FNB_AND, 0);
}
EXPORT_SYMBOL(find_next_zero_bit);
#endif
@@ -94,11 +118,21 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long size,
unsigned long offset)
{
- return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
+ return _find_next_bit(addr1, addr2, size, offset, 0UL, FNB_AND, 0);
}
EXPORT_SYMBOL(find_next_and_bit);
#endif

+#if !defined(find_next_or_bit)
+unsigned long find_next_or_bit(const unsigned long *addr1,
+ const unsigned long *addr2, unsigned long size,
+ unsigned long offset)
+{
+ return _find_next_bit(addr1, addr2, size, offset, 0UL, FNB_OR, 0);
+}
+EXPORT_SYMBOL(find_next_or_bit);
+#endif
+
#ifndef find_first_bit
/*
* Find the first set bit in a memory region.
@@ -161,7 +195,7 @@ EXPORT_SYMBOL(find_last_bit);
unsigned long find_next_zero_bit_le(const void *addr, unsigned
long size, unsigned long offset)
{
- return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
+ return _find_next_bit(addr, NULL, size, offset, ~0UL, FNB_AND, 1);
}
EXPORT_SYMBOL(find_next_zero_bit_le);
#endif
@@ -170,7 +204,7 @@ EXPORT_SYMBOL(find_next_zero_bit_le);
unsigned long find_next_bit_le(const void *addr, unsigned
long size, unsigned long offset)
{
- return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
+ return _find_next_bit(addr, NULL, size, offset, 0UL, FNB_AND, 1);
}
EXPORT_SYMBOL(find_next_bit_le);
#endif
--
2.28.0.220.ged08abb693-goog

2020-08-15 22:06:24

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC 03/12] arch/x86: Add a new TIF flag for untrusted tasks

Add a new TIF flag to indicate whether the kernel needs to be careful
and take additional steps to mitigate micro-architectural issues during
entry into user or guest mode.

This new flag will be used by the series to determine if waiting is
needed or not, during exit to user or guest mode.

Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
arch/x86/include/asm/thread_info.h | 2 ++
kernel/sched/sched.h | 6 ++++++
2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 267701ae3d86..42e63969acb3 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -98,6 +98,7 @@ struct thread_info {
#define TIF_IO_BITMAP 22 /* uses I/O bitmap */
#define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
#define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
+#define TIF_UNSAFE_RET 26 /* On return to process/guest, perform safety checks. */
#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
#define TIF_SYSCALL_TRACEPOINT 28 /* syscall tracepoint instrumentation */
#define TIF_ADDR32 29 /* 32-bit address space on 64 bits */
@@ -127,6 +128,7 @@ struct thread_info {
#define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
#define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
#define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
+#define _TIF_UNSAFE_RET (1 << TIF_UNSAFE_RET)
#define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
#define _TIF_SYSCALL_TRACEPOINT (1 << TIF_SYSCALL_TRACEPOINT)
#define _TIF_ADDR32 (1 << TIF_ADDR32)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3575edc7dc43..1901d11a6f41 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2680,3 +2680,9 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)

void swake_up_all_locked(struct swait_queue_head *q);
void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
+
+#ifdef CONFIG_SCHED_CORE
+#ifndef TIF_UNSAFE_RET
+#define TIF_UNSAFE_RET (0)
+#endif
+#endif
--
2.28.0.220.ged08abb693-goog

2020-08-15 22:08:21

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC 08/12] cpumask: Introduce a new iterator for_each_cpu_wrap_or

From: Vineeth Pillai <[email protected]>

Hotplug fixes to core-scheduling require a new cpumask iterator which iterates
through all online cpus in both the given cpumasks. This patch introduces it.

Signed-off-by: Vineeth Pillai <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/cpumask.h | 42 ++++++++++++++++++++++++++++++++
lib/cpumask.c | 53 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 95 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index f0d895d6ac39..03e8c57c6ca6 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -207,6 +207,10 @@ static inline int cpumask_any_and_distribute(const struct cpumask *src1p,
for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask, (void)(start))
#define for_each_cpu_and(cpu, mask1, mask2) \
for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask1, (void)mask2)
+#define for_each_cpu_or(cpu, mask1, mask2) \
+ for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask1, (void)mask2)
+#define for_each_cpu_wrap_or(cpu, mask1, mask2, start) \
+ for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask1, (void)mask2, (void)(start))
#else
/**
* cpumask_first - get the first cpu in a cpumask
@@ -248,6 +252,7 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
}

int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
+int cpumask_next_or(int n, const struct cpumask *, const struct cpumask *);
int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
unsigned int cpumask_local_spread(unsigned int i, int node);
int cpumask_any_and_distribute(const struct cpumask *src1p,
@@ -278,6 +283,8 @@ int cpumask_any_and_distribute(const struct cpumask *src1p,
(cpu) < nr_cpu_ids;)

extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap);
+extern int cpumask_next_wrap_or(int n, const struct cpumask *mask1,
+ const struct cpumask *mask2, int start, bool wrap);

/**
* for_each_cpu_wrap - iterate over every cpu in a mask, starting at a specified location
@@ -294,6 +301,22 @@ extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool
(cpu) < nr_cpumask_bits; \
(cpu) = cpumask_next_wrap((cpu), (mask), (start), true))

+/**
+ * for_each_cpu_wrap_or - iterate over every cpu in both masks, starting at a specified location
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask1: the first cpumask pointer
+ * @mask2: the second cpumask pointer
+ * @start: the start location
+ *
+ * The implementation does not assume any bit both masks are set (including @start).
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_wrap_or(cpu, mask1, mask2, start) \
+ for ((cpu) = cpumask_next_wrap_or((start)-1, (mask1), (mask2), (start), false); \
+ (cpu) < nr_cpumask_bits; \
+ (cpu) = cpumask_next_wrap_or((cpu), (mask1), (mask2), (start), true))
+
/**
* for_each_cpu_and - iterate over every cpu in both masks
* @cpu: the (optionally unsigned) integer iterator
@@ -312,6 +335,25 @@ extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool
for ((cpu) = -1; \
(cpu) = cpumask_next_and((cpu), (mask1), (mask2)), \
(cpu) < nr_cpu_ids;)
+
+/**
+ * for_each_cpu_or - iterate over every cpu in both masks
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask1: the first cpumask pointer
+ * @mask2: the second cpumask pointer
+ *
+ * This saves a temporary CPU mask in many places. It is equivalent to:
+ * struct cpumask tmp;
+ * cpumask_and(&tmp, &mask1, &mask2);
+ * for_each_cpu(cpu, &tmp)
+ * ...
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_or(cpu, mask1, mask2) \
+ for ((cpu) = -1; \
+ (cpu) = cpumask_next_or((cpu), (mask1), (mask2)), \
+ (cpu) < nr_cpu_ids;)
#endif /* SMP */

#define CPU_BITS_NONE \
diff --git a/lib/cpumask.c b/lib/cpumask.c
index fb22fb266f93..0a5cdbd4eb6a 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -42,6 +42,25 @@ int cpumask_next_and(int n, const struct cpumask *src1p,
}
EXPORT_SYMBOL(cpumask_next_and);

+/**
+ * cpumask_next_or - get the next cpu in *src1p | *src2p
+ * @n: the cpu prior to the place to search (ie. return will be > @n)
+ * @src1p: the first cpumask pointer
+ * @src2p: the second cpumask pointer
+ *
+ * Returns >= nr_cpu_ids if no further cpus set in both.
+ */
+int cpumask_next_or(int n, const struct cpumask *src1p,
+ const struct cpumask *src2p)
+{
+ /* -1 is a legal arg here. */
+ if (n != -1)
+ cpumask_check(n);
+ return find_next_or_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+ nr_cpumask_bits, n + 1);
+}
+EXPORT_SYMBOL(cpumask_next_or);
+
/**
* cpumask_any_but - return a "random" in a cpumask, but not this one.
* @mask: the cpumask to search
@@ -94,6 +113,40 @@ int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap)
}
EXPORT_SYMBOL(cpumask_next_wrap);

+/**
+ * cpumask_next_wrap_or - helper to implement for_each_cpu_wrap_or
+ * @n: the cpu prior to the place to search
+ * @mask1: first cpumask pointer
+ * @mask2: second cpumask pointer
+ * @start: the start point of the iteration
+ * @wrap: assume @n crossing @start terminates the iteration
+ *
+ * Returns >= nr_cpu_ids on completion
+ *
+ * Note: the @wrap argument is required for the start condition when
+ * we cannot assume @start is set in @mask.
+ */
+int cpumask_next_wrap_or(int n, const struct cpumask *mask1, const struct cpumask *mask2,
+ int start, bool wrap)
+{
+ int next;
+
+again:
+ next = cpumask_next_or(n, mask1, mask2);
+
+ if (wrap && n < start && next >= start) {
+ return nr_cpumask_bits;
+
+ } else if (next >= nr_cpumask_bits) {
+ wrap = true;
+ n = -1;
+ goto again;
+ }
+
+ return next;
+}
+EXPORT_SYMBOL(cpumask_next_wrap_or);
+
/* These are not inline because of header tangles. */
#ifdef CONFIG_CPUMASK_OFFSTACK
/**
--
2.28.0.220.ged08abb693-goog

2020-08-15 22:10:06

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC 11/12] sched/coresched: Check for dynamic changes in smt_mask

From: Vineeth Pillai <[email protected]>

There are multiple loops in pick_next_task that iterate over CPUs in
smt_mask. During a hotplug event, sibling could be removed from the
smt_mask while pick_next_task is running. So we cannot trust the mask
across the different loops. This can confuse the logic.

Add a retry logic if smt_mask changes between the loops.

Reported-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Vineeth Pillai <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/core.c | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48a49168e57f..5da5b2317b21 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4613,6 +4613,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
const struct sched_class *class;
const struct cpumask *smt_mask;
int i, j, cpu, occ = 0;
+ int smt_weight;
bool need_sync;

if (!sched_core_enabled(rq))
@@ -4648,6 +4649,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
cpu = cpu_of(rq);
smt_mask = cpu_smt_mask(cpu);

+retry_select:
+ smt_weight = cpumask_weight(smt_mask);
+
/*
* core->core_task_seq, rq->core_pick_seq, rq->core_sched_seq
*
@@ -4691,6 +4695,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
struct rq *rq_i = cpu_rq(i);
struct task_struct *p;

+ /*
+ * During hotplug online a sibling can be added in
+ * the smt_mask * while we are here. If so, we would
+ * need to restart selection by resetting all over.
+ */
+ if (unlikely(smt_weight != cpumask_weight(smt_mask)))
+ goto retry_select;
+
if (rq_i->core_pick)
continue;

@@ -4790,7 +4802,15 @@ next_class:;
for_each_cpu_or(i, smt_mask, cpumask_of(cpu)) {
struct rq *rq_i = cpu_rq(i);

- WARN_ON_ONCE(!rq_i->core_pick);
+ WARN_ON_ONCE(smt_weight == cpumask_weight(smt_mask) && !rq->core_pick);
+
+ /*
+ * During hotplug online a sibling can be added in the smt_mask
+ * while we are here. We might have missed picking a task for it.
+ * Ignore it now as a schedule on that sibling will correct itself.
+ */
+ if (!rq_i->core_pick)
+ continue;

if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
rq_i->core_forceidle = true;
--
2.28.0.220.ged08abb693-goog

2020-08-15 22:10:24

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC 02/12] entry/idle: Add a common function for activites during idle entry/exit

Currently only RCU hooks for idle entry/exit are called. In later
patches, kernel-entry protection functionality will be added.

Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/entry-common.h | 16 ++++++++++++++++
kernel/sched/idle.c | 17 +++++++++--------
2 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index efebbffcd5cc..2ea0e09b00d5 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -369,4 +369,20 @@ void irqentry_exit_cond_resched(void);
*/
void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);

+/**
+ * generic_idle_enter - Called during entry into idle for housekeeping.
+ */
+static inline void generic_idle_enter(void)
+{
+ rcu_idle_enter();
+}
+
+/**
+ * generic_idle_enter - Called when exiting idle for housekeeping.
+ */
+static inline void generic_idle_exit(void)
+{
+ rcu_idle_exit();
+}
+
#endif
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 3b03bedee280..5db5a946aa3b 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -8,6 +8,7 @@
*/
#include "sched.h"

+#include <linux/entry-common.h>
#include <trace/events/power.h>

/* Linker adds these: start and end of __cpuidle functions */
@@ -54,7 +55,7 @@ __setup("hlt", cpu_idle_nopoll_setup);

static noinline int __cpuidle cpu_idle_poll(void)
{
- rcu_idle_enter();
+ generic_idle_enter();
trace_cpu_idle_rcuidle(0, smp_processor_id());
local_irq_enable();
stop_critical_timings();
@@ -64,7 +65,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
cpu_relax();
start_critical_timings();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
- rcu_idle_exit();
+ generic_idle_exit();

return 1;
}
@@ -158,7 +159,7 @@ static void cpuidle_idle_call(void)

if (cpuidle_not_available(drv, dev)) {
tick_nohz_idle_stop_tick();
- rcu_idle_enter();
+ generic_idle_enter();

default_idle_call();
goto exit_idle;
@@ -178,13 +179,13 @@ static void cpuidle_idle_call(void)
u64 max_latency_ns;

if (idle_should_enter_s2idle()) {
- rcu_idle_enter();
+ generic_idle_enter();

entered_state = call_cpuidle_s2idle(drv, dev);
if (entered_state > 0)
goto exit_idle;

- rcu_idle_exit();
+ generic_idle_exit();

max_latency_ns = U64_MAX;
} else {
@@ -192,7 +193,7 @@ static void cpuidle_idle_call(void)
}

tick_nohz_idle_stop_tick();
- rcu_idle_enter();
+ generic_idle_enter();

next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
call_cpuidle(drv, dev, next_state);
@@ -209,7 +210,7 @@ static void cpuidle_idle_call(void)
else
tick_nohz_idle_retain_tick();

- rcu_idle_enter();
+ generic_idle_enter();

entered_state = call_cpuidle(drv, dev, next_state);
/*
@@ -227,7 +228,7 @@ static void cpuidle_idle_call(void)
if (WARN_ON_ONCE(irqs_disabled()))
local_irq_enable();

- rcu_idle_exit();
+ generic_idle_exit();
}

/*
--
2.28.0.220.ged08abb693-goog

2020-08-16 01:46:23

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH RFC 05/12] entry/idle: Enter and exit kernel protection during idle entry and exit

Make use of the generic_idle_{enter,exit} helper function added in
earlier patches to enter and exit kernel protection.

On exiting idle, protection will be reenabled.

Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/entry-common.h | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 2ea0e09b00d5..c833f2fda542 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -374,6 +374,9 @@ void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);
*/
static inline void generic_idle_enter(void)
{
+ /* Entering idle ends the protected kernel region. */
+ sched_core_unsafe_exit();
+
rcu_idle_enter();
}

@@ -383,6 +386,9 @@ static inline void generic_idle_enter(void)
static inline void generic_idle_exit(void)
{
rcu_idle_exit();
+
+ /* Exiting idle (re)starts the protected kernel region. */
+ sched_core_unsafe_enter();
}

#endif
--
2.28.0.220.ged08abb693-goog

2020-08-16 01:57:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC 01/12] irq_work: Add support to detect if work is pending

On Fri, Aug 14, 2020 at 11:18:57PM -0400, Joel Fernandes (Google) wrote:

https://lkml.kernel.org/r/[email protected]

2020-08-17 02:05:45

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC 01/12] irq_work: Add support to detect if work is pending

On Sat, Aug 15, 2020 at 10:13:30AM +0200, [email protected] wrote:
> On Fri, Aug 14, 2020 at 11:18:57PM -0400, Joel Fernandes (Google) wrote:
>
> https://lkml.kernel.org/r/[email protected]

Thank you, so that means I can drop this patch then.

- Joel

2020-08-19 18:30:02

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH RFC 00/12] Core-sched v6+: kernel protection and hotplug fixes

On Fri, Aug 14, 2020 at 11:18:56PM -0400, Joel Fernandes (Google) wrote:
> This series is continuation of main core-sched v6 series [1] and adds support

- Is this really "RFC"? Seeing multiple authors implies this is looking
to get merged. :)

- Is this on top of the v6 core-sched series (or instead of)? (That
series doesn't seem to be RFC any more either, IMO, if it's at v6.) If
this replaces that series, isn't this v7? I don't see the v6
core-sched series in -next anywhere?

- Whatever the case, please include the series version in the "[PATCH vN ...]"
portion of the Subject, otherwise things like "b4" don't really know what's
happening.

Sorry for the drive-by comments, but I was really confused trying to
understand how this fit together. :)

--
Kees Cook

2020-08-20 01:45:59

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC 00/12] Core-sched v6+: kernel protection and hotplug fixes

On Wed, Aug 19, 2020 at 11:26:48AM -0700, Kees Cook wrote:
> On Fri, Aug 14, 2020 at 11:18:56PM -0400, Joel Fernandes (Google) wrote:
> > This series is continuation of main core-sched v6 series [1] and adds support
>
> - Is this really "RFC"? Seeing multiple authors implies this is looking
> to get merged. :)

This mini series was inspired by discussions/concerns mostly by Thomas so
that's why I kept it as RFC since I thought more comments will come.

> - Is this on top of the v6 core-sched series (or instead of)? (That
> series doesn't seem to be RFC any more either, IMO, if it's at v6.) If
> this replaces that series, isn't this v7? I don't see the v6
> core-sched series in -next anywhere?

Yes, it was just a continuation of that. Actually I screwed up, I should have
just posted the whole v6 again with these patches with appropriate
subject-prefix. That would prevent confusion.

> - Whatever the case, please include the series version in the "[PATCH vN ...]"
> portion of the Subject, otherwise things like "b4" don't really know what's
> happening.

Got it.

> Sorry for the drive-by comments, but I was really confused trying to
> understand how this fit together. :)

No problem, your comments are very valid so they are welcome :)

thanks,

- Joel


> --
> Kees Cook

2020-08-22 20:24:47

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH RFC 00/12] Core-sched v6+: kernel protection and hotplug fixes

On Fri, Aug 14, 2020 at 11:19 PM Joel Fernandes (Google)
<[email protected]> wrote:
>
> Hello!
>
> This series is continuation of main core-sched v6 series [1] and adds support
> for syscall and IRQ isolation from usermode processes and guests. It is key to
> safely entering kernel mode in an HT while the other HT is in use by a user or
> guest. The series also fixes CPU hotplug issues arising because of the
> cpu_smt_mask changing while the next task is being picked. These hotplug fixes
> are needed also for kernel protection to work correctly.
>
> The series is based on Thomas's x86/entry tree.
>
> [1] https://lwn.net/Articles/824918/

Hello,
Just wanted to mention that we are talking about this series during
the refereed talk on Monday at 16:00 UTC :
https://linuxplumbersconf.org/event/7/contributions/648/

The slides are here with some nice pictures showing the kernel protection stuff:
https://docs.google.com/presentation/d/1VzeQo3AyGTN35DJ3LKoPWBfiZHZJiF8q0NrX9eVYG70/edit?usp=sharing

And Julien has some promising data to share which he just collected
with this series (and will add them to the slides).

Looking forward to possibly seeing you there and your participation
for these topics both during the refereed talk and the scheduler MC,
thanks!

- Joel


>
> Background:
>
> Core-scheduling prevents hyperthreads in usermode from attacking each
> other, but it does not do anything about one of the hyperthreads
> entering the kernel for any reason. This leaves the door open for MDS
> and L1TF attacks with concurrent execution sequences between
> hyperthreads.
>
> This series adds support for protecting all syscall and IRQ kernel mode entries
> by cleverly tracking when any sibling in a core enter the kernel, and when all
> the siblings exit the kernel. IPIs are sent to force siblings into the kernel.
>
> Care is taken to avoid waiting in IRQ-disabled sections as Thomas suggested
> thus avoiding stop_machine deadlocks. Every attempt is made to avoid
> unnecessary IPIs.
>
> Performance tests:
> sysbench is used to test the performance of the patch series. Used a 8 cpu/4
> Core VM and ran 2 sysbench tests in parallel. Each sysbench test runs 4 tasks:
> sysbench --test=cpu --cpu-max-prime=100000 --num-threads=4 run
>
> Compared the performance results for various combinations as below.
> The metric below is 'events per second':
>
> 1. Coresched disabled
> sysbench-1/sysbench-2 => 175.7/175.6
>
> 2. Coreched enabled, both sysbench tagged
> sysbench-1/sysbench-2 => 168.8/165.6
>
> 3. Coresched enabled, sysbench-1 tagged and sysbench-2 untagged
> sysbench-1/sysbench-2 => 96.4/176.9
>
> 4. smt off
> sysbench-1/sysbench-2 => 97.9/98.8
>
> When both sysbench-es are tagged, there is a perf drop of ~4%. With a
> tagged/untagged case, the tagged one suffers because it always gets
> stalled when the sibiling enters kernel. But this is no worse than smtoff.
>
> Also a modified rcutorture was used to heavily stress the kernel to make sure
> there is not crash or instability.
>
> Joel Fernandes (Google) (5):
> irq_work: Add support to detect if work is pending
> entry/idle: Add a common function for activites during idle entry/exit
> arch/x86: Add a new TIF flag for untrusted tasks
> kernel/entry: Add support for core-wide protection of kernel-mode
> entry/idle: Enter and exit kernel protection during idle entry and
> exit
>
> Vineeth Pillai (7):
> entry/kvm: Protect the kernel when entering from guest
> bitops: Introduce find_next_or_bit
> cpumask: Introduce a new iterator for_each_cpu_wrap_or
> sched/coresched: Use for_each_cpu(_wrap)_or for pick_next_task
> sched/coresched: Make core_pick_seq per run-queue
> sched/coresched: Check for dynamic changes in smt_mask
> sched/coresched: rq->core should be set only if not previously set
>
> arch/x86/include/asm/thread_info.h | 2 +
> arch/x86/kvm/x86.c | 3 +
> include/asm-generic/bitops/find.h | 16 ++
> include/linux/cpumask.h | 42 +++++
> include/linux/entry-common.h | 22 +++
> include/linux/entry-kvm.h | 12 ++
> include/linux/irq_work.h | 1 +
> include/linux/sched.h | 12 ++
> kernel/entry/common.c | 88 +++++----
> kernel/entry/kvm.c | 12 ++
> kernel/irq_work.c | 11 ++
> kernel/sched/core.c | 281 ++++++++++++++++++++++++++---
> kernel/sched/idle.c | 17 +-
> kernel/sched/sched.h | 11 +-
> lib/cpumask.c | 53 ++++++
> lib/find_bit.c | 56 ++++--
> 16 files changed, 564 insertions(+), 75 deletions(-)
>
> --
> 2.28.0.220.ged08abb693-goog
>