2020-11-27 08:54:01

by Singh, Balbir

[permalink] [raw]
Subject: [PATCH v3 0/5] Next revision of the L1D flush patches

Implement a mechanism that allows tasks to conditionally flush
their L1D cache (mitigation mechanism suggested in [2]). The previous
posts of these patches were sent for inclusion (see [3]) and were not
included due to the concern for the need for additional checks,
those checks were:

1. Implement this mechanism only for CPUs affected by the L1TF bug
2. Disable the software fallback
3. Provide an override to disable this mechanism completely
4. Be SMT aware in the implementation

The patches support a use case where the entire system is not in
non SMT mode, but rather a few CPUs can have their SMT turned off
and processes that want to opt-in are expected to run on non SMT
cores. This gives the administrator complete control over setting
up the mitigation for the issue. In addition, the administrator
has a boot time override (l1d_flush_out=off) to turn of the mechanism
completely.

To implement these efficiently, a new per cpu view of whether the core
is in SMT mode or not is implemented in patch 1. The code is refactored
in patch 2 so that the existing code can allow for other speculation
related checks when switching mm between tasks, this mechanism has not
changed since the last post. The ability to flush L1D for tasks if the
TIF_SPEC_L1D_FLUSH bit is set and the task has context switched out of a
non SMT core is provided by patch 3. Hooks for the user space API, for
this feature to be invoked via prctl are provided in patch 4, along with
the checks described above (1, 2, and 3). Documentation updates are in
patch 5, with updates on l1d_flush, the prctl changes and updates to the
kernel-parameters (l1d_flush_out).

The checks for opting into L1D flushing are:
a. If the CPU is affected by L1TF
b. Hardware L1D flush mechanism is available

A task running on a core with SMT enabled and opting into this feature will
receive a SIGBUS.

References
[1] https://software.intel.com/security-software-guidance/software-guidance/snoop-assisted-l1-data-sampling
[2] https://software.intel.com/security-software-guidance/insights/deep-dive-snoop-assisted-l1-data-sampling
[3] https://lkml.org/lkml/2020/6/2/1150
[4] https://lore.kernel.org/lkml/[email protected]/
[5] https://lore.kernel.org/lkml/[email protected]/

Changelog v3:
- Implement the SIGBUS mechansim
- Update and fix the documentation

Balbir Singh (5):
x86/mm: change l1d flush runtime prctl behaviour
x86/mm: Refactor cond_ibpb() to support other use cases
x86/mm: Optionally flush L1D on context switch
prctl: Hook L1D flushing in via prctl
Documentation: Add L1D flushing Documentation

Documentation/admin-guide/hw-vuln/index.rst | 1 +
.../admin-guide/hw-vuln/l1d_flush.rst | 69 ++++++++++++
.../admin-guide/kernel-parameters.txt | 17 +++
Documentation/userspace-api/spec_ctrl.rst | 8 ++
arch/Kconfig | 4 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/cacheflush.h | 8 ++
arch/x86/include/asm/processor.h | 2 +
arch/x86/include/asm/thread_info.h | 9 +-
arch/x86/include/asm/tlbflush.h | 2 +-
arch/x86/kernel/cpu/bugs.c | 54 +++++++++
arch/x86/kernel/smpboot.c | 11 +-
arch/x86/mm/tlb.c | 105 ++++++++++++++----
include/linux/sched.h | 10 ++
include/uapi/linux/prctl.h | 1 +
15 files changed, 274 insertions(+), 28 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst

--
2.17.1


2020-11-27 08:54:03

by Singh, Balbir

[permalink] [raw]
Subject: [PATCH v3 1/5] x86/mm: change l1d flush runtime prctl behaviour

Detection of task affinities at API opt-in time is not the best
approach, the approach is to kill the task if it runs on a
SMT enable core. This is better than not flushing the L1D cache
when the task switches from a non-SMT core to an SMT enabled core.

Signed-off-by: Balbir Singh <[email protected]>
---
arch/x86/include/asm/processor.h | 2 ++
arch/x86/kernel/smpboot.c | 11 ++++++++++-
2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 82a08b585818..60dbcdcb833f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -136,6 +136,8 @@ struct cpuinfo_x86 {
u16 logical_die_id;
/* Index into per_cpu list: */
u16 cpu_index;
+ /* Is SMT active on this core? */
+ bool smt_active;
u32 microcode;
/* Address space bits used by the cache internally */
u8 x86_cache_bits;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index de776b2e6046..9a94934fae5f 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -635,6 +635,9 @@ void set_cpu_sibling_map(int cpu)
threads = cpumask_weight(topology_sibling_cpumask(cpu));
if (threads > __max_smt_threads)
__max_smt_threads = threads;
+
+ for_each_cpu(i, topology_sibling_cpumask(cpu))
+ cpu_data(i).smt_active = threads > 1;
}

/* maps the cpu to the sched domain representing multi-core */
@@ -1548,10 +1551,16 @@ static void remove_siblinginfo(int cpu)

for_each_cpu(sibling, topology_die_cpumask(cpu))
cpumask_clear_cpu(cpu, topology_die_cpumask(sibling));
- for_each_cpu(sibling, topology_sibling_cpumask(cpu))
+
+ for_each_cpu(sibling, topology_sibling_cpumask(cpu)) {
cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling));
+ if (cpumask_weight(topology_sibling_cpumask(sibling)) == 1)
+ cpu_data(sibling).smt_active = false;
+ }
+
for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
+
cpumask_clear(cpu_llc_shared_mask(cpu));
cpumask_clear(topology_sibling_cpumask(cpu));
cpumask_clear(topology_core_cpumask(cpu));
--
2.17.1

2020-11-27 08:54:06

by Singh, Balbir

[permalink] [raw]
Subject: [PATCH v3 4/5] prctl: Hook L1D flushing in via prctl

Use the existing PR_GET/SET_SPECULATION_CTRL API to expose the L1D
flush capability. For L1D flushing PR_SPEC_FORCE_DISABLE and
PR_SPEC_DISABLE_NOEXEC are not supported.

Enabling L1D flush does not check if the task is running on
an SMT enabled core, rather a check is done at runtime (at the
time of flush), if the task runs on a non SMT enabled core
then the task is sent a SIGBUS (this is done prior to the task
executing on the core, so no data is leaked). This is better
than the other alternatives of

a. Ensuring strict affinity of the task (hard to enforce
without further changes in the scheduler)
b. Silently skipping flush for tasks that move to SMT enabled
cores.

An arch config ARCH_HAS_PARANOID_L1D_FLUSH has been added
and struct task carries a callback_head for arch's that support
this config (currently on x86), this callback head is used
to schedule task work (SIGBUS delivery).

There is also no seccomp integration for the feature.

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/Kconfig | 4 +++
arch/x86/Kconfig | 1 +
arch/x86/kernel/cpu/bugs.c | 54 ++++++++++++++++++++++++++++++++++++++
arch/x86/mm/tlb.c | 30 ++++++++++++++++++++-
include/linux/sched.h | 10 +++++++
include/uapi/linux/prctl.h | 1 +
6 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 56b6ccc0e32d..d4a0501ac7fc 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -311,6 +311,10 @@ config ARCH_32BIT_OFF_T
still support 32-bit off_t. This option is enabled for all such
architectures explicitly.

+config ARCH_HAS_PARANOID_L1D_FLUSH
+ bool
+ default n
+
config HAVE_ASM_MODVERSIONS
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f6946b81f74a..4f6caa6dae16 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -101,6 +101,7 @@ config X86
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
select ARCH_WANT_HUGE_PMD_SHARE
select ARCH_WANTS_THP_SWAP if X86_64
+ select ARCH_HAS_PARANOID_L1D_FLUSH
select BUILDTIME_TABLE_SORT
select CLKEVT_I8253
select CLOCKSOURCE_VALIDATE_LAST_CYCLE
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 581fb7223ad0..dece79e4d1e9 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -296,6 +296,13 @@ enum taa_mitigations {
TAA_MITIGATION_TSX_DISABLED,
};

+enum l1d_flush_out_mitigations {
+ L1D_FLUSH_OUT_OFF,
+ L1D_FLUSH_OUT_ON,
+};
+
+static enum l1d_flush_out_mitigations l1d_flush_out_mitigation __ro_after_init = L1D_FLUSH_OUT_ON;
+
/* Default mitigation for TAA-affected CPUs */
static enum taa_mitigations taa_mitigation __ro_after_init = TAA_MITIGATION_VERW;
static bool taa_nosmt __ro_after_init;
@@ -379,6 +386,18 @@ static void __init taa_select_mitigation(void)
pr_info("%s\n", taa_strings[taa_mitigation]);
}

+static int __init l1d_flush_out_parse_cmdline(char *str)
+{
+ if (!boot_cpu_has_bug(X86_BUG_L1TF))
+ return 0;
+
+ if (!strcmp(str, "off"))
+ l1d_flush_out_mitigation = L1D_FLUSH_OUT_OFF;
+
+ return 0;
+}
+early_param("l1d_flush_out", l1d_flush_out_parse_cmdline);
+
static int __init tsx_async_abort_parse_cmdline(char *str)
{
if (!boot_cpu_has_bug(X86_BUG_TAA))
@@ -1215,6 +1234,23 @@ static void task_update_spec_tif(struct task_struct *tsk)
speculation_ctrl_update_current();
}

+static int l1d_flush_out_prctl_set(struct task_struct *task, unsigned long ctrl)
+{
+
+ if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
+ return -EPERM;
+
+ switch (ctrl) {
+ case PR_SPEC_ENABLE:
+ return enable_l1d_flush_for_task(task);
+ case PR_SPEC_DISABLE:
+ return disable_l1d_flush_for_task(task);
+ default:
+ return -ERANGE;
+ }
+ return 0;
+}
+
static int ssb_prctl_set(struct task_struct *task, unsigned long ctrl)
{
if (ssb_mode != SPEC_STORE_BYPASS_PRCTL &&
@@ -1324,6 +1360,8 @@ int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which,
return ssb_prctl_set(task, ctrl);
case PR_SPEC_INDIRECT_BRANCH:
return ib_prctl_set(task, ctrl);
+ case PR_SPEC_L1D_FLUSH_OUT:
+ return l1d_flush_out_prctl_set(task, ctrl);
default:
return -ENODEV;
}
@@ -1340,6 +1378,20 @@ void arch_seccomp_spec_mitigate(struct task_struct *task)
}
#endif

+static int l1d_flush_out_prctl_get(struct task_struct *task)
+{
+ int ret;
+
+ if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
+ return PR_SPEC_FORCE_DISABLE;
+
+ ret = test_ti_thread_flag(&task->thread_info, TIF_SPEC_L1D_FLUSH);
+ if (ret)
+ return PR_SPEC_PRCTL | PR_SPEC_ENABLE;
+ else
+ return PR_SPEC_PRCTL | PR_SPEC_DISABLE;
+}
+
static int ssb_prctl_get(struct task_struct *task)
{
switch (ssb_mode) {
@@ -1390,6 +1442,8 @@ int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which)
return ssb_prctl_get(task);
case PR_SPEC_INDIRECT_BRANCH:
return ib_prctl_get(task);
+ case PR_SPEC_L1D_FLUSH_OUT:
+ return l1d_flush_out_prctl_get(task);
default:
return -ENODEV;
}
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1531d98396a0..bdc399b86bc7 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -315,6 +315,16 @@ EXPORT_SYMBOL_GPL(leave_mm);

int enable_l1d_flush_for_task(struct task_struct *tsk)
{
+ /*
+ * Do not enable L1D_FLUSH_OUT if
+ * b. The CPU is not affected by the L1TF bug
+ * c. The CPU does not have L1D FLUSH feature support
+ */
+
+ if (!boot_cpu_has_bug(X86_BUG_L1TF) ||
+ !boot_cpu_has(X86_FEATURE_FLUSH_L1D))
+ return -EINVAL;
+
set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
return 0;
}
@@ -335,13 +345,31 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
local_irq_restore(flags);
}

+/*
+ * Sent to a task that opts into L1D flushing via the prctl interface
+ * but ends up running on an SMT enabled core.
+ */
+static void l1d_flush_kill(struct callback_head *ch)
+{
+ force_sig(SIGBUS);
+}
+
static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
{
unsigned long next_tif = task_thread_info(next)->flags;
unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;
+ unsigned long next_mm;

BUILD_BUG_ON(TIF_SPEC_L1D_FLUSH != TIF_SPEC_IB + 1);
- return (unsigned long)next->mm | spec_bits;
+ next_mm = (unsigned long)next->mm | spec_bits;
+
+ if ((next_mm & LAST_USER_MM_L1D_FLUSH) && this_cpu_read(cpu_info.smt_active)) {
+ clear_ti_thread_flag(&next->thread_info, TIF_SPEC_L1D_FLUSH);
+ next->l1d_flush_kill.func = l1d_flush_kill;
+ task_work_add(next, &next->l1d_flush_kill, true);
+ }
+
+ return next_mm;
}

static void cond_mitigation(struct task_struct *next)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 76cd21fa5501..f8c5b6833f14 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1348,6 +1348,16 @@ struct task_struct {
struct callback_head mce_kill_me;
#endif

+#ifdef CONFIG_ARCH_HAS_PARANOID_L1D_FLUSH
+ /*
+ * If L1D flush is supported on mm context switch
+ * then we use this callback head to queue kill work
+ * to kill tasks that are not running on SMT disabled
+ * cores
+ */
+ struct callback_head l1d_flush_kill;
+#endif
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 7f0827705c9a..c334e6a02e5f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -213,6 +213,7 @@ struct prctl_mm_map {
/* Speculation control variants */
# define PR_SPEC_STORE_BYPASS 0
# define PR_SPEC_INDIRECT_BRANCH 1
+# define PR_SPEC_L1D_FLUSH_OUT 2
/* Return and control values for PR_SET/GET_SPECULATION_CTRL */
# define PR_SPEC_NOT_AFFECTED 0
# define PR_SPEC_PRCTL (1UL << 0)
--
2.17.1

2020-11-27 08:54:13

by Singh, Balbir

[permalink] [raw]
Subject: [PATCH v3 3/5] x86/mm: Optionally flush L1D on context switch

Implement a mechanism to selectively flush the L1D cache. The goal is to
allow tasks that want to save sensitive information, found by the recent
snoop assisted data sampling vulnerabilites, to flush their L1D on being
switched out. This protects their data from being snooped or leaked via
side channels after the task has context switched out.

There are two scenarios we might want to protect against, a task leaving
the CPU with data still in L1D (which is the main concern of this patch),
the second scenario is a malicious task coming in (not so well trusted)
for which we want to clean up the cache before it starts. Only the case
for the former is addressed.

A new thread_info flag TIF_SPEC_L1D_FLUSH is added to track tasks which
opt-into L1D flushing. cpu_tlbstate.last_user_mm_spec is used to convert
the TIF flags into mm state (per cpu via last_user_mm_spec) in
cond_mitigation(), which then used to do decide when to flush the
L1D cache.

A new helper inline function l1d_flush_hw() has been introduced.
Currently it returns an error code if hardware flushing is not
supported. The caller currently does not check the return value, in the
context of these patches, the routine is called only when HW assisted
flushing is available.

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/cacheflush.h | 8 ++++++++
arch/x86/include/asm/thread_info.h | 9 +++++++--
arch/x86/mm/tlb.c | 30 +++++++++++++++++++++++++++---
3 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index b192d917a6d0..554eaf697f3f 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -10,4 +10,12 @@

void clflush_cache_range(void *addr, unsigned int size);

+static inline int l1d_flush_hw(void)
+{
+ if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
+ wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+ return 0;
+ }
+ return -EOPNOTSUPP;
+}
#endif /* _ASM_X86_CACHEFLUSH_H */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 44733a4bfc42..36a11cfb1061 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -84,7 +84,7 @@ struct thread_info {
#define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */
#define TIF_SECCOMP 8 /* secure computing */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
-#define TIF_SPEC_FORCE_UPDATE 10 /* Force speculation MSR update in context switch */
+#define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
#define TIF_UPROBE 12 /* breakpointed or singlestepping */
#define TIF_PATCH_PENDING 13 /* pending live patching update */
@@ -96,6 +96,7 @@ struct thread_info {
#define TIF_MEMDIE 20 /* is terminating due to OOM killer */
#define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */
#define TIF_IO_BITMAP 22 /* uses I/O bitmap */
+#define TIF_SPEC_FORCE_UPDATE 23 /* Force speculation MSR update in context switch */
#define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
#define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
@@ -113,7 +114,7 @@ struct thread_info {
#define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
#define _TIF_SECCOMP (1 << TIF_SECCOMP)
#define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
-#define _TIF_SPEC_FORCE_UPDATE (1 << TIF_SPEC_FORCE_UPDATE)
+#define _TIF_SPEC_L1D_FLUSH (1 << TIF_SPEC_L1D_FLUSH)
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
#define _TIF_UPROBE (1 << TIF_UPROBE)
#define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING)
@@ -124,6 +125,7 @@ struct thread_info {
#define _TIF_SLD (1 << TIF_SLD)
#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)
#define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
+#define _TIF_SPEC_FORCE_UPDATE (1 << TIF_SPEC_FORCE_UPDATE)
#define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
#define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
#define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
@@ -228,6 +230,9 @@ static inline int arch_within_stack_frames(const void * const stack,
current_thread_info()->status & TS_COMPAT)
#endif

+extern int enable_l1d_flush_for_task(struct task_struct *tsk);
+extern int disable_l1d_flush_for_task(struct task_struct *tsk);
+
extern void arch_task_cache_init(void);
extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
extern void arch_release_task_struct(struct task_struct *tsk);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ca021e039451..1531d98396a0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -8,11 +8,13 @@
#include <linux/export.h>
#include <linux/cpu.h>
#include <linux/debugfs.h>
+#include <linux/sched/smt.h>

#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
#include <asm/nospec-branch.h>
#include <asm/cache.h>
+#include <asm/cacheflush.h>
#include <asm/apic.h>

#include "mm_internal.h"
@@ -42,14 +44,15 @@
*/

/*
- * Bits to mangle the TIF_SPEC_IB state into the mm pointer which is
+ * Bits to mangle the TIF_SPEC_* state into the mm pointer which is
* stored in cpu_tlb_state.last_user_mm_spec.
*/
#define LAST_USER_MM_IBPB 0x1UL
-#define LAST_USER_MM_SPEC_MASK (LAST_USER_MM_IBPB)
+#define LAST_USER_MM_L1D_FLUSH 0x2UL
+#define LAST_USER_MM_SPEC_MASK (LAST_USER_MM_IBPB | LAST_USER_MM_L1D_FLUSH)

/* Bits to set when tlbstate and flush is (re)initialized */
-#define LAST_USER_MM_INIT LAST_USER_MM_IBPB
+#define LAST_USER_MM_INIT (LAST_USER_MM_IBPB | LAST_USER_MM_L1D_FLUSH)

/*
* The x86 feature is called PCID (Process Context IDentifier). It is similar
@@ -310,6 +313,18 @@ void leave_mm(int cpu)
}
EXPORT_SYMBOL_GPL(leave_mm);

+int enable_l1d_flush_for_task(struct task_struct *tsk)
+{
+ set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
+ return 0;
+}
+
+int disable_l1d_flush_for_task(struct task_struct *tsk)
+{
+ clear_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
+ return 0;
+}
+
void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
@@ -325,6 +340,7 @@ static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
unsigned long next_tif = task_thread_info(next)->flags;
unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;

+ BUILD_BUG_ON(TIF_SPEC_L1D_FLUSH != TIF_SPEC_IB + 1);
return (unsigned long)next->mm | spec_bits;
}

@@ -402,6 +418,14 @@ static void cond_mitigation(struct task_struct *next)
indirect_branch_prediction_barrier();
}

+ /*
+ * Flush only if SMT is disabled as per the contract, which is checked
+ * when the feature is enabled.
+ */
+ if (sched_smt_active() && !this_cpu_read(cpu_info.smt_active) &&
+ (prev_mm & LAST_USER_MM_L1D_FLUSH))
+ l1d_flush_hw();
+
this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
}

--
2.17.1

2020-12-04 21:12:33

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] x86/mm: change l1d flush runtime prctl behaviour

On Fri, Nov 27 2020 at 17:59, Balbir Singh wrote:

> Detection of task affinities at API opt-in time is not the best
> approach, the approach is to kill the task if it runs on a
> SMT enable core. This is better than not flushing the L1D cache
> when the task switches from a non-SMT core to an SMT enabled core.
>
> Signed-off-by: Balbir Singh <[email protected]>
> ---
> arch/x86/include/asm/processor.h | 2 ++
> arch/x86/kernel/smpboot.c | 11 ++++++++++-
> 2 files changed, 12 insertions(+), 1 deletion(-)

Subject, changelog match but patch content not so much :)

2020-12-04 21:25:01

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] x86/mm: Optionally flush L1D on context switch

On Fri, Nov 27 2020 at 17:59, Balbir Singh wrote:
>
> + /*
> + * Flush only if SMT is disabled as per the contract, which is checked
> + * when the feature is enabled.
> + */
> + if (sched_smt_active() && !this_cpu_read(cpu_info.smt_active) &&
> + (prev_mm & LAST_USER_MM_L1D_FLUSH))
> + l1d_flush_hw();

So if SMT is completely disabled then no flush? Shouldn't the logic be:

if ((!sched_smt_active() || !this_cpu_read(cpu_info.smt_active) &&
(prev_mm & LAST_USER_MM_L1D_FLUSH))

Hmm?

But that's bad, because it's lot's of conditions to evaluate for every
switch_mm where most of them are not interested in it at all.

Let me read through the rest of the pile.

Thanks,

tglx


2020-12-04 22:22:03

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] prctl: Hook L1D flushing in via prctl

Balbir,

On Fri, Nov 27 2020 at 17:59, Balbir Singh wrote:
> +enum l1d_flush_out_mitigations {
> + L1D_FLUSH_OUT_OFF,
> + L1D_FLUSH_OUT_ON,
> +};
> +
> +static enum l1d_flush_out_mitigations l1d_flush_out_mitigation __ro_after_init = L1D_FLUSH_OUT_ON;

Why default on and why stays it on when the CPU is not affected by L1TF ...

> /* Default mitigation for TAA-affected CPUs */
> static enum taa_mitigations taa_mitigation __ro_after_init = TAA_MITIGATION_VERW;
> static bool taa_nosmt __ro_after_init;
> @@ -379,6 +386,18 @@ static void __init taa_select_mitigation(void)
> pr_info("%s\n", taa_strings[taa_mitigation]);
> }
>
> +static int __init l1d_flush_out_parse_cmdline(char *str)
> +{
> + if (!boot_cpu_has_bug(X86_BUG_L1TF))
> + return 0;

... while here you check for L1TF.

Also shouldn't it be default off and enabled via command line?

> +static int l1d_flush_out_prctl_get(struct task_struct *task)
> +{
> + int ret;
> +
> + if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
> + return PR_SPEC_FORCE_DISABLE;
> +
> + ret = test_ti_thread_flag(&task->thread_info, TIF_SPEC_L1D_FLUSH);

That ret indirection is pointless. Just make it if (test_....)

> +static int l1d_flush_out_prctl_set(struct task_struct *task, unsigned long ctrl)
> +{
> +
> + if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
> + return -EPERM;

So here you check for off and then...

> int enable_l1d_flush_for_task(struct task_struct *tsk)
> {
> + /*
> + * Do not enable L1D_FLUSH_OUT if
> + * b. The CPU is not affected by the L1TF bug
> + * c. The CPU does not have L1D FLUSH feature support
> + */
> +
> + if (!boot_cpu_has_bug(X86_BUG_L1TF) ||
> + !boot_cpu_has(X86_FEATURE_FLUSH_L1D))
> + return -EINVAL;

... you check for the feature bits with a malformatted condition at some
other place. It's not supported when these conditions are not there. So
why having this check here?

> +
> set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
> return 0;
> }

Aside of that, why is this in tlb.c and not in bugs.c? There is nothing
tlb specific in these enable/disable functions. They just fiddle with
the TIF bit.

> +/*
> + * Sent to a task that opts into L1D flushing via the prctl interface
> + * but ends up running on an SMT enabled core.
> + */
> +static void l1d_flush_kill(struct callback_head *ch)
> +{
> + force_sig(SIGBUS);
> +}
> +
> static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
> {
> unsigned long next_tif = task_thread_info(next)->flags;
> unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;
> + unsigned long next_mm;
>
> BUILD_BUG_ON(TIF_SPEC_L1D_FLUSH != TIF_SPEC_IB + 1);
> - return (unsigned long)next->mm | spec_bits;
> + next_mm = (unsigned long)next->mm | spec_bits;
> +
> + if ((next_mm & LAST_USER_MM_L1D_FLUSH) && this_cpu_read(cpu_info.smt_active)) {

Wheeee. Yet more unconditional checks on every context switch.

> + clear_ti_thread_flag(&next->thread_info, TIF_SPEC_L1D_FLUSH);
> + next->l1d_flush_kill.func = l1d_flush_kill;
> + task_work_add(next, &next->l1d_flush_kill, true);

int task_work_add(struct task_struct *task, struct callback_head *twork,
enum task_work_notify_mode mode);

true is truly not a valid enum constant ....

> + }

So you really want to have:

DEFINE_STATIC_KEY_FALSE(l1dflush_enabled);
static bool l1dflush_mitigation __init_data;

and then with the command line option you set l1dflush_mitigation and in
check_bugs() you invoke l1dflush_select_mitigation() which does:

if (!l1dflush_mitigation || !boot_cpu_has_bug(X86_BUG_L1TF) ||
!boot_cpu_has(X86_FEATURE_FLUSH_L1D))
return;

static_branch_enable(&l1dflush_enabled);

and then in l1d_flush_out_prctl_set()

if (!static_branch_unlikely(&l1dflush_enabled))
return -ENOTSUPP;

Then make the whole switch machinery do:

if (static_branch_unlikely(&l1dflush_enabled)) {
if (unlikely((next_mm | prev_mm) & LAST_USER_MM_L1D_FLUSH))
l1dflush_evaluate(next_mm, prev_mm);
}

and l1dflush_evaluate()

if (prev_mm & LAST_USER_MM_L1D_FLUSH)
l1d_flush();

if ((next_mm & LAST_USER_MM_L1D_FLUSH) &&
this_cpu_read(cpu_info.smt_active)) {

clear_ti_thread_flag(&next->thread_info, TIF_SPEC_L1D_FLUSH);
next->l1d_flush_kill.func = l1d_flush_kill;
task_work_add(next, &next->l1d_flush_kill, TWA_RESUME);
}

That way the overhead is on systems where the admin decides to enable it
and if enabled the evaluation of prev_mm and next_mm is pushed out of
line.

Hmm?

Thanks,

tglx

2020-12-04 22:44:14

by Singh, Balbir

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] x86/mm: Optionally flush L1D on context switch

On Fri, 2020-12-04 at 22:21 +0100, Thomas Gleixner wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Fri, Nov 27 2020 at 17:59, Balbir Singh wrote:
> >
> > + /*
> > + * Flush only if SMT is disabled as per the contract, which is checked
> > + * when the feature is enabled.
> > + */
> > + if (sched_smt_active() && !this_cpu_read(cpu_info.smt_active) &&
> > + (prev_mm & LAST_USER_MM_L1D_FLUSH))
> > + l1d_flush_hw();
>
> So if SMT is completely disabled then no flush? Shouldn't the logic be:
>
> if ((!sched_smt_active() || !this_cpu_read(cpu_info.smt_active) &&
> (prev_mm & LAST_USER_MM_L1D_FLUSH))
>
> Hmm?
>
> But that's bad, because it's lot's of conditions to evaluate for every
> switch_mm where most of them are not interested in it at all.
>
> Let me read through the rest of the pile.
>


We don't need this anymore with the new checks for preempting killing
of the task, so it can be removed

Balbir

2020-12-04 22:47:08

by Singh, Balbir

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] x86/mm: change l1d flush runtime prctl behaviour

On Fri, 2020-12-04 at 22:07 +0100, Thomas Gleixner wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Fri, Nov 27 2020 at 17:59, Balbir Singh wrote:
>
> > Detection of task affinities at API opt-in time is not the best
> > approach, the approach is to kill the task if it runs on a
> > SMT enable core. This is better than not flushing the L1D cache
> > when the task switches from a non-SMT core to an SMT enabled core.
> >
> > Signed-off-by: Balbir Singh <[email protected]>
> > ---
> > arch/x86/include/asm/processor.h | 2 ++
> > arch/x86/kernel/smpboot.c | 11 ++++++++++-
> > 2 files changed, 12 insertions(+), 1 deletion(-)
>
> Subject, changelog match but patch content not so much :)
>

The changelog jumped between 1/3 of my fixup and 1/5 of my
new post :)

The correct changelog is below, which I shall fix

x86/smp: Add a per-cpu view of SMT state

A new field smt_active in cpuinfo_x86 identifies if the current core/cpu
is in SMT mode or not. This can be very helpful if the system has some
of its cores with threads offlined and can be used for cases where
action is taken based on the state of SMT. The follow up patches use
this feature.

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---

2020-12-05 03:04:52

by Singh, Balbir

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] prctl: Hook L1D flushing in via prctl

On Fri, Dec 04, 2020 at 11:19:17PM +0100, Thomas Gleixner wrote:
>
> Balbir,
>
> On Fri, Nov 27 2020 at 17:59, Balbir Singh wrote:
> > +enum l1d_flush_out_mitigations {
> > + L1D_FLUSH_OUT_OFF,
> > + L1D_FLUSH_OUT_ON,
> > +};
> > +
> > +static enum l1d_flush_out_mitigations l1d_flush_out_mitigation __ro_after_init = L1D_FLUSH_OUT_ON;
>
> Why default on and why stays it on when the CPU is not affected by L1TF ...
>

Because we don't set the PRCTL is the processor is not affected by the
bug

> > /* Default mitigation for TAA-affected CPUs */
> > static enum taa_mitigations taa_mitigation __ro_after_init = TAA_MITIGATION_VERW;
> > static bool taa_nosmt __ro_after_init;
> > @@ -379,6 +386,18 @@ static void __init taa_select_mitigation(void)
> > pr_info("%s\n", taa_strings[taa_mitigation]);
> > }
> >
> > +static int __init l1d_flush_out_parse_cmdline(char *str)
> > +{
> > + if (!boot_cpu_has_bug(X86_BUG_L1TF))
> > + return 0;
>
> ... while here you check for L1TF.
>
> Also shouldn't it be default off and enabled via command line?
>

I chose the other way because the prctl is an opt-in as well

> > +static int l1d_flush_out_prctl_get(struct task_struct *task)
> > +{
> > + int ret;
> > +
> > + if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
> > + return PR_SPEC_FORCE_DISABLE;
> > +
> > + ret = test_ti_thread_flag(&task->thread_info, TIF_SPEC_L1D_FLUSH);
>
> That ret indirection is pointless. Just make it if (test_....)

Sure, will do

>
> > +static int l1d_flush_out_prctl_set(struct task_struct *task, unsigned long ctrl)
> > +{
> > +
> > + if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
> > + return -EPERM;
>
> So here you check for off and then...
>

Yes

> > int enable_l1d_flush_for_task(struct task_struct *tsk)
> > {
> > + /*
> > + * Do not enable L1D_FLUSH_OUT if
> > + * b. The CPU is not affected by the L1TF bug
> > + * c. The CPU does not have L1D FLUSH feature support
> > + */
> > +
> > + if (!boot_cpu_has_bug(X86_BUG_L1TF) ||
> > + !boot_cpu_has(X86_FEATURE_FLUSH_L1D))
> > + return -EINVAL;
>
> ... you check for the feature bits with a malformatted condition at some
> other place. It's not supported when these conditions are not there. So
> why having this check here?
>
> > +
> > set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
> > return 0;
> > }
>
> Aside of that, why is this in tlb.c and not in bugs.c? There is nothing
> tlb specific in these enable/disable functions. They just fiddle with
> the TIF bit.
>

I can move them over.

> > +/*
> > + * Sent to a task that opts into L1D flushing via the prctl interface
> > + * but ends up running on an SMT enabled core.
> > + */
> > +static void l1d_flush_kill(struct callback_head *ch)
> > +{
> > + force_sig(SIGBUS);
> > +}
> > +
> > static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
> > {
> > unsigned long next_tif = task_thread_info(next)->flags;
> > unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;
> > + unsigned long next_mm;
> >
> > BUILD_BUG_ON(TIF_SPEC_L1D_FLUSH != TIF_SPEC_IB + 1);
> > - return (unsigned long)next->mm | spec_bits;
> > + next_mm = (unsigned long)next->mm | spec_bits;
> > +
> > + if ((next_mm & LAST_USER_MM_L1D_FLUSH) && this_cpu_read(cpu_info.smt_active)) {
>
> Wheeee. Yet more unconditional checks on every context switch.

A task can only get here if it is affected by the bug (processor has
L1TF and L1D_FLUSH support) and the task opted in, I think what your
suggesting is that we avoid the check for all tasks (the signgle next_mm
& LAST_USER_MM_L1D_FLUSH) check as well?

>
> > + clear_ti_thread_flag(&next->thread_info, TIF_SPEC_L1D_FLUSH);
> > + next->l1d_flush_kill.func = l1d_flush_kill;
> > + task_work_add(next, &next->l1d_flush_kill, true);
>
> int task_work_add(struct task_struct *task, struct callback_head *twork,
> enum task_work_notify_mode mode);
>
> true is truly not a valid enum constant ....

:) I might really have added it when we were transitioning from true to
TWA_RESUME, I am surprised the compiler did not catch it, I'll move it
over.

>
> > + }
>
> So you really want to have:
>
> DEFINE_STATIC_KEY_FALSE(l1dflush_enabled);
> static bool l1dflush_mitigation __init_data;
>
> and then with the command line option you set l1dflush_mitigation and in
> check_bugs() you invoke l1dflush_select_mitigation() which does:
>
> if (!l1dflush_mitigation || !boot_cpu_has_bug(X86_BUG_L1TF) ||
> !boot_cpu_has(X86_FEATURE_FLUSH_L1D))
> return;
>
> static_branch_enable(&l1dflush_enabled);
>
> and then in l1d_flush_out_prctl_set()
>
> if (!static_branch_unlikely(&l1dflush_enabled))
> return -ENOTSUPP;
>
> Then make the whole switch machinery do:
>
> if (static_branch_unlikely(&l1dflush_enabled)) {
> if (unlikely((next_mm | prev_mm) & LAST_USER_MM_L1D_FLUSH))
> l1dflush_evaluate(next_mm, prev_mm);
> }
>
> and l1dflush_evaluate()
>
> if (prev_mm & LAST_USER_MM_L1D_FLUSH)
> l1d_flush();
>
> if ((next_mm & LAST_USER_MM_L1D_FLUSH) &&
> this_cpu_read(cpu_info.smt_active)) {
>
> clear_ti_thread_flag(&next->thread_info, TIF_SPEC_L1D_FLUSH);
> next->l1d_flush_kill.func = l1d_flush_kill;
> task_work_add(next, &next->l1d_flush_kill, TWA_RESUME);
> }
>
> That way the overhead is on systems where the admin decides to enable it
> and if enabled the evaluation of prev_mm and next_mm is pushed out of
> line.
>

OK, I'll rewrite it and see how it looks

Thanks for the review,
Balbir Singh