2020-07-29 00:12:08

by Singh, Balbir

[permalink] [raw]
Subject: [PATCH v2 0/5] Implement optional L1D flushing for

Implement a mechanism that allows tasks to conditionally flush
their L1D cache (mitigation mechanism suggested in 2). The previous
posts of these patches were sent for inclusion (see 3) and were not
included due to the concern for the need for additional checks,
those checks were:

1. Implement this mechanism only for CPUs affected by the L1TF bug
2. Disable the software fallback
3. Provide an override to disable this mechanism completely

The patches support a use case where the entire system is not in
non SMT mode, but rather a few CPUs can have their SMT turned off
and processes that want to opt-in are expected to run on non SMT
cores. This gives the administrator complete control over setting
up the mitigation for the issue. In addition, the administrator
has a boot time override (l1d_flush_out=off) to turn of the mechanism
completely.

To implement these efficiently, a new per cpu view of whether the core
is in SMT mode or not is implemented in patch 1. The code is refactored
in patch 2 so that the existing code can allow for other speculation
related checks when switching mm between tasks, this mechanism has not
changed since the last post. The ability to flush L1D for tasks if the
TIF_SPEC_L1D_FLUSH bit is set and the task has context switched out of a
non SMT core is provided by patch 3. Hooks for the user space API, for
this feature to be invoked via prctl are provided in patch 4, along with
the checks described above (1, 2, and 3).

The checks are:
a. If the CPU is affected by L1TF
b. Hardware L1D flush mechanism is available
c. The task opting in has it's affinity set to only non SMT cores.

Documentation updates are in patch 5, with updates on l1d_flush, the
prctl changes and updates to the kernel-parameters (l1d_flush_out).

Balbir Singh (5):
Add a per-cpu view of SMT state
x86/mm: Refactor cond_ibpb() to support other use cases
x86/mm: Optionally flush L1D on context switch
prctl: Hook L1D flushing in via prctl
Documentation: Add L1D flushing Documentation

References:
[1] https://software.intel.com/security-software-guidance/software-guidance/snoop-assisted-l1-data-sampling
[2] https://software.intel.com/security-software-guidance/insights/deep-dive-snoop-assisted-l1-data-sampling
[3] https://lkml.org/lkml/2020/6/2/1150

Documentation/admin-guide/hw-vuln/index.rst | 1 +
.../admin-guide/hw-vuln/l1d_flush.rst | 70 ++++++++++++
.../admin-guide/kernel-parameters.txt | 17 +++
Documentation/userspace-api/spec_ctrl.rst | 8 ++
arch/x86/include/asm/cacheflush.h | 8 ++
arch/x86/include/asm/processor.h | 2 +
arch/x86/include/asm/thread_info.h | 9 +-
arch/x86/include/asm/tlbflush.h | 2 +-
arch/x86/kernel/cpu/bugs.c | 54 +++++++++
arch/x86/kernel/smpboot.c | 11 +-
arch/x86/mm/tlb.c | 104 +++++++++++++-----
include/uapi/linux/prctl.h | 1 +
12 files changed, 258 insertions(+), 29 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst

--
2.17.1


2020-07-29 00:12:21

by Singh, Balbir

[permalink] [raw]
Subject: [PATCH v2 1/5] Add a per-cpu view of SMT state

A new field smt_active in cpuinfo_x86 identifies if the current core/cpu
is in SMT mode or not. This can be very helpful if the system has some
of its cores with threads offlined and can be used for cases where
action is taken based on the state of SMT. The follow up patches use
this feature.

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
---
arch/x86/include/asm/processor.h | 2 ++
arch/x86/kernel/smpboot.c | 11 ++++++++++-
2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 03b7c4ca425a..23ea45dd6f14 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -136,6 +136,8 @@ struct cpuinfo_x86 {
u16 logical_die_id;
/* Index into per_cpu list: */
u16 cpu_index;
+ /* Is SMT active on this core? */
+ bool smt_active;
u32 microcode;
/* Address space bits used by the cache internally */
u8 x86_cache_bits;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ffbd9a3d78d8..0b12f24ebfbb 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -645,6 +645,9 @@ void set_cpu_sibling_map(int cpu)
threads = cpumask_weight(topology_sibling_cpumask(cpu));
if (threads > __max_smt_threads)
__max_smt_threads = threads;
+
+ for_each_cpu(i, topology_sibling_cpumask(cpu))
+ cpu_data(i).smt_active = threads > 1;
}

/* maps the cpu to the sched domain representing multi-core */
@@ -1557,10 +1560,16 @@ static void remove_siblinginfo(int cpu)

for_each_cpu(sibling, topology_die_cpumask(cpu))
cpumask_clear_cpu(cpu, topology_die_cpumask(sibling));
- for_each_cpu(sibling, topology_sibling_cpumask(cpu))
+
+ for_each_cpu(sibling, topology_sibling_cpumask(cpu)) {
cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling));
+ if (cpumask_weight(topology_sibling_cpumask(sibling)) == 1)
+ cpu_data(sibling).smt_active = false;
+ }
+
for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
+
cpumask_clear(cpu_llc_shared_mask(cpu));
cpumask_clear(topology_sibling_cpumask(cpu));
cpumask_clear(topology_core_cpumask(cpu));
--
2.17.1

2020-07-29 00:12:30

by Singh, Balbir

[permalink] [raw]
Subject: [PATCH v2 2/5] x86/mm: Refactor cond_ibpb() to support other use cases

cond_ibpb() has the necessary bits required to track the previous mm in
switch_mm_irqs_off(). This can be reused for other use cases like L1D
flushing on context switch.

[ tglx: Moved comment, added a separate define for state (re)initialization ]

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/tlbflush.h | 2 +-
arch/x86/mm/tlb.c | 53 ++++++++++++++++++---------------
2 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8c87a2e0b660..a927d40664df 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -83,7 +83,7 @@ struct tlb_state {
/* Last user mm for optimizing IBPB */
union {
struct mm_struct *last_user_mm;
- unsigned long last_user_mm_ibpb;
+ unsigned long last_user_mm_spec;
};

u16 loaded_mm_asid;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1a3569b43aa5..e031b324c704 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -43,10 +43,14 @@
*/

/*
- * Use bit 0 to mangle the TIF_SPEC_IB state into the mm pointer which is
- * stored in cpu_tlb_state.last_user_mm_ibpb.
+ * Bits to mangle the TIF_SPEC_IB state into the mm pointer which is
+ * stored in cpu_tlb_state.last_user_mm_spec.
*/
#define LAST_USER_MM_IBPB 0x1UL
+#define LAST_USER_MM_SPEC_MASK (LAST_USER_MM_IBPB)
+
+/* Bits to set when tlbstate and flush is (re)initialized */
+#define LAST_USER_MM_INIT LAST_USER_MM_IBPB

/*
* The x86 feature is called PCID (Process Context IDentifier). It is similar
@@ -317,20 +321,29 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
local_irq_restore(flags);
}

-static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
+static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
{
unsigned long next_tif = task_thread_info(next)->flags;
- unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
+ unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;

- return (unsigned long)next->mm | ibpb;
+ return (unsigned long)next->mm | spec_bits;
}

-static void cond_ibpb(struct task_struct *next)
+static void cond_mitigation(struct task_struct *next)
{
+ unsigned long prev_mm, next_mm;
+
if (!next || !next->mm)
return;

+ next_mm = mm_mangle_tif_spec_bits(next);
+ prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_spec);
+
/*
+ * Avoid user/user BTB poisoning by flushing the branch predictor
+ * when switching between processes. This stops one process from
+ * doing Spectre-v2 attacks on another.
+ *
* Both, the conditional and the always IBPB mode use the mm
* pointer to avoid the IBPB when switching between tasks of the
* same process. Using the mm pointer instead of mm->context.ctx_id
@@ -340,8 +353,6 @@ static void cond_ibpb(struct task_struct *next)
* exposed data is not really interesting.
*/
if (static_branch_likely(&switch_mm_cond_ibpb)) {
- unsigned long prev_mm, next_mm;
-
/*
* This is a bit more complex than the always mode because
* it has to handle two cases:
@@ -371,20 +382,14 @@ static void cond_ibpb(struct task_struct *next)
* Optimize this with reasonably small overhead for the
* above cases. Mangle the TIF_SPEC_IB bit into the mm
* pointer of the incoming task which is stored in
- * cpu_tlbstate.last_user_mm_ibpb for comparison.
- */
- next_mm = mm_mangle_tif_spec_ib(next);
- prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_ibpb);
-
- /*
+ * cpu_tlbstate.last_user_mm_spec for comparison.
+ *
* Issue IBPB only if the mm's are different and one or
* both have the IBPB bit set.
*/
if (next_mm != prev_mm &&
(next_mm | prev_mm) & LAST_USER_MM_IBPB)
indirect_branch_prediction_barrier();
-
- this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, next_mm);
}

if (static_branch_unlikely(&switch_mm_always_ibpb)) {
@@ -393,11 +398,12 @@ static void cond_ibpb(struct task_struct *next)
* different context than the user space task which ran
* last on this CPU.
*/
- if (this_cpu_read(cpu_tlbstate.last_user_mm) != next->mm) {
+ if ((prev_mm & ~LAST_USER_MM_SPEC_MASK) !=
+ (unsigned long)next->mm)
indirect_branch_prediction_barrier();
- this_cpu_write(cpu_tlbstate.last_user_mm, next->mm);
- }
}
+
+ this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
}

#ifdef CONFIG_PERF_EVENTS
@@ -519,11 +525,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
need_flush = true;
} else {
/*
- * Avoid user/user BTB poisoning by flushing the branch
- * predictor when switching between processes. This stops
- * one process from doing Spectre-v2 attacks on another.
+ * Apply process to process speculation vulnerability
+ * mitigations if applicable.
*/
- cond_ibpb(tsk);
+ cond_mitigation(tsk);

/*
* Stop remote flushes for the previous mm.
@@ -640,7 +645,7 @@ void initialize_tlbstate_and_flush(void)
write_cr3(build_cr3(mm->pgd, 0));

/* Reinitialize tlbstate. */
- this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, LAST_USER_MM_IBPB);
+ this_cpu_write(cpu_tlbstate.last_user_mm_spec, LAST_USER_MM_INIT);
this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
this_cpu_write(cpu_tlbstate.next_asid, 1);
this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
--
2.17.1

2020-07-29 00:13:53

by Singh, Balbir

[permalink] [raw]
Subject: [PATCH v2 3/5] x86/mm: Optionally flush L1D on context switch

Implement a mechanism to selectively flush the L1D cache. The goal is to
allow tasks that want to save sensitive information, found by the recent
snoop assisted data sampling vulnerabilites, to flush their L1D on being
switched out. This protects their data from being snooped or leaked via
side channels after the task has context switched out.

There are two scenarios we might want to protect against, a task leaving
the CPU with data still in L1D (which is the main concern of this patch),
the second scenario is a malicious task coming in (not so well trusted)
for which we want to clean up the cache before it starts. Only the case
for the former is addressed.

A new thread_info flag TIF_SPEC_L1D_FLUSH is added to track tasks which
opt-into L1D flushing. cpu_tlbstate.last_user_mm_spec is used to convert
the TIF flags into mm state (per cpu via last_user_mm_spec) in
cond_mitigation(), which then used to do decide when to flush the
L1D cache.

A new helper inline function l1d_flush_hw() has been introduced.
Currently it returns an error code if hardware flushing is not
supported. The caller currently does not check the return value, in the
context of these patches, the routine is called only when HW assisted
flushing is available.

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
---
arch/x86/include/asm/cacheflush.h | 8 ++++++++
arch/x86/include/asm/thread_info.h | 9 +++++++--
arch/x86/mm/tlb.c | 30 +++++++++++++++++++++++++++---
3 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index b192d917a6d0..554eaf697f3f 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -10,4 +10,12 @@

void clflush_cache_range(void *addr, unsigned int size);

+static inline int l1d_flush_hw(void)
+{
+ if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
+ wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+ return 0;
+ }
+ return -EOPNOTSUPP;
+}
#endif /* _ASM_X86_CACHEFLUSH_H */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 8de8ceccb8bc..1655347f11b9 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -84,7 +84,7 @@ struct thread_info {
#define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */
#define TIF_SECCOMP 8 /* secure computing */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
-#define TIF_SPEC_FORCE_UPDATE 10 /* Force speculation MSR update in context switch */
+#define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
#define TIF_UPROBE 12 /* breakpointed or singlestepping */
#define TIF_PATCH_PENDING 13 /* pending live patching update */
@@ -96,6 +96,7 @@ struct thread_info {
#define TIF_MEMDIE 20 /* is terminating due to OOM killer */
#define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */
#define TIF_IO_BITMAP 22 /* uses I/O bitmap */
+#define TIF_SPEC_FORCE_UPDATE 23 /* Force speculation MSR update in context switch */
#define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
#define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
@@ -114,7 +115,7 @@ struct thread_info {
#define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
#define _TIF_SECCOMP (1 << TIF_SECCOMP)
#define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
-#define _TIF_SPEC_FORCE_UPDATE (1 << TIF_SPEC_FORCE_UPDATE)
+#define _TIF_SPEC_L1D_FLUSH (1 << TIF_SPEC_L1D_FLUSH)
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
#define _TIF_UPROBE (1 << TIF_UPROBE)
#define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING)
@@ -125,6 +126,7 @@ struct thread_info {
#define _TIF_SLD (1 << TIF_SLD)
#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)
#define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
+#define _TIF_SPEC_FORCE_UPDATE (1 << TIF_SPEC_FORCE_UPDATE)
#define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
#define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
#define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
@@ -235,6 +237,9 @@ static inline int arch_within_stack_frames(const void * const stack,
current_thread_info()->status & TS_COMPAT)
#endif

+extern int enable_l1d_flush_for_task(struct task_struct *tsk);
+extern int disable_l1d_flush_for_task(struct task_struct *tsk);
+
extern void arch_task_cache_init(void);
extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
extern void arch_release_task_struct(struct task_struct *tsk);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e031b324c704..48ccc3dd1492 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -8,11 +8,13 @@
#include <linux/export.h>
#include <linux/cpu.h>
#include <linux/debugfs.h>
+#include <linux/sched/smt.h>

#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
#include <asm/nospec-branch.h>
#include <asm/cache.h>
+#include <asm/cacheflush.h>
#include <asm/apic.h>
#include <asm/uv/uv.h>

@@ -43,14 +45,15 @@
*/

/*
- * Bits to mangle the TIF_SPEC_IB state into the mm pointer which is
+ * Bits to mangle the TIF_SPEC_* state into the mm pointer which is
* stored in cpu_tlb_state.last_user_mm_spec.
*/
#define LAST_USER_MM_IBPB 0x1UL
-#define LAST_USER_MM_SPEC_MASK (LAST_USER_MM_IBPB)
+#define LAST_USER_MM_L1D_FLUSH 0x2UL
+#define LAST_USER_MM_SPEC_MASK (LAST_USER_MM_IBPB | LAST_USER_MM_L1D_FLUSH)

/* Bits to set when tlbstate and flush is (re)initialized */
-#define LAST_USER_MM_INIT LAST_USER_MM_IBPB
+#define LAST_USER_MM_INIT (LAST_USER_MM_IBPB | LAST_USER_MM_L1D_FLUSH)

/*
* The x86 feature is called PCID (Process Context IDentifier). It is similar
@@ -311,6 +314,18 @@ void leave_mm(int cpu)
}
EXPORT_SYMBOL_GPL(leave_mm);

+int enable_l1d_flush_for_task(struct task_struct *tsk)
+{
+ set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
+ return 0;
+}
+
+int disable_l1d_flush_for_task(struct task_struct *tsk)
+{
+ clear_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
+ return 0;
+}
+
void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
@@ -326,6 +341,7 @@ static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
unsigned long next_tif = task_thread_info(next)->flags;
unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;

+ BUILD_BUG_ON(TIF_SPEC_L1D_FLUSH != TIF_SPEC_IB + 1);
return (unsigned long)next->mm | spec_bits;
}

@@ -403,6 +419,14 @@ static void cond_mitigation(struct task_struct *next)
indirect_branch_prediction_barrier();
}

+ /*
+ * Flush only if SMT is disabled as per the contract, which is checked
+ * when the feature is enabled.
+ */
+ if (sched_smt_active() && !this_cpu_read(cpu_info.smt_active) &&
+ (prev_mm & LAST_USER_MM_L1D_FLUSH))
+ l1d_flush_hw();
+
this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
}

--
2.17.1

2020-07-29 00:14:33

by Singh, Balbir

[permalink] [raw]
Subject: [PATCH v2 5/5] Documentation: Add L1D flushing Documentation

Add documentation of l1d flushing, explain the need for the
feature and how it can be used.

Signed-off-by: Balbir Singh <[email protected]>
---
Documentation/admin-guide/hw-vuln/index.rst | 1 +
.../admin-guide/hw-vuln/l1d_flush.rst | 70 +++++++++++++++++++
.../admin-guide/kernel-parameters.txt | 17 +++++
Documentation/userspace-api/spec_ctrl.rst | 8 +++
4 files changed, 96 insertions(+)
create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst

diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index ca4dbdd9016d..21710f8609fe 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -15,3 +15,4 @@ are configurable at compile, boot or run time.
tsx_async_abort
multihit.rst
special-register-buffer-data-sampling.rst
+ l1d_flush.rst
diff --git a/Documentation/admin-guide/hw-vuln/l1d_flush.rst b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
new file mode 100644
index 000000000000..adc4ecc72361
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
@@ -0,0 +1,70 @@
+L1D Flushing
+============
+
+With an increasing number of vulnerabilities being reported around data
+leaks from the Level 1 Data cache (L1D) the kernel provides an opt-in
+mechanism to flush the L1D cache on context switch.
+
+This mechanism can be used to address e.g. CVE-2020-0550. For applications
+the mechanism keeps them safe from vulnerabilities, related to leaks
+(snooping of) from the L1D cache.
+
+
+Related CVEs
+------------
+The following CVEs can be addressed by this
+mechanism
+
+ ============= ======================== ==================
+ CVE-2020-0550 Improper Data Forwarding OS related aspects
+ ============= ======================== ==================
+
+Usage Guidelines
+----------------
+
+Please see document: :ref:`Documentation/userspace-api/spec_ctrl.rst` for
+details.
+
+**NOTE**: The feature is disabled by default, applications need to
+specifically opt into the feature to enable it.
+
+Mitigation
+----------
+
+When PR_SET_L1D_FLUSH is enabled for a task a flush of the L1D cache is
+performed when the task is scheduled out and the incoming task belongs to a
+different process and therefore to a different address space.
+
+If the underlying CPU supports L1D flushing in hardware, the hardware
+mechanism is used, software fallback for the mitigation, is not supported.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the L1D flush mitigations at boot
+time with the option "l1d_flush_out=". The valid arguments for this option are:
+
+ ============ =============================================================
+ off Disables the prctl interface, applications trying to use
+ the prctl() will fail with an error
+ ============ =============================================================
+
+By default the API is enabled and applications opt-in by by using the prctl
+API.
+
+Limitations
+-----------
+
+The mechanism does not mitigate L1D data leaks between tasks belonging to
+different processes which are concurrently executing on sibling threads of
+a physical CPU core when SMT is enabled on the system.
+
+This can be addressed by controlled placement of processes on physical CPU
+cores or by disabling SMT. See the relevant chapter in the L1TF mitigation
+document: :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`.
+
+**NOTE** : Checks have been added to ensure that the prctl API associated
+with the opt-in will work only when the task affinity of the task opting
+in, is limited to cores running in non-SMT mode. The same checks are made
+when L1D is flushed. Changing the affinity after opting in, would result
+in flushes not working on cores that are in non-SMT mode.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb95fad81c79..59ea09095b7c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2272,6 +2272,23 @@
feature (tagged TLBs) on capable Intel chips.
Default is 1 (enabled)

+ l1d_flush_out= [X86,INTEL]
+ Control mitigation for L1D based snooping vulnerability.
+
+ Certain CPUs are vulnerable to an exploit against CPU
+ internal buffers which can forward information to a
+ disclosure gadget under certain conditions.
+
+ In vulnerable processors, the speculatively
+ forwarded data can be used in a cache side channel
+ attack, to access data to which the attacker does
+ not have direct access.
+
+ This parameter controls the mitigation. The
+ options are:
+
+ off - Unconditionally disable the mitigation
+
l1tf= [X86] Control mitigation of the L1TF vulnerability on
affected CPUs

diff --git a/Documentation/userspace-api/spec_ctrl.rst b/Documentation/userspace-api/spec_ctrl.rst
index 7ddd8f667459..f39744ef8810 100644
--- a/Documentation/userspace-api/spec_ctrl.rst
+++ b/Documentation/userspace-api/spec_ctrl.rst
@@ -106,3 +106,11 @@ Speculation misfeature controls
* prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0);
* prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0);
* prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);
+
+- PR_SPEC_L1D_FLUSH_OUT: Flush L1D Cache on context switch out of the task
+ (works only when tasks run on non SMT cores)
+
+ Invocations:
+ * prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH_OUT, 0, 0, 0);
+ * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH_OUT, PR_SPEC_ENABLE, 0, 0);
+ * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH_OUT, PR_SPEC_DISABLE, 0, 0);
--
2.17.1

2020-07-29 00:16:03

by Singh, Balbir

[permalink] [raw]
Subject: [PATCH v2 4/5] prctl: Hook L1D flushing in via prctl

Use the existing PR_GET/SET_SPECULATION_CTRL API to expose the L1D
flush capability. For L1D flushing PR_SPEC_FORCE_DISABLE and
PR_SPEC_DISABLE_NOEXEC are not supported.

There is also no seccomp integration for the feature.

Signed-off-by: Balbir Singh <[email protected]>
---
arch/x86/kernel/cpu/bugs.c | 54 ++++++++++++++++++++++++++++++++++++++
arch/x86/mm/tlb.c | 25 +++++++++++++++++-
include/uapi/linux/prctl.h | 1 +
3 files changed, 79 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 0b71970d2d3d..935ea88313ab 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -295,6 +295,13 @@ enum taa_mitigations {
TAA_MITIGATION_TSX_DISABLED,
};

+enum l1d_flush_out_mitigations {
+ L1D_FLUSH_OUT_OFF,
+ L1D_FLUSH_OUT_ON,
+};
+
+static enum l1d_flush_out_mitigations l1d_flush_out_mitigation __ro_after_init = L1D_FLUSH_OUT_ON;
+
/* Default mitigation for TAA-affected CPUs */
static enum taa_mitigations taa_mitigation __ro_after_init = TAA_MITIGATION_VERW;
static bool taa_nosmt __ro_after_init;
@@ -378,6 +385,18 @@ static void __init taa_select_mitigation(void)
pr_info("%s\n", taa_strings[taa_mitigation]);
}

+static int __init l1d_flush_out_parse_cmdline(char *str)
+{
+ if (!boot_cpu_has_bug(X86_BUG_L1TF))
+ return 0;
+
+ if (!strcmp(str, "off"))
+ l1d_flush_out_mitigation = L1D_FLUSH_OUT_OFF;
+
+ return 0;
+}
+early_param("l1d_flush_out", l1d_flush_out_parse_cmdline);
+
static int __init tsx_async_abort_parse_cmdline(char *str)
{
if (!boot_cpu_has_bug(X86_BUG_TAA))
@@ -1220,6 +1239,23 @@ static void task_update_spec_tif(struct task_struct *tsk)
speculation_ctrl_update_current();
}

+static int l1d_flush_out_prctl_set(struct task_struct *task, unsigned long ctrl)
+{
+
+ if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
+ return -EPERM;
+
+ switch (ctrl) {
+ case PR_SPEC_ENABLE:
+ return enable_l1d_flush_for_task(task);
+ case PR_SPEC_DISABLE:
+ return disable_l1d_flush_for_task(task);
+ default:
+ return -ERANGE;
+ }
+ return 0;
+}
+
static int ssb_prctl_set(struct task_struct *task, unsigned long ctrl)
{
if (ssb_mode != SPEC_STORE_BYPASS_PRCTL &&
@@ -1312,6 +1348,8 @@ int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which,
return ssb_prctl_set(task, ctrl);
case PR_SPEC_INDIRECT_BRANCH:
return ib_prctl_set(task, ctrl);
+ case PR_SPEC_L1D_FLUSH_OUT:
+ return l1d_flush_out_prctl_set(task, ctrl);
default:
return -ENODEV;
}
@@ -1328,6 +1366,20 @@ void arch_seccomp_spec_mitigate(struct task_struct *task)
}
#endif

+static int l1d_flush_out_prctl_get(struct task_struct *task)
+{
+ int ret;
+
+ if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
+ return PR_SPEC_FORCE_DISABLE;
+
+ ret = test_ti_thread_flag(&task->thread_info, TIF_SPEC_L1D_FLUSH);
+ if (ret)
+ return PR_SPEC_PRCTL | PR_SPEC_ENABLE;
+ else
+ return PR_SPEC_PRCTL | PR_SPEC_DISABLE;
+}
+
static int ssb_prctl_get(struct task_struct *task)
{
switch (ssb_mode) {
@@ -1381,6 +1433,8 @@ int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which)
return ssb_prctl_get(task);
case PR_SPEC_INDIRECT_BRANCH:
return ib_prctl_get(task);
+ case PR_SPEC_L1D_FLUSH_OUT:
+ return l1d_flush_out_prctl_get(task);
default:
return -ENODEV;
}
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 48ccc3dd1492..77b739929ad2 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -316,8 +316,31 @@ EXPORT_SYMBOL_GPL(leave_mm);

int enable_l1d_flush_for_task(struct task_struct *tsk)
{
+ int cpu, ret = 0, i;
+
+ /*
+ * Do not enable L1D_FLUSH_OUT if
+ * b. The CPU is not affected by the L1TF bug
+ * c. The CPU does not have L1D FLUSH feature support
+ * c. The task's affinity is on cores with SMT on.
+ */
+
+ if (!boot_cpu_has_bug(X86_BUG_L1TF) ||
+ !static_cpu_has(X86_FEATURE_FLUSH_L1D))
+ return -EINVAL;
+
+ cpu = get_cpu();
+
+ for_each_cpu(i, &tsk->cpus_mask) {
+ if (cpu_data(i).smt_active == true) {
+ put_cpu();
+ return -EINVAL;
+ }
+ }
+
set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
- return 0;
+ put_cpu();
+ return ret;
}

int disable_l1d_flush_for_task(struct task_struct *tsk)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e36..1e864867a367 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -213,6 +213,7 @@ struct prctl_mm_map {
/* Speculation control variants */
# define PR_SPEC_STORE_BYPASS 0
# define PR_SPEC_INDIRECT_BRANCH 1
+# define PR_SPEC_L1D_FLUSH_OUT 2
/* Return and control values for PR_SET/GET_SPECULATION_CTRL */
# define PR_SPEC_NOT_AFFECTED 0
# define PR_SPEC_PRCTL (1UL << 0)
--
2.17.1

2020-07-29 13:15:41

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] prctl: Hook L1D flushing in via prctl

On 7/28/20 7:11 PM, Balbir Singh wrote:
> Use the existing PR_GET/SET_SPECULATION_CTRL API to expose the L1D
> flush capability. For L1D flushing PR_SPEC_FORCE_DISABLE and
> PR_SPEC_DISABLE_NOEXEC are not supported.
>
> There is also no seccomp integration for the feature.
>
> Signed-off-by: Balbir Singh <[email protected]>
> ---
> arch/x86/kernel/cpu/bugs.c | 54 ++++++++++++++++++++++++++++++++++++++
> arch/x86/mm/tlb.c | 25 +++++++++++++++++-
> include/uapi/linux/prctl.h | 1 +
> 3 files changed, 79 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 0b71970d2d3d..935ea88313ab 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -295,6 +295,13 @@ enum taa_mitigations {
> TAA_MITIGATION_TSX_DISABLED,
> };
>
> +enum l1d_flush_out_mitigations {
> + L1D_FLUSH_OUT_OFF,
> + L1D_FLUSH_OUT_ON,
> +};
> +
> +static enum l1d_flush_out_mitigations l1d_flush_out_mitigation __ro_after_init = L1D_FLUSH_OUT_ON;
> +
> /* Default mitigation for TAA-affected CPUs */
> static enum taa_mitigations taa_mitigation __ro_after_init = TAA_MITIGATION_VERW;
> static bool taa_nosmt __ro_after_init;
> @@ -378,6 +385,18 @@ static void __init taa_select_mitigation(void)
> pr_info("%s\n", taa_strings[taa_mitigation]);
> }
>
> +static int __init l1d_flush_out_parse_cmdline(char *str)
> +{
> + if (!boot_cpu_has_bug(X86_BUG_L1TF))
> + return 0;

Shouldn't this set the l1d_flush_out_mitigation to L1D_FLUSH_OUT_OFF since
it is set to L1D_FLUSH_OUT_ON by default? Or does it not matter because
the enable_l1d_flush_for_task() will return -EINVAL if the cpu doesn't
have the L1TF bug?

I guess it depends on what you want l1d_flush_out_prctl_set() and
l1d_flush_out_prctl_get() to return in this case.

Thanks,
Tom

> +
> + if (!strcmp(str, "off"))
> + l1d_flush_out_mitigation = L1D_FLUSH_OUT_OFF;
> +
> + return 0;
> +}
> +early_param("l1d_flush_out", l1d_flush_out_parse_cmdline);
> +
> static int __init tsx_async_abort_parse_cmdline(char *str)
> {
> if (!boot_cpu_has_bug(X86_BUG_TAA))
> @@ -1220,6 +1239,23 @@ static void task_update_spec_tif(struct task_struct *tsk)
> speculation_ctrl_update_current();
> }
>
> +static int l1d_flush_out_prctl_set(struct task_struct *task, unsigned long ctrl)
> +{
> +
> + if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
> + return -EPERM;
> +
> + switch (ctrl) {
> + case PR_SPEC_ENABLE:
> + return enable_l1d_flush_for_task(task);
> + case PR_SPEC_DISABLE:
> + return disable_l1d_flush_for_task(task);
> + default:
> + return -ERANGE;
> + }
> + return 0;
> +}
> +
> static int ssb_prctl_set(struct task_struct *task, unsigned long ctrl)
> {
> if (ssb_mode != SPEC_STORE_BYPASS_PRCTL &&
> @@ -1312,6 +1348,8 @@ int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which,
> return ssb_prctl_set(task, ctrl);
> case PR_SPEC_INDIRECT_BRANCH:
> return ib_prctl_set(task, ctrl);
> + case PR_SPEC_L1D_FLUSH_OUT:
> + return l1d_flush_out_prctl_set(task, ctrl);
> default:
> return -ENODEV;
> }
> @@ -1328,6 +1366,20 @@ void arch_seccomp_spec_mitigate(struct task_struct *task)
> }
> #endif
>
> +static int l1d_flush_out_prctl_get(struct task_struct *task)
> +{
> + int ret;
> +
> + if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
> + return PR_SPEC_FORCE_DISABLE;
> +
> + ret = test_ti_thread_flag(&task->thread_info, TIF_SPEC_L1D_FLUSH);
> + if (ret)
> + return PR_SPEC_PRCTL | PR_SPEC_ENABLE;
> + else
> + return PR_SPEC_PRCTL | PR_SPEC_DISABLE;
> +}
> +
> static int ssb_prctl_get(struct task_struct *task)
> {
> switch (ssb_mode) {
> @@ -1381,6 +1433,8 @@ int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which)
> return ssb_prctl_get(task);
> case PR_SPEC_INDIRECT_BRANCH:
> return ib_prctl_get(task);
> + case PR_SPEC_L1D_FLUSH_OUT:
> + return l1d_flush_out_prctl_get(task);
> default:
> return -ENODEV;
> }
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 48ccc3dd1492..77b739929ad2 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -316,8 +316,31 @@ EXPORT_SYMBOL_GPL(leave_mm);
>
> int enable_l1d_flush_for_task(struct task_struct *tsk)
> {
> + int cpu, ret = 0, i;
> +
> + /*
> + * Do not enable L1D_FLUSH_OUT if
> + * b. The CPU is not affected by the L1TF bug
> + * c. The CPU does not have L1D FLUSH feature support
> + * c. The task's affinity is on cores with SMT on.
> + */
> +
> + if (!boot_cpu_has_bug(X86_BUG_L1TF) ||
> + !static_cpu_has(X86_FEATURE_FLUSH_L1D))
> + return -EINVAL;
> +
> + cpu = get_cpu();
> +
> + for_each_cpu(i, &tsk->cpus_mask) {
> + if (cpu_data(i).smt_active == true) {
> + put_cpu();
> + return -EINVAL;
> + }
> + }
> +
> set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
> - return 0;
> + put_cpu();
> + return ret;
> }
>
> int disable_l1d_flush_for_task(struct task_struct *tsk)
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 07b4f8131e36..1e864867a367 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -213,6 +213,7 @@ struct prctl_mm_map {
> /* Speculation control variants */
> # define PR_SPEC_STORE_BYPASS 0
> # define PR_SPEC_INDIRECT_BRANCH 1
> +# define PR_SPEC_L1D_FLUSH_OUT 2
> /* Return and control values for PR_SET/GET_SPECULATION_CTRL */
> # define PR_SPEC_NOT_AFFECTED 0
> # define PR_SPEC_PRCTL (1UL << 0)
>

2020-07-30 00:14:46

by Singh, Balbir

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] prctl: Hook L1D flushing in via prctl

On 29/7/20 11:14 pm, Tom Lendacky wrote:
>
>
> On 7/28/20 7:11 PM, Balbir Singh wrote:
>> Use the existing PR_GET/SET_SPECULATION_CTRL API to expose the L1D
>> flush capability. For L1D flushing PR_SPEC_FORCE_DISABLE and
>> PR_SPEC_DISABLE_NOEXEC are not supported.
>>
>> There is also no seccomp integration for the feature.
>>
>> Signed-off-by: Balbir Singh <[email protected]>
>> ---
>> arch/x86/kernel/cpu/bugs.c | 54 ++++++++++++++++++++++++++++++++++++++
>> arch/x86/mm/tlb.c | 25 +++++++++++++++++-
>> include/uapi/linux/prctl.h | 1 +
>> 3 files changed, 79 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
>> index 0b71970d2d3d..935ea88313ab 100644
>> --- a/arch/x86/kernel/cpu/bugs.c
>> +++ b/arch/x86/kernel/cpu/bugs.c
>> @@ -295,6 +295,13 @@ enum taa_mitigations {
>> TAA_MITIGATION_TSX_DISABLED,
>> };
>>
>> +enum l1d_flush_out_mitigations {
>> + L1D_FLUSH_OUT_OFF,
>> + L1D_FLUSH_OUT_ON,
>> +};
>> +
>> +static enum l1d_flush_out_mitigations l1d_flush_out_mitigation __ro_after_init = L1D_FLUSH_OUT_ON;
>> +
>> /* Default mitigation for TAA-affected CPUs */
>> static enum taa_mitigations taa_mitigation __ro_after_init = TAA_MITIGATION_VERW;
>> static bool taa_nosmt __ro_after_init;
>> @@ -378,6 +385,18 @@ static void __init taa_select_mitigation(void)
>> pr_info("%s\n", taa_strings[taa_mitigation]);
>> }
>>
>> +static int __init l1d_flush_out_parse_cmdline(char *str)
>> +{
>> + if (!boot_cpu_has_bug(X86_BUG_L1TF))
>> + return 0;
>
> Shouldn't this set the l1d_flush_out_mitigation to L1D_FLUSH_OUT_OFF since
> it is set to L1D_FLUSH_OUT_ON by default? Or does it not matter because
> the enable_l1d_flush_for_task() will return -EINVAL if the cpu doesn't
> have the L1TF bug?
>
> I guess it depends on what you want l1d_flush_out_prctl_set() and
> l1d_flush_out_prctl_get() to return in this case.
>

Exactly! We want to differentiate between force disabled and not applicable.


Thanks for the review,
Balbir Singh.

Subject: [tip: x86/pti] Documentation: Add L1D flushing Documentation

The following commit has been merged into the x86/pti branch of tip:

Commit-ID: 767d46ab566dd489733666efe48732d523c8c332
Gitweb: https://git.kernel.org/tip/767d46ab566dd489733666efe48732d523c8c332
Author: Balbir Singh <[email protected]>
AuthorDate: Wed, 29 Jul 2020 10:11:03 +10:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 16 Sep 2020 15:08:03 +02:00

Documentation: Add L1D flushing Documentation

Add documentation of l1d flushing, explain the need for the
feature and how it can be used.

Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
Documentation/admin-guide/hw-vuln/index.rst | 1 +-
Documentation/admin-guide/hw-vuln/l1d_flush.rst | 70 ++++++++++++++++-
Documentation/admin-guide/kernel-parameters.txt | 17 ++++-
Documentation/userspace-api/spec_ctrl.rst | 8 ++-
4 files changed, 96 insertions(+)
create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst

diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index ca4dbdd..21710f8 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -15,3 +15,4 @@ are configurable at compile, boot or run time.
tsx_async_abort
multihit.rst
special-register-buffer-data-sampling.rst
+ l1d_flush.rst
diff --git a/Documentation/admin-guide/hw-vuln/l1d_flush.rst b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
new file mode 100644
index 0000000..adc4ecc
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
@@ -0,0 +1,70 @@
+L1D Flushing
+============
+
+With an increasing number of vulnerabilities being reported around data
+leaks from the Level 1 Data cache (L1D) the kernel provides an opt-in
+mechanism to flush the L1D cache on context switch.
+
+This mechanism can be used to address e.g. CVE-2020-0550. For applications
+the mechanism keeps them safe from vulnerabilities, related to leaks
+(snooping of) from the L1D cache.
+
+
+Related CVEs
+------------
+The following CVEs can be addressed by this
+mechanism
+
+ ============= ======================== ==================
+ CVE-2020-0550 Improper Data Forwarding OS related aspects
+ ============= ======================== ==================
+
+Usage Guidelines
+----------------
+
+Please see document: :ref:`Documentation/userspace-api/spec_ctrl.rst` for
+details.
+
+**NOTE**: The feature is disabled by default, applications need to
+specifically opt into the feature to enable it.
+
+Mitigation
+----------
+
+When PR_SET_L1D_FLUSH is enabled for a task a flush of the L1D cache is
+performed when the task is scheduled out and the incoming task belongs to a
+different process and therefore to a different address space.
+
+If the underlying CPU supports L1D flushing in hardware, the hardware
+mechanism is used, software fallback for the mitigation, is not supported.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the L1D flush mitigations at boot
+time with the option "l1d_flush_out=". The valid arguments for this option are:
+
+ ============ =============================================================
+ off Disables the prctl interface, applications trying to use
+ the prctl() will fail with an error
+ ============ =============================================================
+
+By default the API is enabled and applications opt-in by by using the prctl
+API.
+
+Limitations
+-----------
+
+The mechanism does not mitigate L1D data leaks between tasks belonging to
+different processes which are concurrently executing on sibling threads of
+a physical CPU core when SMT is enabled on the system.
+
+This can be addressed by controlled placement of processes on physical CPU
+cores or by disabling SMT. See the relevant chapter in the L1TF mitigation
+document: :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`.
+
+**NOTE** : Checks have been added to ensure that the prctl API associated
+with the opt-in will work only when the task affinity of the task opting
+in, is limited to cores running in non-SMT mode. The same checks are made
+when L1D is flushed. Changing the affinity after opting in, would result
+in flushes not working on cores that are in non-SMT mode.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a106874..0c3f315 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2295,6 +2295,23 @@
feature (tagged TLBs) on capable Intel chips.
Default is 1 (enabled)

+ l1d_flush_out= [X86,INTEL]
+ Control mitigation for L1D based snooping vulnerability.
+
+ Certain CPUs are vulnerable to an exploit against CPU
+ internal buffers which can forward information to a
+ disclosure gadget under certain conditions.
+
+ In vulnerable processors, the speculatively
+ forwarded data can be used in a cache side channel
+ attack, to access data to which the attacker does
+ not have direct access.
+
+ This parameter controls the mitigation. The
+ options are:
+
+ off - Unconditionally disable the mitigation
+
l1tf= [X86] Control mitigation of the L1TF vulnerability on
affected CPUs

diff --git a/Documentation/userspace-api/spec_ctrl.rst b/Documentation/userspace-api/spec_ctrl.rst
index 7ddd8f6..f39744e 100644
--- a/Documentation/userspace-api/spec_ctrl.rst
+++ b/Documentation/userspace-api/spec_ctrl.rst
@@ -106,3 +106,11 @@ Speculation misfeature controls
* prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0);
* prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0);
* prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);
+
+- PR_SPEC_L1D_FLUSH_OUT: Flush L1D Cache on context switch out of the task
+ (works only when tasks run on non SMT cores)
+
+ Invocations:
+ * prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH_OUT, 0, 0, 0);
+ * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH_OUT, PR_SPEC_ENABLE, 0, 0);
+ * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH_OUT, PR_SPEC_DISABLE, 0, 0);

Subject: [tip: x86/pti] x86/smp: Add a per-cpu view of SMT state

The following commit has been merged into the x86/pti branch of tip:

Commit-ID: 0a260b1c5867863121b044afa8087d6b37e4fb7d
Gitweb: https://git.kernel.org/tip/0a260b1c5867863121b044afa8087d6b37e4fb7d
Author: Balbir Singh <[email protected]>
AuthorDate: Wed, 29 Jul 2020 10:10:59 +10:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 16 Sep 2020 15:08:02 +02:00

x86/smp: Add a per-cpu view of SMT state

A new field smt_active in cpuinfo_x86 identifies if the current core/cpu
is in SMT mode or not. This can be very helpful if the system has some
of its cores with threads offlined and can be used for cases where
action is taken based on the state of SMT. The follow up patches use
this feature.

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
arch/x86/include/asm/processor.h | 2 ++
arch/x86/kernel/smpboot.c | 11 ++++++++++-
2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 97143d8..d9eb20f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -136,6 +136,8 @@ struct cpuinfo_x86 {
u16 logical_die_id;
/* Index into per_cpu list: */
u16 cpu_index;
+ /* Is SMT active on this core? */
+ bool smt_active;
u32 microcode;
/* Address space bits used by the cache internally */
u8 x86_cache_bits;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index f5ef689..5fc7e0e 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -635,6 +635,9 @@ void set_cpu_sibling_map(int cpu)
threads = cpumask_weight(topology_sibling_cpumask(cpu));
if (threads > __max_smt_threads)
__max_smt_threads = threads;
+
+ for_each_cpu(i, topology_sibling_cpumask(cpu))
+ cpu_data(i).smt_active = threads > 1;
}

/* maps the cpu to the sched domain representing multi-core */
@@ -1548,10 +1551,16 @@ static void remove_siblinginfo(int cpu)

for_each_cpu(sibling, topology_die_cpumask(cpu))
cpumask_clear_cpu(cpu, topology_die_cpumask(sibling));
- for_each_cpu(sibling, topology_sibling_cpumask(cpu))
+
+ for_each_cpu(sibling, topology_sibling_cpumask(cpu)) {
cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling));
+ if (cpumask_weight(topology_sibling_cpumask(sibling)) == 1)
+ cpu_data(sibling).smt_active = false;
+ }
+
for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
+
cpumask_clear(cpu_llc_shared_mask(cpu));
cpumask_clear(topology_sibling_cpumask(cpu));
cpumask_clear(topology_core_cpumask(cpu));

Subject: [tip: x86/pti] x86/mm: Optionally flush L1D on context switch

The following commit has been merged into the x86/pti branch of tip:

Commit-ID: a9210620ec360f7375282ff1d35c8f8016ccc986
Gitweb: https://git.kernel.org/tip/a9210620ec360f7375282ff1d35c8f8016ccc986
Author: Balbir Singh <[email protected]>
AuthorDate: Wed, 29 Jul 2020 10:11:01 +10:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 16 Sep 2020 15:08:02 +02:00

x86/mm: Optionally flush L1D on context switch

Implement a mechanism to selectively flush the L1D cache. The goal is to
allow tasks that want to save sensitive information, found by the recent
snoop assisted data sampling vulnerabilites, to flush their L1D on being
switched out. This protects their data from being snooped or leaked via
side channels after the task has context switched out.

There are two scenarios we might want to protect against, a task leaving
the CPU with data still in L1D (which is the main concern of this patch),
the second scenario is a malicious task coming in (not so well trusted)
for which we want to clean up the cache before it starts. Only the case
for the former is addressed.

A new thread_info flag TIF_SPEC_L1D_FLUSH is added to track tasks which
opt-into L1D flushing. cpu_tlbstate.last_user_mm_spec is used to convert
the TIF flags into mm state (per cpu via last_user_mm_spec) in
cond_mitigation(), which then used to do decide when to flush the
L1D cache.

A new helper inline function l1d_flush_hw() has been introduced.
Currently it returns an error code if hardware flushing is not
supported. The caller currently does not check the return value, in the
context of these patches, the routine is called only when HW assisted
flushing is available.

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
arch/x86/include/asm/cacheflush.h | 8 ++++++++-
arch/x86/include/asm/thread_info.h | 9 +++++++--
arch/x86/mm/tlb.c | 30 ++++++++++++++++++++++++++---
3 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index b192d91..554eaf6 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -10,4 +10,12 @@

void clflush_cache_range(void *addr, unsigned int size);

+static inline int l1d_flush_hw(void)
+{
+ if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
+ wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+ return 0;
+ }
+ return -EOPNOTSUPP;
+}
#endif /* _ASM_X86_CACHEFLUSH_H */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 267701a..c448fcf 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -84,7 +84,7 @@ struct thread_info {
#define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */
#define TIF_SECCOMP 8 /* secure computing */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
-#define TIF_SPEC_FORCE_UPDATE 10 /* Force speculation MSR update in context switch */
+#define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
#define TIF_UPROBE 12 /* breakpointed or singlestepping */
#define TIF_PATCH_PENDING 13 /* pending live patching update */
@@ -96,6 +96,7 @@ struct thread_info {
#define TIF_MEMDIE 20 /* is terminating due to OOM killer */
#define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */
#define TIF_IO_BITMAP 22 /* uses I/O bitmap */
+#define TIF_SPEC_FORCE_UPDATE 23 /* Force speculation MSR update in context switch */
#define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
#define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
@@ -114,7 +115,7 @@ struct thread_info {
#define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
#define _TIF_SECCOMP (1 << TIF_SECCOMP)
#define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
-#define _TIF_SPEC_FORCE_UPDATE (1 << TIF_SPEC_FORCE_UPDATE)
+#define _TIF_SPEC_L1D_FLUSH (1 << TIF_SPEC_L1D_FLUSH)
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
#define _TIF_UPROBE (1 << TIF_UPROBE)
#define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING)
@@ -125,6 +126,7 @@ struct thread_info {
#define _TIF_SLD (1 << TIF_SLD)
#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)
#define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
+#define _TIF_SPEC_FORCE_UPDATE (1 << TIF_SPEC_FORCE_UPDATE)
#define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
#define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
#define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
@@ -230,6 +232,9 @@ static inline int arch_within_stack_frames(const void * const stack,
current_thread_info()->status & TS_COMPAT)
#endif

+extern int enable_l1d_flush_for_task(struct task_struct *tsk);
+extern int disable_l1d_flush_for_task(struct task_struct *tsk);
+
extern void arch_task_cache_init(void);
extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
extern void arch_release_task_struct(struct task_struct *tsk);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6bbd758..6369a54 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -8,11 +8,13 @@
#include <linux/export.h>
#include <linux/cpu.h>
#include <linux/debugfs.h>
+#include <linux/sched/smt.h>

#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
#include <asm/nospec-branch.h>
#include <asm/cache.h>
+#include <asm/cacheflush.h>
#include <asm/apic.h>
#include <asm/uv/uv.h>

@@ -43,14 +45,15 @@
*/

/*
- * Bits to mangle the TIF_SPEC_IB state into the mm pointer which is
+ * Bits to mangle the TIF_SPEC_* state into the mm pointer which is
* stored in cpu_tlb_state.last_user_mm_spec.
*/
#define LAST_USER_MM_IBPB 0x1UL
-#define LAST_USER_MM_SPEC_MASK (LAST_USER_MM_IBPB)
+#define LAST_USER_MM_L1D_FLUSH 0x2UL
+#define LAST_USER_MM_SPEC_MASK (LAST_USER_MM_IBPB | LAST_USER_MM_L1D_FLUSH)

/* Bits to set when tlbstate and flush is (re)initialized */
-#define LAST_USER_MM_INIT LAST_USER_MM_IBPB
+#define LAST_USER_MM_INIT (LAST_USER_MM_IBPB | LAST_USER_MM_L1D_FLUSH)

/*
* The x86 feature is called PCID (Process Context IDentifier). It is similar
@@ -311,6 +314,18 @@ void leave_mm(int cpu)
}
EXPORT_SYMBOL_GPL(leave_mm);

+int enable_l1d_flush_for_task(struct task_struct *tsk)
+{
+ set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
+ return 0;
+}
+
+int disable_l1d_flush_for_task(struct task_struct *tsk)
+{
+ clear_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
+ return 0;
+}
+
void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
@@ -326,6 +341,7 @@ static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
unsigned long next_tif = task_thread_info(next)->flags;
unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;

+ BUILD_BUG_ON(TIF_SPEC_L1D_FLUSH != TIF_SPEC_IB + 1);
return (unsigned long)next->mm | spec_bits;
}

@@ -403,6 +419,14 @@ static void cond_mitigation(struct task_struct *next)
indirect_branch_prediction_barrier();
}

+ /*
+ * Flush only if SMT is disabled as per the contract, which is checked
+ * when the feature is enabled.
+ */
+ if (sched_smt_active() && !this_cpu_read(cpu_info.smt_active) &&
+ (prev_mm & LAST_USER_MM_L1D_FLUSH))
+ l1d_flush_hw();
+
this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
}

Subject: [tip: x86/pti] prctl: Hook L1D flushing in via prctl

The following commit has been merged into the x86/pti branch of tip:

Commit-ID: b6724f118d44606fddde391ba7527526b3cad211
Gitweb: https://git.kernel.org/tip/b6724f118d44606fddde391ba7527526b3cad211
Author: Balbir Singh <[email protected]>
AuthorDate: Wed, 29 Jul 2020 10:11:02 +10:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 16 Sep 2020 15:08:03 +02:00

prctl: Hook L1D flushing in via prctl

Use the existing PR_GET/SET_SPECULATION_CTRL API to expose the L1D
flush capability. For L1D flushing PR_SPEC_FORCE_DISABLE and
PR_SPEC_DISABLE_NOEXEC are not supported.

There is also no seccomp integration for the feature.

Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
arch/x86/kernel/cpu/bugs.c | 54 +++++++++++++++++++++++++++++++++++++-
arch/x86/mm/tlb.c | 25 ++++++++++++++++-
include/uapi/linux/prctl.h | 1 +-
3 files changed, 79 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index d3f0db4..3923e48 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -296,6 +296,13 @@ enum taa_mitigations {
TAA_MITIGATION_TSX_DISABLED,
};

+enum l1d_flush_out_mitigations {
+ L1D_FLUSH_OUT_OFF,
+ L1D_FLUSH_OUT_ON,
+};
+
+static enum l1d_flush_out_mitigations l1d_flush_out_mitigation __ro_after_init = L1D_FLUSH_OUT_ON;
+
/* Default mitigation for TAA-affected CPUs */
static enum taa_mitigations taa_mitigation __ro_after_init = TAA_MITIGATION_VERW;
static bool taa_nosmt __ro_after_init;
@@ -379,6 +386,18 @@ out:
pr_info("%s\n", taa_strings[taa_mitigation]);
}

+static int __init l1d_flush_out_parse_cmdline(char *str)
+{
+ if (!boot_cpu_has_bug(X86_BUG_L1TF))
+ return 0;
+
+ if (!strcmp(str, "off"))
+ l1d_flush_out_mitigation = L1D_FLUSH_OUT_OFF;
+
+ return 0;
+}
+early_param("l1d_flush_out", l1d_flush_out_parse_cmdline);
+
static int __init tsx_async_abort_parse_cmdline(char *str)
{
if (!boot_cpu_has_bug(X86_BUG_TAA))
@@ -1215,6 +1234,23 @@ static void task_update_spec_tif(struct task_struct *tsk)
speculation_ctrl_update_current();
}

+static int l1d_flush_out_prctl_set(struct task_struct *task, unsigned long ctrl)
+{
+
+ if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
+ return -EPERM;
+
+ switch (ctrl) {
+ case PR_SPEC_ENABLE:
+ return enable_l1d_flush_for_task(task);
+ case PR_SPEC_DISABLE:
+ return disable_l1d_flush_for_task(task);
+ default:
+ return -ERANGE;
+ }
+ return 0;
+}
+
static int ssb_prctl_set(struct task_struct *task, unsigned long ctrl)
{
if (ssb_mode != SPEC_STORE_BYPASS_PRCTL &&
@@ -1306,6 +1342,8 @@ int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which,
return ssb_prctl_set(task, ctrl);
case PR_SPEC_INDIRECT_BRANCH:
return ib_prctl_set(task, ctrl);
+ case PR_SPEC_L1D_FLUSH_OUT:
+ return l1d_flush_out_prctl_set(task, ctrl);
default:
return -ENODEV;
}
@@ -1322,6 +1360,20 @@ void arch_seccomp_spec_mitigate(struct task_struct *task)
}
#endif

+static int l1d_flush_out_prctl_get(struct task_struct *task)
+{
+ int ret;
+
+ if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
+ return PR_SPEC_FORCE_DISABLE;
+
+ ret = test_ti_thread_flag(&task->thread_info, TIF_SPEC_L1D_FLUSH);
+ if (ret)
+ return PR_SPEC_PRCTL | PR_SPEC_ENABLE;
+ else
+ return PR_SPEC_PRCTL | PR_SPEC_DISABLE;
+}
+
static int ssb_prctl_get(struct task_struct *task)
{
switch (ssb_mode) {
@@ -1375,6 +1427,8 @@ int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which)
return ssb_prctl_get(task);
case PR_SPEC_INDIRECT_BRANCH:
return ib_prctl_get(task);
+ case PR_SPEC_L1D_FLUSH_OUT:
+ return l1d_flush_out_prctl_get(task);
default:
return -ENODEV;
}
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6369a54..6b0f4c8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -316,8 +316,31 @@ EXPORT_SYMBOL_GPL(leave_mm);

int enable_l1d_flush_for_task(struct task_struct *tsk)
{
+ int cpu, ret = 0, i;
+
+ /*
+ * Do not enable L1D_FLUSH_OUT if
+ * b. The CPU is not affected by the L1TF bug
+ * c. The CPU does not have L1D FLUSH feature support
+ * c. The task's affinity is on cores with SMT on.
+ */
+
+ if (!boot_cpu_has_bug(X86_BUG_L1TF) ||
+ !static_cpu_has(X86_FEATURE_FLUSH_L1D))
+ return -EINVAL;
+
+ cpu = get_cpu();
+
+ for_each_cpu(i, &tsk->cpus_mask) {
+ if (cpu_data(i).smt_active == true) {
+ put_cpu();
+ return -EINVAL;
+ }
+ }
+
set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
- return 0;
+ put_cpu();
+ return ret;
}

int disable_l1d_flush_for_task(struct task_struct *tsk)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f81..1e86486 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -213,6 +213,7 @@ struct prctl_mm_map {
/* Speculation control variants */
# define PR_SPEC_STORE_BYPASS 0
# define PR_SPEC_INDIRECT_BRANCH 1
+# define PR_SPEC_L1D_FLUSH_OUT 2
/* Return and control values for PR_SET/GET_SPECULATION_CTRL */
# define PR_SPEC_NOT_AFFECTED 0
# define PR_SPEC_PRCTL (1UL << 0)