Implement a mechanism that allows tasks to conditionally flush
their L1D cache (mitigation mechanism suggested in [2]). The previous
posts of these patches were sent for inclusion (see [3]) and were not
included due to the concern for the need for additional checks,
those checks were:
1. Implement this mechanism only for CPUs affected by the L1TF bug
2. Disable the software fallback
3. Provide an override to enable this mechanism
4. Be SMT aware in the implementation
The patches support a use case where the entire system is not in
non SMT mode, but rather a few CPUs can have their SMT turned off
and processes that want to opt-in are expected to run on non SMT
cores. This gives the administrator complete control over setting
up the mitigation for the issue. In addition, the administrator
has a boot time override (l1d_flush=on) to turn on the mechanism
without which this mechanism will not work.
To implement these efficiently, a new per cpu view of whether the core
is in SMT mode or not is implemented in patch 1. The code is refactored
in patch 2 so that the existing code can allow for other speculation
related checks when switching mm between tasks, this mechanism has not
changed since the last post. The ability to flush L1D for tasks if the
TIF_SPEC_L1D_FLUSH bit is set and the task has context switched out of a
non SMT core is provided by patch 3. Hooks for the user space API, for
this feature to be invoked via prctl are provided in patch 4, along with
the checks described above (1, 2, and 3). Documentation updates are in
patch 5, with updates on l1d_flush, the prctl changes and updates to the
kernel-parameters (l1d_flush_out).
The checks for opting into L1D flushing are:
a. If the CPU is affected by L1TF
b. Hardware L1D flush mechanism is available
A task running on a core with SMT enabled and opting into this feature will
receive a SIGBUS.
References
[1] https://software.intel.com/security-software-guidance/software-guidance/snoop-assisted-l1-data-sampling
[2] https://software.intel.com/security-software-guidance/insights/deep-dive-snoop-assisted-l1-data-sampling
[3] https://lkml.org/lkml/2020/6/2/1150
[4] https://lore.kernel.org/lkml/[email protected]/
[5] https://lore.kernel.org/lkml/[email protected]/
Reviewers guide to v4
- The key patch in the series and most of the changes to this
revision are to patch 4. patches 3 and 5 have been modified
to keep them consistent with the changes to patch 4.
Changelog v4:
- Use a static key to enable the mechanism (remove overheads)
- By default have the mechanism turned off, so there are two
opt-ins needed, one by the administrator at boot time, second
by the application
- Rename l1d_flush_out/L1D_FLUSH_OUT to l1d_flush/L1D_FLUSH
- Implement other review recommendations
Changelog v3:
- Implement the SIGBUS mechansim
- Update and fix the documentation
Balbir Singh (5):
x86/smp: Add a per-cpu view of SMT state
x86/mm: Refactor cond_ibpb() to support other use cases
x86/mm: Optionally flush L1D on context switch
prctl: Hook L1D flushing in via prctl
Documentation: Add L1D flushing Documentation
Documentation/admin-guide/hw-vuln/index.rst | 1 +
.../admin-guide/hw-vuln/l1d_flush.rst | 70 +++++++++++++++
.../admin-guide/kernel-parameters.txt | 17 ++++
Documentation/userspace-api/spec_ctrl.rst | 8 ++
arch/Kconfig | 4 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/cacheflush.h | 8 ++
arch/x86/include/asm/nospec-branch.h | 2 +
arch/x86/include/asm/processor.h | 2 +
arch/x86/include/asm/thread_info.h | 6 +-
arch/x86/include/asm/tlbflush.h | 2 +-
arch/x86/kernel/cpu/bugs.c | 71 +++++++++++++++
arch/x86/kernel/smpboot.c | 10 ++-
arch/x86/mm/tlb.c | 88 ++++++++++++++-----
include/linux/sched.h | 10 +++
include/uapi/linux/prctl.h | 1 +
16 files changed, 273 insertions(+), 28 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst
--
2.17.1
A new field smt_active in cpuinfo_x86 identifies if the current core/cpu
is in SMT mode or not. This can be very helpful if the system has some
of its cores with threads offlined and can be used for cases where
action is taken based on the state of SMT. The follow up patches use
this feature.
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
---
arch/x86/include/asm/processor.h | 2 ++
arch/x86/kernel/smpboot.c | 10 +++++++++-
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index c20a52b5534b..a411466a6e74 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -136,6 +136,8 @@ struct cpuinfo_x86 {
u16 logical_die_id;
/* Index into per_cpu list: */
u16 cpu_index;
+ /* Is SMT active on this core? */
+ bool smt_active;
u32 microcode;
/* Address space bits used by the cache internally */
u8 x86_cache_bits;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 8ca66af96a54..5f6df298d785 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -640,6 +640,9 @@ void set_cpu_sibling_map(int cpu)
threads = cpumask_weight(topology_sibling_cpumask(cpu));
if (threads > __max_smt_threads)
__max_smt_threads = threads;
+
+ for_each_cpu(i, topology_sibling_cpumask(cpu))
+ cpu_data(i).smt_active = threads > 1;
}
/* maps the cpu to the sched domain representing multi-core */
@@ -1551,8 +1554,13 @@ static void remove_siblinginfo(int cpu)
for_each_cpu(sibling, topology_die_cpumask(cpu))
cpumask_clear_cpu(cpu, topology_die_cpumask(sibling));
- for_each_cpu(sibling, topology_sibling_cpumask(cpu))
+
+ for_each_cpu(sibling, topology_sibling_cpumask(cpu)) {
cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling));
+ if (cpumask_weight(topology_sibling_cpumask(sibling)) == 1)
+ cpu_data(sibling).smt_active = false;
+ }
+
for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
cpumask_clear(cpu_llc_shared_mask(cpu));
--
2.17.1
Add documentation of l1d flushing, explain the need for the
feature and how it can be used.
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
Documentation/admin-guide/hw-vuln/index.rst | 1 +
.../admin-guide/hw-vuln/l1d_flush.rst | 70 +++++++++++++++++++
.../admin-guide/kernel-parameters.txt | 17 +++++
Documentation/userspace-api/spec_ctrl.rst | 8 +++
4 files changed, 96 insertions(+)
create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index ca4dbdd9016d..21710f8609fe 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -15,3 +15,4 @@ are configurable at compile, boot or run time.
tsx_async_abort
multihit.rst
special-register-buffer-data-sampling.rst
+ l1d_flush.rst
diff --git a/Documentation/admin-guide/hw-vuln/l1d_flush.rst b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
new file mode 100644
index 000000000000..d9bd931641b3
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
@@ -0,0 +1,70 @@
+L1D Flushing
+============
+
+With an increasing number of vulnerabilities being reported around data
+leaks from the Level 1 Data cache (L1D) the kernel provides an opt-in
+mechanism to flush the L1D cache on context switch.
+
+This mechanism can be used to address e.g. CVE-2020-0550. For applications
+the mechanism keeps them safe from vulnerabilities, related to leaks
+(snooping of) from the L1D cache.
+
+
+Related CVEs
+------------
+The following CVEs can be addressed by this
+mechanism
+
+ ============= ======================== ==================
+ CVE-2020-0550 Improper Data Forwarding OS related aspects
+ ============= ======================== ==================
+
+Usage Guidelines
+----------------
+
+Please see document: :ref:`Documentation/userspace-api/spec_ctrl.rst
+<set_spec_ctrl>` for details.
+
+**NOTE**: The feature is disabled by default, applications need to
+specifically opt into the feature to enable it.
+
+Mitigation
+----------
+
+When PR_SET_L1D_FLUSH is enabled for a task a flush of the L1D cache is
+performed when the task is scheduled out and the incoming task belongs to a
+different process and therefore to a different address space.
+
+If the underlying CPU supports L1D flushing in hardware, the hardware
+mechanism is used, software fallback for the mitigation, is not supported.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the L1D flush mitigations at boot
+time with the option "l1d_flush=". The valid arguments for this option are:
+
+ ============ =============================================================
+ on Enables the prctl interface, applications trying to use
+ the prctl() will fail with an error if l1d_flush is not
+ enabled
+ ============ =============================================================
+
+By default the API is enabled and applications opt-in by using the prctl
+API.
+
+Limitations
+-----------
+
+The mechanism does not mitigate L1D data leaks between tasks belonging to
+different processes which are concurrently executing on sibling threads of
+a physical CPU core when SMT is enabled on the system.
+
+This can be addressed by controlled placement of processes on physical CPU
+cores or by disabling SMT. See the relevant chapter in the L1TF mitigation
+document: :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`.
+
+**NOTE** : The opt-in of a task for L1D flushing will work only when the
+tasks affinity is limited to cores running in non-SMT mode. Running the task
+on a CPU with SMT enabled would result in the task getting a SIGBUS when
+t executes on the non-SMT core.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bc20e2f4677f..bd1e8e329727 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2356,6 +2356,23 @@
feature (tagged TLBs) on capable Intel chips.
Default is 1 (enabled)
+ l1d_flush= [X86,INTEL]
+ Control mitigation for L1D based snooping vulnerability.
+
+ Certain CPUs are vulnerable to an exploit against CPU
+ internal buffers which can forward information to a
+ disclosure gadget under certain conditions.
+
+ In vulnerable processors, the speculatively
+ forwarded data can be used in a cache side channel
+ attack, to access data to which the attacker does
+ not have direct access.
+
+ This parameter controls the mitigation. The
+ options are:
+
+ on - enable the interface for the mitigation
+
l1tf= [X86] Control mitigation of the L1TF vulnerability on
affected CPUs
diff --git a/Documentation/userspace-api/spec_ctrl.rst b/Documentation/userspace-api/spec_ctrl.rst
index 7ddd8f667459..5e8ed9eef9aa 100644
--- a/Documentation/userspace-api/spec_ctrl.rst
+++ b/Documentation/userspace-api/spec_ctrl.rst
@@ -106,3 +106,11 @@ Speculation misfeature controls
* prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0);
* prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0);
* prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);
+
+- PR_SPEC_L1D_FLUSH: Flush L1D Cache on context switch out of the task
+ (works only when tasks run on non SMT cores)
+
+ Invocations:
+ * prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, 0, 0, 0);
+ * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, PR_SPEC_ENABLE, 0, 0);
+ * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, PR_SPEC_DISABLE, 0, 0);
--
2.17.1
Use the existing PR_GET/SET_SPECULATION_CTRL API to expose the L1D
flush capability. For L1D flushing PR_SPEC_FORCE_DISABLE and
PR_SPEC_DISABLE_NOEXEC are not supported.
Enabling L1D flush does not check if the task is running on
an SMT enabled core, rather a check is done at runtime (at the
time of flush), if the task runs on a non SMT enabled core
then the task is sent a SIGBUS (this is done prior to the task
executing on the core, so no data is leaked). This is better
than the other alternatives of
a. Ensuring strict affinity of the task (hard to enforce
without further changes in the scheduler)
b. Silently skipping flush for tasks that move to SMT enabled
cores.
An arch config ARCH_HAS_PARANOID_L1D_FLUSH has been added
and struct task carries a callback_head for arch's that support
this config (currently on x86), this callback head is used
to schedule task work (SIGBUS delivery).
There is also no seccomp integration for the feature.
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/Kconfig | 4 ++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/nospec-branch.h | 2 +
arch/x86/include/asm/thread_info.h | 3 --
arch/x86/kernel/cpu/bugs.c | 71 ++++++++++++++++++++++++++++
arch/x86/mm/tlb.c | 29 ++++++++++--
include/linux/sched.h | 10 ++++
include/uapi/linux/prctl.h | 1 +
8 files changed, 115 insertions(+), 6 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 7091f7187951..0a0701e0a1ed 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -327,6 +327,10 @@ config ARCH_32BIT_OFF_T
still support 32-bit off_t. This option is enabled for all such
architectures explicitly.
+config ARCH_HAS_PARANOID_L1D_FLUSH
+ bool
+ default n
+
config HAVE_ASM_MODVERSIONS
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bd4993b276fd..2bb53bfaea02 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -107,6 +107,7 @@ config X86
select ARCH_WANT_HUGE_PMD_SHARE
select ARCH_WANT_LD_ORPHAN_WARN
select ARCH_WANTS_THP_SWAP if X86_64
+ select ARCH_HAS_PARANOID_L1D_FLUSH
select BUILDTIME_TABLE_SORT
select CLKEVT_I8253
select CLOCKSOURCE_VALIDATE_LAST_CYCLE
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index cb9ad6b73973..cd60934c6075 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -253,6 +253,8 @@ DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
DECLARE_STATIC_KEY_FALSE(mds_user_clear);
DECLARE_STATIC_KEY_FALSE(mds_idle_clear);
+DECLARE_STATIC_KEY_FALSE(l1d_flush_enabled);
+
#include <asm/segment.h>
/**
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 33b637442b9e..054dc0f58ac4 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -219,9 +219,6 @@ static inline int arch_within_stack_frames(const void * const stack,
current_thread_info()->status & TS_COMPAT)
#endif
-extern int enable_l1d_flush_for_task(struct task_struct *tsk);
-extern int disable_l1d_flush_for_task(struct task_struct *tsk);
-
extern void arch_task_cache_init(void);
extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
extern void arch_release_task_struct(struct task_struct *tsk);
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index d41b70fe4918..e07d2a1d5eb2 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -76,6 +76,20 @@ EXPORT_SYMBOL_GPL(mds_user_clear);
DEFINE_STATIC_KEY_FALSE(mds_idle_clear);
EXPORT_SYMBOL_GPL(mds_idle_clear);
+/*
+ * Controls whether l1d flush based mitigations are enabled,
+ * based on hw features and admin setting via boot parameter
+ * defaults to false
+ */
+DEFINE_STATIC_KEY_FALSE(l1d_flush_enabled);
+
+enum l1d_flush_mitigations {
+ L1D_FLUSH_OFF = 0,
+ L1D_FLUSH_ON,
+};
+
+static enum l1d_flush_mitigations l1d_flush_mitigation __initdata = L1D_FLUSH_OFF;
+
void __init check_bugs(void)
{
identify_boot_cpu();
@@ -150,6 +164,10 @@ void __init check_bugs(void)
if (!direct_gbpages)
set_memory_4k((unsigned long)__va(0), 1);
#endif
+ if (!l1d_flush_mitigation || !boot_cpu_has_bug(X86_BUG_L1TF) ||
+ !boot_cpu_has(X86_FEATURE_FLUSH_L1D))
+ return;
+ static_branch_enable(&l1d_flush_enabled);
}
void
@@ -379,6 +397,15 @@ static void __init taa_select_mitigation(void)
pr_info("%s\n", taa_strings[taa_mitigation]);
}
+static int __init l1d_flush_parse_cmdline(char *str)
+{
+ if (!strcmp(str, "on"))
+ l1d_flush_mitigation = L1D_FLUSH_ON;
+
+ return 0;
+}
+early_param("l1d_flush", l1d_flush_parse_cmdline);
+
static int __init tsx_async_abort_parse_cmdline(char *str)
{
if (!boot_cpu_has_bug(X86_BUG_TAA))
@@ -1215,6 +1242,35 @@ static void task_update_spec_tif(struct task_struct *tsk)
speculation_ctrl_update_current();
}
+static inline int enable_l1d_flush_for_task(struct task_struct *tsk)
+{
+ set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
+ return 0;
+}
+
+static inline int disable_l1d_flush_for_task(struct task_struct *tsk)
+{
+ clear_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
+ return 0;
+}
+
+static int l1d_flush_prctl_set(struct task_struct *task, unsigned long ctrl)
+{
+
+ if (!static_branch_unlikely(&l1d_flush_enabled))
+ return -EPERM;
+
+ switch (ctrl) {
+ case PR_SPEC_ENABLE:
+ return enable_l1d_flush_for_task(task);
+ case PR_SPEC_DISABLE:
+ return disable_l1d_flush_for_task(task);
+ default:
+ return -ERANGE;
+ }
+ return 0;
+}
+
static int ssb_prctl_set(struct task_struct *task, unsigned long ctrl)
{
if (ssb_mode != SPEC_STORE_BYPASS_PRCTL &&
@@ -1324,6 +1380,8 @@ int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which,
return ssb_prctl_set(task, ctrl);
case PR_SPEC_INDIRECT_BRANCH:
return ib_prctl_set(task, ctrl);
+ case PR_SPEC_L1D_FLUSH:
+ return l1d_flush_prctl_set(task, ctrl);
default:
return -ENODEV;
}
@@ -1340,6 +1398,17 @@ void arch_seccomp_spec_mitigate(struct task_struct *task)
}
#endif
+static int l1d_flush_prctl_get(struct task_struct *task)
+{
+ if (!static_branch_unlikely(&l1d_flush_enabled))
+ return PR_SPEC_FORCE_DISABLE;
+
+ if (test_ti_thread_flag(&task->thread_info, TIF_SPEC_L1D_FLUSH))
+ return PR_SPEC_PRCTL | PR_SPEC_ENABLE;
+ else
+ return PR_SPEC_PRCTL | PR_SPEC_DISABLE;
+}
+
static int ssb_prctl_get(struct task_struct *task)
{
switch (ssb_mode) {
@@ -1390,6 +1459,8 @@ int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which)
return ssb_prctl_get(task);
case PR_SPEC_INDIRECT_BRANCH:
return ib_prctl_get(task);
+ case PR_SPEC_L1D_FLUSH:
+ return l1d_flush_prctl_get(task);
default:
return -ENODEV;
}
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f67c5bd58158..aa9286b83f8f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -323,6 +323,28 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
local_irq_restore(flags);
}
+/*
+ * Sent to a task that opts into L1D flushing via the prctl interface
+ * but ends up running on an SMT enabled core.
+ */
+static void l1d_flush_kill(struct callback_head *ch)
+{
+ force_sig(SIGBUS);
+}
+
+static void l1d_flush_evaluate(unsigned long prev_mm, unsigned long next_mm,
+ struct task_struct *next)
+{
+ if (prev_mm & LAST_USER_MM_L1D_FLUSH)
+ l1d_flush_hw();
+
+ if ((next_mm & LAST_USER_MM_L1D_FLUSH) && this_cpu_read(cpu_info.smt_active)) {
+ clear_ti_thread_flag(&next->thread_info, TIF_SPEC_L1D_FLUSH);
+ next->l1d_flush_kill.func = l1d_flush_kill;
+ task_work_add(next, &next->l1d_flush_kill, TWA_RESUME);
+ }
+}
+
static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
{
unsigned long next_tif = task_thread_info(next)->flags;
@@ -410,9 +432,10 @@ static void cond_mitigation(struct task_struct *next)
* Flush only if SMT is disabled as per the contract, which is checked
* when the feature is enabled.
*/
- if (!this_cpu_read(cpu_info.smt_active) &&
- (prev_mm & LAST_USER_MM_L1D_FLUSH))
- l1d_flush_hw();
+ if (static_branch_unlikely(&l1d_flush_enabled)) {
+ if (unlikely((prev_mm | next_mm) & LAST_USER_MM_L1D_FLUSH))
+ l1d_flush_evaluate(prev_mm, next_mm, next);
+ }
this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e5ad6d354b7b..77e9d32d70ca 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1376,6 +1376,16 @@ struct task_struct {
unsigned long getblk_bh_state;
#endif
+#ifdef CONFIG_ARCH_HAS_PARANOID_L1D_FLUSH
+ /*
+ * If L1D flush is supported on mm context switch
+ * then we use this callback head to queue kill work
+ * to kill tasks that are not running on SMT disabled
+ * cores
+ */
+ struct callback_head l1d_flush_kill;
+#endif
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 90deb41c8a34..44adcae6641c 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -213,6 +213,7 @@ struct prctl_mm_map {
/* Speculation control variants */
# define PR_SPEC_STORE_BYPASS 0
# define PR_SPEC_INDIRECT_BRANCH 1
+# define PR_SPEC_L1D_FLUSH 2
/* Return and control values for PR_SET/GET_SPECULATION_CTRL */
# define PR_SPEC_NOT_AFFECTED 0
# define PR_SPEC_PRCTL (1UL << 0)
--
2.17.1
Implement a mechanism to selectively flush the L1D cache. The goal is to
allow tasks that want to save sensitive information, found by the recent
snoop assisted data sampling vulnerabilites, to flush their L1D on being
switched out. This protects their data from being snooped or leaked via
side channels after the task has context switched out.
There are two scenarios we might want to protect against, a task leaving
the CPU with data still in L1D (which is the main concern of this patch),
the second scenario is a malicious task coming in (not so well trusted)
for which we want to clean up the cache before it starts. Only the case
for the former is addressed.
A new thread_info flag TIF_SPEC_L1D_FLUSH is added to track tasks which
opt-into L1D flushing. cpu_tlbstate.last_user_mm_spec is used to convert
the TIF flags into mm state (per cpu via last_user_mm_spec) in
cond_mitigation(), which then used to do decide when to flush the
L1D cache.
A new helper inline function l1d_flush_hw() has been introduced.
Currently it returns an error code if hardware flushing is not
supported. The caller currently does not check the return value, in the
context of these patches, the routine is called only when HW assisted
flushing is available.
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/cacheflush.h | 8 ++++++++
arch/x86/include/asm/thread_info.h | 9 +++++++--
arch/x86/mm/tlb.c | 16 ++++++++++++++--
3 files changed, 29 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index b192d917a6d0..554eaf697f3f 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -10,4 +10,12 @@
void clflush_cache_range(void *addr, unsigned int size);
+static inline int l1d_flush_hw(void)
+{
+ if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
+ wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+ return 0;
+ }
+ return -EOPNOTSUPP;
+}
#endif /* _ASM_X86_CACHEFLUSH_H */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 0d751d5da702..33b637442b9e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -81,7 +81,7 @@ struct thread_info {
#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
#define TIF_SSBD 5 /* Speculative store bypass disable */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
-#define TIF_SPEC_FORCE_UPDATE 10 /* Force speculation MSR update in context switch */
+#define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
#define TIF_UPROBE 12 /* breakpointed or singlestepping */
#define TIF_PATCH_PENDING 13 /* pending live patching update */
@@ -93,6 +93,7 @@ struct thread_info {
#define TIF_MEMDIE 20 /* is terminating due to OOM killer */
#define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */
#define TIF_IO_BITMAP 22 /* uses I/O bitmap */
+#define TIF_SPEC_FORCE_UPDATE 23 /* Force speculation MSR update in context switch */
#define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
#define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
@@ -104,7 +105,7 @@ struct thread_info {
#define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
#define _TIF_SSBD (1 << TIF_SSBD)
#define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
-#define _TIF_SPEC_FORCE_UPDATE (1 << TIF_SPEC_FORCE_UPDATE)
+#define _TIF_SPEC_L1D_FLUSH (1 << TIF_SPEC_L1D_FLUSH)
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
#define _TIF_UPROBE (1 << TIF_UPROBE)
#define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING)
@@ -115,6 +116,7 @@ struct thread_info {
#define _TIF_SLD (1 << TIF_SLD)
#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)
#define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
+#define _TIF_SPEC_FORCE_UPDATE (1 << TIF_SPEC_FORCE_UPDATE)
#define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
#define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
#define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
@@ -217,6 +219,9 @@ static inline int arch_within_stack_frames(const void * const stack,
current_thread_info()->status & TS_COMPAT)
#endif
+extern int enable_l1d_flush_for_task(struct task_struct *tsk);
+extern int disable_l1d_flush_for_task(struct task_struct *tsk);
+
extern void arch_task_cache_init(void);
extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
extern void arch_release_task_struct(struct task_struct *tsk);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 7320348b5a61..f67c5bd58158 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -8,11 +8,13 @@
#include <linux/export.h>
#include <linux/cpu.h>
#include <linux/debugfs.h>
+#include <linux/sched/smt.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
#include <asm/nospec-branch.h>
#include <asm/cache.h>
+#include <asm/cacheflush.h>
#include <asm/apic.h>
#include "mm_internal.h"
@@ -42,11 +44,12 @@
*/
/*
- * Bits to mangle the TIF_SPEC_IB state into the mm pointer which is
+ * Bits to mangle the TIF_SPEC_* state into the mm pointer which is
* stored in cpu_tlb_state.last_user_mm_spec.
*/
#define LAST_USER_MM_IBPB 0x1UL
-#define LAST_USER_MM_SPEC_MASK (LAST_USER_MM_IBPB)
+#define LAST_USER_MM_L1D_FLUSH 0x2UL
+#define LAST_USER_MM_SPEC_MASK (LAST_USER_MM_IBPB | LAST_USER_MM_L1D_FLUSH)
/* Bits to set when tlbstate and flush is (re)initialized */
#define LAST_USER_MM_INIT LAST_USER_MM_IBPB
@@ -325,6 +328,7 @@ static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
unsigned long next_tif = task_thread_info(next)->flags;
unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;
+ BUILD_BUG_ON(TIF_SPEC_L1D_FLUSH != TIF_SPEC_IB + 1);
return (unsigned long)next->mm | spec_bits;
}
@@ -402,6 +406,14 @@ static void cond_mitigation(struct task_struct *next)
indirect_branch_prediction_barrier();
}
+ /*
+ * Flush only if SMT is disabled as per the contract, which is checked
+ * when the feature is enabled.
+ */
+ if (!this_cpu_read(cpu_info.smt_active) &&
+ (prev_mm & LAST_USER_MM_L1D_FLUSH))
+ l1d_flush_hw();
+
this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
}
--
2.17.1
cond_ibpb() has the necessary bits required to track the previous mm in
switch_mm_irqs_off(). This can be reused for other use cases like L1D
flushing on context switch.
[ tglx: Moved comment, added a separate define for state (re)initialization ]
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/tlbflush.h | 2 +-
arch/x86/mm/tlb.c | 53 ++++++++++++++++++---------------
2 files changed, 30 insertions(+), 25 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8c87a2e0b660..a927d40664df 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -83,7 +83,7 @@ struct tlb_state {
/* Last user mm for optimizing IBPB */
union {
struct mm_struct *last_user_mm;
- unsigned long last_user_mm_ibpb;
+ unsigned long last_user_mm_spec;
};
u16 loaded_mm_asid;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 569ac1d57f55..7320348b5a61 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -42,10 +42,14 @@
*/
/*
- * Use bit 0 to mangle the TIF_SPEC_IB state into the mm pointer which is
- * stored in cpu_tlb_state.last_user_mm_ibpb.
+ * Bits to mangle the TIF_SPEC_IB state into the mm pointer which is
+ * stored in cpu_tlb_state.last_user_mm_spec.
*/
#define LAST_USER_MM_IBPB 0x1UL
+#define LAST_USER_MM_SPEC_MASK (LAST_USER_MM_IBPB)
+
+/* Bits to set when tlbstate and flush is (re)initialized */
+#define LAST_USER_MM_INIT LAST_USER_MM_IBPB
/*
* The x86 feature is called PCID (Process Context IDentifier). It is similar
@@ -316,20 +320,29 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
local_irq_restore(flags);
}
-static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
+static inline unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
{
unsigned long next_tif = task_thread_info(next)->flags;
- unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
+ unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;
- return (unsigned long)next->mm | ibpb;
+ return (unsigned long)next->mm | spec_bits;
}
-static void cond_ibpb(struct task_struct *next)
+static void cond_mitigation(struct task_struct *next)
{
+ unsigned long prev_mm, next_mm;
+
if (!next || !next->mm)
return;
+ next_mm = mm_mangle_tif_spec_bits(next);
+ prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_spec);
+
/*
+ * Avoid user/user BTB poisoning by flushing the branch predictor
+ * when switching between processes. This stops one process from
+ * doing Spectre-v2 attacks on another.
+ *
* Both, the conditional and the always IBPB mode use the mm
* pointer to avoid the IBPB when switching between tasks of the
* same process. Using the mm pointer instead of mm->context.ctx_id
@@ -339,8 +352,6 @@ static void cond_ibpb(struct task_struct *next)
* exposed data is not really interesting.
*/
if (static_branch_likely(&switch_mm_cond_ibpb)) {
- unsigned long prev_mm, next_mm;
-
/*
* This is a bit more complex than the always mode because
* it has to handle two cases:
@@ -370,20 +381,14 @@ static void cond_ibpb(struct task_struct *next)
* Optimize this with reasonably small overhead for the
* above cases. Mangle the TIF_SPEC_IB bit into the mm
* pointer of the incoming task which is stored in
- * cpu_tlbstate.last_user_mm_ibpb for comparison.
- */
- next_mm = mm_mangle_tif_spec_ib(next);
- prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_ibpb);
-
- /*
+ * cpu_tlbstate.last_user_mm_spec for comparison.
+ *
* Issue IBPB only if the mm's are different and one or
* both have the IBPB bit set.
*/
if (next_mm != prev_mm &&
(next_mm | prev_mm) & LAST_USER_MM_IBPB)
indirect_branch_prediction_barrier();
-
- this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, next_mm);
}
if (static_branch_unlikely(&switch_mm_always_ibpb)) {
@@ -392,11 +397,12 @@ static void cond_ibpb(struct task_struct *next)
* different context than the user space task which ran
* last on this CPU.
*/
- if (this_cpu_read(cpu_tlbstate.last_user_mm) != next->mm) {
+ if ((prev_mm & ~LAST_USER_MM_SPEC_MASK) !=
+ (unsigned long)next->mm)
indirect_branch_prediction_barrier();
- this_cpu_write(cpu_tlbstate.last_user_mm, next->mm);
- }
}
+
+ this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
}
#ifdef CONFIG_PERF_EVENTS
@@ -524,11 +530,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
need_flush = true;
} else {
/*
- * Avoid user/user BTB poisoning by flushing the branch
- * predictor when switching between processes. This stops
- * one process from doing Spectre-v2 attacks on another.
+ * Apply process to process speculation vulnerability
+ * mitigations if applicable.
*/
- cond_ibpb(tsk);
+ cond_mitigation(tsk);
/*
* Stop remote flushes for the previous mm.
@@ -636,7 +641,7 @@ void initialize_tlbstate_and_flush(void)
write_cr3(build_cr3(mm->pgd, 0));
/* Reinitialize tlbstate. */
- this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, LAST_USER_MM_IBPB);
+ this_cpu_write(cpu_tlbstate.last_user_mm_spec, LAST_USER_MM_INIT);
this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
this_cpu_write(cpu_tlbstate.next_asid, 1);
this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
--
2.17.1
On Fri, 2021-01-08 at 23:10 +1100, Balbir Singh wrote:
> Implement a mechanism that allows tasks to conditionally flush
> their L1D cache (mitigation mechanism suggested in [2]). The previous
> posts of these patches were sent for inclusion (see [3]) and were not
> included due to the concern for the need for additional checks,
> those checks were:
>
> 1. Implement this mechanism only for CPUs affected by the L1TF bug
> 2. Disable the software fallback
> 3. Provide an override to enable this mechanism
> 4. Be SMT aware in the implementation
>
> The patches support a use case where the entire system is not in
> non SMT mode, but rather a few CPUs can have their SMT turned off
> and processes that want to opt-in are expected to run on non SMT
> cores. This gives the administrator complete control over setting
> up the mitigation for the issue. In addition, the administrator
> has a boot time override (l1d_flush=on) to turn on the mechanism
> without which this mechanism will not work.
>
> To implement these efficiently, a new per cpu view of whether the core
> is in SMT mode or not is implemented in patch 1. The code is refactored
> in patch 2 so that the existing code can allow for other speculation
> related checks when switching mm between tasks, this mechanism has not
> changed since the last post. The ability to flush L1D for tasks if the
> TIF_SPEC_L1D_FLUSH bit is set and the task has context switched out of a
> non SMT core is provided by patch 3. Hooks for the user space API, for
> this feature to be invoked via prctl are provided in patch 4, along with
> the checks described above (1, 2, and 3). Documentation updates are in
> patch 5, with updates on l1d_flush, the prctl changes and updates to the
> kernel-parameters (l1d_flush_out).
>
> The checks for opting into L1D flushing are:
> a. If the CPU is affected by L1TF
> b. Hardware L1D flush mechanism is available
>
> A task running on a core with SMT enabled and opting into this feature will
> receive a SIGBUS.
>
> References
> [1] https://software.intel.com/security-software-guidance/software-guidance/snoop-assisted-l1-data-sampling
> [2] https://software.intel.com/security-software-guidance/insights/deep-dive-snoop-assisted-l1-data-sampling
> [3] https://lkml.org/lkml/2020/6/2/1150
> [4] https://lore.kernel.org/lkml/[email protected]/
> [5] https://lore.kernel.org/lkml/[email protected]/
>
> Reviewers guide to v4
> - The key patch in the series and most of the changes to this
> revision are to patch 4. patches 3 and 5 have been modified
> to keep them consistent with the changes to patch 4.
>
> Changelog v4:
> - Use a static key to enable the mechanism (remove overheads)
> - By default have the mechanism turned off, so there are two
> opt-ins needed, one by the administrator at boot time, second
> by the application
> - Rename l1d_flush_out/L1D_FLUSH_OUT to l1d_flush/L1D_FLUSH
> - Implement other review recommendations
> Changelog v3:
> - Implement the SIGBUS mechansim
> - Update and fix the documentation
>
>
> Balbir Singh (5):
> x86/smp: Add a per-cpu view of SMT state
> x86/mm: Refactor cond_ibpb() to support other use cases
> x86/mm: Optionally flush L1D on context switch
> prctl: Hook L1D flushing in via prctl
> Documentation: Add L1D flushing Documentation
>
> Documentation/admin-guide/hw-vuln/index.rst | 1 +
> .../admin-guide/hw-vuln/l1d_flush.rst | 70 +++++++++++++++
> .../admin-guide/kernel-parameters.txt | 17 ++++
> Documentation/userspace-api/spec_ctrl.rst | 8 ++
> arch/Kconfig | 4 +
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/cacheflush.h | 8 ++
> arch/x86/include/asm/nospec-branch.h | 2 +
> arch/x86/include/asm/processor.h | 2 +
> arch/x86/include/asm/thread_info.h | 6 +-
> arch/x86/include/asm/tlbflush.h | 2 +-
> arch/x86/kernel/cpu/bugs.c | 71 +++++++++++++++
> arch/x86/kernel/smpboot.c | 10 ++-
> arch/x86/mm/tlb.c | 88 ++++++++++++++-----
> include/linux/sched.h | 10 +++
> include/uapi/linux/prctl.h | 1 +
> 16 files changed, 273 insertions(+), 28 deletions(-)
> create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst
>
Ping on any review comments? Suggested refactoring?
Balbir Singh
*thread necromancy*
https://lore.kernel.org/lkml/[email protected]/
On Mon, Jan 25, 2021 at 09:27:38AM +0000, Singh, Balbir wrote:
> On Fri, 2021-01-08 at 23:10 +1100, Balbir Singh wrote:
> > Implement a mechanism that allows tasks to conditionally flush
> > their L1D cache (mitigation mechanism suggested in [2]). The previous
> > posts of these patches were sent for inclusion (see [3]) and were not
> > included due to the concern for the need for additional checks,
> > those checks were:
> >
> > 1. Implement this mechanism only for CPUs affected by the L1TF bug
> > 2. Disable the software fallback
> > 3. Provide an override to enable this mechanism
> > 4. Be SMT aware in the implementation
> > [...]
> Ping on any review comments? Suggested refactoring?
Hi!
I'd still really like to see this -- it's a big hammer, but that's the
point for cases where some new flaw appears and we can point to the
toolbox and say "you can mitigate it with this while you wait for new
kernel/CPU."
Any further thoughts from x86 maintainers? This seems like it addressed
all of tglx's review comments.
--
Kees Cook
On Mon, Apr 26 2021 at 10:31, Thomas Gleixner wrote:
> On Thu, Apr 08 2021 at 13:23, Kees Cook wrote:
>>
>> I'd still really like to see this -- it's a big hammer, but that's the
>> point for cases where some new flaw appears and we can point to the
>> toolbox and say "you can mitigate it with this while you wait for new
>> kernel/CPU."
>>
>> Any further thoughts from x86 maintainers? This seems like it addressed
>> all of tglx's review comments.
>
> Sorry for dropping the ball on this. It's in my list of things to deal
> with. Starting to look at it now.
So I went through the pile and for remorse I sat down and made the
tweaks I think are necessary myself.
I've pushed out the result to
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/l1dflush
The only thing I did not address yet is that the documentation lacks any
mentioning of the SIGBUS mechanism which is invoked when a task which
asked for L1D flush protection ends up on a SMT sibling for whatever
reason. That's essential to have because it's part of the contract of
that prctl.
Balbir, can you please double check the result and prepare an updated
version from there?
If you don't have cycles, please let me know.
Thanks,
tglx
On Tue, Apr 27, 2021 at 12:24:16AM +0200, Thomas Gleixner wrote:
> On Mon, Apr 26 2021 at 10:31, Thomas Gleixner wrote:
> > On Thu, Apr 08 2021 at 13:23, Kees Cook wrote:
> >>
> >> I'd still really like to see this -- it's a big hammer, but that's the
> >> point for cases where some new flaw appears and we can point to the
> >> toolbox and say "you can mitigate it with this while you wait for new
> >> kernel/CPU."
> >>
> >> Any further thoughts from x86 maintainers? This seems like it addressed
> >> all of tglx's review comments.
> >
> > Sorry for dropping the ball on this. It's in my list of things to deal
> > with. Starting to look at it now.
>
> So I went through the pile and for remorse I sat down and made the
> tweaks I think are necessary myself.
>
> I've pushed out the result to
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/l1dflush
Oh excellent; thank you for doing this!
--
Kees Cook
On Tue, Apr 27, 2021 at 12:24:16AM +0200, Thomas Gleixner wrote:
> On Mon, Apr 26 2021 at 10:31, Thomas Gleixner wrote:
> > On Thu, Apr 08 2021 at 13:23, Kees Cook wrote:
> >>
> >> I'd still really like to see this -- it's a big hammer, but that's the
> >> point for cases where some new flaw appears and we can point to the
> >> toolbox and say "you can mitigate it with this while you wait for new
> >> kernel/CPU."
> >>
> >> Any further thoughts from x86 maintainers? This seems like it addressed
> >> all of tglx's review comments.
> >
> > Sorry for dropping the ball on this. It's in my list of things to deal
> > with. Starting to look at it now.
>
> So I went through the pile and for remorse I sat down and made the
> tweaks I think are necessary myself.
>
> I've pushed out the result to
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/l1dflush
>
Thank you I'll take a look and test it.
> The only thing I did not address yet is that the documentation lacks any
> mentioning of the SIGBUS mechanism which is invoked when a task which
> asked for L1D flush protection ends up on a SMT sibling for whatever
> reason. That's essential to have because it's part of the contract of
> that prctl.
IIRC I documented it, I'll double check.
>
> Balbir, can you please double check the result and prepare an updated
> version from there?
>
> If you don't have cycles, please let me know.
>
I might have some cycles for testing and re-review. Thanks for all the
hard work on this
Balbir Singh.
On Wed, Apr 28, 2021 at 01:08:05PM -0700, Kees Cook wrote:
> On Tue, Apr 27, 2021 at 12:24:16AM +0200, Thomas Gleixner wrote:
> > On Mon, Apr 26 2021 at 10:31, Thomas Gleixner wrote:
> > > On Thu, Apr 08 2021 at 13:23, Kees Cook wrote:
> > >>
> > >> I'd still really like to see this -- it's a big hammer, but that's the
> > >> point for cases where some new flaw appears and we can point to the
> > >> toolbox and say "you can mitigate it with this while you wait for new
> > >> kernel/CPU."
> > >>
> > >> Any further thoughts from x86 maintainers? This seems like it addressed
> > >> all of tglx's review comments.
> > >
> > > Sorry for dropping the ball on this. It's in my list of things to deal
> > > with. Starting to look at it now.
> >
> > So I went through the pile and for remorse I sat down and made the
> > tweaks I think are necessary myself.
> >
> > I've pushed out the result to
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/l1dflush
>
> Oh excellent; thank you for doing this!
>
Thanks again Thomas!
I no longer have access to the bare metal hardware, but I was able to test
this under qemu with some emulation changes. The changes worked as expected.
Folks on the list/cc, appreciate any tested-by or additional reviewed-by
tags if you do happen to review/test.
Thanks,
Balbir Singh.
On Fri, Jun 04, 2021 at 08:06:14PM +1000, Balbir Singh wrote:
> On Wed, Apr 28, 2021 at 01:08:05PM -0700, Kees Cook wrote:
> > On Tue, Apr 27, 2021 at 12:24:16AM +0200, Thomas Gleixner wrote:
> > > On Mon, Apr 26 2021 at 10:31, Thomas Gleixner wrote:
> > > > On Thu, Apr 08 2021 at 13:23, Kees Cook wrote:
> > > >>
> > > >> I'd still really like to see this -- it's a big hammer, but that's the
> > > >> point for cases where some new flaw appears and we can point to the
> > > >> toolbox and say "you can mitigate it with this while you wait for new
> > > >> kernel/CPU."
> > > >>
> > > >> Any further thoughts from x86 maintainers? This seems like it addressed
> > > >> all of tglx's review comments.
> > > >
> > > > Sorry for dropping the ball on this. It's in my list of things to deal
> > > > with. Starting to look at it now.
> > >
> > > So I went through the pile and for remorse I sat down and made the
> > > tweaks I think are necessary myself.
> > >
> > > I've pushed out the result to
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/l1dflush
> >
> > Oh excellent; thank you for doing this!
> >
>
> Thanks again Thomas!
>
> I no longer have access to the bare metal hardware, but I was able to test
> this under qemu with some emulation changes. The changes worked as expected.
>
> Folks on the list/cc, appreciate any tested-by or additional reviewed-by
> tags if you do happen to review/test.
I can't test the behavior (no access to CPU), but I wrote a simple prctl
tester. Perhaps this can be expanded on?
diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 333980375bc7..50c150d35962 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -13,7 +13,7 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh $(CC) trivial_program.c -no-pie)
TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \
check_initial_reg_state sigreturn iopl ioperm \
test_vsyscall mov_ss_trap \
- syscall_arg_fault fsgsbase_restore
+ syscall_arg_fault fsgsbase_restore l1d_flush
TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
test_FCMOV test_FCOMI test_FISTTP \
vdso_restorer
diff --git a/tools/testing/selftests/x86/l1d_flush.c b/tools/testing/selftests/x86/l1d_flush.c
new file mode 100644
index 000000000000..ef4e73679d58
--- /dev/null
+++ b/tools/testing/selftests/x86/l1d_flush.c
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * l1d_flush.c: Exercise the L1D flushing behaviors
+ */
+#define _GNU_SOURCE
+
+#include <stdlib.h>
+#include <stddef.h>
+#include <stdio.h>
+#include <string.h>
+#include <errno.h>
+#include <sys/prctl.h>
+
+#ifndef PR_SET_SPECULATION_CTRL
+# define PR_GET_SPECULATION_CTRL 52
+# define PR_SET_SPECULATION_CTRL 53
+# define PR_SPEC_ENABLE (1UL << 1)
+# define PR_SPEC_DISABLE (1UL << 2)
+#endif
+
+#ifndef PR_SPEC_L1D_FLUSH
+# define PR_SPEC_L1D_FLUSH 2
+#endif
+
+#include "../kselftest_harness.h"
+
+TEST(toggle)
+{
+ int ret;
+
+ ret = prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, 0, 0, 0);
+ ASSERT_GE(ret, 0) {
+ TH_LOG("PR_GET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH failed: %d (%s)", errno, strerror(errno));
+ }
+ EXPECT_EQ((ret & (PR_SPEC_DISABLE | PR_SPEC_ENABLE)), 0) {
+ TH_LOG("PR_SPEC_L1D_FLUSH is already enabled!?");
+ }
+
+ /* Enable */
+ ret = prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, PR_SPEC_ENABLE, 0, 0);
+ EXPECT_EQ(ret, 0) {
+ if (errno == EPERM)
+ SKIP(return, "Kernel does not support PR_SPEC_L1D_FLUSH (boot with l1d_flush=on with a supported CPU)");
+ TH_LOG("PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, PR_SPEC_ENABLE failed: %d (%s)", errno, strerror(errno));
+ }
+
+ /* Check Enable */
+ ret = prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, 0, 0, 0);
+ EXPECT_EQ((ret & (PR_SPEC_DISABLE | PR_SPEC_ENABLE)), PR_SPEC_ENABLE) {
+ TH_LOG("PR_SPEC_L1D_FLUSH did not stay enabled");
+ }
+
+ /* Disable */
+ ret = prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, PR_SPEC_DISABLE, 0, 0);
+ EXPECT_EQ(ret, 0) {
+ TH_LOG("PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, PR_SPEC_DISABLE failed: %d (%s)", errno, strerror(errno));
+ }
+
+ /* Check Disable */
+ ret = prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, 0, 0, 0);
+ EXPECT_EQ((ret & (PR_SPEC_DISABLE | PR_SPEC_ENABLE)), PR_SPEC_DISABLE) {
+ TH_LOG("PR_SPEC_L1D_FLUSH did not stay disabled");
+ }
+}
+
+TEST_HARNESS_MAIN
--
Kees Cook
The following commit has been merged into the x86/cpu branch of tip:
Commit-ID: 8aacd1eab53ec853c2d29cdc9b64e9dc87d2a519
Gitweb: https://git.kernel.org/tip/8aacd1eab53ec853c2d29cdc9b64e9dc87d2a519
Author: Balbir Singh <[email protected]>
AuthorDate: Mon, 26 Apr 2021 22:09:43 +02:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 28 Jul 2021 11:42:24 +02:00
x86/process: Make room for TIF_SPEC_L1D_FLUSH
The upcoming support for paranoid L1D flush in switch_mm() requires that
TIF_SPEC_IB and the new TIF_SPEC_L1D_FLUSH are two consecutive bits in
thread_info::flags.
Move TIF_SPEC_FORCE_UPDATE to a spare bit to make room for the new one.
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/thread_info.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index de406d9..d9afd35 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -81,7 +81,6 @@ struct thread_info {
#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
#define TIF_SSBD 5 /* Speculative store bypass disable */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
-#define TIF_SPEC_FORCE_UPDATE 10 /* Force speculation MSR update in context switch */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
#define TIF_UPROBE 12 /* breakpointed or singlestepping */
#define TIF_PATCH_PENDING 13 /* pending live patching update */
@@ -93,6 +92,7 @@ struct thread_info {
#define TIF_MEMDIE 20 /* is terminating due to OOM killer */
#define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */
#define TIF_IO_BITMAP 22 /* uses I/O bitmap */
+#define TIF_SPEC_FORCE_UPDATE 23 /* Force speculation MSR update in context switch */
#define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
#define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
@@ -104,7 +104,6 @@ struct thread_info {
#define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
#define _TIF_SSBD (1 << TIF_SSBD)
#define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
-#define _TIF_SPEC_FORCE_UPDATE (1 << TIF_SPEC_FORCE_UPDATE)
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
#define _TIF_UPROBE (1 << TIF_UPROBE)
#define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING)
@@ -115,6 +114,7 @@ struct thread_info {
#define _TIF_SLD (1 << TIF_SLD)
#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)
#define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
+#define _TIF_SPEC_FORCE_UPDATE (1 << TIF_SPEC_FORCE_UPDATE)
#define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
#define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
#define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
The following commit has been merged into the x86/cpu branch of tip:
Commit-ID: e893bb1bb4d2eb635eba61e5d9c5135d96855773
Gitweb: https://git.kernel.org/tip/e893bb1bb4d2eb635eba61e5d9c5135d96855773
Author: Balbir Singh <[email protected]>
AuthorDate: Fri, 08 Jan 2021 23:10:55 +11:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 28 Jul 2021 11:42:25 +02:00
x86, prctl: Hook L1D flushing in via prctl
Use the existing PR_GET/SET_SPECULATION_CTRL API to expose the L1D flush
capability. For L1D flushing PR_SPEC_FORCE_DISABLE and
PR_SPEC_DISABLE_NOEXEC are not supported.
Enabling L1D flush does not check if the task is running on an SMT enabled
core, rather a check is done at runtime (at the time of flush), if the task
runs on a SMT sibling then the task is sent a SIGBUS which is executed
before the task returns to user space or to a guest.
This is better than the other alternatives of:
a. Ensuring strict affinity of the task (hard to enforce without further
changes in the scheduler)
b. Silently skipping flush for tasks that move to SMT enabled cores.
Hook up the core prctl and implement the x86 specific parts which in turn
makes it functional.
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/cpu/bugs.c | 33 +++++++++++++++++++++++++++++++++
include/uapi/linux/prctl.h | 1 +
2 files changed, 34 insertions(+)
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 1a5a1b0..ecfca3b 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1252,6 +1252,24 @@ static void task_update_spec_tif(struct task_struct *tsk)
speculation_ctrl_update_current();
}
+static int l1d_flush_prctl_set(struct task_struct *task, unsigned long ctrl)
+{
+
+ if (!static_branch_unlikely(&switch_mm_cond_l1d_flush))
+ return -EPERM;
+
+ switch (ctrl) {
+ case PR_SPEC_ENABLE:
+ set_ti_thread_flag(&task->thread_info, TIF_SPEC_L1D_FLUSH);
+ return 0;
+ case PR_SPEC_DISABLE:
+ clear_ti_thread_flag(&task->thread_info, TIF_SPEC_L1D_FLUSH);
+ return 0;
+ default:
+ return -ERANGE;
+ }
+}
+
static int ssb_prctl_set(struct task_struct *task, unsigned long ctrl)
{
if (ssb_mode != SPEC_STORE_BYPASS_PRCTL &&
@@ -1361,6 +1379,8 @@ int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which,
return ssb_prctl_set(task, ctrl);
case PR_SPEC_INDIRECT_BRANCH:
return ib_prctl_set(task, ctrl);
+ case PR_SPEC_L1D_FLUSH:
+ return l1d_flush_prctl_set(task, ctrl);
default:
return -ENODEV;
}
@@ -1377,6 +1397,17 @@ void arch_seccomp_spec_mitigate(struct task_struct *task)
}
#endif
+static int l1d_flush_prctl_get(struct task_struct *task)
+{
+ if (!static_branch_unlikely(&switch_mm_cond_l1d_flush))
+ return PR_SPEC_FORCE_DISABLE;
+
+ if (test_ti_thread_flag(&task->thread_info, TIF_SPEC_L1D_FLUSH))
+ return PR_SPEC_PRCTL | PR_SPEC_ENABLE;
+ else
+ return PR_SPEC_PRCTL | PR_SPEC_DISABLE;
+}
+
static int ssb_prctl_get(struct task_struct *task)
{
switch (ssb_mode) {
@@ -1427,6 +1458,8 @@ int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which)
return ssb_prctl_get(task);
case PR_SPEC_INDIRECT_BRANCH:
return ib_prctl_get(task);
+ case PR_SPEC_L1D_FLUSH:
+ return l1d_flush_prctl_get(task);
default:
return -ENODEV;
}
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 967d9c5..964c41e 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -213,6 +213,7 @@ struct prctl_mm_map {
/* Speculation control variants */
# define PR_SPEC_STORE_BYPASS 0
# define PR_SPEC_INDIRECT_BRANCH 1
+# define PR_SPEC_L1D_FLUSH 2
/* Return and control values for PR_SET/GET_SPECULATION_CTRL */
# define PR_SPEC_NOT_AFFECTED 0
# define PR_SPEC_PRCTL (1UL << 0)
The following commit has been merged into the x86/cpu branch of tip:
Commit-ID: b5f06f64e269f9820cd5ad9e9a98afa6c8914b7a
Gitweb: https://git.kernel.org/tip/b5f06f64e269f9820cd5ad9e9a98afa6c8914b7a
Author: Balbir Singh <[email protected]>
AuthorDate: Mon, 26 Apr 2021 21:42:30 +02:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 28 Jul 2021 11:42:24 +02:00
x86/mm: Prepare for opt-in based L1D flush in switch_mm()
The goal of this is to allow tasks that want to protect sensitive
information, against e.g. the recently found snoop assisted data sampling
vulnerabilites, to flush their L1D on being switched out. This protects
their data from being snooped or leaked via side channels after the task
has context switched out.
This could also be used to wipe L1D when an untrusted task is switched in,
but that's not a really well defined scenario while the opt-in variant is
clearly defined.
The mechanism is default disabled and can be enabled on the kernel command
line.
Prepare for the actual prctl based opt-in:
1) Provide the necessary setup functionality similar to the other
mitigations and enable the static branch when the command line option
is set and the CPU provides support for hardware assisted L1D
flushing. Software based L1D flush is not supported because it's CPU
model specific and not really well defined.
This does not come with a sysfs file like the other mitigations
because it is not bound to any specific vulnerability.
Support has to be queried via the prctl(2) interface.
2) Add TIF_SPEC_L1D_FLUSH next to L1D_SPEC_IB so the two bits can be
mangled into the mm pointer in one go which allows to reuse the
existing mechanism in switch_mm() for the conditional IBPB speculation
barrier efficiently.
3) Add the L1D flush specific functionality which flushes L1D when the
outgoing task opted in.
Also check whether the incoming task has requested L1D flush and if so
validate that it is not accidentaly running on an SMT sibling as this
makes the whole excercise moot because SMT siblings share L1D which
opens tons of other attack vectors. If that happens schedule task work
which signals the incoming task on return to user/guest with SIGBUS as
this is part of the paranoid L1D flush contract.
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/Kconfig | 1 +-
arch/x86/include/asm/nospec-branch.h | 2 +-
arch/x86/include/asm/thread_info.h | 2 +-
arch/x86/kernel/cpu/bugs.c | 37 +++++++++++++++++-
arch/x86/mm/tlb.c | 58 ++++++++++++++++++++++++++-
5 files changed, 98 insertions(+), 2 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4927065..d8a2c3f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -119,6 +119,7 @@ config X86
select ARCH_WANT_HUGE_PMD_SHARE
select ARCH_WANT_LD_ORPHAN_WARN
select ARCH_WANTS_THP_SWAP if X86_64
+ select ARCH_HAS_PARANOID_L1D_FLUSH
select BUILDTIME_TABLE_SORT
select CLKEVT_I8253
select CLOCKSOURCE_VALIDATE_LAST_CYCLE
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 3ad8c6d..ec2d5c8 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -252,6 +252,8 @@ DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
DECLARE_STATIC_KEY_FALSE(mds_user_clear);
DECLARE_STATIC_KEY_FALSE(mds_idle_clear);
+DECLARE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
+
#include <asm/segment.h>
/**
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index d9afd35..cf13266 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -81,6 +81,7 @@ struct thread_info {
#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
#define TIF_SSBD 5 /* Speculative store bypass disable */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
+#define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
#define TIF_UPROBE 12 /* breakpointed or singlestepping */
#define TIF_PATCH_PENDING 13 /* pending live patching update */
@@ -104,6 +105,7 @@ struct thread_info {
#define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
#define _TIF_SSBD (1 << TIF_SSBD)
#define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
+#define _TIF_SPEC_L1D_FLUSH (1 << TIF_SPEC_L1D_FLUSH)
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
#define _TIF_UPROBE (1 << TIF_UPROBE)
#define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING)
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index d41b70f..1a5a1b0 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -43,6 +43,7 @@ static void __init mds_select_mitigation(void);
static void __init mds_print_mitigation(void);
static void __init taa_select_mitigation(void);
static void __init srbds_select_mitigation(void);
+static void __init l1d_flush_select_mitigation(void);
/* The base value of the SPEC_CTRL MSR that always has to be preserved. */
u64 x86_spec_ctrl_base;
@@ -76,6 +77,13 @@ EXPORT_SYMBOL_GPL(mds_user_clear);
DEFINE_STATIC_KEY_FALSE(mds_idle_clear);
EXPORT_SYMBOL_GPL(mds_idle_clear);
+/*
+ * Controls whether l1d flush based mitigations are enabled,
+ * based on hw features and admin setting via boot parameter
+ * defaults to false
+ */
+DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
+
void __init check_bugs(void)
{
identify_boot_cpu();
@@ -111,6 +119,7 @@ void __init check_bugs(void)
mds_select_mitigation();
taa_select_mitigation();
srbds_select_mitigation();
+ l1d_flush_select_mitigation();
/*
* As MDS and TAA mitigations are inter-related, print MDS
@@ -492,6 +501,34 @@ static int __init srbds_parse_cmdline(char *str)
early_param("srbds", srbds_parse_cmdline);
#undef pr_fmt
+#define pr_fmt(fmt) "L1D Flush : " fmt
+
+enum l1d_flush_mitigations {
+ L1D_FLUSH_OFF = 0,
+ L1D_FLUSH_ON,
+};
+
+static enum l1d_flush_mitigations l1d_flush_mitigation __initdata = L1D_FLUSH_OFF;
+
+static void __init l1d_flush_select_mitigation(void)
+{
+ if (!l1d_flush_mitigation || !boot_cpu_has(X86_FEATURE_FLUSH_L1D))
+ return;
+
+ static_branch_enable(&switch_mm_cond_l1d_flush);
+ pr_info("Conditional flush on switch_mm() enabled\n");
+}
+
+static int __init l1d_flush_parse_cmdline(char *str)
+{
+ if (!strcmp(str, "on"))
+ l1d_flush_mitigation = L1D_FLUSH_ON;
+
+ return 0;
+}
+early_param("l1d_flush", l1d_flush_parse_cmdline);
+
+#undef pr_fmt
#define pr_fmt(fmt) "Spectre V1 : " fmt
enum spectre_v1_mitigation {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index c98bc84..59ba296 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -8,11 +8,13 @@
#include <linux/export.h>
#include <linux/cpu.h>
#include <linux/debugfs.h>
+#include <linux/sched/smt.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
#include <asm/nospec-branch.h>
#include <asm/cache.h>
+#include <asm/cacheflush.h>
#include <asm/apic.h>
#include <asm/perf_event.h>
@@ -43,11 +45,12 @@
*/
/*
- * Bits to mangle the TIF_SPEC_IB state into the mm pointer which is
+ * Bits to mangle the TIF_SPEC_* state into the mm pointer which is
* stored in cpu_tlb_state.last_user_mm_spec.
*/
#define LAST_USER_MM_IBPB 0x1UL
-#define LAST_USER_MM_SPEC_MASK (LAST_USER_MM_IBPB)
+#define LAST_USER_MM_L1D_FLUSH 0x2UL
+#define LAST_USER_MM_SPEC_MASK (LAST_USER_MM_IBPB | LAST_USER_MM_L1D_FLUSH)
/* Bits to set when tlbstate and flush is (re)initialized */
#define LAST_USER_MM_INIT LAST_USER_MM_IBPB
@@ -321,11 +324,52 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
local_irq_restore(flags);
}
+/*
+ * Invoked from return to user/guest by a task that opted-in to L1D
+ * flushing but ended up running on an SMT enabled core due to wrong
+ * affinity settings or CPU hotplug. This is part of the paranoid L1D flush
+ * contract which this task requested.
+ */
+static void l1d_flush_force_sigbus(struct callback_head *ch)
+{
+ force_sig(SIGBUS);
+}
+
+static void l1d_flush_evaluate(unsigned long prev_mm, unsigned long next_mm,
+ struct task_struct *next)
+{
+ /* Flush L1D if the outgoing task requests it */
+ if (prev_mm & LAST_USER_MM_L1D_FLUSH)
+ wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+
+ /* Check whether the incoming task opted in for L1D flush */
+ if (likely(!(next_mm & LAST_USER_MM_L1D_FLUSH)))
+ return;
+
+ /*
+ * Validate that it is not running on an SMT sibling as this would
+ * make the excercise pointless because the siblings share L1D. If
+ * it runs on a SMT sibling, notify it with SIGBUS on return to
+ * user/guest
+ */
+ if (this_cpu_read(cpu_info.smt_active)) {
+ clear_ti_thread_flag(&next->thread_info, TIF_SPEC_L1D_FLUSH);
+ next->l1d_flush_kill.func = l1d_flush_force_sigbus;
+ task_work_add(next, &next->l1d_flush_kill, TWA_RESUME);
+ }
+}
+
static unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
{
unsigned long next_tif = task_thread_info(next)->flags;
unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;
+ /*
+ * Ensure that the bit shift above works as expected and the two flags
+ * end up in bit 0 and 1.
+ */
+ BUILD_BUG_ON(TIF_SPEC_L1D_FLUSH != TIF_SPEC_IB + 1);
+
return (unsigned long)next->mm | spec_bits;
}
@@ -403,6 +447,16 @@ static void cond_mitigation(struct task_struct *next)
indirect_branch_prediction_barrier();
}
+ if (static_branch_unlikely(&switch_mm_cond_l1d_flush)) {
+ /*
+ * Flush L1D when the outgoing task requested it and/or
+ * check whether the incoming task requested L1D flushing
+ * and ended up on an SMT sibling.
+ */
+ if (unlikely((prev_mm | next_mm) & LAST_USER_MM_L1D_FLUSH))
+ l1d_flush_evaluate(prev_mm, next_mm, next);
+ }
+
this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
}
The following commit has been merged into the x86/cpu branch of tip:
Commit-ID: 371b09c6fdc436f2c7bb67fc90df5eec8ce90f06
Gitweb: https://git.kernel.org/tip/371b09c6fdc436f2c7bb67fc90df5eec8ce90f06
Author: Balbir Singh <[email protected]>
AuthorDate: Fri, 08 Jan 2021 23:10:53 +11:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 28 Jul 2021 11:42:24 +02:00
x86/mm: Refactor cond_ibpb() to support other use cases
cond_ibpb() has the necessary bits required to track the previous mm in
switch_mm_irqs_off(). This can be reused for other use cases like L1D
flushing on context switch.
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/tlbflush.h | 2 +-
arch/x86/mm/tlb.c | 53 +++++++++++++++++---------------
2 files changed, 30 insertions(+), 25 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index fa952ea..b587a9e 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -83,7 +83,7 @@ struct tlb_state {
/* Last user mm for optimizing IBPB */
union {
struct mm_struct *last_user_mm;
- unsigned long last_user_mm_ibpb;
+ unsigned long last_user_mm_spec;
};
u16 loaded_mm_asid;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index cfe6b1e..c98bc84 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -43,10 +43,14 @@
*/
/*
- * Use bit 0 to mangle the TIF_SPEC_IB state into the mm pointer which is
- * stored in cpu_tlb_state.last_user_mm_ibpb.
+ * Bits to mangle the TIF_SPEC_IB state into the mm pointer which is
+ * stored in cpu_tlb_state.last_user_mm_spec.
*/
#define LAST_USER_MM_IBPB 0x1UL
+#define LAST_USER_MM_SPEC_MASK (LAST_USER_MM_IBPB)
+
+/* Bits to set when tlbstate and flush is (re)initialized */
+#define LAST_USER_MM_INIT LAST_USER_MM_IBPB
/*
* The x86 feature is called PCID (Process Context IDentifier). It is similar
@@ -317,20 +321,29 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
local_irq_restore(flags);
}
-static unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
+static unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
{
unsigned long next_tif = task_thread_info(next)->flags;
- unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
+ unsigned long spec_bits = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_SPEC_MASK;
- return (unsigned long)next->mm | ibpb;
+ return (unsigned long)next->mm | spec_bits;
}
-static void cond_ibpb(struct task_struct *next)
+static void cond_mitigation(struct task_struct *next)
{
+ unsigned long prev_mm, next_mm;
+
if (!next || !next->mm)
return;
+ next_mm = mm_mangle_tif_spec_bits(next);
+ prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_spec);
+
/*
+ * Avoid user/user BTB poisoning by flushing the branch predictor
+ * when switching between processes. This stops one process from
+ * doing Spectre-v2 attacks on another.
+ *
* Both, the conditional and the always IBPB mode use the mm
* pointer to avoid the IBPB when switching between tasks of the
* same process. Using the mm pointer instead of mm->context.ctx_id
@@ -340,8 +353,6 @@ static void cond_ibpb(struct task_struct *next)
* exposed data is not really interesting.
*/
if (static_branch_likely(&switch_mm_cond_ibpb)) {
- unsigned long prev_mm, next_mm;
-
/*
* This is a bit more complex than the always mode because
* it has to handle two cases:
@@ -371,20 +382,14 @@ static void cond_ibpb(struct task_struct *next)
* Optimize this with reasonably small overhead for the
* above cases. Mangle the TIF_SPEC_IB bit into the mm
* pointer of the incoming task which is stored in
- * cpu_tlbstate.last_user_mm_ibpb for comparison.
- */
- next_mm = mm_mangle_tif_spec_ib(next);
- prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_ibpb);
-
- /*
+ * cpu_tlbstate.last_user_mm_spec for comparison.
+ *
* Issue IBPB only if the mm's are different and one or
* both have the IBPB bit set.
*/
if (next_mm != prev_mm &&
(next_mm | prev_mm) & LAST_USER_MM_IBPB)
indirect_branch_prediction_barrier();
-
- this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, next_mm);
}
if (static_branch_unlikely(&switch_mm_always_ibpb)) {
@@ -393,11 +398,12 @@ static void cond_ibpb(struct task_struct *next)
* different context than the user space task which ran
* last on this CPU.
*/
- if (this_cpu_read(cpu_tlbstate.last_user_mm) != next->mm) {
+ if ((prev_mm & ~LAST_USER_MM_SPEC_MASK) !=
+ (unsigned long)next->mm)
indirect_branch_prediction_barrier();
- this_cpu_write(cpu_tlbstate.last_user_mm, next->mm);
- }
}
+
+ this_cpu_write(cpu_tlbstate.last_user_mm_spec, next_mm);
}
#ifdef CONFIG_PERF_EVENTS
@@ -531,11 +537,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
need_flush = true;
} else {
/*
- * Avoid user/user BTB poisoning by flushing the branch
- * predictor when switching between processes. This stops
- * one process from doing Spectre-v2 attacks on another.
+ * Apply process to process speculation vulnerability
+ * mitigations if applicable.
*/
- cond_ibpb(tsk);
+ cond_mitigation(tsk);
/*
* Stop remote flushes for the previous mm.
@@ -643,7 +648,7 @@ void initialize_tlbstate_and_flush(void)
write_cr3(build_cr3(mm->pgd, 0));
/* Reinitialize tlbstate. */
- this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, LAST_USER_MM_IBPB);
+ this_cpu_write(cpu_tlbstate.last_user_mm_spec, LAST_USER_MM_INIT);
this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
this_cpu_write(cpu_tlbstate.next_asid, 1);
this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
The following commit has been merged into the x86/cpu branch of tip:
Commit-ID: b7fe54f6c2d437082dcbecfbd832f38edd9caaf4
Gitweb: https://git.kernel.org/tip/b7fe54f6c2d437082dcbecfbd832f38edd9caaf4
Author: Balbir Singh <[email protected]>
AuthorDate: Fri, 08 Jan 2021 23:10:56 +11:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 28 Jul 2021 11:42:25 +02:00
Documentation: Add L1D flushing Documentation
Add documentation of l1d flushing, explain the need for the
feature and how it can be used.
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
Documentation/admin-guide/hw-vuln/index.rst | 1 +-
Documentation/admin-guide/hw-vuln/l1d_flush.rst | 69 ++++++++++++++++-
Documentation/admin-guide/kernel-parameters.txt | 17 ++++-
Documentation/userspace-api/spec_ctrl.rst | 8 ++-
4 files changed, 95 insertions(+)
create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index f12cda5..8cbc711 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -16,3 +16,4 @@ are configurable at compile, boot or run time.
multihit.rst
special-register-buffer-data-sampling.rst
core-scheduling.rst
+ l1d_flush.rst
diff --git a/Documentation/admin-guide/hw-vuln/l1d_flush.rst b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
new file mode 100644
index 0000000..210020b
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/l1d_flush.rst
@@ -0,0 +1,69 @@
+L1D Flushing
+============
+
+With an increasing number of vulnerabilities being reported around data
+leaks from the Level 1 Data cache (L1D) the kernel provides an opt-in
+mechanism to flush the L1D cache on context switch.
+
+This mechanism can be used to address e.g. CVE-2020-0550. For applications
+the mechanism keeps them safe from vulnerabilities, related to leaks
+(snooping of) from the L1D cache.
+
+
+Related CVEs
+------------
+The following CVEs can be addressed by this
+mechanism
+
+ ============= ======================== ==================
+ CVE-2020-0550 Improper Data Forwarding OS related aspects
+ ============= ======================== ==================
+
+Usage Guidelines
+----------------
+
+Please see document: :ref:`Documentation/userspace-api/spec_ctrl.rst
+<set_spec_ctrl>` for details.
+
+**NOTE**: The feature is disabled by default, applications need to
+specifically opt into the feature to enable it.
+
+Mitigation
+----------
+
+When PR_SET_L1D_FLUSH is enabled for a task a flush of the L1D cache is
+performed when the task is scheduled out and the incoming task belongs to a
+different process and therefore to a different address space.
+
+If the underlying CPU supports L1D flushing in hardware, the hardware
+mechanism is used, software fallback for the mitigation, is not supported.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the L1D flush mitigations at boot
+time with the option "l1d_flush=". The valid arguments for this option are:
+
+ ============ =============================================================
+ on Enables the prctl interface, applications trying to use
+ the prctl() will fail with an error if l1d_flush is not
+ enabled
+ ============ =============================================================
+
+By default the mechanism is disabled.
+
+Limitations
+-----------
+
+The mechanism does not mitigate L1D data leaks between tasks belonging to
+different processes which are concurrently executing on sibling threads of
+a physical CPU core when SMT is enabled on the system.
+
+This can be addressed by controlled placement of processes on physical CPU
+cores or by disabling SMT. See the relevant chapter in the L1TF mitigation
+document: :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`.
+
+**NOTE** : The opt-in of a task for L1D flushing works only when the task's
+affinity is limited to cores running in non-SMT mode. If a task which
+requested L1D flushing is scheduled on a SMT-enabled core the kernel sends
+a SIGBUS to the task.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bdb2200..b105db2 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2421,6 +2421,23 @@
feature (tagged TLBs) on capable Intel chips.
Default is 1 (enabled)
+ l1d_flush= [X86,INTEL]
+ Control mitigation for L1D based snooping vulnerability.
+
+ Certain CPUs are vulnerable to an exploit against CPU
+ internal buffers which can forward information to a
+ disclosure gadget under certain conditions.
+
+ In vulnerable processors, the speculatively
+ forwarded data can be used in a cache side channel
+ attack, to access data to which the attacker does
+ not have direct access.
+
+ This parameter controls the mitigation. The
+ options are:
+
+ on - enable the interface for the mitigation
+
l1tf= [X86] Control mitigation of the L1TF vulnerability on
affected CPUs
diff --git a/Documentation/userspace-api/spec_ctrl.rst b/Documentation/userspace-api/spec_ctrl.rst
index 7ddd8f6..5e8ed9e 100644
--- a/Documentation/userspace-api/spec_ctrl.rst
+++ b/Documentation/userspace-api/spec_ctrl.rst
@@ -106,3 +106,11 @@ Speculation misfeature controls
* prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0);
* prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0);
* prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);
+
+- PR_SPEC_L1D_FLUSH: Flush L1D Cache on context switch out of the task
+ (works only when tasks run on non SMT cores)
+
+ Invocations:
+ * prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, 0, 0, 0);
+ * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, PR_SPEC_ENABLE, 0, 0);
+ * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, PR_SPEC_DISABLE, 0, 0);
The following commit has been merged into the x86/cpu branch of tip:
Commit-ID: 58e106e725eed59896b9141a1c9a917d2f67962a
Gitweb: https://git.kernel.org/tip/58e106e725eed59896b9141a1c9a917d2f67962a
Author: Balbir Singh <[email protected]>
AuthorDate: Mon, 26 Apr 2021 21:59:11 +02:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 28 Jul 2021 11:42:24 +02:00
sched: Add task_work callback for paranoid L1D flush
The upcoming paranoid L1D flush infrastructure allows to conditionally
(opt-in) flush L1D in switch_mm() as a defense against potential new side
channels or for paranoia reasons. As the flush makes only sense when a task
runs on a non-SMT enabled core, because SMT siblings share L1, the
switch_mm() logic will kill a task which is flagged for L1D flush when it
is running on a SMT thread.
Add a taskwork callback so switch_mm() can queue a SIG_KILL command which
is invoked when the task tries to return to user space.
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/Kconfig | 3 +++
include/linux/sched.h | 10 ++++++++++
2 files changed, 13 insertions(+)
diff --git a/arch/Kconfig b/arch/Kconfig
index 129df49..98db634 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1282,6 +1282,9 @@ config ARCH_SPLIT_ARG64
config ARCH_HAS_ELFCORE_COMPAT
bool
+config ARCH_HAS_PARANOID_L1D_FLUSH
+ bool
+
source "kernel/gcov/Kconfig"
source "scripts/gcc-plugins/Kconfig"
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec8d07d..c048e59 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1400,6 +1400,16 @@ struct task_struct {
struct llist_head kretprobe_instances;
#endif
+#ifdef CONFIG_ARCH_HAS_PARANOID_L1D_FLUSH
+ /*
+ * If L1D flush is supported on mm context switch
+ * then we use this callback head to queue kill work
+ * to kill tasks that are not running on SMT disabled
+ * cores
+ */
+ struct callback_head l1d_flush_kill;
+#endif
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
The following commit has been merged into the x86/cpu branch of tip:
Commit-ID: c52787b590634646d4da3d8f23c4532ba050d40d
Gitweb: https://git.kernel.org/tip/c52787b590634646d4da3d8f23c4532ba050d40d
Author: Balbir Singh <[email protected]>
AuthorDate: Fri, 08 Jan 2021 23:10:52 +11:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 28 Jul 2021 11:42:23 +02:00
x86/smp: Add a per-cpu view of SMT state
A new field smt_active in cpuinfo_x86 identifies if the current core/cpu
is in SMT mode or not.
This is helpful when the system has some of its cores with threads offlined
and can be used for cases where action is taken based on the state of SMT.
The upcoming support for paranoid L1D flush will make use of this information.
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/processor.h | 2 ++
arch/x86/kernel/smpboot.c | 10 +++++++++-
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index f3020c5..1e0d13c 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -136,6 +136,8 @@ struct cpuinfo_x86 {
u16 logical_die_id;
/* Index into per_cpu list: */
u16 cpu_index;
+ /* Is SMT active on this core? */
+ bool smt_active;
u32 microcode;
/* Address space bits used by the cache internally */
u8 x86_cache_bits;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 9320285..85f6e24 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -610,6 +610,9 @@ void set_cpu_sibling_map(int cpu)
if (threads > __max_smt_threads)
__max_smt_threads = threads;
+ for_each_cpu(i, topology_sibling_cpumask(cpu))
+ cpu_data(i).smt_active = threads > 1;
+
/*
* This needs a separate iteration over the cpus because we rely on all
* topology_sibling_cpumask links to be set-up.
@@ -1552,8 +1555,13 @@ static void remove_siblinginfo(int cpu)
for_each_cpu(sibling, topology_die_cpumask(cpu))
cpumask_clear_cpu(cpu, topology_die_cpumask(sibling));
- for_each_cpu(sibling, topology_sibling_cpumask(cpu))
+
+ for_each_cpu(sibling, topology_sibling_cpumask(cpu)) {
cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling));
+ if (cpumask_weight(topology_sibling_cpumask(sibling)) == 1)
+ cpu_data(sibling).smt_active = false;
+ }
+
for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
cpumask_clear(cpu_llc_shared_mask(cpu));