Andy Lutomirski recently noted a probable bug in write_pkru(), but
it was unclear if it was user-visible. A recent bug report in
related code[1] forced me to take a look.
Basically, manipulation of XSAVE state is too unstructured.
get_xsave_addr() gives callers the impression they can read and
write XSAVE state when there are a lot of pitfalls, like updates
to xstate.features bits.
As a result, more than one call site screws up the modification
of PKRU in the XSAVE buffer. This series fixes that problem up
and also hopefully carves out a less error-prone path that can
be reused for other XSAVE features.
This series:
* Moves the PKRU manipulation to a more appropriate location,
away from the page table code
* Wraps get_xsave_addr() with more structured, less error-prone
interfaces.
* Conditionally hides a pkey debugfs file, eliminating the need
for new runtime checks to work with the new interface.
* Add a selftest to make it more likely to catch bugs like this
in the future. This improved selftest catches this issue on
Intel CPUs. Without the improvement, it only triggers on AMD.
This has been lightly tested so far. It probably needs a
tested-by from at least the AMD folks before anyone applies it.
1. https://lore.kernel.org/linux-kselftest/[email protected]/
--
arch/x86/include/asm/fpu/internal.h | 8 -
arch/x86/include/asm/fpu/xstate.h | 130 +++++++++++++++++++++++++++
arch/x86/include/asm/pgtable.h | 29 ------
arch/x86/include/asm/processor.h | 7 +
arch/x86/kernel/cpu/common.c | 6 -
arch/x86/kernel/fpu/xstate.c | 2
arch/x86/kernel/process_64.c | 1
arch/x86/kvm/svm/sev.c | 1
arch/x86/kvm/x86.c | 1
arch/x86/mm/pkeys.c | 13 +-
tools/testing/selftests/vm/Makefile | 4
tools/testing/selftests/vm/pkey-x86.h | 1
tools/testing/selftests/vm/protection_keys.c | 73 +++++++++++++++
13 files changed, 227 insertions(+), 49 deletions(-)
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Babu Moger <[email protected]>
Cc: Dave Kleikamp <[email protected]>
Cc: Ram Pai <[email protected]>
Cc: Thiago Jung Bauermann <[email protected]>
From: Dave Hansen <[email protected]>
There are two points in the kernel which write to PKRU in a buggy way:
* In switch_fpu_finish(), where having xfeatures[PKRU]=0 will result
in PKRU being assigned 'init_pkru_value' instead of 0x0.
* In write_pkru(), xfeatures[PKRU]=0 will result in PKRU having the
correct value, but the XSAVE buffer will remain stale because
xfeatures is not updated.
Both of them screw up the fact that get_xsave_addr() will return NULL
for PKRU when it is in the XSAVE "init state". This went unnoticed
until now because on Intel hardware XINUSE[PKRU] is never 0 because
of the kernel policy around 'init_pkru_value'. AMD hardware, on the
other hand, can set XINUSE[PKRU]=0 via a normal WRPKRU(0). The
handy selftests somewhat accidentally produced a case[2] where
WRPKRU(0) occurs.
get_xsave_addr() is a horrible interface[1], especially when used for
writing state. It is too easy for callers to be tricked into thinking:
1. On a NULL return that they have no work to do
2. On a valid pointer return that they *can* safely write state
without doing more work like setting an xfeatures bit.
Wrap get_xsave_addr() with some additional infrastructure. Ensure
that callers must declare their intent to read or write to the state.
Inject the init state into both reads *and* writes. This ensures
that writers never have to deal with detritus from previous state.
The new common xstate infrastructure:
xstatebuf_get_write_ptr()
and
xfeature_init_space()
should be quite usable for other xfeatures with trivial updates to
xfeature_init_space(). My hope is that we can move away from
all use of get_xsave_addr(), replacing it with things like
xstate_read_pkru().
The new BUG_ON()s are not great. But, they do represent a severe
violation of expectations and XSAVE state can be security-sensitive
and these represent a truly dazed-and-confused situation.
1. I know, I wrote it. I'm really sorry.
2. https://lore.kernel.org/linux-kselftest/[email protected]/
Signed-off-by: Dave Hansen <[email protected]>
Fixes: 0d714dba1626 ("x86/fpu: Update xstate's PKRU value on write_pkru()")
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Babu Moger <[email protected]>
Cc: Dave Kleikamp <[email protected]>
Cc: Ram Pai <[email protected]>
Cc: Thiago Jung Bauermann <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
---
b/arch/x86/include/asm/fpu/internal.h | 8 --
b/arch/x86/include/asm/fpu/xstate.h | 111 +++++++++++++++++++++++++++++++---
b/arch/x86/include/asm/processor.h | 7 ++
b/arch/x86/kernel/cpu/common.c | 6 -
b/arch/x86/mm/pkeys.c | 6 -
5 files changed, 115 insertions(+), 23 deletions(-)
diff -puN arch/x86/include/asm/fpu/internal.h~write_pkru arch/x86/include/asm/fpu/internal.h
--- a/arch/x86/include/asm/fpu/internal.h~write_pkru 2021-05-27 16:40:26.903705463 -0700
+++ b/arch/x86/include/asm/fpu/internal.h 2021-05-27 16:40:26.919705463 -0700
@@ -564,7 +564,6 @@ static inline void switch_fpu_prepare(st
static inline void switch_fpu_finish(struct fpu *new_fpu)
{
u32 pkru_val = init_pkru_value;
- struct pkru_state *pk;
if (!static_cpu_has(X86_FEATURE_FPU))
return;
@@ -578,11 +577,8 @@ static inline void switch_fpu_finish(str
* PKRU state is switched eagerly because it needs to be valid before we
* return to userland e.g. for a copy_to_user() operation.
*/
- if (current->mm) {
- pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
- if (pk)
- pkru_val = pk->pkru;
- }
+ if (current->mm)
+ pkru_val = xstate_read_pkru(&new_fpu->state.xsave);
__write_pkru(pkru_val);
/*
diff -puN arch/x86/include/asm/fpu/xstate.h~write_pkru arch/x86/include/asm/fpu/xstate.h
--- a/arch/x86/include/asm/fpu/xstate.h~write_pkru 2021-05-27 16:40:26.906705463 -0700
+++ b/arch/x86/include/asm/fpu/xstate.h 2021-05-27 16:40:26.919705463 -0700
@@ -124,27 +124,124 @@ static inline u32 read_pkru(void)
return 0;
}
+static inline void xfeature_mark_non_init(struct xregs_state *xstate,
+ int xfeature_nr)
+{
+ /*
+ * Caller will place data in the @xstate buffer.
+ * Mark the xfeature as non-init:
+ */
+ xstate->header.xfeatures |= BIT_ULL(xfeature_nr);
+}
+
+
+/* Set the contents of @xfeature_nr to the hardware init state */
+static inline void xfeature_init_space(struct xregs_state *xstate,
+ int xfeature_nr)
+{
+ void *state = get_xsave_addr(xstate, xfeature_nr);
+
+ switch (xfeature_nr) {
+ case XFEATURE_PKRU:
+ /* zero the whole state, including reserved bits */
+ memset(state, 0, sizeof(struct pkru_state));
+ break;
+ default:
+ BUG();
+ }
+}
+
+/*
+ * Called when it is necessary to write to an XSAVE
+ * component feature. Guarantees that a future
+ * XRSTOR of the 'xstate' buffer will not consider
+ * @xfeature_nr to be in its init state.
+ *
+ * The returned buffer may contain old state. The
+ * caller must be prepared to fill the entire buffer.
+ *
+ * Caller must first ensure that @xfeature_nr is
+ * enabled and present in the @xstate buffer.
+ */
+static inline void *xstatebuf_get_write_ptr(struct xregs_state *xstate,
+ int xfeature_nr)
+{
+ bool feature_was_init = xstate->header.xfeatures & BIT_ULL(xfeature_nr);
+
+ /*
+ * xcomp_bv represents whether 'xstate' has space for
+ * features. If not, something is horribly wrong and
+ * a write would corrupt memory. Perhaps xfeature_nr
+ * was not enabled.
+ */
+ BUG_ON(!(xstate->header.xcomp_bv & BIT_ULL(xfeature_nr)));
+
+ /*
+ * Ensure a sane xfeature_nr, including avoiding
+ * confusion with XCOMP_BV_COMPACTED_FORMAT.
+ */
+ BUG_ON(xfeature_nr >= XFEATURE_MAX);
+
+ /* Prepare xstate for a write to the xfeature: */
+ xfeature_mark_non_init(xstate, xfeature_nr);
+
+ /*
+ * If xfeature_nr was in the init state, update the buffer
+ * to match the state. Ensures that callers can safely
+ * write only a part of the state, they are not forced to
+ * write it in its entirety.
+ */
+ if (feature_was_init)
+ xfeature_init_space(xstate, xfeature_nr);
+
+ return get_xsave_addr(xstate, xfeature_nr);
+}
+
+/* Caller must ensure X86_FEATURE_OSPKE is enabled. */
+static inline void xstate_write_pkru(struct xregs_state *xstate, u32 pkru)
+{
+ struct pkru_state *pk;
+
+ pk = xstatebuf_get_write_ptr(xstate, XFEATURE_PKRU);
+ pk->pkru = pkru;
+}
+
+/*
+ * What PKRU value is represented in the 'xstate'? Note,
+ * this returns the *architecturally* represented value,
+ * not the literal in-memory value. They may be different.
+ */
+static inline u32 xstate_read_pkru(struct xregs_state *xstate)
+{
+ struct pkru_state *pk;
+
+ pk = get_xsave_addr(xstate, XFEATURE_PKRU);
+ /*
+ * If present, pull PKRU out of the XSAVE buffer.
+ * Otherwise, use the hardware init value.
+ */
+ if (pk)
+ return pk->pkru;
+ else
+ return PKRU_HW_INIT_VALUE;
+}
+
/*
* Update all of the PKRU state for the current task:
* PKRU register and PKRU xstate.
*/
static inline void current_write_pkru(u32 pkru)
{
- struct pkru_state *pk;
-
if (!boot_cpu_has(X86_FEATURE_OSPKE))
return;
- pk = get_xsave_addr(¤t->thread.fpu.state.xsave, XFEATURE_PKRU);
-
+ fpregs_lock();
/*
* The PKRU value in xstate needs to be in sync with the value that is
* written to the CPU. The FPU restore on return to userland would
* otherwise load the previous value again.
*/
- fpregs_lock();
- if (pk)
- pk->pkru = pkru;
+ xstate_write_pkru(¤t->thread.fpu.state.xsave, pkru);
__write_pkru(pkru);
fpregs_unlock();
}
diff -puN arch/x86/include/asm/processor.h~write_pkru arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~write_pkru 2021-05-27 16:40:26.908705463 -0700
+++ b/arch/x86/include/asm/processor.h 2021-05-27 16:40:26.921705463 -0700
@@ -854,4 +854,11 @@ enum mds_mitigations {
MDS_MITIGATION_VMWERV,
};
+/*
+ * The XSAVE architecture defines an "init state" for
+ * PKRU. PKRU is set to this value by XRSTOR when it
+ * tries to restore PKRU but has on value in the buffer.
+ */
+#define PKRU_HW_INIT_VALUE 0x0
+
#endif /* _ASM_X86_PROCESSOR_H */
diff -puN arch/x86/kernel/cpu/common.c~write_pkru arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~write_pkru 2021-05-27 16:40:26.912705463 -0700
+++ b/arch/x86/kernel/cpu/common.c 2021-05-27 16:40:26.924705463 -0700
@@ -466,8 +466,6 @@ static bool pku_disabled;
static __always_inline void setup_pku(struct cpuinfo_x86 *c)
{
- struct pkru_state *pk;
-
/* check the boot processor, plus compile options for PKU: */
if (!cpu_feature_enabled(X86_FEATURE_PKU))
return;
@@ -478,9 +476,7 @@ static __always_inline void setup_pku(st
return;
cr4_set_bits(X86_CR4_PKE);
- pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
- if (pk)
- pk->pkru = init_pkru_value;
+ xstate_write_pkru(¤t->thread.fpu.state.xsave, init_pkru_value);
/*
* Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
* cpuid bit to be set. We need to ensure that we
diff -puN arch/x86/mm/pkeys.c~write_pkru arch/x86/mm/pkeys.c
--- a/arch/x86/mm/pkeys.c~write_pkru 2021-05-27 16:40:26.914705463 -0700
+++ b/arch/x86/mm/pkeys.c 2021-05-27 16:40:26.926705463 -0700
@@ -155,7 +155,6 @@ static ssize_t init_pkru_read_file(struc
static ssize_t init_pkru_write_file(struct file *file,
const char __user *user_buf, size_t count, loff_t *ppos)
{
- struct pkru_state *pk;
char buf[32];
ssize_t len;
u32 new_init_pkru;
@@ -178,10 +177,7 @@ static ssize_t init_pkru_write_file(stru
return -EINVAL;
WRITE_ONCE(init_pkru_value, new_init_pkru);
- pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
- if (!pk)
- return -EINVAL;
- pk->pkru = new_init_pkru;
+ xstate_write_pkru(&init_fpstate.xsave, new_init_pkru);
return count;
}
_
From: Dave Hansen <[email protected]>
The PKRU hardware is permissive by default: all reads and writes are
allowed. The in-kernel policy is restrictive by default: deny all
unnecessary access until explicitly requested.
That policy can be modified with a debugfs file: "x86/init_pkru".
This file is created unconditionally, regardless of PKRU support in
the hardware, which is a little silly.
Avoid creating the file when pkeys are not available. This also
removes the need to check for pkey support at runtime, which would be
required once the new pkey modification infrastructure is put in place
later in this series.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Babu Moger <[email protected]>
Cc: Dave Kleikamp <[email protected]>
Cc: Ram Pai <[email protected]>
Cc: Thiago Jung Bauermann <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
---
b/arch/x86/mm/pkeys.c | 4 ++++
1 file changed, 4 insertions(+)
diff -puN arch/x86/mm/pkeys.c~x86-pkeys-skip-debugfs-file arch/x86/mm/pkeys.c
--- a/arch/x86/mm/pkeys.c~x86-pkeys-skip-debugfs-file 2021-05-27 16:40:25.847705465 -0700
+++ b/arch/x86/mm/pkeys.c 2021-05-27 16:40:25.852705465 -0700
@@ -193,6 +193,10 @@ static const struct file_operations fops
static int __init create_init_pkru_value(void)
{
+ /* Do not expose the file if pkeys are not supported. */
+ if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
+ return 0;
+
debugfs_create_file("init_pkru", S_IRUSR | S_IWUSR,
arch_debugfs_dir, NULL, &fops_init_pkru);
return 0;
_
From: Dave Hansen <[email protected]>
write_pkru() was once concerned purely with writing to the PKRU register.
However, the current task XSAVE buffer must also be updated in a
coordinated way.
Change the naming to reflect that this is an operation which applies to
a task (current) and its state in addition to the register itself.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Babu Moger <[email protected]>
Cc: Dave Kleikamp <[email protected]>
Cc: Ram Pai <[email protected]>
Cc: Thiago Jung Bauermann <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
---
b/arch/x86/include/asm/fpu/xstate.h | 6 +++++-
b/arch/x86/kernel/fpu/xstate.c | 2 +-
b/arch/x86/mm/pkeys.c | 2 +-
3 files changed, 7 insertions(+), 3 deletions(-)
diff -puN arch/x86/include/asm/fpu/xstate.h~rename-write_pkru arch/x86/include/asm/fpu/xstate.h
--- a/arch/x86/include/asm/fpu/xstate.h~rename-write_pkru 2021-05-27 16:40:24.618705468 -0700
+++ b/arch/x86/include/asm/fpu/xstate.h 2021-05-27 16:40:24.631705468 -0700
@@ -124,7 +124,11 @@ static inline u32 read_pkru(void)
return 0;
}
-static inline void write_pkru(u32 pkru)
+/*
+ * Update all of the PKRU state for the current task:
+ * PKRU register and PKRU xstate.
+ */
+static inline void current_write_pkru(u32 pkru)
{
struct pkru_state *pk;
diff -puN arch/x86/kernel/fpu/xstate.c~rename-write_pkru arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~rename-write_pkru 2021-05-27 16:40:24.620705468 -0700
+++ b/arch/x86/kernel/fpu/xstate.c 2021-05-27 16:40:24.633705468 -0700
@@ -1026,7 +1026,7 @@ int arch_set_user_pkey_access(struct tas
old_pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift);
/* Write old part along with new part: */
- write_pkru(old_pkru | new_pkru_bits);
+ current_write_pkru(old_pkru | new_pkru_bits);
return 0;
}
diff -puN arch/x86/mm/pkeys.c~rename-write_pkru arch/x86/mm/pkeys.c
--- a/arch/x86/mm/pkeys.c~rename-write_pkru 2021-05-27 16:40:24.626705468 -0700
+++ b/arch/x86/mm/pkeys.c 2021-05-27 16:40:24.634705468 -0700
@@ -139,7 +139,7 @@ void copy_init_pkru_to_fpregs(void)
* Override the PKRU state that came from 'init_fpstate'
* with the baseline from the process.
*/
- write_pkru(init_pkru_value_snapshot);
+ current_write_pkru(init_pkru_value_snapshot);
}
static ssize_t init_pkru_read_file(struct file *file, char __user *user_buf,
_
From: Dave Hansen <[email protected]>
On x86, there is a set of instructions used to save and restore register
state collectively known as the XSAVE architecture. There are about a
dozen different features managed with XSAVE. The protection keys
register, PKRU, is one of those features.
The hardware optimizes XSAVE by tracking when the state has not changed
from its initial (init) state. In this case, it can avoid the cost of
writing state to memory (it would usually just be a bunch of 0's).
When the pkey register is 0x0 the hardware optionally choose to track
the register as being in the init state (optimize away the writes).
AMD CPUs do this more aggressively compared to Intel.
On x86, PKRU is rarely in its (very permissive) init state. Instead,
the value defaults to something very restrictive. It is not surprising
that bugs have popped up in the rare cases when PKRU reaches its init
state.
Add a protection key selftest which gets the protection keys register
into its init state in a way that should work on Intel and AMD. Then,
do a bunch of pkey register reads to watch for inadvertent changes.
This adds "-mxsave" to CFLAGS for all the x86 vm selftests in order
to allow use of the XSAVE instruction __builtin functions. This will
make the builtins available on all of the vm selftests, but is
expected to be harmless.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Babu Moger <[email protected]>
Cc: Dave Kleikamp <[email protected]>
Cc: Ram Pai <[email protected]>
Cc: Thiago Jung Bauermann <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
---
b/tools/testing/selftests/vm/Makefile | 4 -
b/tools/testing/selftests/vm/pkey-x86.h | 1
b/tools/testing/selftests/vm/protection_keys.c | 71 +++++++++++++++++++++++++
3 files changed, 74 insertions(+), 2 deletions(-)
diff -puN tools/testing/selftests/vm/Makefile~init-pkru-selftest tools/testing/selftests/vm/Makefile
--- a/tools/testing/selftests/vm/Makefile~init-pkru-selftest 2021-05-27 16:40:28.299705459 -0700
+++ b/tools/testing/selftests/vm/Makefile 2021-05-27 16:40:28.315705459 -0700
@@ -99,7 +99,7 @@ $(1) $(1)_64: $(OUTPUT)/$(1)_64
endef
ifeq ($(CAN_BUILD_I386),1)
-$(BINARIES_32): CFLAGS += -m32
+$(BINARIES_32): CFLAGS += -m32 -mxsave
$(BINARIES_32): LDLIBS += -lrt -ldl -lm
$(BINARIES_32): $(OUTPUT)/%_32: %.c
$(CC) $(CFLAGS) $(EXTRA_CFLAGS) $(notdir $^) $(LDLIBS) -o $@
@@ -107,7 +107,7 @@ $(foreach t,$(TARGETS),$(eval $(call gen
endif
ifeq ($(CAN_BUILD_X86_64),1)
-$(BINARIES_64): CFLAGS += -m64
+$(BINARIES_64): CFLAGS += -m64 -mxsave
$(BINARIES_64): LDLIBS += -lrt -ldl
$(BINARIES_64): $(OUTPUT)/%_64: %.c
$(CC) $(CFLAGS) $(EXTRA_CFLAGS) $(notdir $^) $(LDLIBS) -o $@
diff -puN tools/testing/selftests/vm/pkey-x86.h~init-pkru-selftest tools/testing/selftests/vm/pkey-x86.h
--- a/tools/testing/selftests/vm/pkey-x86.h~init-pkru-selftest 2021-05-27 16:40:28.301705459 -0700
+++ b/tools/testing/selftests/vm/pkey-x86.h 2021-05-27 16:40:28.315705459 -0700
@@ -126,6 +126,7 @@ static inline u32 pkey_bit_position(int
#define XSTATE_PKEY_BIT (9)
#define XSTATE_PKEY 0x200
+#define XSTATE_BV_OFFSET 512
int pkey_reg_xstate_offset(void)
{
diff -puN tools/testing/selftests/vm/protection_keys.c~init-pkru-selftest tools/testing/selftests/vm/protection_keys.c
--- a/tools/testing/selftests/vm/protection_keys.c~init-pkru-selftest 2021-05-27 16:40:28.303705459 -0700
+++ b/tools/testing/selftests/vm/protection_keys.c 2021-05-27 16:40:28.314705459 -0700
@@ -1278,6 +1278,76 @@ void test_pkey_alloc_exhaust(int *ptr, u
}
}
+void arch_force_pkey_reg_init(void)
+{
+#if defined(__i386__) || defined(__x86_64__) /* arch */
+ u64 *buf;
+
+ /*
+ * All keys should be allocated and set to allow reads and
+ * writes, so the register should be all 0. If not, just
+ * skip the test.
+ */
+ if (read_pkey_reg())
+ return;
+
+ /*
+ * Just allocate an absurd about of memory rather than
+ * doing the XSAVE size enumeration dance.
+ */
+ buf = mmap(NULL, 1*MB, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+
+ /* These __builtins require compiling with -mxsave */
+
+ /* XSAVE to build a valid buffer: */
+ __builtin_ia32_xsave(buf, XSTATE_PKEY);
+ /* Clear XSTATE_BV[PKRU]: */
+ buf[XSTATE_BV_OFFSET/sizeof(u64)] &= ~XSTATE_PKEY;
+ /* XRSTOR will likely get PKRU back to the init state: */
+ __builtin_ia32_xrstor(buf, XSTATE_PKEY);
+
+ munmap(buf, 1*MB);
+#endif
+}
+
+
+/*
+ * This is mostly useless on ppc for now. But it will not
+ * hurt anything and should give some better coverage as
+ * a long-running test that continually checks the pkey
+ * register.
+ */
+void test_pkey_init_state(int *ptr, u16 pkey)
+{
+ int err;
+ int allocated_pkeys[NR_PKEYS] = {0};
+ int nr_allocated_pkeys = 0;
+ int i;
+
+ for (i = 0; i < NR_PKEYS*3; i++) {
+ int new_pkey = alloc_pkey();
+
+ allocated_pkeys[nr_allocated_pkeys++] = new_pkey;
+ }
+
+ dprintf3("%s()::%d\n", __func__, __LINE__);
+
+ arch_force_pkey_reg_init();
+
+ /*
+ * Loop for a bit, hoping to get exercise the kernel
+ * context switch code.
+ */
+ for (i = 0; i < 1000000; i++)
+ read_pkey_reg();
+
+ for (i = 0; i < nr_allocated_pkeys; i++) {
+ err = sys_pkey_free(allocated_pkeys[i]);
+ pkey_assert(!err);
+ read_pkey_reg(); /* for shadow checking */
+ }
+}
+
/*
* pkey 0 is special. It is allocated by default, so you do not
* have to call pkey_alloc() to use it first. Make sure that it
@@ -1502,6 +1572,7 @@ void (*pkey_tests[])(int *ptr, u16 pkey)
test_implicit_mprotect_exec_only_memory,
test_mprotect_with_pkey_0,
test_ptrace_of_child,
+ test_pkey_init_state,
test_pkey_syscalls_on_non_allocated_pkey,
test_pkey_syscalls_bad_args,
test_pkey_alloc_exhaust,
_
From: Dave Hansen <[email protected]>
write_pkru() was originally used just to write to the PKRU register. It
was mercifully short and sweet and was not out of place in pgtable.h with
some other pkey-related code.
But, later work included a requirement to also modify the task XSAVE
buffer when updating the register. This really is more related to the
XSAVE architecture than to paging.
Move the read/write_pkru() to asm/fpu/xstate.h. pgtable.h won't miss them.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Babu Moger <[email protected]>
Cc: Dave Kleikamp <[email protected]>
Cc: Ram Pai <[email protected]>
Cc: Thiago Jung Bauermann <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
---
b/arch/x86/include/asm/fpu/xstate.h | 29 +++++++++++++++++++++++++++++
b/arch/x86/include/asm/pgtable.h | 29 -----------------------------
b/arch/x86/kernel/process_64.c | 1 +
b/arch/x86/kvm/svm/sev.c | 1 +
b/arch/x86/kvm/x86.c | 1 +
b/arch/x86/mm/pkeys.c | 1 +
6 files changed, 33 insertions(+), 29 deletions(-)
diff -puN arch/x86/include/asm/fpu/xstate.h~move-write_pkru arch/x86/include/asm/fpu/xstate.h
--- a/arch/x86/include/asm/fpu/xstate.h~move-write_pkru 2021-05-27 16:40:23.110705472 -0700
+++ b/arch/x86/include/asm/fpu/xstate.h 2021-05-27 16:40:23.132705472 -0700
@@ -6,6 +6,7 @@
#include <linux/types.h>
#include <asm/processor.h>
+#include <asm/fpu/api.h>
#include <asm/user.h>
/* Bit 63 of XCR0 is reserved for future expansion */
@@ -116,4 +117,32 @@ void copy_kernel_to_dynamic_supervisor(s
/* Validate an xstate header supplied by userspace (ptrace or sigreturn) */
int validate_user_xstate_header(const struct xstate_header *hdr);
+static inline u32 read_pkru(void)
+{
+ if (boot_cpu_has(X86_FEATURE_OSPKE))
+ return rdpkru();
+ return 0;
+}
+
+static inline void write_pkru(u32 pkru)
+{
+ struct pkru_state *pk;
+
+ if (!boot_cpu_has(X86_FEATURE_OSPKE))
+ return;
+
+ pk = get_xsave_addr(¤t->thread.fpu.state.xsave, XFEATURE_PKRU);
+
+ /*
+ * The PKRU value in xstate needs to be in sync with the value that is
+ * written to the CPU. The FPU restore on return to userland would
+ * otherwise load the previous value again.
+ */
+ fpregs_lock();
+ if (pk)
+ pk->pkru = pkru;
+ __write_pkru(pkru);
+ fpregs_unlock();
+}
+
#endif
diff -puN arch/x86/include/asm/pgtable.h~move-write_pkru arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~move-write_pkru 2021-05-27 16:40:23.114705472 -0700
+++ b/arch/x86/include/asm/pgtable.h 2021-05-27 16:40:23.135705472 -0700
@@ -126,35 +126,6 @@ static inline int pte_dirty(pte_t pte)
return pte_flags(pte) & _PAGE_DIRTY;
}
-
-static inline u32 read_pkru(void)
-{
- if (boot_cpu_has(X86_FEATURE_OSPKE))
- return rdpkru();
- return 0;
-}
-
-static inline void write_pkru(u32 pkru)
-{
- struct pkru_state *pk;
-
- if (!boot_cpu_has(X86_FEATURE_OSPKE))
- return;
-
- pk = get_xsave_addr(¤t->thread.fpu.state.xsave, XFEATURE_PKRU);
-
- /*
- * The PKRU value in xstate needs to be in sync with the value that is
- * written to the CPU. The FPU restore on return to userland would
- * otherwise load the previous value again.
- */
- fpregs_lock();
- if (pk)
- pk->pkru = pkru;
- __write_pkru(pkru);
- fpregs_unlock();
-}
-
static inline int pte_young(pte_t pte)
{
return pte_flags(pte) & _PAGE_ACCESSED;
diff -puN arch/x86/kernel/process_64.c~move-write_pkru arch/x86/kernel/process_64.c
--- a/arch/x86/kernel/process_64.c~move-write_pkru 2021-05-27 16:40:23.117705472 -0700
+++ b/arch/x86/kernel/process_64.c 2021-05-27 16:40:23.138705472 -0700
@@ -41,6 +41,7 @@
#include <linux/syscalls.h>
#include <asm/processor.h>
+#include <asm/fpu/xstate.h>
#include <asm/fpu/internal.h>
#include <asm/mmu_context.h>
#include <asm/prctl.h>
diff -puN arch/x86/kvm/svm/sev.c~move-write_pkru arch/x86/kvm/svm/sev.c
--- a/arch/x86/kvm/svm/sev.c~move-write_pkru 2021-05-27 16:40:23.121705472 -0700
+++ b/arch/x86/kvm/svm/sev.c 2021-05-27 16:40:23.142705472 -0700
@@ -16,6 +16,7 @@
#include <linux/swap.h>
#include <linux/processor.h>
#include <linux/trace_events.h>
+#include <asm/fpu/xstate.h>
#include <asm/fpu/internal.h>
#include <asm/trapnr.h>
diff -puN arch/x86/kvm/x86.c~move-write_pkru arch/x86/kvm/x86.c
--- a/arch/x86/kvm/x86.c~move-write_pkru 2021-05-27 16:40:23.125705472 -0700
+++ b/arch/x86/kvm/x86.c 2021-05-27 16:40:23.150705472 -0700
@@ -66,6 +66,7 @@
#include <asm/desc.h>
#include <asm/mce.h>
#include <linux/kernel_stat.h>
+#include <asm/fpu/xstate.h>
#include <asm/fpu/internal.h> /* Ugh! */
#include <asm/pvclock.h>
#include <asm/div64.h>
diff -puN arch/x86/mm/pkeys.c~move-write_pkru arch/x86/mm/pkeys.c
--- a/arch/x86/mm/pkeys.c~move-write_pkru 2021-05-27 16:40:23.128705472 -0700
+++ b/arch/x86/mm/pkeys.c 2021-05-27 16:40:23.151705472 -0700
@@ -10,6 +10,7 @@
#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/mmu_context.h> /* vma_pkey() */
+#include <asm/fpu/xstate.h> /* read/write_pkru() */
#include <asm/fpu/internal.h> /* init_fpstate */
int __execute_only_pkey(struct mm_struct *mm)
_
On 28.5.2021 2.51, Dave Hansen wrote:
> From: Dave Hansen <[email protected]>
>
> There are two points in the kernel which write to PKRU in a buggy way:
>
> * In switch_fpu_finish(), where having xfeatures[PKRU]=0 will result
> in PKRU being assigned 'init_pkru_value' instead of 0x0.
> * In write_pkru(), xfeatures[PKRU]=0 will result in PKRU having the
> correct value, but the XSAVE buffer will remain stale because
> xfeatures is not updated.
>
> Both of them screw up the fact that get_xsave_addr() will return NULL
> for PKRU when it is in the XSAVE "init state". This went unnoticed
> until now because on Intel hardware XINUSE[PKRU] is never 0 because
> of the kernel policy around 'init_pkru_value'. AMD hardware, on the
> other hand, can set XINUSE[PKRU]=0 via a normal WRPKRU(0). The
> handy selftests somewhat accidentally produced a case[2] where
> WRPKRU(0) occurs.
>
> get_xsave_addr() is a horrible interface[1], especially when used for
> writing state. It is too easy for callers to be tricked into thinking:
> 1. On a NULL return that they have no work to do
> 2. On a valid pointer return that they *can* safely write state
> without doing more work like setting an xfeatures bit.
>
> Wrap get_xsave_addr() with some additional infrastructure. Ensure
> that callers must declare their intent to read or write to the state.
> Inject the init state into both reads *and* writes. This ensures
> that writers never have to deal with detritus from previous state.
>
> The new common xstate infrastructure:
>
> xstatebuf_get_write_ptr()
> and
> xfeature_init_space()
>
> should be quite usable for other xfeatures with trivial updates to
> xfeature_init_space(). My hope is that we can move away from
> all use of get_xsave_addr(), replacing it with things like
> xstate_read_pkru().
>
> The new BUG_ON()s are not great. But, they do represent a severe
> violation of expectations and XSAVE state can be security-sensitive
> and these represent a truly dazed-and-confused situation.
>
> 1. I know, I wrote it. I'm really sorry.
> 2. https://lore.kernel.org/linux-kselftest/[email protected]/
>
> Signed-off-by: Dave Hansen <[email protected]>
> Fixes: 0d714dba1626 ("x86/fpu: Update xstate's PKRU value on write_pkru()")
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: [email protected]
> Cc: Andy Lutomirski <[email protected]>
> Cc: Shuah Khan <[email protected]>
> Cc: Babu Moger <[email protected]>
> Cc: Dave Kleikamp <[email protected]>
> Cc: Ram Pai <[email protected]>
> Cc: Thiago Jung Bauermann <[email protected]>
> Cc: Sebastian Andrzej Siewior <[email protected]>
> ---
>
> b/arch/x86/include/asm/fpu/internal.h | 8 --
> b/arch/x86/include/asm/fpu/xstate.h | 111 +++++++++++++++++++++++++++++++---
> b/arch/x86/include/asm/processor.h | 7 ++
> b/arch/x86/kernel/cpu/common.c | 6 -
> b/arch/x86/mm/pkeys.c | 6 -
> 5 files changed, 115 insertions(+), 23 deletions(-)
>
> diff -puN arch/x86/include/asm/fpu/internal.h~write_pkru arch/x86/include/asm/fpu/internal.h
> --- a/arch/x86/include/asm/fpu/internal.h~write_pkru 2021-05-27 16:40:26.903705463 -0700
> +++ b/arch/x86/include/asm/fpu/internal.h 2021-05-27 16:40:26.919705463 -0700
> @@ -564,7 +564,6 @@ static inline void switch_fpu_prepare(st
> static inline void switch_fpu_finish(struct fpu *new_fpu)
> {
> u32 pkru_val = init_pkru_value;
> - struct pkru_state *pk;
>
> if (!static_cpu_has(X86_FEATURE_FPU))
> return;
> @@ -578,11 +577,8 @@ static inline void switch_fpu_finish(str
> * PKRU state is switched eagerly because it needs to be valid before we
> * return to userland e.g. for a copy_to_user() operation.
> */
> - if (current->mm) {
> - pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
> - if (pk)
> - pkru_val = pk->pkru;
> - }
> + if (current->mm)
> + pkru_val = xstate_read_pkru(&new_fpu->state.xsave);
> __write_pkru(pkru_val);
>
> /*
> diff -puN arch/x86/include/asm/fpu/xstate.h~write_pkru arch/x86/include/asm/fpu/xstate.h
> --- a/arch/x86/include/asm/fpu/xstate.h~write_pkru 2021-05-27 16:40:26.906705463 -0700
> +++ b/arch/x86/include/asm/fpu/xstate.h 2021-05-27 16:40:26.919705463 -0700
> @@ -124,27 +124,124 @@ static inline u32 read_pkru(void)
> return 0;
> }
>
> +static inline void xfeature_mark_non_init(struct xregs_state *xstate,
> + int xfeature_nr)
> +{
> + /*
> + * Caller will place data in the @xstate buffer.
> + * Mark the xfeature as non-init:
> + */
> + xstate->header.xfeatures |= BIT_ULL(xfeature_nr);
> +}
> +
> +
> +/* Set the contents of @xfeature_nr to the hardware init state */
> +static inline void xfeature_init_space(struct xregs_state *xstate,
> + int xfeature_nr)
> +{
> + void *state = get_xsave_addr(xstate, xfeature_nr);
> +
> + switch (xfeature_nr) {
> + case XFEATURE_PKRU:
> + /* zero the whole state, including reserved bits */
> + memset(state, 0, sizeof(struct pkru_state));
> + break;
> + default:
> + BUG();
> + }
> +}
> +
> +/*
> + * Called when it is necessary to write to an XSAVE
> + * component feature. Guarantees that a future
> + * XRSTOR of the 'xstate' buffer will not consider
> + * @xfeature_nr to be in its init state.
> + *
> + * The returned buffer may contain old state. The
> + * caller must be prepared to fill the entire buffer.
> + *
> + * Caller must first ensure that @xfeature_nr is
> + * enabled and present in the @xstate buffer.
> + */
> +static inline void *xstatebuf_get_write_ptr(struct xregs_state *xstate,
> + int xfeature_nr)
> +{
> + bool feature_was_init = xstate->header.xfeatures & BIT_ULL(xfeature_nr);
> +
Maybe
bool feature_was_init = !(xstate->header.xfeatures & BIT_ULL(xfeature_nr));
> + /*
> + * xcomp_bv represents whether 'xstate' has space for
> + * features. If not, something is horribly wrong and
> + * a write would corrupt memory. Perhaps xfeature_nr
> + * was not enabled.
> + */
> + BUG_ON(!(xstate->header.xcomp_bv & BIT_ULL(xfeature_nr)));
> +
> + /*
> + * Ensure a sane xfeature_nr, including avoiding
> + * confusion with XCOMP_BV_COMPACTED_FORMAT.
> + */
> + BUG_ON(xfeature_nr >= XFEATURE_MAX);
> +
> + /* Prepare xstate for a write to the xfeature: */
> + xfeature_mark_non_init(xstate, xfeature_nr);
> +
> + /*
> + * If xfeature_nr was in the init state, update the buffer
> + * to match the state. Ensures that callers can safely
> + * write only a part of the state, they are not forced to
> + * write it in its entirety.
> + */
> + if (feature_was_init)
> + xfeature_init_space(xstate, xfeature_nr);
> +
> + return get_xsave_addr(xstate, xfeature_nr);
> +}
> +
> +/* Caller must ensure X86_FEATURE_OSPKE is enabled. */
> +static inline void xstate_write_pkru(struct xregs_state *xstate, u32 pkru)
> +{
> + struct pkru_state *pk;
> +
> + pk = xstatebuf_get_write_ptr(xstate, XFEATURE_PKRU);
> + pk->pkru = pkru;
> +}
> +
> +/*
> + * What PKRU value is represented in the 'xstate'? Note,
> + * this returns the *architecturally* represented value,
> + * not the literal in-memory value. They may be different.
> + */
> +static inline u32 xstate_read_pkru(struct xregs_state *xstate)
> +{
> + struct pkru_state *pk;
> +
> + pk = get_xsave_addr(xstate, XFEATURE_PKRU);
> + /*
> + * If present, pull PKRU out of the XSAVE buffer.
> + * Otherwise, use the hardware init value.
> + */
> + if (pk)
> + return pk->pkru;
> + else
> + return PKRU_HW_INIT_VALUE;
> +}
> +
> /*
> * Update all of the PKRU state for the current task:
> * PKRU register and PKRU xstate.
> */
> static inline void current_write_pkru(u32 pkru)
> {
> - struct pkru_state *pk;
> -
> if (!boot_cpu_has(X86_FEATURE_OSPKE))
> return;
>
> - pk = get_xsave_addr(¤t->thread.fpu.state.xsave, XFEATURE_PKRU);
> -
> + fpregs_lock();
> /*
> * The PKRU value in xstate needs to be in sync with the value that is
> * written to the CPU. The FPU restore on return to userland would
> * otherwise load the previous value again.
> */
> - fpregs_lock();
> - if (pk)
> - pk->pkru = pkru;
> + xstate_write_pkru(¤t->thread.fpu.state.xsave, pkru);
> __write_pkru(pkru);
> fpregs_unlock();
> }
> diff -puN arch/x86/include/asm/processor.h~write_pkru arch/x86/include/asm/processor.h
> --- a/arch/x86/include/asm/processor.h~write_pkru 2021-05-27 16:40:26.908705463 -0700
> +++ b/arch/x86/include/asm/processor.h 2021-05-27 16:40:26.921705463 -0700
> @@ -854,4 +854,11 @@ enum mds_mitigations {
> MDS_MITIGATION_VMWERV,
> };
>
> +/*
> + * The XSAVE architecture defines an "init state" for
> + * PKRU. PKRU is set to this value by XRSTOR when it
> + * tries to restore PKRU but has on value in the buffer.
> + */
> +#define PKRU_HW_INIT_VALUE 0x0
> +
> #endif /* _ASM_X86_PROCESSOR_H */
> diff -puN arch/x86/kernel/cpu/common.c~write_pkru arch/x86/kernel/cpu/common.c
> --- a/arch/x86/kernel/cpu/common.c~write_pkru 2021-05-27 16:40:26.912705463 -0700
> +++ b/arch/x86/kernel/cpu/common.c 2021-05-27 16:40:26.924705463 -0700
> @@ -466,8 +466,6 @@ static bool pku_disabled;
>
> static __always_inline void setup_pku(struct cpuinfo_x86 *c)
> {
> - struct pkru_state *pk;
> -
> /* check the boot processor, plus compile options for PKU: */
> if (!cpu_feature_enabled(X86_FEATURE_PKU))
> return;
> @@ -478,9 +476,7 @@ static __always_inline void setup_pku(st
> return;
>
> cr4_set_bits(X86_CR4_PKE);
> - pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
> - if (pk)
> - pk->pkru = init_pkru_value;
> + xstate_write_pkru(¤t->thread.fpu.state.xsave, init_pkru_value);
> /*
> * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
> * cpuid bit to be set. We need to ensure that we
> diff -puN arch/x86/mm/pkeys.c~write_pkru arch/x86/mm/pkeys.c
> --- a/arch/x86/mm/pkeys.c~write_pkru 2021-05-27 16:40:26.914705463 -0700
> +++ b/arch/x86/mm/pkeys.c 2021-05-27 16:40:26.926705463 -0700
> @@ -155,7 +155,6 @@ static ssize_t init_pkru_read_file(struc
> static ssize_t init_pkru_write_file(struct file *file,
> const char __user *user_buf, size_t count, loff_t *ppos)
> {
> - struct pkru_state *pk;
> char buf[32];
> ssize_t len;
> u32 new_init_pkru;
> @@ -178,10 +177,7 @@ static ssize_t init_pkru_write_file(stru
> return -EINVAL;
>
> WRITE_ONCE(init_pkru_value, new_init_pkru);
> - pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
> - if (!pk)
> - return -EINVAL;
> - pk->pkru = new_init_pkru;
> + xstate_write_pkru(&init_fpstate.xsave, new_init_pkru);
> return count;
> }
>
> _
On 5/28/21 8:32 AM, Thomas Gleixner wrote:
>>
>> This series:
>> * Moves the PKRU manipulation to a more appropriate location,
>> away from the page table code
>> * Wraps get_xsave_addr() with more structured, less error-prone
>> interfaces.
>> * Conditionally hides a pkey debugfs file, eliminating the need
>> for new runtime checks to work with the new interface.
>> * Add a selftest to make it more likely to catch bugs like this
>> in the future. This improved selftest catches this issue on
>> Intel CPUs. Without the improvement, it only triggers on AMD.
> I think all of this is fundamentaly wrong.
>
> Contrary to FPU state, PKRU has to be updated at context switch
> time. There is absolutely no point in having PKRU XSAVES managed.
>
> It's broken in several ways. Anything which clears and loads the FPU
> will load the wrong PKRU value. Go figure...
>
> So the right thing is to disable PKRU in XCR0 and on sched out simply do
>
> task->thread.pkru = read_pkru();
>
> and on sched in
>
> write_pkru(task->thread.pkru);
>
> Simple, trivial and not going to be wreckaged by anything which fiddles
> with xstates. We all know by now that xstates is a trainwreck and not
> having stuff like that in there is making the fixes I'm doing way
> simpler.
As for the general sentiment that PKRU is not suitable for management
with XSAVE, I'm with you.
I have a few concerns about moving away from XSAVE management, though.
I'm not nixing the whole idea, but there are some things we need to resolve.
First is that there _may_ be ABI concerns. The pkey selftest, for
instance, manipulates the PKRU state on the signal stack and expects
PKRU to be set in XCR0 so that it can do this. I wouldn't be shocked if
some other pkey user depended on the XSAVE signal stack ABI this way.
There are also the usual concerns that folks doing user-level context
switching or other insanity get PKRU context switching for "free" when
it's XSAVE-managed. Moving away from that could break them.
I'll ask around. We could also pretty trivially put some surveillance
in the sigreturn code to look for PKRU changes.
Second, the XSAVE/FPU abomination is actually really handy for pkeys:
1. It establishes a *different* state upon signal delivery. Like
sigaltstack, this means that the signal handler can recover from
what would normally be a fatal condition like WRPKRU(0x3). I *think*
this is OK today even if the signal entry XRSTOR did not touch PKRU
since there's another write_pkru() in this path.
2. It allows the signal handler to inspect the interrupted context's
PKRU value. (used in the selftest)
3. It allows the signal handler to *override* the PKRU value of the
interrupted context. This is used in the selftest as an easy way
to let a memory access instruction execute that initially causes
a pkey-induced page fault as opposed to messing with RIP.
None of this is insurmountable. For the selftest, I need to go looking
at how important that functionality is and look for some alternatives.
Noticed a typo in a comment, but I haven't reviewed these thoroughly.
Shaggy
On 5/27/21 6:51 PM, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> There are two points in the kernel which write to PKRU in a buggy way:
>
> * In switch_fpu_finish(), where having xfeatures[PKRU]=0 will result
> in PKRU being assigned 'init_pkru_value' instead of 0x0.
> * In write_pkru(), xfeatures[PKRU]=0 will result in PKRU having the
> correct value, but the XSAVE buffer will remain stale because
> xfeatures is not updated.
>
> Both of them screw up the fact that get_xsave_addr() will return NULL
> for PKRU when it is in the XSAVE "init state". This went unnoticed
> until now because on Intel hardware XINUSE[PKRU] is never 0 because
> of the kernel policy around 'init_pkru_value'. AMD hardware, on the
> other hand, can set XINUSE[PKRU]=0 via a normal WRPKRU(0). The
> handy selftests somewhat accidentally produced a case[2] where
> WRPKRU(0) occurs.
>
> get_xsave_addr() is a horrible interface[1], especially when used for
> writing state. It is too easy for callers to be tricked into thinking:
> 1. On a NULL return that they have no work to do
> 2. On a valid pointer return that they *can* safely write state
> without doing more work like setting an xfeatures bit.
>
> Wrap get_xsave_addr() with some additional infrastructure. Ensure
> that callers must declare their intent to read or write to the state.
> Inject the init state into both reads *and* writes. This ensures
> that writers never have to deal with detritus from previous state.
>
> The new common xstate infrastructure:
>
> xstatebuf_get_write_ptr()
> and
> xfeature_init_space()
>
> should be quite usable for other xfeatures with trivial updates to
> xfeature_init_space(). My hope is that we can move away from
> all use of get_xsave_addr(), replacing it with things like
> xstate_read_pkru().
>
> The new BUG_ON()s are not great. But, they do represent a severe
> violation of expectations and XSAVE state can be security-sensitive
> and these represent a truly dazed-and-confused situation.
>
> 1. I know, I wrote it. I'm really sorry.
> 2. https://lore.kernel.org/linux-kselftest/[email protected]/
>
> Signed-off-by: Dave Hansen <[email protected]>
> Fixes: 0d714dba1626 ("x86/fpu: Update xstate's PKRU value on write_pkru()")
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: [email protected]
> Cc: Andy Lutomirski <[email protected]>
> Cc: Shuah Khan <[email protected]>
> Cc: Babu Moger <[email protected]>
> Cc: Dave Kleikamp <[email protected]>
> Cc: Ram Pai <[email protected]>
> Cc: Thiago Jung Bauermann <[email protected]>
> Cc: Sebastian Andrzej Siewior <[email protected]>
> ---
>
> b/arch/x86/include/asm/fpu/internal.h | 8 --
> b/arch/x86/include/asm/fpu/xstate.h | 111 +++++++++++++++++++++++++++++++---
> b/arch/x86/include/asm/processor.h | 7 ++
> b/arch/x86/kernel/cpu/common.c | 6 -
> b/arch/x86/mm/pkeys.c | 6 -
> 5 files changed, 115 insertions(+), 23 deletions(-)
>
> diff -puN arch/x86/include/asm/fpu/internal.h~write_pkru arch/x86/include/asm/fpu/internal.h
> --- a/arch/x86/include/asm/fpu/internal.h~write_pkru 2021-05-27 16:40:26.903705463 -0700
> +++ b/arch/x86/include/asm/fpu/internal.h 2021-05-27 16:40:26.919705463 -0700
> @@ -564,7 +564,6 @@ static inline void switch_fpu_prepare(st
> static inline void switch_fpu_finish(struct fpu *new_fpu)
> {
> u32 pkru_val = init_pkru_value;
> - struct pkru_state *pk;
>
> if (!static_cpu_has(X86_FEATURE_FPU))
> return;
> @@ -578,11 +577,8 @@ static inline void switch_fpu_finish(str
> * PKRU state is switched eagerly because it needs to be valid before we
> * return to userland e.g. for a copy_to_user() operation.
> */
> - if (current->mm) {
> - pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
> - if (pk)
> - pkru_val = pk->pkru;
> - }
> + if (current->mm)
> + pkru_val = xstate_read_pkru(&new_fpu->state.xsave);
> __write_pkru(pkru_val);
>
> /*
> diff -puN arch/x86/include/asm/fpu/xstate.h~write_pkru arch/x86/include/asm/fpu/xstate.h
> --- a/arch/x86/include/asm/fpu/xstate.h~write_pkru 2021-05-27 16:40:26.906705463 -0700
> +++ b/arch/x86/include/asm/fpu/xstate.h 2021-05-27 16:40:26.919705463 -0700
> @@ -124,27 +124,124 @@ static inline u32 read_pkru(void)
> return 0;
> }
>
> +static inline void xfeature_mark_non_init(struct xregs_state *xstate,
> + int xfeature_nr)
> +{
> + /*
> + * Caller will place data in the @xstate buffer.
> + * Mark the xfeature as non-init:
> + */
> + xstate->header.xfeatures |= BIT_ULL(xfeature_nr);
> +}
> +
> +
> +/* Set the contents of @xfeature_nr to the hardware init state */
> +static inline void xfeature_init_space(struct xregs_state *xstate,
> + int xfeature_nr)
> +{
> + void *state = get_xsave_addr(xstate, xfeature_nr);
> +
> + switch (xfeature_nr) {
> + case XFEATURE_PKRU:
> + /* zero the whole state, including reserved bits */
> + memset(state, 0, sizeof(struct pkru_state));
> + break;
> + default:
> + BUG();
> + }
> +}
> +
> +/*
> + * Called when it is necessary to write to an XSAVE
> + * component feature. Guarantees that a future
> + * XRSTOR of the 'xstate' buffer will not consider
> + * @xfeature_nr to be in its init state.
> + *
> + * The returned buffer may contain old state. The
> + * caller must be prepared to fill the entire buffer.
> + *
> + * Caller must first ensure that @xfeature_nr is
> + * enabled and present in the @xstate buffer.
> + */
> +static inline void *xstatebuf_get_write_ptr(struct xregs_state *xstate,
> + int xfeature_nr)
> +{
> + bool feature_was_init = xstate->header.xfeatures & BIT_ULL(xfeature_nr);
> +
> + /*
> + * xcomp_bv represents whether 'xstate' has space for
> + * features. If not, something is horribly wrong and
> + * a write would corrupt memory. Perhaps xfeature_nr
> + * was not enabled.
> + */
> + BUG_ON(!(xstate->header.xcomp_bv & BIT_ULL(xfeature_nr)));
> +
> + /*
> + * Ensure a sane xfeature_nr, including avoiding
> + * confusion with XCOMP_BV_COMPACTED_FORMAT.
> + */
> + BUG_ON(xfeature_nr >= XFEATURE_MAX);
> +
> + /* Prepare xstate for a write to the xfeature: */
> + xfeature_mark_non_init(xstate, xfeature_nr);
> +
> + /*
> + * If xfeature_nr was in the init state, update the buffer
> + * to match the state. Ensures that callers can safely
> + * write only a part of the state, they are not forced to
> + * write it in its entirety.
> + */
> + if (feature_was_init)
> + xfeature_init_space(xstate, xfeature_nr);
> +
> + return get_xsave_addr(xstate, xfeature_nr);
> +}
> +
> +/* Caller must ensure X86_FEATURE_OSPKE is enabled. */
> +static inline void xstate_write_pkru(struct xregs_state *xstate, u32 pkru)
> +{
> + struct pkru_state *pk;
> +
> + pk = xstatebuf_get_write_ptr(xstate, XFEATURE_PKRU);
> + pk->pkru = pkru;
> +}
> +
> +/*
> + * What PKRU value is represented in the 'xstate'? Note,
> + * this returns the *architecturally* represented value,
> + * not the literal in-memory value. They may be different.
> + */
> +static inline u32 xstate_read_pkru(struct xregs_state *xstate)
> +{
> + struct pkru_state *pk;
> +
> + pk = get_xsave_addr(xstate, XFEATURE_PKRU);
> + /*
> + * If present, pull PKRU out of the XSAVE buffer.
> + * Otherwise, use the hardware init value.
> + */
> + if (pk)
> + return pk->pkru;
> + else
> + return PKRU_HW_INIT_VALUE;
> +}
> +
> /*
> * Update all of the PKRU state for the current task:
> * PKRU register and PKRU xstate.
> */
> static inline void current_write_pkru(u32 pkru)
> {
> - struct pkru_state *pk;
> -
> if (!boot_cpu_has(X86_FEATURE_OSPKE))
> return;
>
> - pk = get_xsave_addr(¤t->thread.fpu.state.xsave, XFEATURE_PKRU);
> -
> + fpregs_lock();
> /*
> * The PKRU value in xstate needs to be in sync with the value that is
> * written to the CPU. The FPU restore on return to userland would
> * otherwise load the previous value again.
> */
> - fpregs_lock();
> - if (pk)
> - pk->pkru = pkru;
> + xstate_write_pkru(¤t->thread.fpu.state.xsave, pkru);
> __write_pkru(pkru);
> fpregs_unlock();
> }
> diff -puN arch/x86/include/asm/processor.h~write_pkru arch/x86/include/asm/processor.h
> --- a/arch/x86/include/asm/processor.h~write_pkru 2021-05-27 16:40:26.908705463 -0700
> +++ b/arch/x86/include/asm/processor.h 2021-05-27 16:40:26.921705463 -0700
> @@ -854,4 +854,11 @@ enum mds_mitigations {
> MDS_MITIGATION_VMWERV,
> };
>
> +/*
> + * The XSAVE architecture defines an "init state" for
> + * PKRU. PKRU is set to this value by XRSTOR when it
> + * tries to restore PKRU but has on value in the buffer.
... but has *no* value ...
> + */
> +#define PKRU_HW_INIT_VALUE 0x0
> +
> #endif /* _ASM_X86_PROCESSOR_H */
> diff -puN arch/x86/kernel/cpu/common.c~write_pkru arch/x86/kernel/cpu/common.c
> --- a/arch/x86/kernel/cpu/common.c~write_pkru 2021-05-27 16:40:26.912705463 -0700
> +++ b/arch/x86/kernel/cpu/common.c 2021-05-27 16:40:26.924705463 -0700
> @@ -466,8 +466,6 @@ static bool pku_disabled;
>
> static __always_inline void setup_pku(struct cpuinfo_x86 *c)
> {
> - struct pkru_state *pk;
> -
> /* check the boot processor, plus compile options for PKU: */
> if (!cpu_feature_enabled(X86_FEATURE_PKU))
> return;
> @@ -478,9 +476,7 @@ static __always_inline void setup_pku(st
> return;
>
> cr4_set_bits(X86_CR4_PKE);
> - pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
> - if (pk)
> - pk->pkru = init_pkru_value;
> + xstate_write_pkru(¤t->thread.fpu.state.xsave, init_pkru_value);
> /*
> * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
> * cpuid bit to be set. We need to ensure that we
> diff -puN arch/x86/mm/pkeys.c~write_pkru arch/x86/mm/pkeys.c
> --- a/arch/x86/mm/pkeys.c~write_pkru 2021-05-27 16:40:26.914705463 -0700
> +++ b/arch/x86/mm/pkeys.c 2021-05-27 16:40:26.926705463 -0700
> @@ -155,7 +155,6 @@ static ssize_t init_pkru_read_file(struc
> static ssize_t init_pkru_write_file(struct file *file,
> const char __user *user_buf, size_t count, loff_t *ppos)
> {
> - struct pkru_state *pk;
> char buf[32];
> ssize_t len;
> u32 new_init_pkru;
> @@ -178,10 +177,7 @@ static ssize_t init_pkru_write_file(stru
> return -EINVAL;
>
> WRITE_ONCE(init_pkru_value, new_init_pkru);
> - pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
> - if (!pk)
> - return -EINVAL;
> - pk->pkru = new_init_pkru;
> + xstate_write_pkru(&init_fpstate.xsave, new_init_pkru);
> return count;
> }
>
> _
>
On Fri, May 28, 2021, at 9:11 AM, Dave Hansen wrote:
> On 5/28/21 8:32 AM, Thomas Gleixner wrote:
> >>
> >> This series:
> >> * Moves the PKRU manipulation to a more appropriate location,
> >> away from the page table code
> >> * Wraps get_xsave_addr() with more structured, less error-prone
> >> interfaces.
> >> * Conditionally hides a pkey debugfs file, eliminating the need
> >> for new runtime checks to work with the new interface.
> >> * Add a selftest to make it more likely to catch bugs like this
> >> in the future. This improved selftest catches this issue on
> >> Intel CPUs. Without the improvement, it only triggers on AMD.
> > I think all of this is fundamentaly wrong.
> >
> > Contrary to FPU state, PKRU has to be updated at context switch
> > time. There is absolutely no point in having PKRU XSAVES managed.
> >
> > It's broken in several ways. Anything which clears and loads the FPU
> > will load the wrong PKRU value. Go figure...
> >
> > So the right thing is to disable PKRU in XCR0 and on sched out simply do
> >
> > task->thread.pkru = read_pkru();
> >
> > and on sched in
> >
> > write_pkru(task->thread.pkru);
> >
> > Simple, trivial and not going to be wreckaged by anything which fiddles
> > with xstates. We all know by now that xstates is a trainwreck and not
> > having stuff like that in there is making the fixes I'm doing way
> > simpler.
>
> As for the general sentiment that PKRU is not suitable for management
> with XSAVE, I'm with you.
>
> I have a few concerns about moving away from XSAVE management, though.
> I'm not nixing the whole idea, but there are some things we need to resolve.
>
> First is that there _may_ be ABI concerns.
I tend to think that, for -stable, we should fix the bug without an ABI change.
On Fri, May 28 2021 at 09:11, Dave Hansen wrote:
> On 5/28/21 8:32 AM, Thomas Gleixner wrote:
> There are also the usual concerns that folks doing user-level context
> switching or other insanity get PKRU context switching for "free" when
> it's XSAVE-managed. Moving away from that could break them.
Both issues are trivial to solve.
We can have pkru enabled in xcr0 and just do not restore it when
returning to user space (clear the mask bit).
When we restore it in sigrestore via xrstor then we read it via rdpkru
afterwards and update task->thread.pkru.
Thanks,
tglx
On Fri, May 28 2021 at 10:13, Andy Lutomirski wrote:
> On Fri, May 28, 2021, at 9:11 AM, Dave Hansen wrote:
>> I have a few concerns about moving away from XSAVE management, though.
>> I'm not nixing the whole idea, but there are some things we need to resolve.
>>
>> First is that there _may_ be ABI concerns.
>
> I tend to think that, for -stable, we should fix the bug without an ABI change.
See my other mail. It's trivial enough to do.
On Thu, May 27 2021 at 16:51, Dave Hansen wrote:
> Andy Lutomirski recently noted a probable bug in write_pkru(), but
> it was unclear if it was user-visible. A recent bug report in
> related code[1] forced me to take a look.
>
> Basically, manipulation of XSAVE state is too unstructured.
> get_xsave_addr() gives callers the impression they can read and
> write XSAVE state when there are a lot of pitfalls, like updates
> to xstate.features bits.
>
> As a result, more than one call site screws up the modification
> of PKRU in the XSAVE buffer. This series fixes that problem up
> and also hopefully carves out a less error-prone path that can
> be reused for other XSAVE features.
>
> This series:
> * Moves the PKRU manipulation to a more appropriate location,
> away from the page table code
> * Wraps get_xsave_addr() with more structured, less error-prone
> interfaces.
> * Conditionally hides a pkey debugfs file, eliminating the need
> for new runtime checks to work with the new interface.
> * Add a selftest to make it more likely to catch bugs like this
> in the future. This improved selftest catches this issue on
> Intel CPUs. Without the improvement, it only triggers on AMD.
I think all of this is fundamentaly wrong.
Contrary to FPU state, PKRU has to be updated at context switch
time. There is absolutely no point in having PKRU XSAVES managed.
It's broken in several ways. Anything which clears and loads the FPU
will load the wrong PKRU value. Go figure...
So the right thing is to disable PKRU in XCR0 and on sched out simply do
task->thread.pkru = read_pkru();
and on sched in
write_pkru(task->thread.pkru);
Simple, trivial and not going to be wreckaged by anything which fiddles
with xstates. We all know by now that xstates is a trainwreck and not
having stuff like that in there is making the fixes I'm doing way
simpler.
CET will have a similar issue, but we'll discuss that once we have the
existing horrors sorted.
Thanks,
tglx
Hi Dave,
Thanks for the patches and trying to address the issues.
I know these patches are in early stages and there is still a discussion
on different ways to address these issues. But I wanted to give a try anyway.
Tried to boot the system with these patches. But system did not come up
after this patch(#4). System fails very early in the boot process. So, I
could'nt collect much logs. It failed both on AMD and Intel machines.
I will try to collect more logs.
Thanks
Babu
On 5/27/21 6:51 PM, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> There are two points in the kernel which write to PKRU in a buggy way:
>
> * In switch_fpu_finish(), where having xfeatures[PKRU]=0 will result
> in PKRU being assigned 'init_pkru_value' instead of 0x0.
> * In write_pkru(), xfeatures[PKRU]=0 will result in PKRU having the
> correct value, but the XSAVE buffer will remain stale because
> xfeatures is not updated.
>
> Both of them screw up the fact that get_xsave_addr() will return NULL
> for PKRU when it is in the XSAVE "init state". This went unnoticed
> until now because on Intel hardware XINUSE[PKRU] is never 0 because
> of the kernel policy around 'init_pkru_value'. AMD hardware, on the
> other hand, can set XINUSE[PKRU]=0 via a normal WRPKRU(0). The
> handy selftests somewhat accidentally produced a case[2] where
> WRPKRU(0) occurs.
>
> get_xsave_addr() is a horrible interface[1], especially when used for
> writing state. It is too easy for callers to be tricked into thinking:
> 1. On a NULL return that they have no work to do
> 2. On a valid pointer return that they *can* safely write state
> without doing more work like setting an xfeatures bit.
>
> Wrap get_xsave_addr() with some additional infrastructure. Ensure
> that callers must declare their intent to read or write to the state.
> Inject the init state into both reads *and* writes. This ensures
> that writers never have to deal with detritus from previous state.
>
> The new common xstate infrastructure:
>
> xstatebuf_get_write_ptr()
> and
> xfeature_init_space()
>
> should be quite usable for other xfeatures with trivial updates to
> xfeature_init_space(). My hope is that we can move away from
> all use of get_xsave_addr(), replacing it with things like
> xstate_read_pkru().
>
> The new BUG_ON()s are not great. But, they do represent a severe
> violation of expectations and XSAVE state can be security-sensitive
> and these represent a truly dazed-and-confused situation.
>
> 1. I know, I wrote it. I'm really sorry.
> 2. https://lore.kernel.org/linux-kselftest/[email protected]/
>
> Signed-off-by: Dave Hansen <[email protected]>
> Fixes: 0d714dba1626 ("x86/fpu: Update xstate's PKRU value on write_pkru()")
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: [email protected]
> Cc: Andy Lutomirski <[email protected]>
> Cc: Shuah Khan <[email protected]>
> Cc: Babu Moger <[email protected]>
> Cc: Dave Kleikamp <[email protected]>
> Cc: Ram Pai <[email protected]>
> Cc: Thiago Jung Bauermann <[email protected]>
> Cc: Sebastian Andrzej Siewior <[email protected]>
> ---
>
> b/arch/x86/include/asm/fpu/internal.h | 8 --
> b/arch/x86/include/asm/fpu/xstate.h | 111 +++++++++++++++++++++++++++++++---
> b/arch/x86/include/asm/processor.h | 7 ++
> b/arch/x86/kernel/cpu/common.c | 6 -
> b/arch/x86/mm/pkeys.c | 6 -
> 5 files changed, 115 insertions(+), 23 deletions(-)
>
> diff -puN arch/x86/include/asm/fpu/internal.h~write_pkru arch/x86/include/asm/fpu/internal.h
> --- a/arch/x86/include/asm/fpu/internal.h~write_pkru 2021-05-27 16:40:26.903705463 -0700
> +++ b/arch/x86/include/asm/fpu/internal.h 2021-05-27 16:40:26.919705463 -0700
> @@ -564,7 +564,6 @@ static inline void switch_fpu_prepare(st
> static inline void switch_fpu_finish(struct fpu *new_fpu)
> {
> u32 pkru_val = init_pkru_value;
> - struct pkru_state *pk;
>
> if (!static_cpu_has(X86_FEATURE_FPU))
> return;
> @@ -578,11 +577,8 @@ static inline void switch_fpu_finish(str
> * PKRU state is switched eagerly because it needs to be valid before we
> * return to userland e.g. for a copy_to_user() operation.
> */
> - if (current->mm) {
> - pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
> - if (pk)
> - pkru_val = pk->pkru;
> - }
> + if (current->mm)
> + pkru_val = xstate_read_pkru(&new_fpu->state.xsave);
> __write_pkru(pkru_val);
>
> /*
> diff -puN arch/x86/include/asm/fpu/xstate.h~write_pkru arch/x86/include/asm/fpu/xstate.h
> --- a/arch/x86/include/asm/fpu/xstate.h~write_pkru 2021-05-27 16:40:26.906705463 -0700
> +++ b/arch/x86/include/asm/fpu/xstate.h 2021-05-27 16:40:26.919705463 -0700
> @@ -124,27 +124,124 @@ static inline u32 read_pkru(void)
> return 0;
> }
>
> +static inline void xfeature_mark_non_init(struct xregs_state *xstate,
> + int xfeature_nr)
> +{
> + /*
> + * Caller will place data in the @xstate buffer.
> + * Mark the xfeature as non-init:
> + */
> + xstate->header.xfeatures |= BIT_ULL(xfeature_nr);
> +}
> +
> +
> +/* Set the contents of @xfeature_nr to the hardware init state */
> +static inline void xfeature_init_space(struct xregs_state *xstate,
> + int xfeature_nr)
> +{
> + void *state = get_xsave_addr(xstate, xfeature_nr);
> +
> + switch (xfeature_nr) {
> + case XFEATURE_PKRU:
> + /* zero the whole state, including reserved bits */
> + memset(state, 0, sizeof(struct pkru_state));
> + break;
> + default:
> + BUG();
> + }
> +}
> +
> +/*
> + * Called when it is necessary to write to an XSAVE
> + * component feature. Guarantees that a future
> + * XRSTOR of the 'xstate' buffer will not consider
> + * @xfeature_nr to be in its init state.
> + *
> + * The returned buffer may contain old state. The
> + * caller must be prepared to fill the entire buffer.
> + *
> + * Caller must first ensure that @xfeature_nr is
> + * enabled and present in the @xstate buffer.
> + */
> +static inline void *xstatebuf_get_write_ptr(struct xregs_state *xstate,
> + int xfeature_nr)
> +{
> + bool feature_was_init = xstate->header.xfeatures & BIT_ULL(xfeature_nr);
> +
> + /*
> + * xcomp_bv represents whether 'xstate' has space for
> + * features. If not, something is horribly wrong and
> + * a write would corrupt memory. Perhaps xfeature_nr
> + * was not enabled.
> + */
> + BUG_ON(!(xstate->header.xcomp_bv & BIT_ULL(xfeature_nr)));
> +
> + /*
> + * Ensure a sane xfeature_nr, including avoiding
> + * confusion with XCOMP_BV_COMPACTED_FORMAT.
> + */
> + BUG_ON(xfeature_nr >= XFEATURE_MAX);
> +
> + /* Prepare xstate for a write to the xfeature: */
> + xfeature_mark_non_init(xstate, xfeature_nr);
> +
> + /*
> + * If xfeature_nr was in the init state, update the buffer
> + * to match the state. Ensures that callers can safely
> + * write only a part of the state, they are not forced to
> + * write it in its entirety.
> + */
> + if (feature_was_init)
> + xfeature_init_space(xstate, xfeature_nr);
> +
> + return get_xsave_addr(xstate, xfeature_nr);
> +}
> +
> +/* Caller must ensure X86_FEATURE_OSPKE is enabled. */
> +static inline void xstate_write_pkru(struct xregs_state *xstate, u32 pkru)
> +{
> + struct pkru_state *pk;
> +
> + pk = xstatebuf_get_write_ptr(xstate, XFEATURE_PKRU);
> + pk->pkru = pkru;
> +}
> +
> +/*
> + * What PKRU value is represented in the 'xstate'? Note,
> + * this returns the *architecturally* represented value,
> + * not the literal in-memory value. They may be different.
> + */
> +static inline u32 xstate_read_pkru(struct xregs_state *xstate)
> +{
> + struct pkru_state *pk;
> +
> + pk = get_xsave_addr(xstate, XFEATURE_PKRU);
> + /*
> + * If present, pull PKRU out of the XSAVE buffer.
> + * Otherwise, use the hardware init value.
> + */
> + if (pk)
> + return pk->pkru;
> + else
> + return PKRU_HW_INIT_VALUE;
> +}
> +
> /*
> * Update all of the PKRU state for the current task:
> * PKRU register and PKRU xstate.
> */
> static inline void current_write_pkru(u32 pkru)
> {
> - struct pkru_state *pk;
> -
> if (!boot_cpu_has(X86_FEATURE_OSPKE))
> return;
>
> - pk = get_xsave_addr(¤t->thread.fpu.state.xsave, XFEATURE_PKRU);
> -
> + fpregs_lock();
> /*
> * The PKRU value in xstate needs to be in sync with the value that is
> * written to the CPU. The FPU restore on return to userland would
> * otherwise load the previous value again.
> */
> - fpregs_lock();
> - if (pk)
> - pk->pkru = pkru;
> + xstate_write_pkru(¤t->thread.fpu.state.xsave, pkru);
> __write_pkru(pkru);
> fpregs_unlock();
> }
> diff -puN arch/x86/include/asm/processor.h~write_pkru arch/x86/include/asm/processor.h
> --- a/arch/x86/include/asm/processor.h~write_pkru 2021-05-27 16:40:26.908705463 -0700
> +++ b/arch/x86/include/asm/processor.h 2021-05-27 16:40:26.921705463 -0700
> @@ -854,4 +854,11 @@ enum mds_mitigations {
> MDS_MITIGATION_VMWERV,
> };
>
> +/*
> + * The XSAVE architecture defines an "init state" for
> + * PKRU. PKRU is set to this value by XRSTOR when it
> + * tries to restore PKRU but has on value in the buffer.
> + */
> +#define PKRU_HW_INIT_VALUE 0x0
> +
> #endif /* _ASM_X86_PROCESSOR_H */
> diff -puN arch/x86/kernel/cpu/common.c~write_pkru arch/x86/kernel/cpu/common.c
> --- a/arch/x86/kernel/cpu/common.c~write_pkru 2021-05-27 16:40:26.912705463 -0700
> +++ b/arch/x86/kernel/cpu/common.c 2021-05-27 16:40:26.924705463 -0700
> @@ -466,8 +466,6 @@ static bool pku_disabled;
>
> static __always_inline void setup_pku(struct cpuinfo_x86 *c)
> {
> - struct pkru_state *pk;
> -
> /* check the boot processor, plus compile options for PKU: */
> if (!cpu_feature_enabled(X86_FEATURE_PKU))
> return;
> @@ -478,9 +476,7 @@ static __always_inline void setup_pku(st
> return;
>
> cr4_set_bits(X86_CR4_PKE);
> - pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
> - if (pk)
> - pk->pkru = init_pkru_value;
> + xstate_write_pkru(¤t->thread.fpu.state.xsave, init_pkru_value);
> /*
> * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
> * cpuid bit to be set. We need to ensure that we
> diff -puN arch/x86/mm/pkeys.c~write_pkru arch/x86/mm/pkeys.c
> --- a/arch/x86/mm/pkeys.c~write_pkru 2021-05-27 16:40:26.914705463 -0700
> +++ b/arch/x86/mm/pkeys.c 2021-05-27 16:40:26.926705463 -0700
> @@ -155,7 +155,6 @@ static ssize_t init_pkru_read_file(struc
> static ssize_t init_pkru_write_file(struct file *file,
> const char __user *user_buf, size_t count, loff_t *ppos)
> {
> - struct pkru_state *pk;
> char buf[32];
> ssize_t len;
> u32 new_init_pkru;
> @@ -178,10 +177,7 @@ static ssize_t init_pkru_write_file(stru
> return -EINVAL;
>
> WRITE_ONCE(init_pkru_value, new_init_pkru);
> - pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
> - if (!pk)
> - return -EINVAL;
> - pk->pkru = new_init_pkru;
> + xstate_write_pkru(&init_fpstate.xsave, new_init_pkru);
> return count;
> }
>
> _
>
On 6/1/21 3:43 PM, Dave Kleikamp wrote:
> On 6/1/21 4:54 PM, Babu Moger wrote:
>> Hi Dave,
>> Thanks for the patches and trying to address the issues.
>>
>> I know these patches are in early stages and there is still a discussion
>> on different ways to address these issues. But I wanted to give a try
>> anyway.
>>
>> Tried to boot the system with these patches. But system did not come up
>> after this patch(#4). System fails very early in the boot process. So, I
>> could'nt collect much logs. It failed both on AMD and Intel machines.
>> I will try to collect more logs.
>> Thanks
>> Babu
>
> Same here. Hitting this line in arch/x86/include/asm/fpu/xstate.h
>
> BUG_ON(!(xstate->header.xcomp_bv & BIT_ULL(xfeature_nr)));
>
> [ 0.384929] kernel BUG at arch/x86/include/asm/fpu/xstate.h:177!
Thanks for the report. I am, indeed, reworking this code a bit. I'll
pay close attention if this code remains in some form.
On 6/1/21 4:54 PM, Babu Moger wrote:
> Hi Dave,
> Thanks for the patches and trying to address the issues.
>
> I know these patches are in early stages and there is still a discussion
> on different ways to address these issues. But I wanted to give a try anyway.
>
> Tried to boot the system with these patches. But system did not come up
> after this patch(#4). System fails very early in the boot process. So, I
> could'nt collect much logs. It failed both on AMD and Intel machines.
> I will try to collect more logs.
> Thanks
> Babu
Same here. Hitting this line in arch/x86/include/asm/fpu/xstate.h
BUG_ON(!(xstate->header.xcomp_bv & BIT_ULL(xfeature_nr)));
[ 0.384929] kernel BUG at arch/x86/include/asm/fpu/xstate.h:177!
[ 0.385529] invalid opcode: 0000 [#1] SMP NOPTI
[ 0.386519] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.13.0-rc4+ #1
[ 0.386519] RIP: 0010:identify_cpu+0x5ee/0x5f0
[ 0.386519] Code: ff 0f 0b ff 14 25 e0 72 e4 bc eb 84 48 89 cf be 09 00 00 00 48 89 4d c0 e8 ef 97 ff ff 48 8b 4d c0 48 c7 00 00 00 00 00 eb a5 <0f> 0b 48 8b 05 01 3e b7 01 48 c1 e8 1d 83 e0 01 74 0b 55 48 89 e5
[ 0.386519] RSP: 0000:ffffffffbce03e28 EFLAGS: 00010246
[ 0.386519] RAX: ffffffffbce14940 RBX: ffffffffbd3c0220 RCX: ffffffffbce15e80
[ 0.386519] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000770eb0
[ 0.386519] RBP: ffffffffbce03e68 R08: 0000000000000000 R09: c0000000fffeffff
[ 0.386519] R10: 0000000000000001 R11: ffffffffbce03bf8 R12: ffffffffbd6bcd00
[ 0.386519] R13: ffffffffbd6bcd58 R14: 0000000000000000 R15: 0000000000000246
[ 0.386519] FS: 0000000000000000(0000) GS:ffff91280d200000(0000) knlGS:0000000000000000
[ 0.386519] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.386519] CR2: ffff924b3fbff000 CR3: 0000017261a0a001 CR4: 0000000000770eb0
[ 0.386519] Call Trace:
[ 0.386519] identify_boot_cpu+0x10/0x9a
[ 0.386519] check_bugs+0x2a/0xa19
[ 0.386519] start_kernel+0x4c7/0x4fa
[ 0.386519] x86_64_start_reservations+0x32/0x34
[ 0.386519] x86_64_start_kernel+0x8a/0x8d
[ 0.386519] secondary_startup_64_no_verify+0xc2/0xcb
[ 0.386519] Modules linked in:
[ 0.386521] ---[ end trace 12db536e6a291746 ]---
[ 0.387520] RIP: 0010:identify_cpu+0x5ee/0x5f0
[ 0.388519] Code: ff 0f 0b ff 14 25 e0 72 e4 bc eb 84 48 89 cf be 09 00 00 00 48 89 4d c0 e8 ef 97 ff ff 48 8b 4d c0 48 c7 00 00 00 00 00 eb a5 <0f> 0b 48 8b 05 01 3e b7 01 48 c1 e8 1d 83 e0 01 74 0b 55 48 89 e5
[ 0.389519] RSP: 0000:ffffffffbce03e28 EFLAGS: 00010246
[ 0.390519] RAX: ffffffffbce14940 RBX: ffffffffbd3c0220 RCX: ffffffffbce15e80
[ 0.391519] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000770eb0
[ 0.392519] RBP: ffffffffbce03e68 R08: 0000000000000000 R09: c0000000fffeffff
[ 0.393519] R10: 0000000000000001 R11: ffffffffbce03bf8 R12: ffffffffbd6bcd00
[ 0.394519] R13: ffffffffbd6bcd58 R14: 0000000000000000 R15: 0000000000000246
[ 0.395519] FS: 0000000000000000(0000) GS:ffff91280d200000(0000) knlGS:0000000000000000
[ 0.396519] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.397519] CR2: ffff924b3fbff000 CR3: 0000017261a0a001 CR4: 0000000000770eb0
[ 0.398520] Kernel panic - not syncing: Fatal exception
[ 0.399519] ---[ end Kernel panic - not syncing: Fatal exception ]---
>
> On 5/27/21 6:51 PM, Dave Hansen wrote:
>>
>> From: Dave Hansen <[email protected]>
>>
>> There are two points in the kernel which write to PKRU in a buggy way:
>>
>> * In switch_fpu_finish(), where having xfeatures[PKRU]=0 will result
>> in PKRU being assigned 'init_pkru_value' instead of 0x0.
>> * In write_pkru(), xfeatures[PKRU]=0 will result in PKRU having the
>> correct value, but the XSAVE buffer will remain stale because
>> xfeatures is not updated.
>>
>> Both of them screw up the fact that get_xsave_addr() will return NULL
>> for PKRU when it is in the XSAVE "init state". This went unnoticed
>> until now because on Intel hardware XINUSE[PKRU] is never 0 because
>> of the kernel policy around 'init_pkru_value'. AMD hardware, on the
>> other hand, can set XINUSE[PKRU]=0 via a normal WRPKRU(0). The
>> handy selftests somewhat accidentally produced a case[2] where
>> WRPKRU(0) occurs.
>>
>> get_xsave_addr() is a horrible interface[1], especially when used for
>> writing state. It is too easy for callers to be tricked into thinking:
>> 1. On a NULL return that they have no work to do
>> 2. On a valid pointer return that they *can* safely write state
>> without doing more work like setting an xfeatures bit.
>>
>> Wrap get_xsave_addr() with some additional infrastructure. Ensure
>> that callers must declare their intent to read or write to the state.
>> Inject the init state into both reads *and* writes. This ensures
>> that writers never have to deal with detritus from previous state.
>>
>> The new common xstate infrastructure:
>>
>> xstatebuf_get_write_ptr()
>> and
>> xfeature_init_space()
>>
>> should be quite usable for other xfeatures with trivial updates to
>> xfeature_init_space(). My hope is that we can move away from
>> all use of get_xsave_addr(), replacing it with things like
>> xstate_read_pkru().
>>
>> The new BUG_ON()s are not great. But, they do represent a severe
>> violation of expectations and XSAVE state can be security-sensitive
>> and these represent a truly dazed-and-confused situation.
>>
>> 1. I know, I wrote it. I'm really sorry.
>> 2. https://lore.kernel.org/linux-kselftest/[email protected]/
>>
>> Signed-off-by: Dave Hansen <[email protected]>
>> Fixes: 0d714dba1626 ("x86/fpu: Update xstate's PKRU value on write_pkru()")
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: [email protected]
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Shuah Khan <[email protected]>
>> Cc: Babu Moger <[email protected]>
>> Cc: Dave Kleikamp <[email protected]>
>> Cc: Ram Pai <[email protected]>
>> Cc: Thiago Jung Bauermann <[email protected]>
>> Cc: Sebastian Andrzej Siewior <[email protected]>
>> ---
>>
>> b/arch/x86/include/asm/fpu/internal.h | 8 --
>> b/arch/x86/include/asm/fpu/xstate.h | 111 +++++++++++++++++++++++++++++++---
>> b/arch/x86/include/asm/processor.h | 7 ++
>> b/arch/x86/kernel/cpu/common.c | 6 -
>> b/arch/x86/mm/pkeys.c | 6 -
>> 5 files changed, 115 insertions(+), 23 deletions(-)
>>
>> diff -puN arch/x86/include/asm/fpu/internal.h~write_pkru arch/x86/include/asm/fpu/internal.h
>> --- a/arch/x86/include/asm/fpu/internal.h~write_pkru 2021-05-27 16:40:26.903705463 -0700
>> +++ b/arch/x86/include/asm/fpu/internal.h 2021-05-27 16:40:26.919705463 -0700
>> @@ -564,7 +564,6 @@ static inline void switch_fpu_prepare(st
>> static inline void switch_fpu_finish(struct fpu *new_fpu)
>> {
>> u32 pkru_val = init_pkru_value;
>> - struct pkru_state *pk;
>>
>> if (!static_cpu_has(X86_FEATURE_FPU))
>> return;
>> @@ -578,11 +577,8 @@ static inline void switch_fpu_finish(str
>> * PKRU state is switched eagerly because it needs to be valid before we
>> * return to userland e.g. for a copy_to_user() operation.
>> */
>> - if (current->mm) {
>> - pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
>> - if (pk)
>> - pkru_val = pk->pkru;
>> - }
>> + if (current->mm)
>> + pkru_val = xstate_read_pkru(&new_fpu->state.xsave);
>> __write_pkru(pkru_val);
>>
>> /*
>> diff -puN arch/x86/include/asm/fpu/xstate.h~write_pkru arch/x86/include/asm/fpu/xstate.h
>> --- a/arch/x86/include/asm/fpu/xstate.h~write_pkru 2021-05-27 16:40:26.906705463 -0700
>> +++ b/arch/x86/include/asm/fpu/xstate.h 2021-05-27 16:40:26.919705463 -0700
>> @@ -124,27 +124,124 @@ static inline u32 read_pkru(void)
>> return 0;
>> }
>>
>> +static inline void xfeature_mark_non_init(struct xregs_state *xstate,
>> + int xfeature_nr)
>> +{
>> + /*
>> + * Caller will place data in the @xstate buffer.
>> + * Mark the xfeature as non-init:
>> + */
>> + xstate->header.xfeatures |= BIT_ULL(xfeature_nr);
>> +}
>> +
>> +
>> +/* Set the contents of @xfeature_nr to the hardware init state */
>> +static inline void xfeature_init_space(struct xregs_state *xstate,
>> + int xfeature_nr)
>> +{
>> + void *state = get_xsave_addr(xstate, xfeature_nr);
>> +
>> + switch (xfeature_nr) {
>> + case XFEATURE_PKRU:
>> + /* zero the whole state, including reserved bits */
>> + memset(state, 0, sizeof(struct pkru_state));
>> + break;
>> + default:
>> + BUG();
>> + }
>> +}
>> +
>> +/*
>> + * Called when it is necessary to write to an XSAVE
>> + * component feature. Guarantees that a future
>> + * XRSTOR of the 'xstate' buffer will not consider
>> + * @xfeature_nr to be in its init state.
>> + *
>> + * The returned buffer may contain old state. The
>> + * caller must be prepared to fill the entire buffer.
>> + *
>> + * Caller must first ensure that @xfeature_nr is
>> + * enabled and present in the @xstate buffer.
>> + */
>> +static inline void *xstatebuf_get_write_ptr(struct xregs_state *xstate,
>> + int xfeature_nr)
>> +{
>> + bool feature_was_init = xstate->header.xfeatures & BIT_ULL(xfeature_nr);
>> +
>> + /*
>> + * xcomp_bv represents whether 'xstate' has space for
>> + * features. If not, something is horribly wrong and
>> + * a write would corrupt memory. Perhaps xfeature_nr
>> + * was not enabled.
>> + */
>> + BUG_ON(!(xstate->header.xcomp_bv & BIT_ULL(xfeature_nr)));
>> +
>> + /*
>> + * Ensure a sane xfeature_nr, including avoiding
>> + * confusion with XCOMP_BV_COMPACTED_FORMAT.
>> + */
>> + BUG_ON(xfeature_nr >= XFEATURE_MAX);
>> +
>> + /* Prepare xstate for a write to the xfeature: */
>> + xfeature_mark_non_init(xstate, xfeature_nr);
>> +
>> + /*
>> + * If xfeature_nr was in the init state, update the buffer
>> + * to match the state. Ensures that callers can safely
>> + * write only a part of the state, they are not forced to
>> + * write it in its entirety.
>> + */
>> + if (feature_was_init)
>> + xfeature_init_space(xstate, xfeature_nr);
>> +
>> + return get_xsave_addr(xstate, xfeature_nr);
>> +}
>> +
>> +/* Caller must ensure X86_FEATURE_OSPKE is enabled. */
>> +static inline void xstate_write_pkru(struct xregs_state *xstate, u32 pkru)
>> +{
>> + struct pkru_state *pk;
>> +
>> + pk = xstatebuf_get_write_ptr(xstate, XFEATURE_PKRU);
>> + pk->pkru = pkru;
>> +}
>> +
>> +/*
>> + * What PKRU value is represented in the 'xstate'? Note,
>> + * this returns the *architecturally* represented value,
>> + * not the literal in-memory value. They may be different.
>> + */
>> +static inline u32 xstate_read_pkru(struct xregs_state *xstate)
>> +{
>> + struct pkru_state *pk;
>> +
>> + pk = get_xsave_addr(xstate, XFEATURE_PKRU);
>> + /*
>> + * If present, pull PKRU out of the XSAVE buffer.
>> + * Otherwise, use the hardware init value.
>> + */
>> + if (pk)
>> + return pk->pkru;
>> + else
>> + return PKRU_HW_INIT_VALUE;
>> +}
>> +
>> /*
>> * Update all of the PKRU state for the current task:
>> * PKRU register and PKRU xstate.
>> */
>> static inline void current_write_pkru(u32 pkru)
>> {
>> - struct pkru_state *pk;
>> -
>> if (!boot_cpu_has(X86_FEATURE_OSPKE))
>> return;
>>
>> - pk = get_xsave_addr(¤t->thread.fpu.state.xsave, XFEATURE_PKRU);
>> -
>> + fpregs_lock();
>> /*
>> * The PKRU value in xstate needs to be in sync with the value that is
>> * written to the CPU. The FPU restore on return to userland would
>> * otherwise load the previous value again.
>> */
>> - fpregs_lock();
>> - if (pk)
>> - pk->pkru = pkru;
>> + xstate_write_pkru(¤t->thread.fpu.state.xsave, pkru);
>> __write_pkru(pkru);
>> fpregs_unlock();
>> }
>> diff -puN arch/x86/include/asm/processor.h~write_pkru arch/x86/include/asm/processor.h
>> --- a/arch/x86/include/asm/processor.h~write_pkru 2021-05-27 16:40:26.908705463 -0700
>> +++ b/arch/x86/include/asm/processor.h 2021-05-27 16:40:26.921705463 -0700
>> @@ -854,4 +854,11 @@ enum mds_mitigations {
>> MDS_MITIGATION_VMWERV,
>> };
>>
>> +/*
>> + * The XSAVE architecture defines an "init state" for
>> + * PKRU. PKRU is set to this value by XRSTOR when it
>> + * tries to restore PKRU but has on value in the buffer.
>> + */
>> +#define PKRU_HW_INIT_VALUE 0x0
>> +
>> #endif /* _ASM_X86_PROCESSOR_H */
>> diff -puN arch/x86/kernel/cpu/common.c~write_pkru arch/x86/kernel/cpu/common.c
>> --- a/arch/x86/kernel/cpu/common.c~write_pkru 2021-05-27 16:40:26.912705463 -0700
>> +++ b/arch/x86/kernel/cpu/common.c 2021-05-27 16:40:26.924705463 -0700
>> @@ -466,8 +466,6 @@ static bool pku_disabled;
>>
>> static __always_inline void setup_pku(struct cpuinfo_x86 *c)
>> {
>> - struct pkru_state *pk;
>> -
>> /* check the boot processor, plus compile options for PKU: */
>> if (!cpu_feature_enabled(X86_FEATURE_PKU))
>> return;
>> @@ -478,9 +476,7 @@ static __always_inline void setup_pku(st
>> return;
>>
>> cr4_set_bits(X86_CR4_PKE);
>> - pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
>> - if (pk)
>> - pk->pkru = init_pkru_value;
>> + xstate_write_pkru(¤t->thread.fpu.state.xsave, init_pkru_value);
>> /*
>> * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
>> * cpuid bit to be set. We need to ensure that we
>> diff -puN arch/x86/mm/pkeys.c~write_pkru arch/x86/mm/pkeys.c
>> --- a/arch/x86/mm/pkeys.c~write_pkru 2021-05-27 16:40:26.914705463 -0700
>> +++ b/arch/x86/mm/pkeys.c 2021-05-27 16:40:26.926705463 -0700
>> @@ -155,7 +155,6 @@ static ssize_t init_pkru_read_file(struc
>> static ssize_t init_pkru_write_file(struct file *file,
>> const char __user *user_buf, size_t count, loff_t *ppos)
>> {
>> - struct pkru_state *pk;
>> char buf[32];
>> ssize_t len;
>> u32 new_init_pkru;
>> @@ -178,10 +177,7 @@ static ssize_t init_pkru_write_file(stru
>> return -EINVAL;
>>
>> WRITE_ONCE(init_pkru_value, new_init_pkru);
>> - pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
>> - if (!pk)
>> - return -EINVAL;
>> - pk->pkru = new_init_pkru;
>> + xstate_write_pkru(&init_fpstate.xsave, new_init_pkru);
>> return count;
>> }
>>
>> _
>>