2021-05-23 19:39:38

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 00/28] x86: Support Intel Advanced Matrix Extensions

Intel Advanced Matrix Extensions (AMX)[1][2] will be shipping on servers
soon. AMX consists of configurable TMM "TILE" registers plus new
accelerator instructions that operate on them. TMUL (Tile matrix MULtiply)
is the first accelerator instruction set to use the new registers, and we
anticipate additional instructions in the future.

Neither AMX state nor TMUL instructions depend on AVX. However, AMX and
AVX do share common challenges. The TMM registers are 8KB today, and
architecturally as large as 64KB, which merits updates to hardware and
software state management.

Further, both technologies run faster when they are not simultaneously
running on SMT siblings, and both technologies use of power and bandwidth
impact the power and performance available to neighboring cores. (This
impact has measurably improved in recent hardware.)

If the existing kernel approach for managing XSAVE state was employed to
handle AMX, 8KB space would be added to every task, but possibly rarely
used. So Linux support is optimized by using a new XSAVE feature: eXtended
Feature Disabling (XFD). The kernel arms XFD to provide a #NM exception
upon a tasks' first access to TILE state. The kernel exception handler
installs the appropriate XSAVE context switch buffer, and the task behaves
as if the kernel had done that for all tasks. Using XFD, AMX space is
allocated only when needed, eliminating the memory waste for unused state
components.

This series requires glibc 2.34 or later [8][9], or the new minimum signal
stack size export [3]. The patches are based on the mainline (4.13-rc2).
The series is composed of three parts:
* Patch 01-18: Foundation to support dynamic user state management
* Patch 19-23: AMX enablement, including unit tests
* Patch 24-28: Optimizations for signal frame and for power/performance

Note. The per-task system call with GET/PUT semantics in PATCH15 is
currently under active discussion on LKML [10]. This is its v1 proposal.

Thanks to Len Brown and Dave Hansen for help with the cover letter.

Changes from v4 [7]:
* Changed the buffer expansion policy to the access-request based approach
from the transparent #NM-based approach. (Andy Lutomirski, Thomas
Gleixner, and et al)
* Removed the boot parameter patch. (Thomas Gleixner)
* Included code to explicitly initialize AMX state during a context switch.
(Thomas Gleixner)
* Added a new arch_prctl to pre-allocate a buffer for dynamic state. (Andy
Lutomirski)
* Updated the fork() path to initialize all the AMX state.
* Improved ptracer's dynamic user state injection path.
* Add optimization to skip tile data in sigframe when an AMX thread
initialized the state.
* Updated to treat the mismatched state size as an error. (Thomas Gleixner)
* Simplified the xstate feature check routine. (Thomas Gleixner)
* Simplified and updated the selftest.
* Updated some changelog. (Thomas Gleixner)
* Updated a function description. (Borislav Petkov)

Changes from v3 [6]:
* Updated some commit messages and code comments. (Borislav Petkov)
* Added and removed some helpers. (Borislav Petkov)
* Revised the buffer allocation function. (Borislav Petkov)
* Simplified in accessing buffers. (Borislav Petkov)
* Re-organized some code change more reviewable. (PATCH9/10)
* Reverted unnecessary changes. (PATCH4)
* Fixed typo in the documentation. (Randy Dunlap)

Changes from v2 [5]:
* Removed the patch for the tile data inheritance. Also, updated the
selftest patch. (Andy Lutomirski)
* Changed the kernel tainted when any unknown state is enabled. (Andy
Lutomirski)
* Changed to use the XFD feature only when the compacted format in use.
* Improved the test code.
* Simplified the cmdline handling.
* Removed 'task->fpu' in changelogs. (Boris Petkov)
* Updated the variable name / comments / changelogs for clarification.

Changes from v1 [4]:
* Added vmalloc() error tracing (Dave Hansen, PeterZ, and Andy Lutomirski)
* Inlined the #NM handling code (Andy Lutomirski)
* Made signal handling optimization revertible
* Revised the new parameter handling code (Andy Lutomirski and Dave Hansen)
* Rebased on the upstream kernel

[1]: Intel Architecture Instruction Set Extension Programming Reference
February 2021, https://software.intel.com/content/dam/develop/external/us/en/documents-tps/architecture-instruction-set-extensions-programming-reference.pdf
[2]: https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions.html
[3]: https://lore.kernel.org/lkml/[email protected]/
[4]: https://lore.kernel.org/lkml/[email protected]/
[5]: https://lore.kernel.org/lkml/[email protected]/
[6]: https://lore.kernel.org/lkml/[email protected]/
[7]: https://lore.kernel.org/lkml/[email protected]/
[8]: https://sourceware.org/git/?p=glibc.git;a=commit;h=6c57d320484988e87e446e2e60ce42816bf51d53
[9]: https://sourceware.org/git/?p=glibc.git;a=blob;f=NEWS;h=aa0f10a891f8f9b4e6f0f6d25b6a307898c07d82;hb=HEAD#l12
[10]: https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ@mail.gmail.com/

Chang S. Bae (28):
x86/fpu/xstate: Modify the initialization helper to handle both static
and dynamic buffers
x86/fpu/xstate: Modify state copy helpers to handle both static and
dynamic buffers
x86/fpu/xstate: Modify address finders to handle both static and
dynamic buffers
x86/fpu/xstate: Modify the context restore helper to handle both
static and dynamic buffers
x86/fpu/xstate: Add a new variable to indicate dynamic user states
x86/fpu/xstate: Add new variables to indicate dynamic xstate buffer
size
x86/fpu/xstate: Calculate and remember dynamic xstate buffer sizes
x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer
x86/fpu/xstate: Introduce helpers to manage the xstate buffer
dynamically
x86/fpu/xstate: Define the scope of the initial xstate data
x86/fpu/xstate: Update the xstate save function to support dynamic
states
x86/fpu/xstate: Update the xstate buffer address finder to support
dynamic states
x86/fpu/xstate: Update the xstate context copy function to support
dynamic states
x86/fpu/xstate: Prevent unauthorised use of dynamic user state
x86/arch_prctl: Create ARCH_GET_XSTATE/ARCH_PUT_XSTATE
x86/fpu/xstate: Support ptracer-induced xstate buffer expansion
x86/fpu/xstate: Adjust the XSAVE feature table to address gaps in
state component numbers
x86/fpu/xstate: Disable xstate support if an inconsistent state is
detected
x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature
bits
x86/fpu/amx: Define AMX state components and have it used for
boot-time checks
x86/fpu/amx: Initialize child's AMX state
x86/fpu/amx: Enable the AMX feature in 64-bit mode
selftest/x86/amx: Test cases for the AMX state management
x86/fpu/xstate: Use per-task xstate mask for saving xstate in signal
frame
x86/fpu/xstate: Skip writing zeros to signal frame for dynamic user
states if in INIT-state
selftest/x86/amx: Test case for AMX state copy optimization in signal
delivery
x86/insn/amx: Add TILERELEASE instruction to the opcode map
x86/fpu/amx: Clear the AMX state when appropriate

arch/x86/include/asm/cpufeatures.h | 4 +
arch/x86/include/asm/fpu/internal.h | 93 +++-
arch/x86/include/asm/fpu/types.h | 69 ++-
arch/x86/include/asm/fpu/xstate.h | 42 +-
arch/x86/include/asm/msr-index.h | 2 +
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/include/asm/processor.h | 10 +-
arch/x86/include/asm/proto.h | 2 +-
arch/x86/include/asm/special_insns.h | 6 +
arch/x86/include/asm/trace/fpu.h | 9 +-
arch/x86/include/uapi/asm/prctl.h | 3 +
arch/x86/kernel/cpu/common.c | 2 +-
arch/x86/kernel/cpu/cpuid-deps.c | 4 +
arch/x86/kernel/fpu/core.c | 100 +++-
arch/x86/kernel/fpu/init.c | 42 +-
arch/x86/kernel/fpu/regset.c | 66 ++-
arch/x86/kernel/fpu/signal.c | 61 ++-
arch/x86/kernel/fpu/xstate.c | 725 +++++++++++++++++++++----
arch/x86/kernel/process.c | 20 +-
arch/x86/kernel/process_32.c | 2 +-
arch/x86/kernel/process_64.c | 2 +-
arch/x86/kernel/traps.c | 12 +
arch/x86/kvm/x86.c | 48 +-
arch/x86/lib/x86-opcode-map.txt | 8 +-
arch/x86/math-emu/fpu_aux.c | 2 +-
arch/x86/math-emu/fpu_entry.c | 4 +-
arch/x86/math-emu/fpu_system.h | 2 +-
arch/x86/mm/pkeys.c | 2 +-
tools/arch/x86/lib/x86-opcode-map.txt | 8 +-
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/amx.c | 738 ++++++++++++++++++++++++++
31 files changed, 1825 insertions(+), 267 deletions(-)
create mode 100644 tools/testing/selftests/x86/amx.c


base-commit: d07f6ca923ea0927a1024dfccafc5b53b61cfecc
--
2.17.1


2021-05-23 19:39:40

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 10/28] x86/fpu/xstate: Define the scope of the initial xstate data

init_fpstate is used to record the initial xstate value and covers all the
states. But it is wasteful to cover large states all with trivial initial
data.

Limit init_fpstate by clarifying its size and coverage, which are all but
dynamic user states. The dynamic states are assumed to be large but having
initial data with zeros.

Expand copy_xregs_to_kernel_booting() to receive a mask argument of which
states to save.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v3:
* Removed the helper functions. (Borislav Petkov)
* Removed 'no functional change' in the changelog. (Borislav Petkov)
* Updated the code comment.
* Moved out the other initialization changes into the previous patch.

Changes from v2:
* Updated the changelog for clarification.
* Updated the code comments.
---
arch/x86/include/asm/fpu/internal.h | 3 +--
arch/x86/kernel/fpu/core.c | 13 ++++++++++---
arch/x86/kernel/fpu/xstate.c | 11 +++++++++--
3 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 46cb51ef4d17..e4afc1831e29 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -272,9 +272,8 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu)
* This function is called only during boot time when x86 caps are not set
* up and alternative can not be used yet.
*/
-static inline void copy_xregs_to_kernel_booting(struct xregs_state *xstate)
+static inline void copy_xregs_to_kernel_booting(struct xregs_state *xstate, u64 mask)
{
- u64 mask = xfeatures_mask_all;
u32 lmask = mask;
u32 hmask = mask >> 32;
int err;
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 918930553290..2584a2922fea 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -21,7 +21,10 @@

/*
* Represents the initial FPU state. It's mostly (but not completely) zeroes,
- * depending on the FPU hardware format:
+ * depending on the FPU hardware format.
+ *
+ * The dynamic user states are excluded as they are large but having initial
+ * values with zeros.
*/
union fpregs_state init_fpstate __read_mostly;

@@ -213,9 +216,13 @@ void fpstate_init(struct fpu *fpu)
mask = fpu->state_mask;
size = get_xstate_size(fpu->state_mask);
} else {
+ /*
+ * init_fpstate excludes the dynamic user states as they are
+ * large but having initial values with zeros.
+ */
state = &init_fpstate;
- mask = xfeatures_mask_all;
- size = get_xstate_config(XSTATE_MAX_SIZE);
+ mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
+ size = get_xstate_config(XSTATE_MIN_SIZE);
}

if (!static_cpu_has(X86_FEATURE_FPU)) {
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 0e3f93b03b3f..773f594bd730 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -552,6 +552,7 @@ static void __init print_xstate_offset_size(void)
static void __init setup_init_fpu_buf(void)
{
static int on_boot_cpu __initdata = 1;
+ u64 mask;

WARN_ON_FPU(!on_boot_cpu);
on_boot_cpu = 0;
@@ -562,8 +563,14 @@ static void __init setup_init_fpu_buf(void)
setup_xstate_features();
print_xstate_features();

+ /*
+ * Exclude the dynamic user states as they are large but having
+ * initial values with zeros.
+ */
+ mask = xfeatures_mask_all & ~xfeatures_mask_user_dynamic;
+
if (boot_cpu_has(X86_FEATURE_XSAVES))
- fpstate_init_xstate(&init_fpstate.xsave, xfeatures_mask_all);
+ fpstate_init_xstate(&init_fpstate.xsave, mask);

/*
* Init all the features state with header.xfeatures being 0x0
@@ -574,7 +581,7 @@ static void __init setup_init_fpu_buf(void)
* Dump the init state again. This is to identify the init state
* of any feature which is not represented by all zero's.
*/
- copy_xregs_to_kernel_booting(&init_fpstate.xsave);
+ copy_xregs_to_kernel_booting(&init_fpstate.xsave, mask);
}

static int xfeature_uncompacted_offset(int xfeature_nr)
--
2.17.1

2021-05-23 19:39:41

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 02/28] x86/fpu/xstate: Modify state copy helpers to handle both static and dynamic buffers

Have all the functions copying xstate take a struct fpu * pointer in
preparation for dynamic state buffer support.

No functional change.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v3:
* Updated the changelog. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
---
arch/x86/include/asm/fpu/xstate.h | 8 ++++----
arch/x86/kernel/fpu/regset.c | 6 +++---
arch/x86/kernel/fpu/signal.c | 16 +++++++---------
arch/x86/kernel/fpu/xstate.c | 19 +++++++++++++++----
4 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 47a92232d595..e0f1b22f53ce 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -105,10 +105,10 @@ const void *get_xsave_field_ptr(int xfeature_nr);
int using_compacted_format(void);
int xfeature_size(int xfeature_nr);
struct membuf;
-void copy_xstate_to_kernel(struct membuf to, struct xregs_state *xsave);
-int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf);
-int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf);
-void copy_supervisor_to_kernel(struct xregs_state *xsave);
+void copy_xstate_to_kernel(struct membuf to, struct fpu *fpu);
+int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf);
+int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf);
+void copy_supervisor_to_kernel(struct fpu *fpu);
void copy_dynamic_supervisor_to_kernel(struct xregs_state *xstate, u64 mask);
void copy_kernel_to_dynamic_supervisor(struct xregs_state *xstate, u64 mask);

diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 4c4d9059ff36..5e13e58d11d4 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -85,7 +85,7 @@ int xstateregs_get(struct task_struct *target, const struct user_regset *regset,
fpu__prepare_read(fpu);

if (using_compacted_format()) {
- copy_xstate_to_kernel(to, xsave);
+ copy_xstate_to_kernel(to, fpu);
return 0;
} else {
fpstate_sanitize_xstate(fpu);
@@ -126,9 +126,9 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,

if (using_compacted_format()) {
if (kbuf)
- ret = copy_kernel_to_xstate(xsave, kbuf);
+ ret = copy_kernel_to_xstate(fpu, kbuf);
else
- ret = copy_user_to_xstate(xsave, ubuf);
+ ret = copy_user_to_xstate(fpu, ubuf);
} else {
ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, xsave, 0, -1);
if (!ret)
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index a4ec65317a7f..0d6deb75c507 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -212,11 +212,11 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
}

static inline void
-sanitize_restored_user_xstate(union fpregs_state *state,
+sanitize_restored_user_xstate(struct fpu *fpu,
struct user_i387_ia32_struct *ia32_env,
u64 user_xfeatures, int fx_only)
{
- struct xregs_state *xsave = &state->xsave;
+ struct xregs_state *xsave = &fpu->state.xsave;
struct xstate_header *header = &xsave->header;

if (use_xsave()) {
@@ -253,7 +253,7 @@ sanitize_restored_user_xstate(union fpregs_state *state,
xsave->i387.mxcsr &= mxcsr_feature_mask;

if (ia32_env)
- convert_to_fxsr(&state->fxsave, ia32_env);
+ convert_to_fxsr(&fpu->state.fxsave, ia32_env);
}
}

@@ -396,7 +396,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
* current supervisor states first and invalidate the FPU regs.
*/
if (xfeatures_mask_supervisor())
- copy_supervisor_to_kernel(&fpu->state.xsave);
+ copy_supervisor_to_kernel(fpu);
set_thread_flag(TIF_NEED_FPU_LOAD);
}
__fpu_invalidate_fpregs_state(fpu);
@@ -406,7 +406,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
u64 init_bv = xfeatures_mask_user() & ~user_xfeatures;

if (using_compacted_format()) {
- ret = copy_user_to_xstate(&fpu->state.xsave, buf_fx);
+ ret = copy_user_to_xstate(fpu, buf_fx);
} else {
ret = __copy_from_user(&fpu->state.xsave, buf_fx, state_size);

@@ -416,8 +416,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
if (ret)
goto err_out;

- sanitize_restored_user_xstate(&fpu->state, envp, user_xfeatures,
- fx_only);
+ sanitize_restored_user_xstate(fpu, envp, user_xfeatures, fx_only);

fpregs_lock();
if (unlikely(init_bv))
@@ -437,8 +436,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
goto err_out;
}

- sanitize_restored_user_xstate(&fpu->state, envp, user_xfeatures,
- fx_only);
+ sanitize_restored_user_xstate(fpu, envp, user_xfeatures, fx_only);

fpregs_lock();
if (use_xsave()) {
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 767ad6b008c2..cb634c6afbb2 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1071,14 +1071,17 @@ static void copy_part(struct membuf *to, unsigned *last, unsigned offset,
* It supports partial copy but pos always starts from zero. This is called
* from xstateregs_get() and there we check the CPU has XSAVES.
*/
-void copy_xstate_to_kernel(struct membuf to, struct xregs_state *xsave)
+void copy_xstate_to_kernel(struct membuf to, struct fpu *fpu)
{
struct xstate_header header;
const unsigned off_mxcsr = offsetof(struct fxregs_state, mxcsr);
+ struct xregs_state *xsave;
unsigned size = to.left;
unsigned last = 0;
int i;

+ xsave = &fpu->state.xsave;
+
/*
* The destination is a ptrace buffer; we put in only user xstates:
*/
@@ -1127,8 +1130,9 @@ void copy_xstate_to_kernel(struct membuf to, struct xregs_state *xsave)
* Convert from a ptrace standard-format kernel buffer to kernel XSAVES format
* and copy to the target thread. This is called from xstateregs_set().
*/
-int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
+int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf)
{
+ struct xregs_state *xsave;
unsigned int offset, size;
int i;
struct xstate_header hdr;
@@ -1141,6 +1145,8 @@ int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
if (validate_user_xstate_header(&hdr))
return -EINVAL;

+ xsave = &fpu->state.xsave;
+
for (i = 0; i < XFEATURE_MAX; i++) {
u64 mask = ((u64)1 << i);

@@ -1180,8 +1186,9 @@ int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
* xstateregs_set(), as well as potentially from the sigreturn() and
* rt_sigreturn() system calls.
*/
-int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
+int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf)
{
+ struct xregs_state *xsave;
unsigned int offset, size;
int i;
struct xstate_header hdr;
@@ -1195,6 +1202,8 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
if (validate_user_xstate_header(&hdr))
return -EINVAL;

+ xsave = &fpu->state.xsave;
+
for (i = 0; i < XFEATURE_MAX; i++) {
u64 mask = ((u64)1 << i);

@@ -1235,9 +1244,10 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
* old states, and is intended to be used only in __fpu__restore_sig(), where
* user states are restored from the user buffer.
*/
-void copy_supervisor_to_kernel(struct xregs_state *xstate)
+void copy_supervisor_to_kernel(struct fpu *fpu)
{
struct xstate_header *header;
+ struct xregs_state *xstate;
u64 max_bit, min_bit;
u32 lmask, hmask;
int err, i;
@@ -1251,6 +1261,7 @@ void copy_supervisor_to_kernel(struct xregs_state *xstate)
max_bit = __fls(xfeatures_mask_supervisor());
min_bit = __ffs(xfeatures_mask_supervisor());

+ xstate = &fpu->state.xsave;
lmask = xfeatures_mask_supervisor();
hmask = xfeatures_mask_supervisor() >> 32;
XSTATE_OP(XSAVES, xstate, lmask, hmask, err);
--
2.17.1

2021-05-23 19:39:46

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 07/28] x86/fpu/xstate: Calculate and remember dynamic xstate buffer sizes

The xstate per-task buffer is currently embedded into struct fpu with
static size. To accommodate dynamic user xstates, record the maximum and
minimum buffer sizes.

Rename the size calculation function. It calculates the maximum xstate size
and sanity checks it with CPUID. It also calculates the static embedded
buffer size by excluding the dynamic user states from the maximum size.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Massaged the function description, in preparation for the change
with a return value.

Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the code comment. (Borislav Petkov)
* Adjusted the calculation function naming.
* Moved out the new variable addition into a new patch.

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
* Renamed the in-line size variable.
* Updated some code comments.
---
arch/x86/kernel/fpu/xstate.c | 56 ++++++++++++++++++++++--------------
1 file changed, 35 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 1d6613751625..74e0892bd8ec 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -654,24 +654,32 @@ static void check_xstate_against_struct(int nr)
}
}

-/*
- * This essentially double-checks what the cpu told us about
- * how large the XSAVE buffer needs to be. We are recalculating
- * it to be safe.
+/**
+ * calculate_xstate_sizes() - Calculate the xstate per-task buffer sizes -- maximum and minimum.
+ *
+ * Record the minimum and double-check the maximum against what the CPU told.
*
- * Dynamic supervisor XSAVE features allocate their own buffers and are
- * not covered by these checks. Only the size of the buffer for task->fpu
- * is checked here.
+ * Dynamic user states are stored in this per-task buffer. They account for the delta between the
+ * maximum and the minimum.
+ *
+ * Dynamic supervisor XSAVE features allocate their own buffers and are not covered by these checks.
+ *
+ * Return: nothing.
*/
-static void do_extra_xstate_size_checks(void)
+static void calculate_xstate_sizes(void)
{
- int paranoid_xstate_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
+ int paranoid_min_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
+ int paranoid_max_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
int i;

for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+ bool user_dynamic;
+
if (!xfeature_enabled(i))
continue;

+ user_dynamic = (xfeatures_mask_user_dynamic & BIT_ULL(i)) ? true : false;
+
check_xstate_against_struct(i);
/*
* Supervisor state components can be managed only by
@@ -681,23 +689,32 @@ static void do_extra_xstate_size_checks(void)
XSTATE_WARN_ON(xfeature_is_supervisor(i));

/* Align from the end of the previous feature */
- if (xfeature_is_aligned(i))
- paranoid_xstate_size = ALIGN(paranoid_xstate_size, 64);
+ if (xfeature_is_aligned(i)) {
+ paranoid_max_size = ALIGN(paranoid_max_size, 64);
+ if (!user_dynamic)
+ paranoid_min_size = ALIGN(paranoid_min_size, 64);
+ }
/*
* The offset of a given state in the non-compacted
* format is given to us in a CPUID leaf. We check
* them for being ordered (increasing offsets) in
* setup_xstate_features().
*/
- if (!using_compacted_format())
- paranoid_xstate_size = xfeature_uncompacted_offset(i);
+ if (!using_compacted_format()) {
+ paranoid_max_size = xfeature_uncompacted_offset(i);
+ if (!user_dynamic)
+ paranoid_min_size = xfeature_uncompacted_offset(i);
+ }
/*
* The compacted-format offset always depends on where
* the previous state ended.
*/
- paranoid_xstate_size += xfeature_size(i);
+ paranoid_max_size += xfeature_size(i);
+ if (!user_dynamic)
+ paranoid_min_size += xfeature_size(i);
}
- XSTATE_WARN_ON(paranoid_xstate_size != get_xstate_config(XSTATE_MAX_SIZE));
+ XSTATE_WARN_ON(paranoid_max_size != get_xstate_config(XSTATE_MAX_SIZE));
+ set_xstate_config(XSTATE_MIN_SIZE, paranoid_min_size);
}


@@ -798,14 +815,11 @@ static int __init init_xstate_size(void)
*/
set_xstate_config(XSTATE_MAX_SIZE, possible_xstate_size);

- /* Perform an extra check for the maximum size. */
- do_extra_xstate_size_checks();
-
/*
- * Set the minimum to be the same as the maximum. The dynamic
- * user states are not supported yet.
+ * Calculate and double-check the maximum size. Calculate and record
+ * the minimum size.
*/
- set_xstate_config(XSTATE_MIN_SIZE, possible_xstate_size);
+ calculate_xstate_sizes();

/* Ensure the minimum size fits in the statically-alocated buffer: */
if (!is_supported_xstate_size(get_xstate_config(XSTATE_MIN_SIZE)))
--
2.17.1

2021-05-23 19:39:51

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 03/28] x86/fpu/xstate: Modify address finders to handle both static and dynamic buffers

Have all the functions finding xstate address take a struct fpu * pointer
in preparation for dynamic state buffer support.

init_fpstate is a special case, which is indicated by a null pointer
parameter to get_xsave_addr() and __raw_xsave_addr().

No functional change.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the function comment to use kernel-doc style. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)

Changes from v1:
* Rebased on the upstream kernel (5.10)
---
arch/x86/include/asm/fpu/internal.h | 2 +-
arch/x86/include/asm/fpu/xstate.h | 2 +-
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/kernel/cpu/common.c | 2 +-
arch/x86/kernel/fpu/xstate.c | 46 ++++++++++++++++++++++-------
arch/x86/kvm/x86.c | 10 +++----
arch/x86/mm/pkeys.c | 2 +-
7 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index d81d8c407dc0..0153c4d4ca77 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -579,7 +579,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
* return to userland e.g. for a copy_to_user() operation.
*/
if (current->mm) {
- pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
+ pk = get_xsave_addr(new_fpu, XFEATURE_PKRU);
if (pk)
pkru_val = pk->pkru;
}
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index e0f1b22f53ce..24bf8d3f559a 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -100,7 +100,7 @@ extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
extern void __init update_regset_xstate_info(unsigned int size,
u64 xstate_mask);

-void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
+void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
const void *get_xsave_field_ptr(int xfeature_nr);
int using_compacted_format(void);
int xfeature_size(int xfeature_nr);
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b1099f2d9800..024699f19f96 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -141,7 +141,7 @@ static inline void write_pkru(u32 pkru)
if (!boot_cpu_has(X86_FEATURE_OSPKE))
return;

- pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
+ pk = get_xsave_addr(&current->thread.fpu, XFEATURE_PKRU);

/*
* The PKRU value in xstate needs to be in sync with the value that is
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a1b756c49a93..0a6bd4e2c098 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -477,7 +477,7 @@ static __always_inline void setup_pku(struct cpuinfo_x86 *c)
return;

cr4_set_bits(X86_CR4_PKE);
- pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
+ pk = get_xsave_addr(NULL, XFEATURE_PKRU);
if (pk)
pk->pkru = init_pkru_value;
/*
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index cb634c6afbb2..b55771eb739f 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -890,19 +890,34 @@ void fpu__resume_cpu(void)
}
}

-/*
+/**
+ * __raw_xsave_addr() - Find the address where the feature state is saved.
+ *
* Given an xstate feature nr, calculate where in the xsave
* buffer the state is. Callers should ensure that the buffer
* is valid.
+ *
+ * If @fpu is NULL, use init_fpstate.
+ *
+ * @fpu: A struct fpu * pointer
+ *
+ * Return: An address of the feature state in the buffer
*/
-static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
+static void *__raw_xsave_addr(struct fpu *fpu, int xfeature_nr)
{
+ void *xsave;
+
if (!xfeature_enabled(xfeature_nr)) {
WARN_ON_FPU(1);
return NULL;
}

- return (void *)xsave + xstate_comp_offsets[xfeature_nr];
+ if (fpu)
+ xsave = &fpu->state.xsave;
+ else
+ xsave = &init_fpstate.xsave;
+
+ return xsave + xstate_comp_offsets[xfeature_nr];
}
/*
* Given the xsave area and a state inside, this function returns the
@@ -915,15 +930,18 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
* this will return NULL.
*
* Inputs:
- * xstate: the thread's storage area for all FPU data
+ * fpu: the thread's FPU data to reference xstate buffer(s).
+ * (A null pointer parameter indicates init_fpstate.)
* xfeature_nr: state which is defined in xsave.h (e.g. XFEATURE_FP,
* XFEATURE_SSE, etc...)
* Output:
* address of the state in the xsave area, or NULL if the
* field is not present in the xsave buffer.
*/
-void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
+void *get_xsave_addr(struct fpu *fpu, int xfeature_nr)
{
+ struct xregs_state *xsave;
+
/*
* Do we even *have* xsave state?
*/
@@ -936,6 +954,12 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
*/
WARN_ONCE(!(xfeatures_mask_all & BIT_ULL(xfeature_nr)),
"get of unsupported state");
+
+ if (fpu)
+ xsave = &fpu->state.xsave;
+ else
+ xsave = &init_fpstate.xsave;
+
/*
* This assumes the last 'xsave*' instruction to
* have requested that 'xfeature_nr' be saved.
@@ -950,7 +974,7 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
if (!(xsave->header.xfeatures & BIT_ULL(xfeature_nr)))
return NULL;

- return __raw_xsave_addr(xsave, xfeature_nr);
+ return __raw_xsave_addr(fpu, xfeature_nr);
}
EXPORT_SYMBOL_GPL(get_xsave_addr);

@@ -981,7 +1005,7 @@ const void *get_xsave_field_ptr(int xfeature_nr)
*/
fpu__save(fpu);

- return get_xsave_addr(&fpu->state.xsave, xfeature_nr);
+ return get_xsave_addr(fpu, xfeature_nr);
}

#ifdef CONFIG_ARCH_HAS_PKEYS
@@ -1116,7 +1140,7 @@ void copy_xstate_to_kernel(struct membuf to, struct fpu *fpu)
* Copy only in-use xstates:
*/
if ((header.xfeatures >> i) & 1) {
- void *src = __raw_xsave_addr(xsave, i);
+ void *src = __raw_xsave_addr(fpu, i);

copy_part(&to, &last, xstate_offsets[i],
xstate_sizes[i], src);
@@ -1151,7 +1175,7 @@ int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf)
u64 mask = ((u64)1 << i);

if (hdr.xfeatures & mask) {
- void *dst = __raw_xsave_addr(xsave, i);
+ void *dst = __raw_xsave_addr(fpu, i);

offset = xstate_offsets[i];
size = xstate_sizes[i];
@@ -1208,7 +1232,7 @@ int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf)
u64 mask = ((u64)1 << i);

if (hdr.xfeatures & mask) {
- void *dst = __raw_xsave_addr(xsave, i);
+ void *dst = __raw_xsave_addr(fpu, i);

offset = xstate_offsets[i];
size = xstate_sizes[i];
@@ -1450,7 +1474,7 @@ void update_pasid(void)
*/
xsave = &fpu->state.xsave;
xsave->header.xfeatures |= XFEATURE_MASK_PASID;
- ppasid_state = get_xsave_addr(xsave, XFEATURE_PASID);
+ ppasid_state = get_xsave_addr(fpu, XFEATURE_PASID);
/*
* Since XFEATURE_MASK_PASID is set in xfeatures, ppasid_state
* won't be NULL and no need to check its value.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9af0a3c52b62..01c8642cb56c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4589,7 +4589,7 @@ static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
while (valid) {
u64 xfeature_mask = valid & -valid;
int xfeature_nr = fls64(xfeature_mask) - 1;
- void *src = get_xsave_addr(xsave, xfeature_nr);
+ void *src = get_xsave_addr(vcpu->arch.guest_fpu, xfeature_nr);

if (src) {
u32 size, offset, ecx, edx;
@@ -4632,7 +4632,7 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
while (valid) {
u64 xfeature_mask = valid & -valid;
int xfeature_nr = fls64(xfeature_mask) - 1;
- void *dest = get_xsave_addr(xsave, xfeature_nr);
+ void *dest = get_xsave_addr(vcpu->arch.guest_fpu, xfeature_nr);

if (dest) {
u32 size, offset, ecx, edx;
@@ -10473,12 +10473,10 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
*/
if (init_event)
kvm_put_guest_fpu(vcpu);
- mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu->state.xsave,
- XFEATURE_BNDREGS);
+ mpx_state_buffer = get_xsave_addr(vcpu->arch.guest_fpu, XFEATURE_BNDREGS);
if (mpx_state_buffer)
memset(mpx_state_buffer, 0, sizeof(struct mpx_bndreg_state));
- mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu->state.xsave,
- XFEATURE_BNDCSR);
+ mpx_state_buffer = get_xsave_addr(vcpu->arch.guest_fpu, XFEATURE_BNDCSR);
if (mpx_state_buffer)
memset(mpx_state_buffer, 0, sizeof(struct mpx_bndcsr));
if (init_event)
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index a2332eef66e9..5f2609af4223 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -177,7 +177,7 @@ static ssize_t init_pkru_write_file(struct file *file,
return -EINVAL;

WRITE_ONCE(init_pkru_value, new_init_pkru);
- pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
+ pk = get_xsave_addr(NULL, XFEATURE_PKRU);
if (!pk)
return -EINVAL;
pk->pkru = new_init_pkru;
--
2.17.1

2021-05-23 19:40:01

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 09/28] x86/fpu/xstate: Introduce helpers to manage the xstate buffer dynamically

The static xstate per-task buffer contains the extended register states --
but it is not expandable at runtime. Introduce runtime methods and a new
fpu struct field to support the expansion.

fpu->state_mask indicates which state components are reserved to be
saved in the xstate buffer.

alloc_xstate_buffer() uses vmalloc(). If use of this mechanism grows to
allocate buffers larger than 64KB, a more sophisticated allocation scheme
that includes purpose-built reclaim capability might be justified.

Introduce a new helper -- get_xstate_size() to calculate the buffer size.

Also, use the new field and helper to initialize the buffer.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v3:
* Updated code comments. (Borislav Petkov)
* Used vzalloc() instead of vmalloc() with memset(). (Borislav Petkov)
* Removed the max size check for >64KB. (Borislav Petkov)
* Removed the allocation size check in the helper. (Borislav Petkov)
* Switched the function description in the kernel-doc style.
* Used them for buffer initialization -- moved from the next patch.

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
* Replaced 'area' with 'buffer' in the comments and the changelog.
* Updated the code comments.

Changes from v1:
* Removed unneeded interrupt masking (Andy Lutomirski)
* Added vmalloc() error tracing (Dave Hansen, PeterZ, and Andy Lutomirski)
---
arch/x86/include/asm/fpu/types.h | 7 ++
arch/x86/include/asm/fpu/xstate.h | 4 +
arch/x86/include/asm/trace/fpu.h | 5 ++
arch/x86/kernel/fpu/core.c | 14 ++--
arch/x86/kernel/fpu/xstate.c | 125 ++++++++++++++++++++++++++++++
5 files changed, 148 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index dcd28a545377..6fc707c14350 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -336,6 +336,13 @@ struct fpu {
*/
unsigned long avx512_timestamp;

+ /*
+ * @state_mask:
+ *
+ * The bitmap represents state components reserved to be saved in ->state.
+ */
+ u64 state_mask;
+
/*
* @state:
*
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 1fba2ca15874..cbb4795d2b45 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -112,6 +112,10 @@ extern unsigned int get_xstate_config(enum xstate_config cfg);
void set_xstate_config(enum xstate_config cfg, unsigned int value);

void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
+unsigned int get_xstate_size(u64 mask);
+int alloc_xstate_buffer(struct fpu *fpu, u64 mask);
+void free_xstate_buffer(struct fpu *fpu);
+
const void *get_xsave_field_ptr(int xfeature_nr);
int using_compacted_format(void);
int xfeature_size(int xfeature_nr);
diff --git a/arch/x86/include/asm/trace/fpu.h b/arch/x86/include/asm/trace/fpu.h
index ef82f4824ce7..b691c2db47c7 100644
--- a/arch/x86/include/asm/trace/fpu.h
+++ b/arch/x86/include/asm/trace/fpu.h
@@ -89,6 +89,11 @@ DEFINE_EVENT(x86_fpu, x86_fpu_xstate_check_failed,
TP_ARGS(fpu)
);

+DEFINE_EVENT(x86_fpu, x86_fpu_xstate_alloc_failed,
+ TP_PROTO(struct fpu *fpu),
+ TP_ARGS(fpu)
+);
+
#undef TRACE_INCLUDE_PATH
#define TRACE_INCLUDE_PATH asm/trace/
#undef TRACE_INCLUDE_FILE
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 44d41251b474..918930553290 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -210,9 +210,8 @@ void fpstate_init(struct fpu *fpu)

if (likely(fpu)) {
state = fpu->state;
- /* The dynamic user states are not prepared yet. */
- mask = xfeatures_mask_all & ~xfeatures_mask_user_dynamic;
- size = get_xstate_config(XSTATE_MIN_SIZE);
+ mask = fpu->state_mask;
+ size = get_xstate_size(fpu->state_mask);
} else {
state = &init_fpstate;
mask = xfeatures_mask_all;
@@ -248,14 +247,15 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)

WARN_ON_FPU(src_fpu != &current->thread.fpu);

+ /*
+ * The child does not inherit the dynamic states. Thus, use the buffer
+ * embedded in struct task_struct, which has the minimum size.
+ */
+ dst_fpu->state_mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
dst_fpu->state = &dst_fpu->__default_state;
-
/*
* Don't let 'init optimized' areas of the XSAVE area
* leak into the child task:
- *
- * The child does not inherit the dynamic states. So,
- * the xstate buffer has the minimum size.
*/
memset(&dst_fpu->state->xsave, 0, get_xstate_config(XSTATE_MIN_SIZE));

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c3c7c57616eb..0e3f93b03b3f 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -10,6 +10,7 @@
#include <linux/pkeys.h>
#include <linux/seq_file.h>
#include <linux/proc_fs.h>
+#include <linux/vmalloc.h>

#include <asm/fpu/api.h>
#include <asm/fpu/internal.h>
@@ -19,6 +20,7 @@

#include <asm/tlbflush.h>
#include <asm/cpufeature.h>
+#include <asm/trace/fpu.h>

/*
* Although we spell it out in here, the Processor Trace
@@ -71,6 +73,11 @@ static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] =
static unsigned int xstate_sizes[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
static unsigned int xstate_comp_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
static unsigned int xstate_supervisor_only_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
+/*
+ * True if the buffer of the corresponding XFEATURE is located on the next 64
+ * byte boundary. Otherwise, it follows the preceding component immediately.
+ */
+static bool xstate_aligns[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = false};

/**
* struct fpu_xstate_buffer_config - xstate per-task buffer configuration
@@ -168,6 +175,58 @@ static bool xfeature_is_supervisor(int xfeature_nr)
return ecx & 1;
}

+/**
+ * get_xstate_size() - calculate an xstate buffer size
+ * @mask: This bitmap tells which components reserved in the buffer.
+ *
+ * Available once those arrays for the offset, size, and alignment info are set up,
+ * by setup_xstate_features().
+ *
+ * Returns: The buffer size
+ */
+unsigned int get_xstate_size(u64 mask)
+{
+ unsigned int size;
+ u64 xmask;
+ int i, nr;
+
+ if (!mask)
+ return 0;
+
+ /*
+ * The minimum buffer size excludes the dynamic user state. When a task
+ * uses the state, the buffer can grow up to the max size.
+ */
+ if (mask == (xfeatures_mask_all & ~xfeatures_mask_user_dynamic))
+ return get_xstate_config(XSTATE_MIN_SIZE);
+ else if (mask == xfeatures_mask_all)
+ return get_xstate_config(XSTATE_MAX_SIZE);
+
+ nr = fls64(mask) - 1;
+
+ if (!using_compacted_format())
+ return xstate_offsets[nr] + xstate_sizes[nr];
+
+ xmask = BIT_ULL(nr + 1) - 1;
+
+ if (mask == (xmask & xfeatures_mask_all))
+ return xstate_comp_offsets[nr] + xstate_sizes[nr];
+
+ /*
+ * With the given mask, no relevant size is found so far. So, calculate
+ * it by summing up each state size.
+ */
+ for (size = FXSAVE_SIZE + XSAVE_HDR_SIZE, i = FIRST_EXTENDED_XFEATURE; i <= nr; i++) {
+ if (!(mask & BIT_ULL(i)))
+ continue;
+
+ if (xstate_aligns[i])
+ size = ALIGN(size, 64);
+ size += xstate_sizes[i];
+ }
+ return size;
+}
+
/*
* When executing XSAVEOPT (or other optimized XSAVE instructions), if
* a processor implementation detects that an FPU state component is still
@@ -308,10 +367,12 @@ static void __init setup_xstate_features(void)
xstate_offsets[XFEATURE_FP] = 0;
xstate_sizes[XFEATURE_FP] = offsetof(struct fxregs_state,
xmm_space);
+ xstate_aligns[XFEATURE_FP] = true;

xstate_offsets[XFEATURE_SSE] = xstate_sizes[XFEATURE_FP];
xstate_sizes[XFEATURE_SSE] = sizeof_field(struct fxregs_state,
xmm_space);
+ xstate_aligns[XFEATURE_SSE] = true;

for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
if (!xfeature_enabled(i))
@@ -329,6 +390,7 @@ static void __init setup_xstate_features(void)
continue;

xstate_offsets[i] = ebx;
+ xstate_aligns[i] = (ecx & 2) ? true : false;

/*
* In our xstate size checks, we assume that the highest-numbered
@@ -915,6 +977,9 @@ void __init fpu__init_system_xstate(void)
if (err)
goto out_disable;

+ /* Make sure init_task does not include the dynamic user states. */
+ current->thread.fpu.state_mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
+
/*
* Update info used for ptrace frames; use standard-format size and no
* supervisor xstates:
@@ -1141,6 +1206,66 @@ static inline bool xfeatures_mxcsr_quirk(u64 xfeatures)
return true;
}

+void free_xstate_buffer(struct fpu *fpu)
+{
+ /* Free up only the dynamically-allocated memory. */
+ if (fpu->state != &fpu->__default_state)
+ vfree(fpu->state);
+}
+
+/**
+ * alloc_xstate_buffer() - allocate an xstate buffer with the size calculated based on @mask.
+ *
+ * @fpu: A struct fpu * pointer
+ * @mask: The bitmap tells which components to be reserved in the new buffer.
+ *
+ * Use vmalloc() simply here. If the task with a vmalloc()-allocated buffer tends
+ * to terminate quickly, vfree()-induced IPIs may be a concern. Caching may be
+ * helpful for this. But the task with large state is likely to live longer.
+ *
+ * Also, this method does not shrink or reclaim the buffer.
+ *
+ * Returns 0 on success, -ENOMEM on allocation error.
+ */
+int alloc_xstate_buffer(struct fpu *fpu, u64 mask)
+{
+ union fpregs_state *state;
+ unsigned int oldsz, newsz;
+ u64 state_mask;
+
+ state_mask = fpu->state_mask | mask;
+
+ oldsz = get_xstate_size(fpu->state_mask);
+ newsz = get_xstate_size(state_mask);
+
+ if (oldsz >= newsz)
+ return 0;
+
+ state = vzalloc(newsz);
+ if (!state) {
+ /*
+ * When allocation requested from #NM, the error code may not be
+ * populated well. Then, this tracepoint is useful for providing
+ * the failure context.
+ */
+ trace_x86_fpu_xstate_alloc_failed(fpu);
+ return -ENOMEM;
+ }
+
+ if (using_compacted_format())
+ fpstate_init_xstate(&state->xsave, state_mask);
+
+ /*
+ * As long as the register state is intact, save the xstate in the new buffer
+ * at the next context copy/switch or potentially ptrace-driven xstate writing.
+ */
+
+ free_xstate_buffer(fpu);
+ fpu->state = state;
+ fpu->state_mask = state_mask;
+ return 0;
+}
+
static void fill_gap(struct membuf *to, unsigned *last, unsigned offset)
{
if (*last >= offset)
--
2.17.1

2021-05-23 19:40:07

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 28/28] x86/fpu/amx: Clear the AMX state when appropriate

When AMX is enabled, and an AMX-task is saved, explicitly initialize the
AMX state after the XSAVE.

This assures that the kernel will only request idle states with clean AMX
state. In the case of the C6 idle state, this allows the hardware to get to
a deeper power saving condition.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Added as a new patch. (Thomas Gleixner)
---
arch/x86/include/asm/special_insns.h | 6 ++++++
arch/x86/kernel/fpu/core.c | 8 ++++++++
2 files changed, 14 insertions(+)

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 2acd6cb62328..f0ed063035eb 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -306,6 +306,12 @@ static inline int enqcmds(void __iomem *dst, const void *src)
return 0;
}

+static inline void tile_release(void)
+{
+ /* Instruction opcode for TILERELEASE; supported in binutils >= 2.36. */
+ asm volatile(".byte 0xc4, 0xe2, 0x78, 0x49, 0xc0");
+}
+
#endif /* __KERNEL__ */

#endif /* _ASM_X86_SPECIAL_INSNS_H */
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index cccfeafe81e5..53a5869078b8 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -106,6 +106,14 @@ int copy_fpregs_to_fpstate(struct fpu *fpu)
*/
if (fpu->state->xsave.header.xfeatures & XFEATURE_MASK_AVX512)
fpu->avx512_timestamp = jiffies;
+
+ /*
+ * Since the current task's state is safely in the XSAVE buffer, TILERELEASE
+ * the TILE registers to guarantee that dirty state will not interfere with the
+ * hardware's ability to enter the core C6 idle state.
+ */
+ if (fpu->state_mask & XFEATURE_MASK_XTILE_DATA)
+ tile_release();
return 1;
}

--
2.17.1

2021-05-23 19:40:09

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 05/28] x86/fpu/xstate: Add a new variable to indicate dynamic user states

The xstate per-task buffer is in preparation to be dynamic for user states.
Introduce a new mask variable to indicate the 'dynamic' user states. The
value is determined at boot-time.

The perf subsystem has a separate buffer to save some state only when
needed, not in every context switch. The states are named as 'dynamic'
supervisor states. Some define and helper are not named with dynamic
supervisor states, so rename them.

No functional change.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the code comment. (Borislav Petkov)

Changes from v2:
* Updated the changelog for clarification.
---
arch/x86/include/asm/fpu/xstate.h | 12 ++++++-----
arch/x86/kernel/fpu/xstate.c | 33 ++++++++++++++++++++-----------
2 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 24bf8d3f559a..6ce8350672c2 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -56,7 +56,7 @@
* - Don't set the bit corresponding to the dynamic supervisor feature in
* IA32_XSS at run time, since it has been set at boot time.
*/
-#define XFEATURE_MASK_DYNAMIC (XFEATURE_MASK_LBR)
+#define XFEATURE_MASK_SUPERVISOR_DYNAMIC (XFEATURE_MASK_LBR)

/*
* Unsupported supervisor features. When a supervisor feature in this mask is
@@ -66,7 +66,7 @@

/* All supervisor states including supported and unsupported states. */
#define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
- XFEATURE_MASK_DYNAMIC | \
+ XFEATURE_MASK_SUPERVISOR_DYNAMIC | \
XFEATURE_MASK_SUPERVISOR_UNSUPPORTED)

#ifdef CONFIG_X86_64
@@ -87,14 +87,16 @@ static inline u64 xfeatures_mask_user(void)
return xfeatures_mask_all & XFEATURE_MASK_USER_SUPPORTED;
}

-static inline u64 xfeatures_mask_dynamic(void)
+static inline u64 xfeatures_mask_supervisor_dynamic(void)
{
if (!boot_cpu_has(X86_FEATURE_ARCH_LBR))
- return XFEATURE_MASK_DYNAMIC & ~XFEATURE_MASK_LBR;
+ return XFEATURE_MASK_SUPERVISOR_DYNAMIC & ~XFEATURE_MASK_LBR;

- return XFEATURE_MASK_DYNAMIC;
+ return XFEATURE_MASK_SUPERVISOR_DYNAMIC;
}

+extern u64 xfeatures_mask_user_dynamic;
+
extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];

extern void __init update_regset_xstate_info(unsigned int size,
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index b55771eb739f..34755612019c 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -61,6 +61,12 @@ static short xsave_cpuid_features[] __initdata = {
*/
u64 xfeatures_mask_all __read_mostly;

+/*
+ * This represents user xstates, a subset of xfeatures_mask_all, saved in a
+ * dynamic kernel XSAVE buffer.
+ */
+u64 xfeatures_mask_user_dynamic __read_mostly;
+
static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
static unsigned int xstate_sizes[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
static unsigned int xstate_comp_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
@@ -237,7 +243,7 @@ void fpu__init_cpu_xstate(void)
*/
if (boot_cpu_has(X86_FEATURE_XSAVES)) {
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
- xfeatures_mask_dynamic());
+ xfeatures_mask_supervisor_dynamic());
}
}

@@ -615,8 +621,8 @@ static void check_xstate_against_struct(int nr)
* how large the XSAVE buffer needs to be. We are recalculating
* it to be safe.
*
- * Dynamic XSAVE features allocate their own buffers and are not
- * covered by these checks. Only the size of the buffer for task->fpu
+ * Dynamic supervisor XSAVE features allocate their own buffers and are
+ * not covered by these checks. Only the size of the buffer for task->fpu
* is checked here.
*/
static void do_extra_xstate_size_checks(void)
@@ -686,7 +692,7 @@ static unsigned int __init get_xsaves_size(void)
*/
static unsigned int __init get_xsaves_size_no_dynamic(void)
{
- u64 mask = xfeatures_mask_dynamic();
+ u64 mask = xfeatures_mask_supervisor_dynamic();
unsigned int size;

if (!mask)
@@ -773,6 +779,7 @@ static int __init init_xstate_size(void)
static void fpu__init_disable_system_xstate(void)
{
xfeatures_mask_all = 0;
+ xfeatures_mask_user_dynamic = 0;
cr4_clear_bits(X86_CR4_OSXSAVE);
setup_clear_cpu_cap(X86_FEATURE_XSAVE);
}
@@ -839,6 +846,8 @@ void __init fpu__init_system_xstate(void)
}

xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();
+ /* Do not support the dynamically allocated buffer yet. */
+ xfeatures_mask_user_dynamic = 0;

/* Enable xstate instructions to be able to continue with initialization: */
fpu__init_cpu_xstate();
@@ -886,7 +895,7 @@ void fpu__resume_cpu(void)
*/
if (boot_cpu_has(X86_FEATURE_XSAVES)) {
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
- xfeatures_mask_dynamic());
+ xfeatures_mask_supervisor_dynamic());
}
}

@@ -1327,8 +1336,8 @@ void copy_supervisor_to_kernel(struct fpu *fpu)
* @mask: Represent the dynamic supervisor features saved into the xsave area
*
* Only the dynamic supervisor states sets in the mask are saved into the xsave
- * area (See the comment in XFEATURE_MASK_DYNAMIC for the details of dynamic
- * supervisor feature). Besides the dynamic supervisor states, the legacy
+ * area (See the comment in XFEATURE_MASK_SUPERVISOR_DYNAMIC for the details of
+ * dynamic supervisor feature). Besides the dynamic supervisor states, the legacy
* region and XSAVE header are also saved into the xsave area. The supervisor
* features in the XFEATURE_MASK_SUPERVISOR_SUPPORTED and
* XFEATURE_MASK_SUPERVISOR_UNSUPPORTED are not saved.
@@ -1337,7 +1346,7 @@ void copy_supervisor_to_kernel(struct fpu *fpu)
*/
void copy_dynamic_supervisor_to_kernel(struct xregs_state *xstate, u64 mask)
{
- u64 dynamic_mask = xfeatures_mask_dynamic() & mask;
+ u64 dynamic_mask = xfeatures_mask_supervisor_dynamic() & mask;
u32 lmask, hmask;
int err;

@@ -1363,9 +1372,9 @@ void copy_dynamic_supervisor_to_kernel(struct xregs_state *xstate, u64 mask)
* @mask: Represent the dynamic supervisor features restored from the xsave area
*
* Only the dynamic supervisor states sets in the mask are restored from the
- * xsave area (See the comment in XFEATURE_MASK_DYNAMIC for the details of
- * dynamic supervisor feature). Besides the dynamic supervisor states, the
- * legacy region and XSAVE header are also restored from the xsave area. The
+ * xsave area (See the comment in XFEATURE_MASK_SUPERVISOR_DYNAMIC for the
+ * details of dynamic supervisor feature). Besides the dynamic supervisor states,
+ * the legacy region and XSAVE header are also restored from the xsave area. The
* supervisor features in the XFEATURE_MASK_SUPERVISOR_SUPPORTED and
* XFEATURE_MASK_SUPERVISOR_UNSUPPORTED are not restored.
*
@@ -1373,7 +1382,7 @@ void copy_dynamic_supervisor_to_kernel(struct xregs_state *xstate, u64 mask)
*/
void copy_kernel_to_dynamic_supervisor(struct xregs_state *xstate, u64 mask)
{
- u64 dynamic_mask = xfeatures_mask_dynamic() & mask;
+ u64 dynamic_mask = xfeatures_mask_supervisor_dynamic() & mask;
u32 lmask, hmask;
int err;

--
2.17.1

2021-05-23 19:40:18

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 08/28] x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer

The xstate per-task buffer is embedded into struct fpu. And the field
'state' represents the buffer. When the dynamic user state is in use,
the buffer may be dynamically allocated.

Convert the 'state' field to point either to the embedded buffer or to the
dynamically-allocated buffer. Also, add a new field to represent the
embedded buffer.

The initial task sets it before dealing with soft FPU. Every child process
will set the pointer on its creation.

No functional change.

Suggested-by: Borislav Petkov <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Fixed KVM's user_fpu and guest_fpu to initialize the 'state' field correctly.
* Massaged the changelog.

Changes from v3:
* Added as a new patch to simplify the buffer access. (Borislav Petkov)
---
arch/x86/include/asm/fpu/internal.h | 6 +++---
arch/x86/include/asm/fpu/types.h | 27 +++++++++++++++++++++------
arch/x86/include/asm/trace/fpu.h | 4 ++--
arch/x86/kernel/fpu/core.c | 26 ++++++++++++++------------
arch/x86/kernel/fpu/init.c | 6 ++++--
arch/x86/kernel/fpu/regset.c | 22 +++++++++++-----------
arch/x86/kernel/fpu/signal.c | 22 +++++++++++-----------
arch/x86/kernel/fpu/xstate.c | 18 +++++++++---------
arch/x86/kernel/process.c | 2 +-
arch/x86/kvm/x86.c | 20 +++++++++++---------
arch/x86/math-emu/fpu_aux.c | 2 +-
arch/x86/math-emu/fpu_entry.c | 4 ++--
arch/x86/math-emu/fpu_system.h | 2 +-
13 files changed, 91 insertions(+), 70 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index b34d0d29e4b8..46cb51ef4d17 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -199,9 +199,9 @@ static inline int copy_user_to_fregs(struct fregs_state __user *fx)
static inline void copy_fxregs_to_kernel(struct fpu *fpu)
{
if (IS_ENABLED(CONFIG_X86_32))
- asm volatile( "fxsave %[fx]" : [fx] "=m" (fpu->state.fxsave));
+ asm volatile( "fxsave %[fx]" : [fx] "=m" (fpu->state->fxsave));
else
- asm volatile("fxsaveq %[fx]" : [fx] "=m" (fpu->state.fxsave));
+ asm volatile("fxsaveq %[fx]" : [fx] "=m" (fpu->state->fxsave));
}

/* These macros all use (%edi)/(%rdi) as the single memory argument. */
@@ -427,7 +427,7 @@ static inline void __copy_kernel_to_fpregs(union fpregs_state *fpstate, u64 mask

static inline void copy_kernel_to_fpregs(struct fpu *fpu)
{
- union fpregs_state *fpstate = &fpu->state;
+ union fpregs_state *fpstate = fpu->state;

/*
* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception is
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index f5a38a5f3ae1..dcd28a545377 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -339,13 +339,28 @@ struct fpu {
/*
* @state:
*
- * In-memory copy of all FPU registers that we save/restore
- * over context switches. If the task is using the FPU then
- * the registers in the FPU are more recent than this state
- * copy. If the task context-switches away then they get
- * saved here and represent the FPU state.
+ * A pointer to indicate the in-memory copy of all FPU registers that are
+ * saved/restored over context switches.
+ *
+ * Initially @state points to @__default_state. When dynamic states get
+ * used, a memory is allocated for the larger state copy and @state is
+ * updated to point to it. Then, the state in ->state supersedes and
+ * invalidates the state in @__default_state.
+ *
+ * In general, if the task is using the FPU then the registers in the FPU
+ * are more recent than the state copy. If the task context-switches away
+ * then they get saved in ->state and represent the FPU state.
+ */
+ union fpregs_state *state;
+
+ /*
+ * @__default_state:
+ *
+ * Initial in-memory copy of all FPU registers that saved/restored
+ * over context switches. When the task is switched to dynamic states,
+ * this copy is replaced with the new in-memory copy in ->state.
*/
- union fpregs_state state;
+ union fpregs_state __default_state;
/*
* WARNING: 'state' is dynamically-sized. Do not put
* anything after it here.
diff --git a/arch/x86/include/asm/trace/fpu.h b/arch/x86/include/asm/trace/fpu.h
index 879b77792f94..ef82f4824ce7 100644
--- a/arch/x86/include/asm/trace/fpu.h
+++ b/arch/x86/include/asm/trace/fpu.h
@@ -22,8 +22,8 @@ DECLARE_EVENT_CLASS(x86_fpu,
__entry->fpu = fpu;
__entry->load_fpu = test_thread_flag(TIF_NEED_FPU_LOAD);
if (boot_cpu_has(X86_FEATURE_OSXSAVE)) {
- __entry->xfeatures = fpu->state.xsave.header.xfeatures;
- __entry->xcomp_bv = fpu->state.xsave.header.xcomp_bv;
+ __entry->xfeatures = fpu->state->xsave.header.xfeatures;
+ __entry->xcomp_bv = fpu->state->xsave.header.xcomp_bv;
}
),
TP_printk("x86/fpu: %p load: %d xfeatures: %llx xcomp_bv: %llx",
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index d8c6f284adb0..44d41251b474 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -95,13 +95,13 @@ EXPORT_SYMBOL(irq_fpu_usable);
int copy_fpregs_to_fpstate(struct fpu *fpu)
{
if (likely(use_xsave())) {
- copy_xregs_to_kernel(&fpu->state.xsave);
+ copy_xregs_to_kernel(&fpu->state->xsave);

/*
* AVX512 state is tracked here because its use is
* known to slow the max clock speed of the core.
*/
- if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
+ if (fpu->state->xsave.header.xfeatures & XFEATURE_MASK_AVX512)
fpu->avx512_timestamp = jiffies;
return 1;
}
@@ -115,7 +115,7 @@ int copy_fpregs_to_fpstate(struct fpu *fpu)
* Legacy FPU register saving, FNSAVE always clears FPU registers,
* so we have to mark them inactive:
*/
- asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (fpu->state.fsave));
+ asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (fpu->state->fsave));

return 0;
}
@@ -209,7 +209,7 @@ void fpstate_init(struct fpu *fpu)
u64 mask;

if (likely(fpu)) {
- state = &fpu->state;
+ state = fpu->state;
/* The dynamic user states are not prepared yet. */
mask = xfeatures_mask_all & ~xfeatures_mask_user_dynamic;
size = get_xstate_config(XSTATE_MIN_SIZE);
@@ -248,6 +248,8 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)

WARN_ON_FPU(src_fpu != &current->thread.fpu);

+ dst_fpu->state = &dst_fpu->__default_state;
+
/*
* Don't let 'init optimized' areas of the XSAVE area
* leak into the child task:
@@ -255,7 +257,7 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
* The child does not inherit the dynamic states. So,
* the xstate buffer has the minimum size.
*/
- memset(&dst_fpu->state.xsave, 0, get_xstate_config(XSTATE_MIN_SIZE));
+ memset(&dst_fpu->state->xsave, 0, get_xstate_config(XSTATE_MIN_SIZE));

/*
* If the FPU registers are not current just memcpy() the state.
@@ -267,7 +269,7 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
*/
fpregs_lock();
if (test_thread_flag(TIF_NEED_FPU_LOAD))
- memcpy(&dst_fpu->state, &src_fpu->state, get_xstate_config(XSTATE_MIN_SIZE));
+ memcpy(dst_fpu->state, src_fpu->state, get_xstate_config(XSTATE_MIN_SIZE));

else if (!copy_fpregs_to_fpstate(dst_fpu))
copy_kernel_to_fpregs(dst_fpu);
@@ -403,7 +405,7 @@ static void fpu__clear(struct fpu *fpu, bool user_only)
if (user_only) {
if (!fpregs_state_valid(fpu, smp_processor_id()) &&
xfeatures_mask_supervisor())
- copy_kernel_to_xregs(&fpu->state.xsave,
+ copy_kernel_to_xregs(&fpu->state->xsave,
xfeatures_mask_supervisor());
copy_init_fpstate_to_fpregs(xfeatures_mask_user());
} else {
@@ -485,11 +487,11 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
* fully reproduce the context of the exception.
*/
if (boot_cpu_has(X86_FEATURE_FXSR)) {
- cwd = fpu->state.fxsave.cwd;
- swd = fpu->state.fxsave.swd;
+ cwd = fpu->state->fxsave.cwd;
+ swd = fpu->state->fxsave.swd;
} else {
- cwd = (unsigned short)fpu->state.fsave.cwd;
- swd = (unsigned short)fpu->state.fsave.swd;
+ cwd = (unsigned short)fpu->state->fsave.cwd;
+ swd = (unsigned short)fpu->state->fsave.swd;
}

err = swd & ~cwd;
@@ -503,7 +505,7 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
unsigned short mxcsr = MXCSR_DEFAULT;

if (boot_cpu_has(X86_FEATURE_XMM))
- mxcsr = fpu->state.fxsave.mxcsr;
+ mxcsr = fpu->state->fxsave.mxcsr;

err = ~(mxcsr >> 7) & mxcsr;
}
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index f63765b7a83c..f2fcdcc979e7 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -31,10 +31,12 @@ static void fpu__init_cpu_generic(void)
cr0 |= X86_CR0_EM;
write_cr0(cr0);

+ current->thread.fpu.state = &current->thread.fpu.__default_state;
+
/* Flush out any pending x87 state: */
#ifdef CONFIG_MATH_EMULATION
if (!boot_cpu_has(X86_FEATURE_FPU))
- fpstate_init_soft(&current->thread.fpu.state.soft);
+ fpstate_init_soft(&current->thread.fpu.state->soft);
else
#endif
asm volatile ("fninit");
@@ -170,7 +172,7 @@ static void __init fpu__init_task_struct_size(void)
* you hit a compile error here, check the structure to
* see if something got added to the end.
*/
- CHECK_MEMBER_AT_END_OF(struct fpu, state);
+ CHECK_MEMBER_AT_END_OF(struct fpu, __default_state);
CHECK_MEMBER_AT_END_OF(struct thread_struct, fpu);
CHECK_MEMBER_AT_END_OF(struct task_struct, thread);

diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 6a025fa26a7e..ee27df4caed6 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -37,7 +37,7 @@ int xfpregs_get(struct task_struct *target, const struct user_regset *regset,
fpu__prepare_read(fpu);
fpstate_sanitize_xstate(fpu);

- return membuf_write(&to, &fpu->state.fxsave, sizeof(struct fxregs_state));
+ return membuf_write(&to, &fpu->state->fxsave, sizeof(struct fxregs_state));
}

int xfpregs_set(struct task_struct *target, const struct user_regset *regset,
@@ -54,19 +54,19 @@ int xfpregs_set(struct task_struct *target, const struct user_regset *regset,
fpstate_sanitize_xstate(fpu);

ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf,
- &fpu->state.fxsave, 0, -1);
+ &fpu->state->fxsave, 0, -1);

/*
* mxcsr reserved bits must be masked to zero for security reasons.
*/
- fpu->state.fxsave.mxcsr &= mxcsr_feature_mask;
+ fpu->state->fxsave.mxcsr &= mxcsr_feature_mask;

/*
* update the header bits in the xsave header, indicating the
* presence of FP and SSE state.
*/
if (boot_cpu_has(X86_FEATURE_XSAVE))
- fpu->state.xsave.header.xfeatures |= XFEATURE_MASK_FPSSE;
+ fpu->state->xsave.header.xfeatures |= XFEATURE_MASK_FPSSE;

return ret;
}
@@ -80,7 +80,7 @@ int xstateregs_get(struct task_struct *target, const struct user_regset *regset,
if (!boot_cpu_has(X86_FEATURE_XSAVE))
return -ENODEV;

- xsave = &fpu->state.xsave;
+ xsave = &fpu->state->xsave;

fpu__prepare_read(fpu);

@@ -120,7 +120,7 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
if ((pos != 0) || (count < get_xstate_config(XSTATE_USER_SIZE)))
return -EFAULT;

- xsave = &fpu->state.xsave;
+ xsave = &fpu->state->xsave;

fpu__prepare_write(fpu);

@@ -224,7 +224,7 @@ static inline u32 twd_fxsr_to_i387(struct fxregs_state *fxsave)
void
convert_from_fxsr(struct user_i387_ia32_struct *env, struct task_struct *tsk)
{
- struct fxregs_state *fxsave = &tsk->thread.fpu.state.fxsave;
+ struct fxregs_state *fxsave = &tsk->thread.fpu.state->fxsave;
struct _fpreg *to = (struct _fpreg *) &env->st_space[0];
struct _fpxreg *from = (struct _fpxreg *) &fxsave->st_space[0];
int i;
@@ -297,7 +297,7 @@ int fpregs_get(struct task_struct *target, const struct user_regset *regset,
return fpregs_soft_get(target, regset, to);

if (!boot_cpu_has(X86_FEATURE_FXSR)) {
- return membuf_write(&to, &fpu->state.fsave,
+ return membuf_write(&to, &fpu->state->fsave,
sizeof(struct fregs_state));
}

@@ -328,7 +328,7 @@ int fpregs_set(struct task_struct *target, const struct user_regset *regset,

if (!boot_cpu_has(X86_FEATURE_FXSR))
return user_regset_copyin(&pos, &count, &kbuf, &ubuf,
- &fpu->state.fsave, 0,
+ &fpu->state->fsave, 0,
-1);

if (pos > 0 || count < sizeof(env))
@@ -336,14 +336,14 @@ int fpregs_set(struct task_struct *target, const struct user_regset *regset,

ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &env, 0, -1);
if (!ret)
- convert_to_fxsr(&target->thread.fpu.state.fxsave, &env);
+ convert_to_fxsr(&target->thread.fpu.state->fxsave, &env);

/*
* update the header bit in the xsave header, indicating the
* presence of FP.
*/
if (boot_cpu_has(X86_FEATURE_XSAVE))
- fpu->state.xsave.header.xfeatures |= XFEATURE_MASK_FP;
+ fpu->state->xsave.header.xfeatures |= XFEATURE_MASK_FP;
return ret;
}

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 3a2d8665b9a3..9719241da034 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -58,7 +58,7 @@ static inline int check_for_xstate(struct fxregs_state __user *buf,
static inline int save_fsave_header(struct task_struct *tsk, void __user *buf)
{
if (use_fxsr()) {
- struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
+ struct xregs_state *xsave = &tsk->thread.fpu.state->xsave;
struct user_i387_ia32_struct env;
struct _fpstate_32 __user *fp = buf;

@@ -216,7 +216,7 @@ sanitize_restored_user_xstate(struct fpu *fpu,
struct user_i387_ia32_struct *ia32_env,
u64 user_xfeatures, int fx_only)
{
- struct xregs_state *xsave = &fpu->state.xsave;
+ struct xregs_state *xsave = &fpu->state->xsave;
struct xstate_header *header = &xsave->header;

if (use_xsave()) {
@@ -253,7 +253,7 @@ sanitize_restored_user_xstate(struct fpu *fpu,
xsave->i387.mxcsr &= mxcsr_feature_mask;

if (ia32_env)
- convert_to_fxsr(&fpu->state.fxsave, ia32_env);
+ convert_to_fxsr(&fpu->state->fxsave, ia32_env);
}
}

@@ -366,7 +366,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
*/
if (test_thread_flag(TIF_NEED_FPU_LOAD) &&
xfeatures_mask_supervisor())
- copy_kernel_to_xregs(&fpu->state.xsave,
+ copy_kernel_to_xregs(&fpu->state->xsave,
xfeatures_mask_supervisor());
fpregs_mark_activate();
fpregs_unlock();
@@ -411,10 +411,10 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
if (using_compacted_format()) {
ret = copy_user_to_xstate(fpu, buf_fx);
} else {
- ret = __copy_from_user(&fpu->state.xsave, buf_fx, state_size);
+ ret = __copy_from_user(&fpu->state->xsave, buf_fx, state_size);

if (!ret && state_size > offsetof(struct xregs_state, header))
- ret = validate_user_xstate_header(&fpu->state.xsave.header);
+ ret = validate_user_xstate_header(&fpu->state->xsave.header);
}
if (ret)
goto err_out;
@@ -429,11 +429,11 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
* Restore previously saved supervisor xstates along with
* copied-in user xstates.
*/
- ret = copy_kernel_to_xregs_err(&fpu->state.xsave,
+ ret = copy_kernel_to_xregs_err(&fpu->state->xsave,
user_xfeatures | xfeatures_mask_supervisor());

} else if (use_fxsr()) {
- ret = __copy_from_user(&fpu->state.fxsave, buf_fx, state_size);
+ ret = __copy_from_user(&fpu->state->fxsave, buf_fx, state_size);
if (ret) {
ret = -EFAULT;
goto err_out;
@@ -449,14 +449,14 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
copy_kernel_to_xregs(&init_fpstate.xsave, init_bv);
}

- ret = copy_kernel_to_fxregs_err(&fpu->state.fxsave);
+ ret = copy_kernel_to_fxregs_err(&fpu->state->fxsave);
} else {
- ret = __copy_from_user(&fpu->state.fsave, buf_fx, state_size);
+ ret = __copy_from_user(&fpu->state->fsave, buf_fx, state_size);
if (ret)
goto err_out;

fpregs_lock();
- ret = copy_kernel_to_fregs_err(&fpu->state.fsave);
+ ret = copy_kernel_to_fregs_err(&fpu->state->fsave);
}
if (!ret)
fpregs_mark_activate();
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 74e0892bd8ec..c3c7c57616eb 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -185,14 +185,14 @@ static bool xfeature_is_supervisor(int xfeature_nr)
*/
void fpstate_sanitize_xstate(struct fpu *fpu)
{
- struct fxregs_state *fx = &fpu->state.fxsave;
+ struct fxregs_state *fx = &fpu->state->fxsave;
int feature_bit;
u64 xfeatures;

if (!use_xsaveopt())
return;

- xfeatures = fpu->state.xsave.header.xfeatures;
+ xfeatures = fpu->state->xsave.header.xfeatures;

/*
* None of the feature bits are in init state. So nothing else
@@ -982,7 +982,7 @@ static void *__raw_xsave_addr(struct fpu *fpu, int xfeature_nr)
}

if (fpu)
- xsave = &fpu->state.xsave;
+ xsave = &fpu->state->xsave;
else
xsave = &init_fpstate.xsave;

@@ -1025,7 +1025,7 @@ void *get_xsave_addr(struct fpu *fpu, int xfeature_nr)
"get of unsupported state");

if (fpu)
- xsave = &fpu->state.xsave;
+ xsave = &fpu->state->xsave;
else
xsave = &init_fpstate.xsave;

@@ -1173,7 +1173,7 @@ void copy_xstate_to_kernel(struct membuf to, struct fpu *fpu)
unsigned last = 0;
int i;

- xsave = &fpu->state.xsave;
+ xsave = &fpu->state->xsave;

/*
* The destination is a ptrace buffer; we put in only user xstates:
@@ -1238,7 +1238,7 @@ int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf)
if (validate_user_xstate_header(&hdr))
return -EINVAL;

- xsave = &fpu->state.xsave;
+ xsave = &fpu->state->xsave;

for (i = 0; i < XFEATURE_MAX; i++) {
u64 mask = ((u64)1 << i);
@@ -1295,7 +1295,7 @@ int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf)
if (validate_user_xstate_header(&hdr))
return -EINVAL;

- xsave = &fpu->state.xsave;
+ xsave = &fpu->state->xsave;

for (i = 0; i < XFEATURE_MAX; i++) {
u64 mask = ((u64)1 << i);
@@ -1354,7 +1354,7 @@ void copy_supervisor_to_kernel(struct fpu *fpu)
max_bit = __fls(xfeatures_mask_supervisor());
min_bit = __ffs(xfeatures_mask_supervisor());

- xstate = &fpu->state.xsave;
+ xstate = &fpu->state->xsave;
lmask = xfeatures_mask_supervisor();
hmask = xfeatures_mask_supervisor() >> 32;
XSTATE_OP(XSAVES, xstate, lmask, hmask, err);
@@ -1541,7 +1541,7 @@ void update_pasid(void)
* update the PASID state in the memory buffer here. The
* PASID MSR will be loaded when returning to user mode.
*/
- xsave = &fpu->state.xsave;
+ xsave = &fpu->state->xsave;
xsave->header.xfeatures |= XFEATURE_MASK_PASID;
ppasid_state = get_xsave_addr(fpu, XFEATURE_PASID);
/*
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 3f78ca2600d3..5252464a27e3 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -93,7 +93,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)

void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
{
- *offset = offsetof(struct thread_struct, fpu.state);
+ *offset = offsetof(struct thread_struct, fpu.__default_state);
/* The buffer embedded in thread_struct has the minimum size. */
*size = get_xstate_config(XSTATE_MIN_SIZE);
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8839d4dd3fab..4323ee191504 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4567,7 +4567,7 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,

static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
{
- struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave;
+ struct xregs_state *xsave = &vcpu->arch.guest_fpu->state->xsave;
u64 xstate_bv = xsave->header.xfeatures;
u64 valid;

@@ -4609,7 +4609,7 @@ static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)

static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
{
- struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave;
+ struct xregs_state *xsave = &vcpu->arch.guest_fpu->state->xsave;
u64 xstate_bv = *(u64 *)(src + XSAVE_HDR_OFFSET);
u64 valid;

@@ -4660,7 +4660,7 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
fill_xsave((u8 *) guest_xsave->region, vcpu);
} else {
memcpy(guest_xsave->region,
- &vcpu->arch.guest_fpu->state.fxsave,
+ &vcpu->arch.guest_fpu->state->fxsave,
sizeof(struct fxregs_state));
*(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)] =
XFEATURE_MASK_FPSSE;
@@ -4694,7 +4694,7 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
if (xstate_bv & ~XFEATURE_MASK_FPSSE ||
mxcsr & ~mxcsr_feature_mask)
return -EINVAL;
- memcpy(&vcpu->arch.guest_fpu->state.fxsave,
+ memcpy(&vcpu->arch.guest_fpu->state->fxsave,
guest_xsave->region, sizeof(struct fxregs_state));
}
return 0;
@@ -9612,7 +9612,7 @@ static void kvm_save_current_fpu(struct fpu *fpu)
* always has the minimum size.
*/
if (test_thread_flag(TIF_NEED_FPU_LOAD))
- memcpy(&fpu->state, &current->thread.fpu.state,
+ memcpy(fpu->state, current->thread.fpu.state,
get_xstate_config(XSTATE_MIN_SIZE));
else
copy_fpregs_to_fpstate(fpu);
@@ -9631,7 +9631,7 @@ static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
*/
if (vcpu->arch.guest_fpu)
/* PKRU is separately restored in kvm_x86_ops.run. */
- __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu->state,
+ __copy_kernel_to_fpregs(vcpu->arch.guest_fpu->state,
~XFEATURE_MASK_PKRU);

fpregs_mark_activate();
@@ -10169,7 +10169,7 @@ int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)

vcpu_load(vcpu);

- fxsave = &vcpu->arch.guest_fpu->state.fxsave;
+ fxsave = &vcpu->arch.guest_fpu->state->fxsave;
memcpy(fpu->fpr, fxsave->st_space, 128);
fpu->fcw = fxsave->cwd;
fpu->fsw = fxsave->swd;
@@ -10192,7 +10192,7 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)

vcpu_load(vcpu);

- fxsave = &vcpu->arch.guest_fpu->state.fxsave;
+ fxsave = &vcpu->arch.guest_fpu->state->fxsave;

memcpy(fxsave->st_space, fpu->fpr, 128);
fxsave->cwd = fpu->fcw;
@@ -10253,7 +10253,7 @@ static void fx_init(struct kvm_vcpu *vcpu)

fpstate_init(vcpu->arch.guest_fpu);
if (boot_cpu_has(X86_FEATURE_XSAVES))
- vcpu->arch.guest_fpu->state.xsave.header.xcomp_bv =
+ vcpu->arch.guest_fpu->state->xsave.header.xcomp_bv =
host_xcr0 | XSTATE_COMPACTION_ENABLED;

/*
@@ -10333,6 +10333,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
pr_err("kvm: failed to allocate userspace's fpu\n");
goto free_emulate_ctxt;
}
+ vcpu->arch.user_fpu->state = &vcpu->arch.user_fpu->__default_state;

vcpu->arch.guest_fpu = kmem_cache_zalloc(x86_fpu_cache,
GFP_KERNEL_ACCOUNT);
@@ -10340,6 +10341,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
pr_err("kvm: failed to allocate vcpu's fpu\n");
goto free_user_fpu;
}
+ vcpu->arch.guest_fpu->state = &vcpu->arch.guest_fpu->__default_state;
fx_init(vcpu);

vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
diff --git a/arch/x86/math-emu/fpu_aux.c b/arch/x86/math-emu/fpu_aux.c
index 034748459482..51432a73024c 100644
--- a/arch/x86/math-emu/fpu_aux.c
+++ b/arch/x86/math-emu/fpu_aux.c
@@ -53,7 +53,7 @@ void fpstate_init_soft(struct swregs_state *soft)

void finit(void)
{
- fpstate_init_soft(&current->thread.fpu.state.soft);
+ fpstate_init_soft(&current->thread.fpu.state->soft);
}

/*
diff --git a/arch/x86/math-emu/fpu_entry.c b/arch/x86/math-emu/fpu_entry.c
index 8679a9d6c47f..6ba56632170e 100644
--- a/arch/x86/math-emu/fpu_entry.c
+++ b/arch/x86/math-emu/fpu_entry.c
@@ -640,7 +640,7 @@ int fpregs_soft_set(struct task_struct *target,
unsigned int pos, unsigned int count,
const void *kbuf, const void __user *ubuf)
{
- struct swregs_state *s387 = &target->thread.fpu.state.soft;
+ struct swregs_state *s387 = &target->thread.fpu.state->soft;
void *space = s387->st_space;
int ret;
int offset, other, i, tags, regnr, tag, newtop;
@@ -691,7 +691,7 @@ int fpregs_soft_get(struct task_struct *target,
const struct user_regset *regset,
struct membuf to)
{
- struct swregs_state *s387 = &target->thread.fpu.state.soft;
+ struct swregs_state *s387 = &target->thread.fpu.state->soft;
const void *space = s387->st_space;
int offset = (S387->ftop & 7) * 10, other = 80 - offset;

diff --git a/arch/x86/math-emu/fpu_system.h b/arch/x86/math-emu/fpu_system.h
index 9b41391867dc..a6291ddfdda6 100644
--- a/arch/x86/math-emu/fpu_system.h
+++ b/arch/x86/math-emu/fpu_system.h
@@ -73,7 +73,7 @@ static inline bool seg_writable(struct desc_struct *d)
return (d->type & SEG_TYPE_EXECUTE_MASK) == SEG_TYPE_WRITABLE;
}

-#define I387 (&current->thread.fpu.state)
+#define I387 (current->thread.fpu.state)
#define FPU_info (I387->soft.info)

#define FPU_CS (*(unsigned short *) &(FPU_info->regs->cs))
--
2.17.1

2021-05-23 19:40:31

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 11/28] x86/fpu/xstate: Update the xstate save function to support dynamic states

Extend copy_xregs_to_kernel() to receive a mask argument of which states to
save, in preparation for dynamic user state handling.

Update KVM to set a valid fpu->state_mask, so it can continue to share with
the core code.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Made the code change more reviewable.

Changes from v2:
* Updated the changelog to clarify the KVM code changes.
---
arch/x86/include/asm/fpu/internal.h | 3 +--
arch/x86/kernel/fpu/core.c | 2 +-
arch/x86/kvm/x86.c | 9 +++++++--
3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index e4afc1831e29..f964f3efc92e 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -317,9 +317,8 @@ static inline void copy_kernel_to_xregs_booting(struct xregs_state *xstate)
/*
* Save processor xstate to xsave area.
*/
-static inline void copy_xregs_to_kernel(struct xregs_state *xstate)
+static inline void copy_xregs_to_kernel(struct xregs_state *xstate, u64 mask)
{
- u64 mask = xfeatures_mask_all;
u32 lmask = mask;
u32 hmask = mask >> 32;
int err;
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 2584a2922fea..25c9c7dad3f9 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -98,7 +98,7 @@ EXPORT_SYMBOL(irq_fpu_usable);
int copy_fpregs_to_fpstate(struct fpu *fpu)
{
if (likely(use_xsave())) {
- copy_xregs_to_kernel(&fpu->state->xsave);
+ copy_xregs_to_kernel(&fpu->state->xsave, fpu->state_mask);

/*
* AVX512 state is tracked here because its use is
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4323ee191504..c60c92806883 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9611,11 +9611,16 @@ static void kvm_save_current_fpu(struct fpu *fpu)
* KVM does not support dynamic user states yet. Assume the buffer
* always has the minimum size.
*/
- if (test_thread_flag(TIF_NEED_FPU_LOAD))
+ if (test_thread_flag(TIF_NEED_FPU_LOAD)) {
memcpy(fpu->state, current->thread.fpu.state,
get_xstate_config(XSTATE_MIN_SIZE));
- else
+ } else {
+ struct fpu *src_fpu = &current->thread.fpu;
+
+ if (fpu->state_mask != src_fpu->state_mask)
+ fpu->state_mask = src_fpu->state_mask;
copy_fpregs_to_fpstate(fpu);
+ }
}

/* Swap (qemu) user FPU context for the guest FPU context. */
--
2.17.1

2021-05-23 19:40:43

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 18/28] x86/fpu/xstate: Disable xstate support if an inconsistent state is detected

The kernel has a sanity check between two methods to calculate xstate size.
In the unlikely event that they disagree, disable the use of xstate.

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Added as a new patch. (Thomas Gleixner)
---
arch/x86/kernel/fpu/xstate.c | 33 ++++++++++++++++++++++-----------
1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 1cf5888210aa..544e35a9d777 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -730,11 +730,11 @@ static void __xstate_dump_leaves(void)
} while (0)

#define XCHECK_SZ(sz, nr, nr_macro, __struct) do { \
- if ((nr == nr_macro) && \
- WARN_ONCE(sz != sizeof(__struct), \
- "%s: struct is %zu bytes, cpu state %d bytes\n", \
- __stringify(nr_macro), sizeof(__struct), sz)) { \
+ if ((nr == nr_macro) && (sz != sizeof(__struct))) { \
+ pr_err("%s: struct is %zu bytes, cpu state %d bytes\n", \
+ __stringify(nr_macro), sizeof(__struct), sz); \
__xstate_dump_leaves(); \
+ return -EINVAL; \
} \
} while (0)

@@ -743,7 +743,7 @@ static void __xstate_dump_leaves(void)
* that our software representation matches what the CPU
* tells us about the state's size.
*/
-static void check_xstate_against_struct(int nr)
+static int check_xstate_against_struct(int nr)
{
/*
* Ask the CPU for the size of the state.
@@ -771,9 +771,12 @@ static void check_xstate_against_struct(int nr)
(nr >= XFEATURE_MAX) ||
(nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_LBR))) {
- WARN_ONCE(1, "no structure for xstate: %d\n", nr);
+ pr_err("no structure for xstate: %d\n", nr);
XSTATE_WARN_ON(1);
+ return -EINVAL;
}
+
+ return 0;
}

/**
@@ -786,9 +789,9 @@ static void check_xstate_against_struct(int nr)
*
* Dynamic supervisor XSAVE features allocate their own buffers and are not covered by these checks.
*
- * Return: nothing.
+ * Returns 0 on success, -EINVAL on state size mismatch.
*/
-static void calculate_xstate_sizes(void)
+static int calculate_xstate_sizes(void)
{
int paranoid_min_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
int paranoid_max_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
@@ -796,13 +799,17 @@ static void calculate_xstate_sizes(void)

for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
bool user_dynamic;
+ int err;

if (!xfeature_enabled(i))
continue;

user_dynamic = (xfeatures_mask_user_dynamic & BIT_ULL(i)) ? true : false;

- check_xstate_against_struct(i);
+ err = check_xstate_against_struct(i);
+ if (err)
+ return err;
+
/*
* Supervisor state components can be managed only by
* XSAVES, which is compacted-format only.
@@ -837,6 +844,7 @@ static void calculate_xstate_sizes(void)
}
XSTATE_WARN_ON(paranoid_max_size != get_xstate_config(XSTATE_MAX_SIZE));
set_xstate_config(XSTATE_MIN_SIZE, paranoid_min_size);
+ return 0;
}


@@ -923,6 +931,7 @@ static int __init init_xstate_size(void)
/* Recompute the context size for enabled features: */
unsigned int possible_xstate_size;
unsigned int xsave_size;
+ int err;

xsave_size = get_xsave_size();

@@ -939,9 +948,11 @@ static int __init init_xstate_size(void)

/*
* Calculate and double-check the maximum size. Calculate and record
- * the minimum size.
+ * the minimum size. Emit the error when received.
*/
- calculate_xstate_sizes();
+ err = calculate_xstate_sizes();
+ if (err)
+ return err;

/* Ensure the minimum size fits in the statically-alocated buffer: */
if (!is_supported_xstate_size(get_xstate_config(XSTATE_MIN_SIZE)))
--
2.17.1

2021-05-23 19:40:43

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 15/28] x86/arch_prctl: Create ARCH_GET_XSTATE/ARCH_PUT_XSTATE

N.B. This interface is currently under active discussion on LKML. This v1
proposal implements a per-task system call with GET/PUT semantics. A
per-process system call without PUT semantics may be superior.

arch_prctl(ARCH_GET_XSTATE)
Each user task requests access to the hardware features associated
with the specified xstate bits.
Return codes:
0: success (including repeated calls)
EPERM: requested feature is not a user-supported feature.
EINVAL: too many GETs (>INT_MAX) without matching PUTS (reference
counter overflow)
ENOMEM: buffer or reference counter allocation failed.

arch_prctl(ARCH_PUT_XSTATE)
User task tells the kernel that it is no longer using the features
associated with the specified xstate bits.
Return codes:
0: success
EINVAL: imbalance with preceding ARCH_GET_XSTATE calls

The kernel enforces access to the specified using XFD hardware support. By
default, XFD is armed and results in #NM traps on un-authorized access.
Upon successful ARCH_GET_XSTATE, XFD traps are dis-armed and the user is
free to access the feature.

On the last ARCH_PUT_XSTATE, the kernel re-arms XFD. However, as an
optimization, the kernel does not immediately free the xstate buffer.
A future enhancement could be made to identify and free allocated but
un-used xstate buffers under constrained memory conditions.

Rename the third argument for do_arch_prctl_common() to reflect its generic
use.

Free up the dynamically-allocated buffer as well as the reference counters
when the task terminates.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Added as a new patch. (Andy Lutomirski, Thomas Gleixner, and et al)
---
arch/x86/include/asm/fpu/internal.h | 2 +
arch/x86/include/asm/fpu/types.h | 8 +++
arch/x86/include/asm/fpu/xstate.h | 2 +
arch/x86/include/asm/proto.h | 2 +-
arch/x86/include/uapi/asm/prctl.h | 3 +
arch/x86/kernel/fpu/core.c | 14 ++++
arch/x86/kernel/fpu/xstate.c | 103 ++++++++++++++++++++++++++--
arch/x86/kernel/process.c | 13 +++-
8 files changed, 140 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index c250216320df..6460d52991ba 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -636,4 +636,6 @@ static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
update_pasid();
}

+void free_fpu(struct fpu *fpu);
+
#endif /* _ASM_X86_FPU_INTERNAL_H */
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 6fc707c14350..387b96e0b643 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -343,6 +343,14 @@ struct fpu {
*/
u64 state_mask;

+ /*
+ * @refcount;
+ *
+ * Point a reference counter array for arch_prctl that tracks the xstate
+ * use request.
+ */
+ int *refcount;
+
/*
* @state:
*
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index cbb4795d2b45..30f83eb0aef4 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -116,6 +116,8 @@ unsigned int get_xstate_size(u64 mask);
int alloc_xstate_buffer(struct fpu *fpu, u64 mask);
void free_xstate_buffer(struct fpu *fpu);

+long do_arch_prctl_xstate(struct fpu *fpu, int option, unsigned long mask);
+
const void *get_xsave_field_ptr(int xfeature_nr);
int using_compacted_format(void);
int xfeature_size(int xfeature_nr);
diff --git a/arch/x86/include/asm/proto.h b/arch/x86/include/asm/proto.h
index 8c5d1910a848..feed36d44d04 100644
--- a/arch/x86/include/asm/proto.h
+++ b/arch/x86/include/asm/proto.h
@@ -40,6 +40,6 @@ void x86_report_nx(void);
extern int reboot_force;

long do_arch_prctl_common(struct task_struct *task, int option,
- unsigned long cpuid_enabled);
+ unsigned long arg2);

#endif /* _ASM_X86_PROTO_H */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9fa41f..d3fce82845fd 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -10,6 +10,9 @@
#define ARCH_GET_CPUID 0x1011
#define ARCH_SET_CPUID 0x1012

+#define ARCH_GET_XSTATE 0x1021
+#define ARCH_PUT_XSTATE 0x1022
+
#define ARCH_MAP_VDSO_X32 0x2001
#define ARCH_MAP_VDSO_32 0x2002
#define ARCH_MAP_VDSO_64 0x2003
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 25c9c7dad3f9..016c3adebec3 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -254,6 +254,8 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)

WARN_ON_FPU(src_fpu != &current->thread.fpu);

+ dst_fpu->refcount = NULL;
+
/*
* The child does not inherit the dynamic states. Thus, use the buffer
* embedded in struct task_struct, which has the minimum size.
@@ -541,3 +543,15 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
*/
return 0;
}
+
+/**
+ * free_fpu() - Free up memory that belongs to the FPU context.
+ * @fpu: A struct fpu * pointer
+ *
+ * Return: Nothing
+ */
+void free_fpu(struct fpu *fpu)
+{
+ kfree(fpu->refcount);
+ free_xstate_buffer(fpu);
+}
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index e60a20a1b24b..126c4a509669 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -21,6 +21,7 @@
#include <asm/tlbflush.h>
#include <asm/cpufeature.h>
#include <asm/trace/fpu.h>
+#include <asm/prctl.h>

/*
* Although we spell it out in here, the Processor Trace
@@ -78,6 +79,11 @@ static unsigned int xstate_supervisor_only_offsets[XFEATURE_MAX] = { [ 0 ... XFE
* byte boundary. Otherwise, it follows the preceding component immediately.
*/
static bool xstate_aligns[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = false};
+/*
+ * Remember the index number in the reference counter array that supports
+ * access request. '-1' indicates that a state component does not support it.
+ */
+static unsigned int xstate_refcount_idx[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};

/**
* struct fpu_xstate_buffer_config - xstate per-task buffer configuration
@@ -969,8 +975,7 @@ void __init fpu__init_system_xstate(void)
{
unsigned int eax, ebx, ecx, edx;
static int on_boot_cpu __initdata = 1;
- int err;
- int i;
+ int err, i, j;

WARN_ON_FPU(!on_boot_cpu);
on_boot_cpu = 0;
@@ -1025,14 +1030,17 @@ void __init fpu__init_system_xstate(void)
xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();
xfeatures_mask_user_dynamic = 0;

- for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+ for (i = FIRST_EXTENDED_XFEATURE, j = 0; i < XFEATURE_MAX; i++) {
u64 feature_mask = BIT_ULL(i);

if (!(xfeatures_mask_user() & feature_mask))
continue;

- if (xfd_supported(i))
+ if (xfd_supported(i)) {
xfeatures_mask_user_dynamic |= feature_mask;
+ xstate_refcount_idx[i] = j;
+ j++;
+ }
}

/* Enable xstate instructions to be able to continue with initialization: */
@@ -1339,6 +1347,93 @@ int alloc_xstate_buffer(struct fpu *fpu, u64 mask)
return 0;
}

+/**
+ * do_arch_prctl_xstate() - Handle xstate-related arch_prctl requests.
+ *
+ * @fpu: A struct fpu * pointer
+ * @option: A subfunction of arch_prctl()
+ * @mask: A xstate-component bitmap
+ *
+ * Return: 0 if successful; otherwise, return a relevant error code.
+ */
+long do_arch_prctl_xstate(struct fpu *fpu, int option, unsigned long mask)
+{
+ bool need_xfd_update = false;
+ int i;
+
+ switch (option) {
+ case ARCH_GET_XSTATE: {
+ int err = 0;
+
+ if (mask & ~xfeatures_mask_user())
+ return -EPERM;
+
+ if (!fpu->refcount) {
+ fpu->refcount = kcalloc(hweight64(xfeatures_mask_user_dynamic),
+ sizeof(int), GFP_KERNEL);
+ if (!fpu->refcount)
+ return -ENOMEM;
+ }
+
+ for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+ unsigned int idx = xstate_refcount_idx[i];
+
+ if ((idx == -1) || !(BIT_ULL(i) & mask))
+ continue;
+
+ if (fpu->refcount[idx] == INT_MAX)
+ return -EINVAL;
+
+ fpu->refcount[idx]++;
+ }
+
+ if ((mask & fpu->state_mask) == mask)
+ return 0;
+
+ err = alloc_xstate_buffer(fpu, mask);
+ if (!err)
+ need_xfd_update = true;
+ else
+ return err;
+ break;
+ }
+ case ARCH_PUT_XSTATE: {
+ if (mask & ~xfeatures_mask_user())
+ return -EPERM;
+
+ if (!fpu->refcount)
+ return -EINVAL;
+
+ for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+ int idx = xstate_refcount_idx[i];
+ u64 feature_mask = BIT_ULL(i);
+
+ if ((idx == -1) || !(feature_mask & mask))
+ continue;
+
+ if (fpu->refcount[idx] <= 0)
+ return -EINVAL;
+
+ fpu->refcount[idx]--;
+ if (!fpu->refcount[idx]) {
+ need_xfd_update = true;
+ fpu->state_mask &= ~(feature_mask);
+ }
+ }
+ break;
+ }
+ default:
+ return -EINVAL;
+ }
+
+ if (need_xfd_update) {
+ u64 fpu_xfd_mask = fpu->state_mask & xfd_capable();
+
+ xfd_write(xfd_capable() ^ fpu_xfd_mask);
+ }
+ return 0;
+}
+
static void fill_gap(struct membuf *to, unsigned *last, unsigned offset)
{
if (*last >= offset)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 5252464a27e3..c166243f64e4 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -98,6 +98,12 @@ void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
*size = get_xstate_config(XSTATE_MIN_SIZE);
}

+void arch_release_task_struct(struct task_struct *tsk)
+{
+ if (boot_cpu_has(X86_FEATURE_FPU))
+ free_fpu(&tsk->thread.fpu);
+}
+
/*
* Free thread data structures etc..
*/
@@ -990,13 +996,16 @@ unsigned long get_wchan(struct task_struct *p)
}

long do_arch_prctl_common(struct task_struct *task, int option,
- unsigned long cpuid_enabled)
+ unsigned long arg2)
{
switch (option) {
case ARCH_GET_CPUID:
return get_cpuid_mode();
case ARCH_SET_CPUID:
- return set_cpuid_mode(task, cpuid_enabled);
+ return set_cpuid_mode(task, arg2);
+ case ARCH_GET_XSTATE:
+ case ARCH_PUT_XSTATE:
+ return do_arch_prctl_xstate(&task->thread.fpu, option, arg2);
}

return -EINVAL;
--
2.17.1

2021-05-23 19:40:52

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 20/28] x86/fpu/amx: Define AMX state components and have it used for boot-time checks

Linux uses check_xstate_against_struct() to sanity check the size of
XSTATE-enabled features. AMX is the XSAVE-enabled feature, and its size is
not hard-coded but discoverable at run-time via CPUID.

The AMX state is composed of state components 17 and 18, which are all user
state components. The first component is the XTILECFG state of a 64-byte
tile-related control register. The state component 18, called XTILEDATA,
contains the actual tile data, and the state size varies on
implementations. The architectural maximum, as defined in the CPUID(0x1d,
1): EAX[15:0], is a byte less than 64KB. The first implementation supports
8KB.

Check the XTILEDATA state size dynamically. The feature introduces the new
tile register, TMM. Define one register struct only and read the number of
registers from CPUID. Cross-check the overall size with CPUID again.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Changed to return an error when tile data size mismatches. (Thomas Gleixner)
* Updated the function description and code comments.

Changes from v2:
* Updated the code comments.

Changes from v1:
* Rebased on the upstream kernel (5.10)
---
arch/x86/include/asm/fpu/types.h | 27 +++++++++++
arch/x86/include/asm/fpu/xstate.h | 2 +
arch/x86/kernel/fpu/xstate.c | 74 +++++++++++++++++++++++++++++++
3 files changed, 103 insertions(+)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 387b96e0b643..4e620e006787 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -120,6 +120,9 @@ enum xfeature {
XFEATURE_RSRVD_COMP_13,
XFEATURE_RSRVD_COMP_14,
XFEATURE_LBR,
+ XFEATURE_RSRVD_COMP_16,
+ XFEATURE_XTILE_CFG,
+ XFEATURE_XTILE_DATA,

XFEATURE_MAX,
};
@@ -136,11 +139,15 @@ enum xfeature {
#define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
#define XFEATURE_MASK_PASID (1 << XFEATURE_PASID)
#define XFEATURE_MASK_LBR (1 << XFEATURE_LBR)
+#define XFEATURE_MASK_XTILE_CFG (1 << XFEATURE_XTILE_CFG)
+#define XFEATURE_MASK_XTILE_DATA (1 << XFEATURE_XTILE_DATA)

#define XFEATURE_MASK_FPSSE (XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
#define XFEATURE_MASK_AVX512 (XFEATURE_MASK_OPMASK \
| XFEATURE_MASK_ZMM_Hi256 \
| XFEATURE_MASK_Hi16_ZMM)
+#define XFEATURE_MASK_XTILE (XFEATURE_MASK_XTILE_DATA \
+ | XFEATURE_MASK_XTILE_CFG)

#define FIRST_EXTENDED_XFEATURE XFEATURE_YMM

@@ -153,6 +160,9 @@ struct reg_256_bit {
struct reg_512_bit {
u8 regbytes[512/8];
};
+struct reg_1024_byte {
+ u8 regbytes[1024];
+};

/*
* State component 2:
@@ -255,6 +265,23 @@ struct arch_lbr_state {
u64 ler_to;
u64 ler_info;
struct lbr_entry entries[];
+};
+
+/*
+ * State component 17: 64-byte tile configuration register.
+ */
+struct xtile_cfg {
+ u64 tcfg[8];
+} __packed;
+
+/*
+ * State component 18: 1KB tile data register.
+ * Each register represents 16 64-byte rows of the matrix
+ * data. But the number of registers depends on the actual
+ * implementation.
+ */
+struct xtile_data {
+ struct reg_1024_byte tmm;
} __packed;

/*
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 30f83eb0aef4..2c9156e4f799 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -13,6 +13,8 @@

#define XSTATE_CPUID 0x0000000d

+#define TILE_CPUID 0x0000001d
+
#define FXSAVE_SIZE 512

#define XSAVE_HDR_SIZE 64
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 544e35a9d777..0d10a9ec29da 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -42,6 +42,14 @@ static const char *xfeature_names[] =
"Protection Keys User registers",
"PASID state",
"unknown xstate feature" ,
+ "unknown xstate feature" ,
+ "unknown xstate feature" ,
+ "unknown xstate feature" ,
+ "unknown xstate feature" ,
+ "unknown xstate feature" ,
+ "AMX Tile config" ,
+ "AMX Tile data" ,
+ "unknown xstate feature" ,
};

static unsigned short xsave_cpuid_features[] __initdata = {
@@ -55,6 +63,8 @@ static unsigned short xsave_cpuid_features[] __initdata = {
[XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
[XFEATURE_PKRU] = X86_FEATURE_PKU,
[XFEATURE_PASID] = X86_FEATURE_ENQCMD,
+ [XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
+ [XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
};

/*
@@ -479,6 +489,8 @@ static void __init print_xstate_features(void)
print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
print_xstate_feature(XFEATURE_MASK_PKRU);
print_xstate_feature(XFEATURE_MASK_PASID);
+ print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
+ print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
}

/*
@@ -738,6 +750,63 @@ static void __xstate_dump_leaves(void)
} \
} while (0)

+/**
+ * check_xtile_data_against_struct() - Do a sanity check for tile data state size.
+ *
+ * Calculate the state size by multiplying the single tile size which is recorded in a C struct, and
+ * the number of tiles that the CPU informs. Compare the provided size with the calculation.
+ *
+ * @size: The tile data state size
+ *
+ * Returns: 0 on success, -EINVAL on mismatch.
+ */
+static int check_xtile_data_against_struct(int size)
+{
+ u32 max_palid, palid, state_size;
+ u32 eax, ebx, ecx, edx;
+ u16 max_tile;
+
+ /*
+ * Check the maximum palette id:
+ * eax: the highest numbered palette subleaf.
+ */
+ cpuid_count(TILE_CPUID, 0, &max_palid, &ebx, &ecx, &edx);
+
+ /* Cross-check each tile size and find the maximum number of supported tiles. */
+ for (palid = 1, max_tile = 0; palid <= max_palid; palid++) {
+ u16 tile_size, max;
+
+ /*
+ * Check the tile size info:
+ * eax[31:16]: bytes per title
+ * ebx[31:16]: the max names (or max number of tiles)
+ */
+ cpuid_count(TILE_CPUID, palid, &eax, &ebx, &edx, &edx);
+ tile_size = eax >> 16;
+ max = ebx >> 16;
+
+ if (tile_size != sizeof(struct xtile_data)) {
+ pr_err("%s: struct is %zu bytes, cpu xtile %d bytes\n",
+ __stringify(XFEATURE_XTILE_DATA),
+ sizeof(struct xtile_data), tile_size);
+ __xstate_dump_leaves();
+ return -EINVAL;
+ }
+
+ if (max > max_tile)
+ max_tile = max;
+ }
+
+ state_size = sizeof(struct xtile_data) * max_tile;
+ if (size != state_size) {
+ pr_err("%s: calculated size is %u bytes, cpu state %d bytes\n",
+ __stringify(XFEATURE_XTILE_DATA), state_size, size);
+ __xstate_dump_leaves();
+ return -EINVAL;
+ }
+ return 0;
+}
+
/*
* We have a C struct for each 'xstate'. We need to ensure
* that our software representation matches what the CPU
@@ -761,6 +830,11 @@ static int check_xstate_against_struct(int nr)
XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM, struct avx_512_hi16_state);
XCHECK_SZ(sz, nr, XFEATURE_PKRU, struct pkru_state);
XCHECK_SZ(sz, nr, XFEATURE_PASID, struct ia32_pasid_state);
+ XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
+
+ /* The tile data size varies between implementations. */
+ if (nr == XFEATURE_XTILE_DATA)
+ check_xtile_data_against_struct(sz);

/*
* Make *SURE* to add any feature numbers in below if
--
2.17.1

2021-05-23 19:40:56

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 17/28] x86/fpu/xstate: Adjust the XSAVE feature table to address gaps in state component numbers

At compile-time xfeatures_mask_all includes all possible XCR0 features. At
run-time fpu__init_system_xstate() clears features in xfeatures_mask_all
that are not enabled in CPUID. It does this by looping through all possible
XCR0 features.

Update the code to handle the possibility that there will be gaps in the
XCR0 feature bit numbers.

No functional change.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Simplified the implementation. (Thomas Gleixner)
* Updated the patch title accordingly.

Changes from v1:
* Rebased on the upstream kernel (5.10)
---
arch/x86/kernel/fpu/xstate.c | 32 ++++++++++++++++++--------------
1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 126c4a509669..1cf5888210aa 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -44,18 +44,17 @@ static const char *xfeature_names[] =
"unknown xstate feature" ,
};

-static short xsave_cpuid_features[] __initdata = {
- X86_FEATURE_FPU,
- X86_FEATURE_XMM,
- X86_FEATURE_AVX,
- X86_FEATURE_MPX,
- X86_FEATURE_MPX,
- X86_FEATURE_AVX512F,
- X86_FEATURE_AVX512F,
- X86_FEATURE_AVX512F,
- X86_FEATURE_INTEL_PT,
- X86_FEATURE_PKU,
- X86_FEATURE_ENQCMD,
+static unsigned short xsave_cpuid_features[] __initdata = {
+ [XFEATURE_SSE] = X86_FEATURE_XMM,
+ [XFEATURE_YMM] = X86_FEATURE_AVX,
+ [XFEATURE_BNDREGS] = X86_FEATURE_MPX,
+ [XFEATURE_BNDCSR] = X86_FEATURE_MPX,
+ [XFEATURE_OPMASK] = X86_FEATURE_AVX512F,
+ [XFEATURE_ZMM_Hi256] = X86_FEATURE_AVX512F,
+ [XFEATURE_Hi16_ZMM] = X86_FEATURE_AVX512F,
+ [XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
+ [XFEATURE_PKRU] = X86_FEATURE_PKU,
+ [XFEATURE_PASID] = X86_FEATURE_ENQCMD,
};

/*
@@ -1022,8 +1021,13 @@ void __init fpu__init_system_xstate(void)
/*
* Clear XSAVE features that are disabled in the normal CPUID.
*/
- for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
- if (!boot_cpu_has(xsave_cpuid_features[i]))
+
+ /* Handle the first component separately as its value is zero. */
+ if (!boot_cpu_has(X86_FEATURE_FPU))
+ xfeatures_mask_all &= ~BIT_ULL(XFEATURE_FP);
+
+ for (i = 1; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
+ if (xsave_cpuid_features[i] && !boot_cpu_has(xsave_cpuid_features[i]))
xfeatures_mask_all &= ~BIT_ULL(i);
}

--
2.17.1

2021-05-23 19:40:57

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 25/28] x86/fpu/xstate: Skip writing zeros to signal frame for dynamic user states if in INIT-state

By default, for xstate features in the INIT-state, XSAVE writes zeros to
the uncompressed destination buffer.

E.g., if you are not using AVX-512, you will still get a bunch of zeros on
the signal stack where live AVX-512 data would go.

For 'dynamic user state' (currently only XTILEDATA), explicitly skip this
data transfer. The result is that the user buffer for the AMX region will
not be touched by XSAVE.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Added as a new patch.
---
arch/x86/include/asm/fpu/internal.h | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 4a3436684805..131f2557fc85 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -354,11 +354,27 @@ static inline void copy_kernel_to_xregs(struct xregs_state *xstate, u64 mask)
*/
static inline int copy_xregs_to_user(struct xregs_state __user *buf)
{
- u64 mask = current->thread.fpu.state_mask;
- u32 lmask = mask;
- u32 hmask = mask >> 32;
+ u64 state_mask = current->thread.fpu.state_mask;
+ u64 dynamic_state_mask;
+ u32 lmask, hmask;
int err;

+ dynamic_state_mask = state_mask & xfeatures_mask_user_dynamic;
+ if (dynamic_state_mask && boot_cpu_has(X86_FEATURE_XGETBV1)) {
+ u64 dynamic_xinuse, dynamic_init;
+ u64 xinuse = xgetbv(1);
+
+ dynamic_xinuse = xinuse & dynamic_state_mask;
+ dynamic_init = ~(xinuse) & dynamic_state_mask;
+ if (dynamic_init) {
+ state_mask &= ~xfeatures_mask_user_dynamic;
+ state_mask |= dynamic_xinuse;
+ }
+ }
+
+ lmask = state_mask;
+ hmask = state_mask >> 32;
+
/*
* Clear the xsave header first, so that reserved fields are
* initialized to zero.
--
2.17.1

2021-05-23 19:40:57

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 04/28] x86/fpu/xstate: Modify the context restore helper to handle both static and dynamic buffers

Have the function restoring xstate take a struct fpu * pointer in
preparation for dynamic state buffer support.

No functional change.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Reverted the change of the copy_kernel_to_xregs_err() function as not
needed.

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
---
arch/x86/include/asm/fpu/internal.h | 6 ++++--
arch/x86/kernel/fpu/core.c | 4 ++--
arch/x86/kvm/x86.c | 2 +-
3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 0153c4d4ca77..b34d0d29e4b8 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -425,8 +425,10 @@ static inline void __copy_kernel_to_fpregs(union fpregs_state *fpstate, u64 mask
}
}

-static inline void copy_kernel_to_fpregs(union fpregs_state *fpstate)
+static inline void copy_kernel_to_fpregs(struct fpu *fpu)
{
+ union fpregs_state *fpstate = &fpu->state;
+
/*
* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception is
* pending. Clear the x87 state here by setting it to fixed values.
@@ -511,7 +513,7 @@ static inline void __fpregs_load_activate(void)
return;

if (!fpregs_state_valid(fpu, cpu)) {
- copy_kernel_to_fpregs(&fpu->state);
+ copy_kernel_to_fpregs(fpu);
fpregs_activate(fpu);
fpu->last_cpu = cpu;
}
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 52f886477b07..830b1bc75ed4 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -173,7 +173,7 @@ void fpu__save(struct fpu *fpu)

if (!test_thread_flag(TIF_NEED_FPU_LOAD)) {
if (!copy_fpregs_to_fpstate(fpu)) {
- copy_kernel_to_fpregs(&fpu->state);
+ copy_kernel_to_fpregs(fpu);
}
}

@@ -258,7 +258,7 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
memcpy(&dst_fpu->state, &src_fpu->state, fpu_kernel_xstate_size);

else if (!copy_fpregs_to_fpstate(dst_fpu))
- copy_kernel_to_fpregs(&dst_fpu->state);
+ copy_kernel_to_fpregs(dst_fpu);

fpregs_unlock();

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 01c8642cb56c..b428ff496bca 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9649,7 +9649,7 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
if (vcpu->arch.guest_fpu)
kvm_save_current_fpu(vcpu->arch.guest_fpu);

- copy_kernel_to_fpregs(&vcpu->arch.user_fpu->state);
+ copy_kernel_to_fpregs(vcpu->arch.user_fpu);

fpregs_mark_activate();
fpregs_unlock();
--
2.17.1

2021-05-23 19:40:57

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 19/28] x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature bits

Intel's Advanced Matrix Extension (AMX) is a new 64-bit extended feature
consisting of two-dimensional registers and an accelerator unit. The first
implementation of the latter is the tile matrix multiply unit (TMUL). TMUL
performs SIMD dot-products on four bytes (INT8) or two bfloat16
floating-point (BF16) elements.

Here enumerate this hardware capability to be shown as 'amx_tile',
'amx_bf16', and 'amx_int8' in /proc/cpuinfo.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Massaged the changelog a bit.
---
arch/x86/include/asm/cpufeatures.h | 3 +++
arch/x86/kernel/cpu/cpuid-deps.c | 3 +++
2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 67a9e5282128..a4fb3ca76929 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -385,7 +385,10 @@
#define X86_FEATURE_TSXLDTRK (18*32+16) /* TSX Suspend Load Address Tracking */
#define X86_FEATURE_PCONFIG (18*32+18) /* Intel PCONFIG */
#define X86_FEATURE_ARCH_LBR (18*32+19) /* Intel ARCH LBR */
+#define X86_FEATURE_AMX_BF16 (18*32+22) /* AMX BF16 Support */
#define X86_FEATURE_AVX512_FP16 (18*32+23) /* AVX512 FP16 */
+#define X86_FEATURE_AMX_TILE (18*32+24) /* AMX tile Support */
+#define X86_FEATURE_AMX_INT8 (18*32+25) /* AMX INT8 Support */
#define X86_FEATURE_SPEC_CTRL (18*32+26) /* "" Speculation Control (IBRS + IBPB) */
#define X86_FEATURE_INTEL_STIBP (18*32+27) /* "" Single Thread Indirect Branch Predictors */
#define X86_FEATURE_FLUSH_L1D (18*32+28) /* Flush L1D cache */
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index 7f891d2eb52e..9a520ab259ac 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -76,6 +76,9 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_SGX1, X86_FEATURE_SGX },
{ X86_FEATURE_SGX2, X86_FEATURE_SGX1 },
{ X86_FEATURE_XFD, X86_FEATURE_XSAVE },
+ { X86_FEATURE_AMX_TILE, X86_FEATURE_XSAVE },
+ { X86_FEATURE_AMX_INT8, X86_FEATURE_AMX_TILE },
+ { X86_FEATURE_AMX_BF16, X86_FEATURE_AMX_TILE },
{}
};

--
2.17.1

2021-05-23 19:41:04

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 21/28] x86/fpu/amx: Initialize child's AMX state

Assure that a forked child starts AMX registers in the INIT-state.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Added as a new patch. This was missing on previous versions.
---
arch/x86/kernel/fpu/core.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 016c3adebec3..cccfeafe81e5 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -285,6 +285,10 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)

fpregs_unlock();

+ /* AMX state is volatile, children do not inherit it. */
+ if (xfeatures_mask_all & XFEATURE_MASK_XTILE)
+ dst_fpu->state->xsave.header.xfeatures &= ~(XFEATURE_MASK_XTILE);
+
set_tsk_thread_flag(dst, TIF_NEED_FPU_LOAD);

trace_x86_fpu_copy_src(src_fpu);
--
2.17.1

2021-05-23 19:41:09

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 13/28] x86/fpu/xstate: Update the xstate context copy function to support dynamic states

ptrace() and signal return paths use xstate context copy functions. They
allow callers to read (or write) xstate values in the target's buffer. With
dynamic user states, a component's position in the buffer may vary and the
initial value is not always stored in init_fpstate.

Change the helpers to find a component's offset accordingly.

When copying an initial value, explicitly check the init_fpstate coverage.
If not found, reset the memory in the destination. Otherwise, copy values
from init_fpstate.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v3:
* Cleaned up the code change with more comments.
* Removed 'no functional change' in the changelog. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
---
arch/x86/kernel/fpu/xstate.c | 69 ++++++++++++++++++++++++++++--------
1 file changed, 55 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9d3cd9775b76..299373669a5d 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -301,7 +301,7 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
* in a special way already:
*/
feature_bit = 0x2;
- xfeatures = (xfeatures_mask_user() & ~xfeatures) >> 2;
+ xfeatures = (xfeatures_mask_user() & fpu->state_mask & ~xfeatures) >> feature_bit;

/*
* Update all the remaining memory layouts according to their
@@ -310,12 +310,19 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
*/
while (xfeatures) {
if (xfeatures & 0x1) {
- int offset = xstate_comp_offsets[feature_bit];
+ int offset = get_xstate_comp_offset(fpu->state_mask, feature_bit);
int size = xstate_sizes[feature_bit];

- memcpy((void *)fx + offset,
- (void *)&init_fpstate.xsave + offset,
- size);
+ /*
+ * init_fpstate does not include the dynamic user states
+ * as having initial values with zeros.
+ */
+ if (xfeatures_mask_user_dynamic & BIT_ULL(feature_bit))
+ memset((void *)fx + offset, 0, size);
+ else
+ memcpy((void *)fx + offset,
+ (void *)&init_fpstate.xsave + offset,
+ size);
}

xfeatures >>= 1;
@@ -1297,15 +1304,31 @@ static void fill_gap(struct membuf *to, unsigned *last, unsigned offset)
{
if (*last >= offset)
return;
- membuf_write(to, (void *)&init_fpstate.xsave + *last, offset - *last);
+
+ /*
+ * Copy initial data.
+ *
+ * init_fpstate buffer has the minimum size as excluding the dynamic user
+ * states. But their initial values are zeros.
+ */
+ if (offset <= get_xstate_config(XSTATE_MIN_SIZE))
+ membuf_write(to, (void *)&init_fpstate.xsave + *last, offset - *last);
+ else
+ membuf_zero(to, offset - *last);
*last = offset;
}

+/*
+ * @from: If NULL, copy zeros.
+ */
static void copy_part(struct membuf *to, unsigned *last, unsigned offset,
unsigned size, void *from)
{
fill_gap(to, last, offset);
- membuf_write(to, from, size);
+ if (from)
+ membuf_write(to, from, size);
+ else
+ membuf_zero(to, size);
*last = offset + size;
}

@@ -1357,15 +1380,27 @@ void copy_xstate_to_kernel(struct membuf to, struct fpu *fpu)
sizeof(header), &header);

for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+ u64 mask = BIT_ULL(i);
+ void *src;
+
+ if (!(xfeatures_mask_user() & mask))
+ continue;
+
/*
- * Copy only in-use xstates:
+ * Copy states if used. Otherwise, copy the initial data.
*/
- if ((header.xfeatures >> i) & 1) {
- void *src = __raw_xsave_addr(fpu, i);

- copy_part(&to, &last, xstate_offsets[i],
- xstate_sizes[i], src);
- }
+ if (header.xfeatures & mask)
+ src = __raw_xsave_addr(fpu, i);
+ else
+ /*
+ * init_fpstate buffer does not include the dynamic
+ * user state data as having initial values with zeros.
+ */
+ src = (xfeatures_mask_user_dynamic & mask) ?
+ NULL : (void *)&init_fpstate.xsave + last;
+
+ copy_part(&to, &last, xstate_offsets[i], xstate_sizes[i], src);

}
fill_gap(&to, &last, size);
@@ -1398,6 +1433,9 @@ int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf)
if (hdr.xfeatures & mask) {
void *dst = __raw_xsave_addr(fpu, i);

+ if (!dst)
+ continue;
+
offset = xstate_offsets[i];
size = xstate_sizes[i];

@@ -1455,6 +1493,9 @@ int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf)
if (hdr.xfeatures & mask) {
void *dst = __raw_xsave_addr(fpu, i);

+ if (!dst)
+ continue;
+
offset = xstate_offsets[i];
size = xstate_sizes[i];

@@ -1535,7 +1576,7 @@ void copy_supervisor_to_kernel(struct fpu *fpu)
continue;

/* Move xfeature 'i' into its normal location */
- memmove(xbuf + xstate_comp_offsets[i],
+ memmove(xbuf + get_xstate_comp_offset(fpu->state_mask, i),
xbuf + xstate_supervisor_only_offsets[i],
xstate_sizes[i]);
}
--
2.17.1

2021-05-23 19:41:12

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 16/28] x86/fpu/xstate: Support ptracer-induced xstate buffer expansion

ptrace() may update xstate data before the target task has taken an XFD
fault and expanded the xstate buffer. Detect this case and allocate a
sufficient buffer to support the request. Also, disable the (now
unnecessary) associated first-use fault.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Improved the condition check for the expansion.
* Simplified the XSTATE_BV retrieval.
* Updated the code comment.

Changes from v3:
* Removed 'no functional changes' in the changelog. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
* Updated the code comments.
---
arch/x86/kernel/fpu/regset.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)

diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index ee27df4caed6..b2fac4b5e483 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -122,6 +122,38 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,

xsave = &fpu->state->xsave;

+ /*
+ * When a ptracer attempts to write any dynamic user state in the target buffer but not
+ * sufficiently allocated, it dynamically expands the buffer.
+ *
+ * Check if the expansion is possibly needed.
+ */
+ if (xfeatures_mask_user_dynamic &&
+ ((fpu->state_mask & xfeatures_mask_user_dynamic) != xfeatures_mask_user_dynamic)) {
+ unsigned int offset, size;
+ u64 state_mask;
+
+ offset = offsetof(struct xregs_state, header);
+ size = sizeof(u64);
+
+ /* Retrieve XSTATE_BV. */
+ if (kbuf) {
+ memcpy(&state_mask, kbuf + offset, size);
+ } else {
+ ret = __copy_from_user(&state_mask, ubuf + offset, size);
+ if (ret)
+ return ret;
+ }
+
+ /* Expand the xstate buffer based on the XSTATE_BV. */
+ state_mask &= xfeatures_mask_user_dynamic;
+ if (state_mask) {
+ ret = alloc_xstate_buffer(fpu, state_mask);
+ if (ret)
+ return ret;
+ }
+ }
+
fpu__prepare_write(fpu);

if (using_compacted_format()) {
--
2.17.1

2021-05-23 19:41:19

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 06/28] x86/fpu/xstate: Add new variables to indicate dynamic xstate buffer size

The xstate per-task buffer is in preparation to be dynamic for user states.
Introduce new size variables to indicate the minimum and maximum size of
the buffer. The value is determined at boot-time.

Instead of adding them as newly exported, introduce helper functions to
access them as well as the user buffer size.

No functional change. Those sizes have no difference, as the buffer is not
dynamic yet.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from v3:
* Added as a new patch to add the variables along with new helpers.
(Borislav Petkov)
---
arch/x86/include/asm/fpu/xstate.h | 9 ++++
arch/x86/include/asm/processor.h | 10 +---
arch/x86/kernel/fpu/core.c | 24 +++++++---
arch/x86/kernel/fpu/init.c | 26 ++++-------
arch/x86/kernel/fpu/regset.c | 4 +-
arch/x86/kernel/fpu/signal.c | 27 ++++++-----
arch/x86/kernel/fpu/xstate.c | 78 ++++++++++++++++++++++++-------
arch/x86/kernel/process.c | 7 +++
arch/x86/kvm/x86.c | 5 +-
9 files changed, 129 insertions(+), 61 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 6ce8350672c2..1fba2ca15874 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -102,6 +102,15 @@ extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
extern void __init update_regset_xstate_info(unsigned int size,
u64 xstate_mask);

+enum xstate_config {
+ XSTATE_MIN_SIZE,
+ XSTATE_MAX_SIZE,
+ XSTATE_USER_SIZE
+};
+
+extern unsigned int get_xstate_config(enum xstate_config cfg);
+void set_xstate_config(enum xstate_config cfg, unsigned int value);
+
void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
const void *get_xsave_field_ptr(int xfeature_nr);
int using_compacted_format(void);
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 556b2b17c3e2..028337c454b5 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -459,9 +459,6 @@ DECLARE_PER_CPU(struct irq_stack *, hardirq_stack_ptr);
DECLARE_PER_CPU(struct irq_stack *, softirq_stack_ptr);
#endif /* !X86_64 */

-extern unsigned int fpu_kernel_xstate_size;
-extern unsigned int fpu_user_xstate_size;
-
struct perf_event;

struct thread_struct {
@@ -527,12 +524,7 @@ struct thread_struct {
};

/* Whitelist the FPU state from the task_struct for hardened usercopy. */
-static inline void arch_thread_struct_whitelist(unsigned long *offset,
- unsigned long *size)
-{
- *offset = offsetof(struct thread_struct, fpu.state);
- *size = fpu_kernel_xstate_size;
-}
+extern void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size);

static inline void
native_load_sp0(unsigned long sp0)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 830b1bc75ed4..d8c6f284adb0 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -205,21 +205,30 @@ static inline void fpstate_init_fstate(struct fregs_state *fp)
void fpstate_init(struct fpu *fpu)
{
union fpregs_state *state;
+ unsigned int size;
+ u64 mask;

- if (likely(fpu))
+ if (likely(fpu)) {
state = &fpu->state;
- else
+ /* The dynamic user states are not prepared yet. */
+ mask = xfeatures_mask_all & ~xfeatures_mask_user_dynamic;
+ size = get_xstate_config(XSTATE_MIN_SIZE);
+ } else {
state = &init_fpstate;
+ mask = xfeatures_mask_all;
+ size = get_xstate_config(XSTATE_MAX_SIZE);
+ }

if (!static_cpu_has(X86_FEATURE_FPU)) {
fpstate_init_soft(&state->soft);
return;
}

- memset(state, 0, fpu_kernel_xstate_size);
+ memset(state, 0, size);

if (static_cpu_has(X86_FEATURE_XSAVES))
- fpstate_init_xstate(&state->xsave, xfeatures_mask_all);
+ fpstate_init_xstate(&state->xsave, mask);
+
if (static_cpu_has(X86_FEATURE_FXSR))
fpstate_init_fxstate(&state->fxsave);
else
@@ -242,8 +251,11 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
/*
* Don't let 'init optimized' areas of the XSAVE area
* leak into the child task:
+ *
+ * The child does not inherit the dynamic states. So,
+ * the xstate buffer has the minimum size.
*/
- memset(&dst_fpu->state.xsave, 0, fpu_kernel_xstate_size);
+ memset(&dst_fpu->state.xsave, 0, get_xstate_config(XSTATE_MIN_SIZE));

/*
* If the FPU registers are not current just memcpy() the state.
@@ -255,7 +267,7 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
*/
fpregs_lock();
if (test_thread_flag(TIF_NEED_FPU_LOAD))
- memcpy(&dst_fpu->state, &src_fpu->state, fpu_kernel_xstate_size);
+ memcpy(&dst_fpu->state, &src_fpu->state, get_xstate_config(XSTATE_MIN_SIZE));

else if (!copy_fpregs_to_fpstate(dst_fpu))
copy_kernel_to_fpregs(dst_fpu);
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 74e03e3bc20f..f63765b7a83c 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -129,15 +129,6 @@ static void __init fpu__init_system_generic(void)
fpu__init_system_mxcsr();
}

-/*
- * Size of the FPU context state. All tasks in the system use the
- * same context size, regardless of what portion they use.
- * This is inherent to the XSAVE architecture which puts all state
- * components into a single, continuous memory block:
- */
-unsigned int fpu_kernel_xstate_size;
-EXPORT_SYMBOL_GPL(fpu_kernel_xstate_size);
-
/* Get alignment of the TYPE. */
#define TYPE_ALIGN(TYPE) offsetof(struct { char x; TYPE test; }, test)

@@ -167,8 +158,10 @@ static void __init fpu__init_task_struct_size(void)
/*
* Add back the dynamically-calculated register state
* size.
+ *
+ * Use the minimum size as embedded to task_struct.
*/
- task_size += fpu_kernel_xstate_size;
+ task_size += get_xstate_config(XSTATE_MIN_SIZE);

/*
* We dynamically size 'struct fpu', so we require that
@@ -193,6 +186,7 @@ static void __init fpu__init_task_struct_size(void)
static void __init fpu__init_system_xstate_size_legacy(void)
{
static int on_boot_cpu __initdata = 1;
+ unsigned int size;

WARN_ON_FPU(!on_boot_cpu);
on_boot_cpu = 0;
@@ -203,17 +197,17 @@ static void __init fpu__init_system_xstate_size_legacy(void)
*/

if (!boot_cpu_has(X86_FEATURE_FPU)) {
- fpu_kernel_xstate_size = sizeof(struct swregs_state);
+ size = sizeof(struct swregs_state);
} else {
if (boot_cpu_has(X86_FEATURE_FXSR))
- fpu_kernel_xstate_size =
- sizeof(struct fxregs_state);
+ size = sizeof(struct fxregs_state);
else
- fpu_kernel_xstate_size =
- sizeof(struct fregs_state);
+ size = sizeof(struct fregs_state);
}

- fpu_user_xstate_size = fpu_kernel_xstate_size;
+ set_xstate_config(XSTATE_MIN_SIZE, size);
+ set_xstate_config(XSTATE_MAX_SIZE, size);
+ set_xstate_config(XSTATE_USER_SIZE, size);
}

/*
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 5e13e58d11d4..6a025fa26a7e 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -99,7 +99,7 @@ int xstateregs_get(struct task_struct *target, const struct user_regset *regset,
/*
* Copy the xstate memory layout.
*/
- return membuf_write(&to, xsave, fpu_user_xstate_size);
+ return membuf_write(&to, xsave, get_xstate_config(XSTATE_USER_SIZE));
}
}

@@ -117,7 +117,7 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
/*
* A whole standard-format XSAVE buffer is needed:
*/
- if ((pos != 0) || (count < fpu_user_xstate_size))
+ if ((pos != 0) || (count < get_xstate_config(XSTATE_USER_SIZE)))
return -EFAULT;

xsave = &fpu->state.xsave;
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 0d6deb75c507..3a2d8665b9a3 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -35,7 +35,7 @@ static inline int check_for_xstate(struct fxregs_state __user *buf,
/* Check for the first magic field and other error scenarios. */
if (fx_sw->magic1 != FP_XSTATE_MAGIC1 ||
fx_sw->xstate_size < min_xstate_size ||
- fx_sw->xstate_size > fpu_user_xstate_size ||
+ fx_sw->xstate_size > get_xstate_config(XSTATE_USER_SIZE) ||
fx_sw->xstate_size > fx_sw->extended_size)
return -1;

@@ -98,7 +98,7 @@ static inline int save_xstate_epilog(void __user *buf, int ia32_frame)
return err;

err |= __put_user(FP_XSTATE_MAGIC2,
- (__u32 __user *)(buf + fpu_user_xstate_size));
+ (__u32 __user *)(buf + get_xstate_config(XSTATE_USER_SIZE)));

/*
* Read the xfeatures which we copied (directly from the cpu or
@@ -135,7 +135,7 @@ static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf)
else
err = copy_fregs_to_user((struct fregs_state __user *) buf);

- if (unlikely(err) && __clear_user(buf, fpu_user_xstate_size))
+ if (unlikely(err) && __clear_user(buf, get_xstate_config(XSTATE_USER_SIZE)))
err = -EFAULT;
return err;
}
@@ -196,7 +196,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
fpregs_unlock();

if (ret) {
- if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
+ if (!fault_in_pages_writeable(buf_fx, get_xstate_config(XSTATE_USER_SIZE)))
goto retry;
return -EFAULT;
}
@@ -290,13 +290,13 @@ static int copy_user_to_fpregs_zeroing(void __user *buf, u64 xbv, int fx_only)
static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
{
struct user_i387_ia32_struct *envp = NULL;
- int state_size = fpu_kernel_xstate_size;
int ia32_fxstate = (buf != buf_fx);
struct task_struct *tsk = current;
struct fpu *fpu = &tsk->thread.fpu;
struct user_i387_ia32_struct env;
u64 user_xfeatures = 0;
int fx_only = 0;
+ int state_size;
int ret = 0;

ia32_fxstate &= (IS_ENABLED(CONFIG_X86_32) ||
@@ -330,6 +330,9 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
state_size = fx_sw_user.xstate_size;
user_xfeatures = fx_sw_user.xfeatures;
}
+ } else {
+ /* The buffer cannot be dynamic without using XSAVE. */
+ state_size = get_xstate_config(XSTATE_MIN_SIZE);
}

if ((unsigned long)buf_fx % 64)
@@ -469,8 +472,9 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)

static inline int xstate_sigframe_size(void)
{
- return use_xsave() ? fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE :
- fpu_user_xstate_size;
+ int size = get_xstate_config(XSTATE_USER_SIZE);
+
+ return use_xsave() ? size + FP_XSTATE_MAGIC2_SIZE : size;
}

/*
@@ -514,19 +518,20 @@ fpu__alloc_mathframe(unsigned long sp, int ia32_frame,
*/
void fpu__init_prepare_fx_sw_frame(void)
{
- int size = fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE;
+ int xstate_size = get_xstate_config(XSTATE_USER_SIZE);
+ int ext_size = xstate_size + FP_XSTATE_MAGIC2_SIZE;

fx_sw_reserved.magic1 = FP_XSTATE_MAGIC1;
- fx_sw_reserved.extended_size = size;
+ fx_sw_reserved.extended_size = ext_size;
fx_sw_reserved.xfeatures = xfeatures_mask_user();
- fx_sw_reserved.xstate_size = fpu_user_xstate_size;
+ fx_sw_reserved.xstate_size = xstate_size;

if (IS_ENABLED(CONFIG_IA32_EMULATION) ||
IS_ENABLED(CONFIG_X86_32)) {
int fsave_header_size = sizeof(struct fregs_state);

fx_sw_reserved_ia32 = fx_sw_reserved;
- fx_sw_reserved_ia32.extended_size = size + fsave_header_size;
+ fx_sw_reserved_ia32.extended_size = ext_size + fsave_header_size;
}
}

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 34755612019c..1d6613751625 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -72,12 +72,50 @@ static unsigned int xstate_sizes[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] =
static unsigned int xstate_comp_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
static unsigned int xstate_supervisor_only_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};

-/*
- * The XSAVE area of kernel can be in standard or compacted format;
- * it is always in standard format for user mode. This is the user
- * mode standard format size used for signal and ptrace frames.
+/**
+ * struct fpu_xstate_buffer_config - xstate per-task buffer configuration
+ * @min_size, @max_size: The size of the kernel buffer. It is variable with the dynamic user
+ * states. Every task has the minimum buffer by default. It can be
+ * expanded to the max size. The two sizes are the same when using the
+ * standard format.
+ * @user_size: The size of the userspace buffer. The buffer is always in the
+ * standard format. It is used for signal and ptrace frames.
*/
-unsigned int fpu_user_xstate_size;
+struct fpu_xstate_buffer_config {
+ unsigned int min_size, max_size;
+ unsigned int user_size;
+};
+
+static struct fpu_xstate_buffer_config buffer_config __read_mostly;
+
+unsigned int get_xstate_config(enum xstate_config cfg)
+{
+ switch (cfg) {
+ case XSTATE_MIN_SIZE:
+ return buffer_config.min_size;
+ case XSTATE_MAX_SIZE:
+ return buffer_config.max_size;
+ case XSTATE_USER_SIZE:
+ return buffer_config.user_size;
+ default:
+ return 0;
+ }
+}
+EXPORT_SYMBOL_GPL(get_xstate_config);
+
+void set_xstate_config(enum xstate_config cfg, unsigned int value)
+{
+ switch (cfg) {
+ case XSTATE_MIN_SIZE:
+ buffer_config.min_size = value;
+ break;
+ case XSTATE_MAX_SIZE:
+ buffer_config.max_size = value;
+ break;
+ case XSTATE_USER_SIZE:
+ buffer_config.user_size = value;
+ }
+}

/*
* Return whether the system supports a given xfeature.
@@ -659,7 +697,7 @@ static void do_extra_xstate_size_checks(void)
*/
paranoid_xstate_size += xfeature_size(i);
}
- XSTATE_WARN_ON(paranoid_xstate_size != fpu_kernel_xstate_size);
+ XSTATE_WARN_ON(paranoid_xstate_size != get_xstate_config(XSTATE_MAX_SIZE));
}


@@ -754,21 +792,29 @@ static int __init init_xstate_size(void)
else
possible_xstate_size = xsave_size;

- /* Ensure we have the space to store all enabled: */
- if (!is_supported_xstate_size(possible_xstate_size))
- return -EINVAL;
-
/*
- * The size is OK, we are definitely going to use xsave,
- * make it known to the world that we need more space.
+ * The size accounts for all the possible states reserved in the
+ * per-task buffer. Set the maximum with this value.
*/
- fpu_kernel_xstate_size = possible_xstate_size;
+ set_xstate_config(XSTATE_MAX_SIZE, possible_xstate_size);
+
+ /* Perform an extra check for the maximum size. */
do_extra_xstate_size_checks();

+ /*
+ * Set the minimum to be the same as the maximum. The dynamic
+ * user states are not supported yet.
+ */
+ set_xstate_config(XSTATE_MIN_SIZE, possible_xstate_size);
+
+ /* Ensure the minimum size fits in the statically-alocated buffer: */
+ if (!is_supported_xstate_size(get_xstate_config(XSTATE_MIN_SIZE)))
+ return -EINVAL;
+
/*
* User space is always in standard format.
*/
- fpu_user_xstate_size = xsave_size;
+ set_xstate_config(XSTATE_USER_SIZE, xsave_size);
return 0;
}

@@ -859,7 +905,7 @@ void __init fpu__init_system_xstate(void)
* Update info used for ptrace frames; use standard-format size and no
* supervisor xstates:
*/
- update_regset_xstate_info(fpu_user_xstate_size, xfeatures_mask_user());
+ update_regset_xstate_info(get_xstate_config(XSTATE_USER_SIZE), xfeatures_mask_user());

fpu__init_prepare_fx_sw_frame();
setup_init_fpu_buf();
@@ -869,7 +915,7 @@ void __init fpu__init_system_xstate(void)

pr_info("x86/fpu: Enabled xstate features 0x%llx, context size is %d bytes, using '%s' format.\n",
xfeatures_mask_all,
- fpu_kernel_xstate_size,
+ get_xstate_config(XSTATE_MAX_SIZE),
boot_cpu_has(X86_FEATURE_XSAVES) ? "compacted" : "standard");
return;

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 5e1f38179f49..3f78ca2600d3 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -91,6 +91,13 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
return fpu__copy(dst, src);
}

+void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
+{
+ *offset = offsetof(struct thread_struct, fpu.state);
+ /* The buffer embedded in thread_struct has the minimum size. */
+ *size = get_xstate_config(XSTATE_MIN_SIZE);
+}
+
/*
* Free thread data structures etc..
*/
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b428ff496bca..8839d4dd3fab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9607,10 +9607,13 @@ static void kvm_save_current_fpu(struct fpu *fpu)
/*
* If the target FPU state is not resident in the CPU registers, just
* memcpy() from current, else save CPU state directly to the target.
+ *
+ * KVM does not support dynamic user states yet. Assume the buffer
+ * always has the minimum size.
*/
if (test_thread_flag(TIF_NEED_FPU_LOAD))
memcpy(&fpu->state, &current->thread.fpu.state,
- fpu_kernel_xstate_size);
+ get_xstate_config(XSTATE_MIN_SIZE));
else
copy_fpregs_to_fpstate(fpu);
}
--
2.17.1

2021-05-23 19:41:33

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 23/28] selftest/x86/amx: Test cases for the AMX state management

This selftest verifies that the xstate arch_prctl works for AMX state and
that a forked task has the AMX state in the INIT-state.

In addition, this test verifies that the kernel correctly context switches
unique AMX data, when multiple threads are using AMX.

Finally, the test verifies that ptrace() can insert data into existing
threads.

These test cases do not depend on AMX compiler support, as they employ
userspace-XSAVE directly to access AMX state.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Added test for arch_prctl.
* Excluded tile config details to focus on testing the kernel's ability to
manage dynamic user state.
* Removed tile instructions.
* Simplified the fork() and ptrace() test routine.
* Massaged the changelog.

Changes from v2:
* Updated the test messages and the changelog as tile data is not inherited
to a child anymore.
* Removed bytecode for the instructions already supported by binutils.
* Changed to check the XSAVE availability in a reliable way.

Changes from v1:
* Removed signal testing code
---
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/amx.c | 601 +++++++++++++++++++++++++++
2 files changed, 602 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/x86/amx.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 333980375bc7..2f7feb03867b 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -17,7 +17,7 @@ TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap
TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
test_FCMOV test_FCOMI test_FISTTP \
vdso_restorer
-TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering
+TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering amx
# Some selftests require 32bit support enabled also on 64bit systems
TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall

diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c
new file mode 100644
index 000000000000..7d177c03cdcf
--- /dev/null
+++ b/tools/testing/selftests/x86/amx.c
@@ -0,0 +1,601 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <errno.h>
+#include <elf.h>
+#include <pthread.h>
+#include <setjmp.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdbool.h>
+#include <unistd.h>
+#include <x86intrin.h>
+
+#include <linux/futex.h>
+
+#include <sys/ptrace.h>
+#include <sys/shm.h>
+#include <sys/syscall.h>
+#include <sys/wait.h>
+#include <sys/uio.h>
+
+#ifndef __x86_64__
+# error This test is 64-bit only
+#endif
+
+static inline uint64_t __xgetbv(uint32_t index)
+{
+ uint32_t eax, edx;
+
+ asm volatile("xgetbv;"
+ : "=a" (eax), "=d" (edx)
+ : "c" (index));
+ return eax + ((uint64_t)edx << 32);
+}
+
+static inline void __cpuid(uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx)
+{
+ asm volatile("cpuid;"
+ : "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx)
+ : "0" (*eax), "2" (*ecx));
+}
+
+static inline void __xsave(void *buffer, uint32_t lo, uint32_t hi)
+{
+ asm volatile("xsave (%%rdi)"
+ : : "D" (buffer), "a" (lo), "d" (hi)
+ : "memory");
+}
+
+static inline void __xrstor(void *buffer, uint32_t lo, uint32_t hi)
+{
+ asm volatile("xrstor (%%rdi)"
+ : : "D" (buffer), "a" (lo), "d" (hi));
+}
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+ int flags)
+{
+ struct sigaction sa;
+
+ memset(&sa, 0, sizeof(sa));
+ sa.sa_sigaction = handler;
+ sa.sa_flags = SA_SIGINFO | flags;
+ sigemptyset(&sa.sa_mask);
+ if (sigaction(sig, &sa, 0))
+ err(1, "sigaction");
+}
+
+static void clearhandler(int sig)
+{
+ struct sigaction sa;
+
+ memset(&sa, 0, sizeof(sa));
+ sa.sa_handler = SIG_DFL;
+ sigemptyset(&sa.sa_mask);
+ if (sigaction(sig, &sa, 0))
+ err(1, "sigaction");
+}
+
+static jmp_buf jmpbuf;
+
+/* Hardware info check: */
+
+static bool xsave_disabled;
+
+static void handle_sigill(int sig, siginfo_t *si, void *ctx_void)
+{
+ xsave_disabled = true;
+ siglongjmp(jmpbuf, 1);
+}
+
+#define XFEATURE_XTILECFG 17
+#define XFEATURE_XTILEDATA 18
+#define XFEATURE_MASK_XTILECFG (1 << XFEATURE_XTILECFG)
+#define XFEATURE_MASK_XTILEDATA (1 << XFEATURE_XTILEDATA)
+#define XFEATURE_MASK_XTILE (XFEATURE_MASK_XTILECFG | XFEATURE_MASK_XTILEDATA)
+
+static inline bool check_xsave_capability(void)
+{
+ sethandler(SIGILL, handle_sigill, 0);
+
+ if ((!sigsetjmp(jmpbuf, 1)) && (__xgetbv(0) & XFEATURE_MASK_XTILEDATA)) {
+ clearhandler(SIGILL);
+ return true;
+ }
+
+ clearhandler(SIGILL);
+ return false;
+}
+
+static uint32_t xsave_size;
+
+static uint32_t xsave_xtiledata_offset;
+static uint32_t xsave_xtiledata_size;
+
+static uint32_t xsave_xtilecfg_offset;
+static uint32_t xsave_xtilecfg_size;
+
+#define XSTATE_CPUID 0xd
+#define XSTATE_USER_STATE_SUBLEAVE 0x0
+
+static void check_cpuid(void)
+{
+ uint32_t eax, ebx, ecx, edx;
+
+ eax = XSTATE_CPUID;
+ ecx = XSTATE_USER_STATE_SUBLEAVE;
+
+ __cpuid(&eax, &ebx, &ecx, &edx);
+ if (!ebx)
+ err(1, "xstate cpuid: xsave size");
+
+ xsave_size = ebx;
+
+ eax = XSTATE_CPUID;
+ ecx = XFEATURE_XTILECFG;
+
+ __cpuid(&eax, &ebx, &ecx, &edx);
+ if (!eax || !ebx)
+ err(1, "xstate cpuid: tile config state");
+
+ xsave_xtilecfg_size = eax;
+ xsave_xtilecfg_offset = ebx;
+
+ eax = XSTATE_CPUID;
+ ecx = XFEATURE_XTILEDATA;
+
+ __cpuid(&eax, &ebx, &ecx, &edx);
+ if (!eax || !ebx)
+ err(1, "xstate cpuid: tile data state");
+
+ xsave_xtiledata_size = eax;
+ xsave_xtiledata_offset = ebx;
+}
+
+/* The helpers for managing XSAVE buffer and tile states: */
+
+#define XSAVE_HDR_OFFSET 512
+
+static inline uint64_t get_xstatebv(void *xsave)
+{
+ return *(uint64_t *)(xsave + XSAVE_HDR_OFFSET);
+}
+
+static inline void set_xstatebv(void *xsave, uint64_t bv)
+{
+ *(uint64_t *)(xsave + XSAVE_HDR_OFFSET) = bv;
+}
+
+static void set_tiledata(void *tiledata)
+{
+ int *ptr = tiledata;
+ int data = rand();
+ int i;
+
+ for (i = 0; i < xsave_xtiledata_size / sizeof(int); i++, ptr++)
+ *ptr = data;
+}
+
+static void *xsave_buffer, *tiledata;
+static int nerrs, errs;
+
+static void handle_sigsegv(int sig, siginfo_t *si, void *ctx_void)
+{
+ siglongjmp(jmpbuf, 1);
+}
+
+static bool xrstor(void *buffer, uint32_t lo, uint32_t hi)
+{
+ if (!sigsetjmp(jmpbuf, 1)) {
+ __xrstor(buffer, lo, hi);
+ return true;
+ } else {
+ return false;
+ }
+}
+
+/* arch_prctl test */
+
+#define ARCH_GET_XSTATE 0x1021
+#define ARCH_PUT_XSTATE 0x1022
+
+#define ARCH_PRCTL_REPEAT 10
+
+static void test_arch_prctl(void)
+{
+ bool xfd_armed;
+ pid_t child;
+ int rc, i;
+
+ child = fork();
+ if (child < 0) {
+ err(1, "fork");
+ } else if (child > 0) {
+ int status;
+
+ wait(&status);
+ if (!WIFEXITED(status) || WEXITSTATUS(status))
+ err(1, "arch_prctl test child");
+ return;
+ }
+
+ set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILE);
+ set_tiledata(xsave_buffer + xsave_xtiledata_offset);
+
+ printf("[RUN]\tCheck ARCH_GET_XSTATE/ARCH_SET_XSTATE.\n");
+
+ printf("\tLoad tile data without GET:\n");
+
+ if (!xrstor(xsave_buffer, -1, -1)) {
+ printf("[OK]\tBlocked.\n");
+ } else {
+ nerrs++;
+ printf("[FAIL]\tSucceeded.\n");
+ }
+
+ printf("\tGET with invalid arg:\n");
+
+ rc = syscall(SYS_arch_prctl, ARCH_GET_XSTATE, -1);
+ if (rc == -EPERM) {
+ printf("[OK]\tEPERM was returned.\n");
+ } else {
+ nerrs++;
+ printf("[FAIL]\tNo EPERM was returned.\n");
+ }
+
+ printf("\tGET with AMX state %d-times:\n", ARCH_PRCTL_REPEAT);
+
+ for (i = 0; i < ARCH_PRCTL_REPEAT; i++) {
+ rc = syscall(SYS_arch_prctl, ARCH_GET_XSTATE, XFEATURE_MASK_XTILE);
+ if (rc != 0)
+ break;
+
+ xfd_armed = !xrstor(xsave_buffer, -1, -1);
+ if (xfd_armed)
+ break;
+ }
+
+ if (i == ARCH_PRCTL_REPEAT) {
+ printf("[OK]\tNo error and correctly disarmed XFD.\n");
+ } else {
+ nerrs++;
+ i++;
+ if (rc)
+ printf("[FAIL]\t%d-th GET returned error (rc=%d).\n", i, rc);
+ else
+ printf("[FAIL]\t%d-th GET failed to disarm XFD.\n", i);
+ }
+
+ printf("\tPUT with AMX state %d-times:\n", ARCH_PRCTL_REPEAT);
+
+ for (i = 0; i < ARCH_PRCTL_REPEAT; i++) {
+ rc = syscall(SYS_arch_prctl, ARCH_PUT_XSTATE, XFEATURE_MASK_XTILE);
+ if (rc != 0)
+ break;
+
+ xfd_armed = !xrstor(xsave_buffer, -1, -1);
+ if (xfd_armed)
+ break;
+ }
+
+ if ((i == (ARCH_PRCTL_REPEAT - 1)) && xfd_armed) {
+ printf("[OK]\tNo error and re-armed XFD at the end.\n");
+ } else {
+ nerrs++;
+ i++;
+ if (rc)
+ printf("[FAIL]\t%d-th PUT returned error (rc=%d).\n", i, rc);
+ else if (i == ARCH_PRCTL_REPEAT)
+ printf("[FAIL]\tthe final PUT failed to arm XFD.\n");
+ else
+ printf("[FAIL]\t%d-th PUT disarm XFD.\n", i);
+ }
+
+ _exit(0);
+}
+
+/* Testing tile data inheritance */
+
+static void test_fork(void)
+{
+ pid_t child, grandchild;
+
+ child = fork();
+ if (child < 0) {
+ err(1, "fork");
+ } else if (child > 0) {
+ int status;
+
+ wait(&status);
+ if (!WIFEXITED(status) || WEXITSTATUS(status))
+ err(1, "fork test child");
+ return;
+ }
+
+ printf("[RUN]\tCheck tile data inheritance.\n\tBefore fork(), load tile data -- yes:\n");
+
+ set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILE);
+ set_tiledata(xsave_buffer + xsave_xtiledata_offset);
+ memset(xsave_buffer + xsave_xtilecfg_offset, 1, xsave_xtilecfg_size);
+ syscall(SYS_arch_prctl, ARCH_GET_XSTATE, XFEATURE_MASK_XTILE);
+ xrstor(xsave_buffer, -1, -1);
+
+ grandchild = fork();
+ if (grandchild < 0) {
+ err(1, "fork");
+ } else if (grandchild > 0) {
+ int status;
+
+ wait(&status);
+ if (!WIFEXITED(status) || WEXITSTATUS(status))
+ err(1, "fork test child");
+ _exit(0);
+ }
+
+ if (__xgetbv(1) & XFEATURE_MASK_XTILE) {
+ nerrs++;
+ printf("[FAIL]\tIn a child, AMX state is not initialized.\n");
+ } else {
+ printf("[OK]\tIn a child, AMX state is initialized.\n");
+ }
+ _exit(0);
+}
+
+/* Context switching test */
+
+#define ITERATIONS 10
+#define NUM_THREADS 5
+
+struct futex_info {
+ int current;
+ int *futex;
+ int next;
+};
+
+static inline void command_wait(struct futex_info *info, int value)
+{
+ do {
+ sched_yield();
+ } while (syscall(SYS_futex, info->futex, FUTEX_WAIT, value, 0, 0, 0));
+}
+
+static inline void command_wake(struct futex_info *info, int value)
+{
+ do {
+ *info->futex = value;
+ while (!syscall(SYS_futex, info->futex, FUTEX_WAKE, 1, 0, 0, 0))
+ sched_yield();
+ } while (0);
+}
+
+static inline int get_iterative_value(int id)
+{
+ return ((id << 1) & ~0x1);
+}
+
+static inline int get_endpoint_value(int id)
+{
+ return ((id << 1) | 0x1);
+}
+
+static void *check_tiledata(void *info)
+{
+ struct futex_info *finfo = (struct futex_info *)info;
+ void *xsave, *tiledata;
+ int i;
+
+ xsave = aligned_alloc(64, xsave_size);
+ if (!xsave)
+ err(1, "aligned_alloc()");
+
+ tiledata = malloc(xsave_xtiledata_size);
+ if (!tiledata)
+ err(1, "malloc()");
+
+ set_xstatebv(xsave, XFEATURE_MASK_XTILEDATA);
+ set_tiledata(xsave + xsave_xtiledata_offset);
+ syscall(SYS_arch_prctl, ARCH_GET_XSTATE, XFEATURE_MASK_XTILE);
+ xrstor(xsave, -1, -1);
+ memcpy(tiledata, xsave + xsave_xtiledata_offset, xsave_xtiledata_size);
+
+ for (i = 0; i < ITERATIONS; i++) {
+ command_wait(finfo, get_iterative_value(finfo->current));
+
+ __xsave(xsave, XFEATURE_MASK_XTILEDATA, 0);
+ if (memcmp(tiledata, xsave + xsave_xtiledata_offset, xsave_xtiledata_size))
+ errs++;
+
+ set_tiledata(xsave + xsave_xtiledata_offset);
+ syscall(SYS_arch_prctl, ARCH_GET_XSTATE, XFEATURE_MASK_XTILE);
+ xrstor(xsave, -1, -1);
+ memcpy(tiledata, xsave + xsave_xtiledata_offset, xsave_xtiledata_size);
+
+ command_wake(finfo, get_iterative_value(finfo->next));
+ }
+
+ command_wait(finfo, get_endpoint_value(finfo->current));
+
+ free(xsave);
+ free(tiledata);
+ return NULL;
+}
+
+static int create_threads(int num, struct futex_info *finfo)
+{
+ const int shm_id = shmget(IPC_PRIVATE, sizeof(int), IPC_CREAT | 0666);
+ int *futex = shmat(shm_id, NULL, 0);
+ pthread_t thread;
+ int i;
+
+ for (i = 0; i < num; i++) {
+ finfo[i].futex = futex;
+ finfo[i].current = i + 1;
+ finfo[i].next = (i + 2) % (num + 1);
+
+ if (pthread_create(&thread, NULL, check_tiledata, &finfo[i]))
+ err(1, "pthread_create()");
+ }
+ return 0;
+}
+
+static void test_context_switch(void)
+{
+ struct futex_info *finfo;
+ cpu_set_t cpuset;
+ int i;
+
+ printf("[RUN]\tCheck tile data context switches.\n");
+ printf("\t# of context switches -- %u, # of threads -- %d:\n",
+ ITERATIONS * NUM_THREADS, NUM_THREADS);
+
+ errs = 0;
+
+ CPU_ZERO(&cpuset);
+ CPU_SET(0, &cpuset);
+
+ if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0)
+ err(1, "sched_setaffinity to CPU 0");
+
+ finfo = malloc(sizeof(*finfo) * NUM_THREADS);
+ if (!finfo)
+ err(1, "malloc()");
+
+ create_threads(NUM_THREADS, finfo);
+
+ for (i = 0; i < ITERATIONS; i++) {
+ command_wake(finfo, get_iterative_value(1));
+ command_wait(finfo, get_iterative_value(0));
+ }
+
+ for (i = 1; i <= NUM_THREADS; i++)
+ command_wake(finfo, get_endpoint_value(i));
+
+ if (errs) {
+ nerrs += errs;
+ printf("[FAIL]\tIncorrect cases were found -- (%d / %u).\n",
+ errs, ITERATIONS * NUM_THREADS);
+ free(finfo);
+ return;
+ }
+
+ free(finfo);
+ printf("[OK]\tNo incorrect case was found.\n");
+}
+
+/* Ptrace test */
+
+static bool set_tiledata_state;
+
+static int write_tiledata(pid_t child)
+{
+ struct iovec iov;
+
+ iov.iov_base = xsave_buffer;
+ iov.iov_len = xsave_size;
+
+ set_xstatebv(xsave_buffer, set_tiledata_state ? XFEATURE_MASK_XTILEDATA : 0);
+ set_tiledata(xsave_buffer + xsave_xtiledata_offset);
+ if (set_tiledata_state)
+ memcpy(tiledata, xsave_buffer + xsave_xtiledata_offset, xsave_xtiledata_size);
+ else
+ memset(tiledata, 0, xsave_xtiledata_size);
+
+ if (ptrace(PTRACE_SETREGSET, child, (uint32_t)NT_X86_XSTATE, &iov))
+ err(1, "PTRACE_SETREGSET");
+
+ if (ptrace(PTRACE_GETREGSET, child, (uint32_t)NT_X86_XSTATE, &iov))
+ err(1, "PTRACE_GETREGSET");
+
+ return memcmp(tiledata, xsave_buffer + xsave_xtiledata_offset, xsave_xtiledata_size);
+}
+
+static void test_tile_write(void)
+{
+ pid_t child;
+ int status;
+
+ child = fork();
+ if (child < 0)
+ err(1, "fork");
+
+ if (!child) {
+ printf("\tInject tile data -- %s:\n",
+ set_tiledata_state ? "yes" : "no");
+
+ if (ptrace(PTRACE_TRACEME, 0, NULL, NULL))
+ err(1, "PTRACE_TRACEME");
+
+ raise(SIGTRAP);
+ _exit(0);
+ }
+
+ do {
+ wait(&status);
+ } while (WSTOPSIG(status) != SIGTRAP);
+
+ errs = write_tiledata(child);
+ if (errs) {
+ nerrs++;
+ printf("[FAIL]\tTile data was %swritten on ptracee.\n",
+ set_tiledata_state ? "not " : "");
+ } else {
+ printf("[OK]\tTile data was %swritten on ptracee.\n",
+ set_tiledata_state ? "" : "not ");
+ }
+
+ ptrace(PTRACE_DETACH, child, NULL, NULL);
+ wait(&status);
+ if (!WIFEXITED(status) || WEXITSTATUS(status))
+ err(1, "fork test child");
+}
+
+static void test_ptrace(void)
+{
+ printf("[RUN]\tCheck ptrace() to inject tile data.\n");
+
+ set_tiledata_state = true;
+ test_tile_write();
+
+ set_tiledata_state = false;
+ test_tile_write();
+}
+
+int main(void)
+{
+ /* Check hardware availability at first */
+
+ if (!check_xsave_capability()) {
+ if (xsave_disabled)
+ printf("XSAVE disabled.\n");
+ else
+ printf("Tile data not available.\n");
+ return 0;
+ }
+
+ check_cpuid();
+
+ xsave_buffer = aligned_alloc(64, xsave_size);
+ if (!xsave_buffer)
+ err(1, "aligned_alloc()");
+
+ tiledata = malloc(xsave_xtiledata_size);
+ if (!tiledata)
+ err(1, "malloc()");
+
+ nerrs = 0;
+
+ sethandler(SIGSEGV, handle_sigsegv, 0);
+
+ test_arch_prctl();
+ test_fork();
+ test_context_switch();
+ test_ptrace();
+
+ clearhandler(SIGSEGV);
+
+ free(xsave_buffer);
+ free(tiledata);
+ return nerrs ? 1 : 0;
+}
--
2.17.1

2021-05-23 19:41:37

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 12/28] x86/fpu/xstate: Update the xstate buffer address finder to support dynamic states

__raw_xsave_addr() returns the requested component's pointer in an xstate
buffer, by simply looking up the offset table. The offset used to be fixed,
but, with dynamic user states, it becomes variable.

get_xstate_size() has a routine to find an offset at runtime. Refactor to
use it for the address finder.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v3:
* Added the function description in the kernel-doc style. (Borislav Petkov)
* Removed 'no functional change' in the changelog. (Borislav Petkov)
---
arch/x86/kernel/fpu/xstate.c | 80 ++++++++++++++++++++++--------------
1 file changed, 50 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 773f594bd730..9d3cd9775b76 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -175,6 +175,40 @@ static bool xfeature_is_supervisor(int xfeature_nr)
return ecx & 1;
}

+/**
+ * get_xstate_comp_offset() - Find the feature's offset in the compacted format
+ * @mask: This bitmap tells which components reserved in the format.
+ * @feature_nr: The feature number
+ *
+ * Returns: The offset value
+ */
+static unsigned int get_xstate_comp_offset(u64 mask, int feature_nr)
+{
+ u64 xmask = BIT_ULL(feature_nr + 1) - 1;
+ unsigned int next_offset, offset = 0;
+ int i;
+
+ if ((mask & xmask) == (xfeatures_mask_all & xmask))
+ return xstate_comp_offsets[feature_nr];
+
+ /*
+ * With the given mask, no relevant size is found. Calculate it by summing
+ * up each state size.
+ */
+
+ next_offset = FXSAVE_SIZE + XSAVE_HDR_SIZE;
+
+ for (i = FIRST_EXTENDED_XFEATURE; i <= feature_nr; i++) {
+ if (!(mask & BIT_ULL(i)))
+ continue;
+
+ offset = xstate_aligns[i] ? ALIGN(next_offset, 64) : next_offset;
+ next_offset += xstate_sizes[i];
+ }
+
+ return offset;
+}
+
/**
* get_xstate_size() - calculate an xstate buffer size
* @mask: This bitmap tells which components reserved in the buffer.
@@ -186,9 +220,8 @@ static bool xfeature_is_supervisor(int xfeature_nr)
*/
unsigned int get_xstate_size(u64 mask)
{
- unsigned int size;
- u64 xmask;
- int i, nr;
+ unsigned int offset;
+ int nr;

if (!mask)
return 0;
@@ -207,24 +240,8 @@ unsigned int get_xstate_size(u64 mask)
if (!using_compacted_format())
return xstate_offsets[nr] + xstate_sizes[nr];

- xmask = BIT_ULL(nr + 1) - 1;
-
- if (mask == (xmask & xfeatures_mask_all))
- return xstate_comp_offsets[nr] + xstate_sizes[nr];
-
- /*
- * With the given mask, no relevant size is found so far. So, calculate
- * it by summing up each state size.
- */
- for (size = FXSAVE_SIZE + XSAVE_HDR_SIZE, i = FIRST_EXTENDED_XFEATURE; i <= nr; i++) {
- if (!(mask & BIT_ULL(i)))
- continue;
-
- if (xstate_aligns[i])
- size = ALIGN(size, 64);
- size += xstate_sizes[i];
- }
- return size;
+ offset = get_xstate_comp_offset(mask, nr);
+ return offset + xstate_sizes[nr];
}

/*
@@ -1048,17 +1065,20 @@ static void *__raw_xsave_addr(struct fpu *fpu, int xfeature_nr)
{
void *xsave;

- if (!xfeature_enabled(xfeature_nr)) {
- WARN_ON_FPU(1);
- return NULL;
- }
-
- if (fpu)
- xsave = &fpu->state->xsave;
- else
+ if (!xfeature_enabled(xfeature_nr))
+ goto not_found;
+ else if (!fpu)
xsave = &init_fpstate.xsave;
+ else if (!(fpu->state_mask & BIT_ULL(xfeature_nr)))
+ goto not_found;
+ else
+ xsave = &fpu->state->xsave;
+
+ return xsave + get_xstate_comp_offset(fpu->state_mask, xfeature_nr);

- return xsave + xstate_comp_offsets[xfeature_nr];
+not_found:
+ WARN_ON_FPU(1);
+ return NULL;
}
/*
* Given the xsave area and a state inside, this function returns the
--
2.17.1

2021-05-23 19:41:49

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 24/28] x86/fpu/xstate: Use per-task xstate mask for saving xstate in signal frame

Entering a signal handler, the kernel saves xstate in signal frame. The
dynamic user state is better to be saved only when used. fpu->state_mask
can help to exclude unused states.

Returning from a signal handler, XRSTOR(S) re-initializes the excluded
state components.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Updated the title and the changelog.
* Excluded the test case addition; separated out as a new patch.

Changes from v3:
* Removed 'no functional changes' in the changelog. (Borislav Petkov)

Changes from v1:
* Made it revertable (moved close to the end of the series).
* Included the test case.
---
arch/x86/include/asm/fpu/internal.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 6460d52991ba..4a3436684805 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -354,7 +354,7 @@ static inline void copy_kernel_to_xregs(struct xregs_state *xstate, u64 mask)
*/
static inline int copy_xregs_to_user(struct xregs_state __user *buf)
{
- u64 mask = xfeatures_mask_user();
+ u64 mask = current->thread.fpu.state_mask;
u32 lmask = mask;
u32 hmask = mask >> 32;
int err;
--
2.17.1

2021-05-23 19:42:00

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 26/28] selftest/x86/amx: Test case for AMX state copy optimization in signal delivery

Add a test case to verify that unused states are excluded, by leaving a
known pattern on the signal stack and verifying that it is still intact
after taking a subsequent signal.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Separated out as a new patch.
* Improved the test routine by explicitly checking sigframe write.
---
tools/testing/selftests/x86/amx.c | 137 ++++++++++++++++++++++++++++++
1 file changed, 137 insertions(+)

diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c
index 7d177c03cdcf..c5a5582e2b6f 100644
--- a/tools/testing/selftests/x86/amx.c
+++ b/tools/testing/selftests/x86/amx.c
@@ -562,6 +562,142 @@ static void test_ptrace(void)
test_tile_write();
}

+/* Signal handling test */
+
+static bool init_tiledata_state_before_signal;
+static bool load_tiledata_at_first;
+static bool sigalarmed, sigused;
+
+#define SIGFRAME_TILEDATA_SIGNATURE 0xFF
+
+static void handle_sigusr1(int sig, siginfo_t *info, void *ctx_void)
+{
+ void *xsave = ((ucontext_t *)ctx_void)->uc_mcontext.fpregs;
+
+ memset(xsave + xsave_xtiledata_offset, SIGFRAME_TILEDATA_SIGNATURE, xsave_xtiledata_size);
+
+ sigused = true;
+}
+
+static void handle_sigalrm(int sig, siginfo_t *info, void *ctx_void)
+{
+ void *xsave = ((ucontext_t *)ctx_void)->uc_mcontext.fpregs;
+ char d = SIGFRAME_TILEDATA_SIGNATURE;
+ bool written = false;
+ int i;
+
+ for (i = 0; i < xsave_xtiledata_size; i++) {
+ written = memcmp(xsave + xsave_xtiledata_offset + i, &d, 1);
+ if (written)
+ break;
+ }
+
+ if (__xgetbv(1) & XFEATURE_MASK_XTILEDATA)
+ err(1, "tile data state at signal delivery");
+
+ if (init_tiledata_state_before_signal && written) {
+ errs++;
+ printf("[FAIL]\tTile data was %swritten on sigframe.\n", !written ? "not " : "");
+ }
+
+ set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILEDATA);
+ set_tiledata(xsave_buffer + xsave_xtiledata_offset);
+ xrstor(xsave_buffer, -1, -1);
+ sigalarmed = true;
+}
+
+static void test_signal_handling(void)
+{
+ pid_t child;
+
+ sigalarmed = false;
+ sigused = false;
+
+ child = fork();
+ if (child < 0) {
+ err(1, "fork");
+ } else if (child > 0) {
+ do {
+ int status;
+
+ wait(&status);
+ if (WIFSTOPPED(status))
+ kill(child, SIGCONT);
+ else if (WIFEXITED(status) && !WEXITSTATUS(status))
+ break;
+ else
+ err(1, "signal test child");
+ } while (1);
+ return;
+ }
+
+ printf("\tBefore signal, load tile data -- %s, re-initialized -- %s:\n",
+ load_tiledata_at_first ? "yes" : "no",
+ init_tiledata_state_before_signal ? "yes" : "no");
+
+ syscall(SYS_arch_prctl, ARCH_GET_XSTATE, XFEATURE_MASK_XTILE);
+
+ raise(SIGUSR1);
+ if (!sigused)
+ err(1, "SIGUSR1");
+
+ if (load_tiledata_at_first) {
+ set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILEDATA);
+ set_tiledata(xsave_buffer + xsave_xtiledata_offset);
+ xrstor(xsave_buffer, -1, -1);
+ memcpy(tiledata, xsave_buffer + xsave_xtiledata_offset, xsave_xtiledata_size);
+ }
+
+ if (init_tiledata_state_before_signal) {
+ set_xstatebv(xsave_buffer, 0);
+ xrstor(xsave_buffer, -1, -1);
+ memset(tiledata, 0, xsave_xtiledata_size);
+ }
+
+ raise(SIGALRM);
+ if (!sigalarmed)
+ err(1, "SIGALRM");
+
+ __xsave(xsave_buffer, XFEATURE_MASK_XTILEDATA, 0);
+ if (memcmp(tiledata, xsave_buffer + xsave_xtiledata_offset, xsave_xtiledata_size)) {
+ errs++;
+ printf("[FAIL]\tTile data was not restored at sigreturn\n");
+ }
+
+ if (errs)
+ nerrs++;
+ else
+ printf("[OK]\tTile data was %swritten on sigframe and restored at sigreturn\n",
+ init_tiledata_state_before_signal ? "not " : "");
+ _exit(0);
+}
+
+static void test_signal(void)
+{
+ printf("[RUN]\tCheck tile data state in signal path:\n");
+
+ sethandler(SIGALRM, handle_sigalrm, 0);
+ sethandler(SIGUSR1, handle_sigusr1, 0);
+
+ load_tiledata_at_first = false;
+ init_tiledata_state_before_signal = true;
+ errs = 0;
+ test_signal_handling();
+
+ load_tiledata_at_first = true;
+ init_tiledata_state_before_signal = false;
+ errs = 0;
+ test_signal_handling();
+
+ load_tiledata_at_first = true;
+ init_tiledata_state_before_signal = true;
+ errs = 0;
+ test_signal_handling();
+
+ clearhandler(SIGALRM);
+ clearhandler(SIGUSR1);
+}
+
int main(void)
{
/* Check hardware availability at first */
@@ -592,6 +728,7 @@ int main(void)
test_fork();
test_context_switch();
test_ptrace();
+ test_signal();

clearhandler(SIGSEGV);

--
2.17.1

2021-05-23 19:42:00

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 27/28] x86/insn/amx: Add TILERELEASE instruction to the opcode map

Include the opcode of TILERELEASE that returns all the AMX state to
INIT-state.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Added as a new patch as preparatory to use the instruction in the kernel.
---
arch/x86/lib/x86-opcode-map.txt | 8 +++++++-
tools/arch/x86/lib/x86-opcode-map.txt | 8 +++++++-
2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index ec31f5b60323..dbc5078ccafe 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -690,7 +690,9 @@ AVXcode: 2
45: vpsrlvd/q Vx,Hx,Wx (66),(v)
46: vpsravd Vx,Hx,Wx (66),(v) | vpsravd/q Vx,Hx,Wx (66),(evo)
47: vpsllvd/q Vx,Hx,Wx (66),(v)
-# Skip 0x48-0x4b
+# Skip 0x48
+49: Grp22 (1A)
+# Skip 0x4a-0x4b
4c: vrcp14ps/d Vpd,Wpd (66),(ev)
4d: vrcp14ss/d Vsd,Hpd,Wsd (66),(ev)
4e: vrsqrt14ps/d Vpd,Wpd (66),(ev)
@@ -1082,6 +1084,10 @@ GrpTable: Grp21
7: ENDBR64 (F3),(010),(11B) | ENDBR32 (F3),(011),(11B)
EndTable

+GrpTable: Grp22
+0: TILERELEASE (!F3),(v1),(11B)
+EndTable
+
# AMD's Prefetch Group
GrpTable: GrpP
0: PREFETCH
diff --git a/tools/arch/x86/lib/x86-opcode-map.txt b/tools/arch/x86/lib/x86-opcode-map.txt
index ec31f5b60323..dbc5078ccafe 100644
--- a/tools/arch/x86/lib/x86-opcode-map.txt
+++ b/tools/arch/x86/lib/x86-opcode-map.txt
@@ -690,7 +690,9 @@ AVXcode: 2
45: vpsrlvd/q Vx,Hx,Wx (66),(v)
46: vpsravd Vx,Hx,Wx (66),(v) | vpsravd/q Vx,Hx,Wx (66),(evo)
47: vpsllvd/q Vx,Hx,Wx (66),(v)
-# Skip 0x48-0x4b
+# Skip 0x48
+49: Grp22 (1A)
+# Skip 0x4a-0x4b
4c: vrcp14ps/d Vpd,Wpd (66),(ev)
4d: vrcp14ss/d Vsd,Hpd,Wsd (66),(ev)
4e: vrsqrt14ps/d Vpd,Wpd (66),(ev)
@@ -1082,6 +1084,10 @@ GrpTable: Grp21
7: ENDBR64 (F3),(010),(11B) | ENDBR32 (F3),(011),(11B)
EndTable

+GrpTable: Grp22
+0: TILERELEASE (!F3),(v1),(11B)
+EndTable
+
# AMD's Prefetch Group
GrpTable: GrpP
0: PREFETCH
--
2.17.1

2021-05-23 19:42:00

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state

Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
architecture. XFD allows the kernel to enable a feature state in XCR0 and
to receive a #NM trap when a task uses instructions accessing that state.
In this way, Linux can control the access to that state.

XFD introduces two MSRs: IA32_XFD to enable/disable the feature and
IA32_XFD_ERR to assist the #NM trap handler. Both use the same
xstate-component bitmap format, used by XCR0.

Use XFD to detect unauthorized xstate access and raise a SIGSEGV.

Introduce helper functions:
xfd_write() - write IA32_XFD MSR
xfd_read() - read IA32_XFD MSR
xfd_switch() - switch IA32_XFD MSR
xfd_capable() - indicate XFD-capable features

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Changed to use XFD to support the access request policy. Updated #NM
handler to raise a signal instead of buffer allocation.
* Updated helper functions.
* Updated function descriptions in a proper format.
* Updated some code comments.

Changes from v3:
* Removed 'no functional change' in the changelog. (Borislav Petkov)

Changes from v2:
* Changed to enable XFD only when the compacted format is used.
* Updated the changelog with task->fpu removed. (Borislav Petkov)

Changes from v1:
* Inlined the XFD-induced #NM handling code (Andy Lutomirski)
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/fpu/internal.h | 45 ++++++++++++++++++++++++++++-
arch/x86/include/asm/msr-index.h | 2 ++
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
arch/x86/kernel/fpu/xstate.c | 45 +++++++++++++++++++++++++++--
arch/x86/kernel/process_32.c | 2 +-
arch/x86/kernel/process_64.c | 2 +-
arch/x86/kernel/traps.c | 12 ++++++++
8 files changed, 104 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index ac37830ae941..67a9e5282128 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -277,6 +277,7 @@
#define X86_FEATURE_XSAVEC (10*32+ 1) /* XSAVEC instruction */
#define X86_FEATURE_XGETBV1 (10*32+ 2) /* XGETBV with ECX = 1 instruction */
#define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS instructions */
+#define X86_FEATURE_XFD (10*32+ 4) /* eXtended Feature Disabling */

/*
* Extended auxiliary flags: Linux defined - for features scattered in various
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index f964f3efc92e..c250216320df 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -557,11 +557,52 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
* Misc helper functions:
*/

+/* The Extended Feature Disable (XFD) helpers: */
+
+static inline void xfd_write(u64 value)
+{
+ wrmsrl_safe(MSR_IA32_XFD, value);
+}
+
+static inline u64 xfd_read(void)
+{
+ u64 value;
+
+ rdmsrl_safe(MSR_IA32_XFD, &value);
+ return value;
+}
+
+static inline u64 xfd_capable(void)
+{
+ return xfeatures_mask_user_dynamic;
+}
+
+/**
+ * xfd_switch() - Switches the MSR IA32_XFD context if needed.
+ * @prev: The previous task's struct fpu pointer
+ * @next: The next task's struct fpu pointer
+ *
+ * Returns: Nothing
+ */
+static inline void xfd_switch(struct fpu *prev, struct fpu *next)
+{
+ u64 prev_xfd_mask, next_xfd_mask;
+
+ if (!static_cpu_has(X86_FEATURE_XFD) || !xfd_capable())
+ return;
+
+ prev_xfd_mask = prev->state_mask & xfd_capable();
+ next_xfd_mask = next->state_mask & xfd_capable();
+
+ if (unlikely(prev_xfd_mask != next_xfd_mask))
+ xfd_write(xfd_capable() ^ next_xfd_mask);
+}
+
/*
* Load PKRU from the FPU context if available. Delay loading of the
* complete FPU state until the return to userland.
*/
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
{
u32 pkru_val = init_pkru_value;
struct pkru_state *pk;
@@ -571,6 +612,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)

set_thread_flag(TIF_NEED_FPU_LOAD);

+ xfd_switch(old_fpu, new_fpu);
+
if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
return;

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 211ba3375ee9..f9f92ab71fc1 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -626,6 +626,8 @@
#define MSR_IA32_BNDCFGS_RSVD 0x00000ffc

#define MSR_IA32_XSS 0x00000da0
+#define MSR_IA32_XFD 0x000001c4
+#define MSR_IA32_XFD_ERR 0x000001c5

#define MSR_IA32_APICBASE 0x0000001b
#define MSR_IA32_APICBASE_BSP (1<<8)
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index defda61f372d..7f891d2eb52e 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -75,6 +75,7 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_SGX_LC, X86_FEATURE_SGX },
{ X86_FEATURE_SGX1, X86_FEATURE_SGX },
{ X86_FEATURE_SGX2, X86_FEATURE_SGX1 },
+ { X86_FEATURE_XFD, X86_FEATURE_XSAVE },
{}
};

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 299373669a5d..e60a20a1b24b 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -175,6 +175,26 @@ static bool xfeature_is_supervisor(int xfeature_nr)
return ecx & 1;
}

+/**
+ * xfd_supported() - Check if the feature supports Extended Feature Disable (XFD).
+ * @feature_nr: The feature number.
+ *
+ * Returns: Nothing
+ */
+static bool xfd_supported(int feature_nr)
+{
+ u32 eax, ebx, ecx, edx;
+
+ if (!boot_cpu_has(X86_FEATURE_XFD))
+ return false;
+
+ /*
+ * If state component 'i' supports XFD, ECX[2] return 1; otherwise, 0.
+ */
+ cpuid_count(XSTATE_CPUID, feature_nr, &eax, &ebx, &ecx, &edx);
+ return ecx & 4;
+}
+
/**
* get_xstate_comp_offset() - Find the feature's offset in the compacted format
* @mask: This bitmap tells which components reserved in the format.
@@ -366,6 +386,9 @@ void fpu__init_cpu_xstate(void)
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
xfeatures_mask_supervisor_dynamic());
}
+
+ if (boot_cpu_has(X86_FEATURE_XFD))
+ xfd_write(xfd_capable());
}

static bool xfeature_enabled(enum xfeature xfeature)
@@ -565,8 +588,9 @@ static void __init print_xstate_offset_size(void)
for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
if (!xfeature_enabled(i))
continue;
- pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d\n",
- i, xstate_comp_offsets[i], i, xstate_sizes[i]);
+ pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d (%s)\n",
+ i, xstate_comp_offsets[i], i, xstate_sizes[i],
+ (xfeatures_mask_user_dynamic & BIT_ULL(i)) ? "dynamic" : "default");
}
}

@@ -999,9 +1023,18 @@ void __init fpu__init_system_xstate(void)
}

xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();
- /* Do not support the dynamically allocated buffer yet. */
xfeatures_mask_user_dynamic = 0;

+ for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+ u64 feature_mask = BIT_ULL(i);
+
+ if (!(xfeatures_mask_user() & feature_mask))
+ continue;
+
+ if (xfd_supported(i))
+ xfeatures_mask_user_dynamic |= feature_mask;
+ }
+
/* Enable xstate instructions to be able to continue with initialization: */
fpu__init_cpu_xstate();
err = init_xstate_size();
@@ -1053,6 +1086,12 @@ void fpu__resume_cpu(void)
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
xfeatures_mask_supervisor_dynamic());
}
+
+ if (boot_cpu_has(X86_FEATURE_XFD)) {
+ u64 fpu_xfd_mask = current->thread.fpu.state_mask & xfd_capable();
+
+ xfd_write(xfd_capable() ^ fpu_xfd_mask);
+ }
}

/**
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4f2f54e1281c..7bd5d08eeb41 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -213,7 +213,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)

this_cpu_write(current_task, next_p);

- switch_fpu_finish(next_fpu);
+ switch_fpu_finish(prev_fpu, next_fpu);

/* Load the Intel cache allocation PQR MSR. */
resctrl_sched_in();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index d08307df69ad..5375a869f3f3 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -594,7 +594,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
this_cpu_write(current_task, next_p);
this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));

- switch_fpu_finish(next_fpu);
+ switch_fpu_finish(prev_fpu, next_fpu);

/* Reload sp0. */
update_task_stack(next_p);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 853ea7a80806..7482448fcdca 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1111,6 +1111,18 @@ DEFINE_IDTENTRY(exc_device_not_available)
{
unsigned long cr0 = read_cr0();

+ if (boot_cpu_has(X86_FEATURE_XFD)) {
+ u64 event_mask;
+
+ rdmsrl_safe(MSR_IA32_XFD_ERR, &event_mask);
+ wrmsrl_safe(MSR_IA32_XFD_ERR, 0);
+
+ if (event_mask) {
+ force_sig(SIGSEGV);
+ return;
+ }
+ }
+
#ifdef CONFIG_MATH_EMULATION
if (!boot_cpu_has(X86_FEATURE_FPU) && (cr0 & X86_CR0_EM)) {
struct math_emu_info info = { };
--
2.17.1

2021-05-23 19:42:00

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5 22/28] x86/fpu/amx: Enable the AMX feature in 64-bit mode

In 64-bit mode, include the AMX state components in
XFEATURE_MASK_USER_SUPPORTED.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Removed the irrelevant line from the changelog. (Thomas Gleixner)
---
arch/x86/include/asm/fpu/xstate.h | 3 ++-
arch/x86/kernel/fpu/init.c | 8 ++++++--
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 2c9156e4f799..a4751fbb7acd 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -34,7 +34,8 @@
XFEATURE_MASK_Hi16_ZMM | \
XFEATURE_MASK_PKRU | \
XFEATURE_MASK_BNDREGS | \
- XFEATURE_MASK_BNDCSR)
+ XFEATURE_MASK_BNDCSR | \
+ XFEATURE_MASK_XTILE)

/* All currently supported supervisor features */
#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID)
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index f2fcdcc979e7..046889f31037 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -219,8 +219,12 @@ static void __init fpu__init_system_xstate_size_legacy(void)
*/
u64 __init fpu__get_supported_xfeatures_mask(void)
{
- return XFEATURE_MASK_USER_SUPPORTED |
- XFEATURE_MASK_SUPERVISOR_SUPPORTED;
+ u64 mask = XFEATURE_MASK_USER_SUPPORTED | XFEATURE_MASK_SUPERVISOR_SUPPORTED;
+
+ if (!IS_ENABLED(CONFIG_X86_64))
+ mask &= ~(XFEATURE_MASK_XTILE);
+
+ return mask;
}

/* Legacy code to initialize eager fpu mode. */
--
2.17.1

2021-05-24 03:10:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 21/28] x86/fpu/amx: Initialize child's AMX state

On 5/23/21 12:32 PM, Chang S. Bae wrote:
> Assure that a forked child starts AMX registers in the INIT-state.
>
> Signed-off-by: Chang S. Bae <[email protected]>
> Reviewed-by: Len Brown <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> Changes from v4:
> * Added as a new patch. This was missing on previous versions.
> ---
> arch/x86/kernel/fpu/core.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 016c3adebec3..cccfeafe81e5 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -285,6 +285,10 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
>
> fpregs_unlock();
>
> + /* AMX state is volatile, children do not inherit it. */
> + if (xfeatures_mask_all & XFEATURE_MASK_XTILE)
> + dst_fpu->state->xsave.header.xfeatures &= ~(XFEATURE_MASK_XTILE);
> +
> set_tsk_thread_flag(dst, TIF_NEED_FPU_LOAD);
>
> trace_x86_fpu_copy_src(src_fpu);
>

If we're going to start having different states with different
behaviors, let's make them real defines.

#define XFEATURE_MASK_CLEARED_ON_CLONE ...

--Andy

2021-05-24 03:15:42

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 28/28] x86/fpu/amx: Clear the AMX state when appropriate

On 5/23/21 12:32 PM, Chang S. Bae wrote:
> When AMX is enabled, and an AMX-task is saved, explicitly initialize the
> AMX state after the XSAVE.
>
> This assures that the kernel will only request idle states with clean AMX
> state. In the case of the C6 idle state, this allows the hardware to get to
> a deeper power saving condition.
>
> Signed-off-by: Chang S. Bae <[email protected]>
> Reviewed-by: Len Brown <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> Changes from v4:
> * Added as a new patch. (Thomas Gleixner)
> ---
> arch/x86/include/asm/special_insns.h | 6 ++++++
> arch/x86/kernel/fpu/core.c | 8 ++++++++
> 2 files changed, 14 insertions(+)
>
> diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
> index 2acd6cb62328..f0ed063035eb 100644
> --- a/arch/x86/include/asm/special_insns.h
> +++ b/arch/x86/include/asm/special_insns.h
> @@ -306,6 +306,12 @@ static inline int enqcmds(void __iomem *dst, const void *src)
> return 0;
> }
>
> +static inline void tile_release(void)
> +{
> + /* Instruction opcode for TILERELEASE; supported in binutils >= 2.36. */
> + asm volatile(".byte 0xc4, 0xe2, 0x78, 0x49, 0xc0");
> +}
> +
> #endif /* __KERNEL__ */
>
> #endif /* _ASM_X86_SPECIAL_INSNS_H */
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index cccfeafe81e5..53a5869078b8 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -106,6 +106,14 @@ int copy_fpregs_to_fpstate(struct fpu *fpu)
> */
> if (fpu->state->xsave.header.xfeatures & XFEATURE_MASK_AVX512)
> fpu->avx512_timestamp = jiffies;
> +
> + /*
> + * Since the current task's state is safely in the XSAVE buffer, TILERELEASE
> + * the TILE registers to guarantee that dirty state will not interfere with the
> + * hardware's ability to enter the core C6 idle state.
> + */
> + if (fpu->state_mask & XFEATURE_MASK_XTILE_DATA)
> + tile_release();
> return 1;
> }
>
>

This looks wrong -- you should also invalidate the state. And doing it
in the save path seems inefficient.

Can we do this just when going idle?

--Andy

2021-05-24 03:17:49

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 24/28] x86/fpu/xstate: Use per-task xstate mask for saving xstate in signal frame

On 5/23/21 12:32 PM, Chang S. Bae wrote:
> Entering a signal handler, the kernel saves xstate in signal frame. The
> dynamic user state is better to be saved only when used. fpu->state_mask
> can help to exclude unused states.
>
> Returning from a signal handler, XRSTOR(S) re-initializes the excluded
> state components.

If I'm reading this right, it means that tasks that have ever used AMX
get one format and tasks that haven't get another one. Do we really
want that? We could instead look at the actual XINUSE mask.

BTW, the current FPU signal restore code is horrific. I'm not convinced
we want to make any changes to the signal handling until that code gets
cleaned up.

--Andy

2021-05-24 03:26:23

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 25/28] x86/fpu/xstate: Skip writing zeros to signal frame for dynamic user states if in INIT-state

On 5/23/21 12:32 PM, Chang S. Bae wrote:
> By default, for xstate features in the INIT-state, XSAVE writes zeros to
> the uncompressed destination buffer.
>
> E.g., if you are not using AVX-512, you will still get a bunch of zeros on
> the signal stack where live AVX-512 data would go.
>
> For 'dynamic user state' (currently only XTILEDATA), explicitly skip this
> data transfer. The result is that the user buffer for the AMX region will
> not be touched by XSAVE.

Why?

>
> Signed-off-by: Chang S. Bae <[email protected]>
> Reviewed-by: Len Brown <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> Changes from v4:
> * Added as a new patch.
> ---
> arch/x86/include/asm/fpu/internal.h | 22 +++++++++++++++++++---
> 1 file changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index 4a3436684805..131f2557fc85 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -354,11 +354,27 @@ static inline void copy_kernel_to_xregs(struct xregs_state *xstate, u64 mask)
> */
> static inline int copy_xregs_to_user(struct xregs_state __user *buf)
> {
> - u64 mask = current->thread.fpu.state_mask;
> - u32 lmask = mask;
> - u32 hmask = mask >> 32;
> + u64 state_mask = current->thread.fpu.state_mask;
> + u64 dynamic_state_mask;
> + u32 lmask, hmask;
> int err;
>
> + dynamic_state_mask = state_mask & xfeatures_mask_user_dynamic;
> + if (dynamic_state_mask && boot_cpu_has(X86_FEATURE_XGETBV1)) {
> + u64 dynamic_xinuse, dynamic_init;
> + u64 xinuse = xgetbv(1);
> +
> + dynamic_xinuse = xinuse & dynamic_state_mask;
> + dynamic_init = ~(xinuse) & dynamic_state_mask;
> + if (dynamic_init) {
> + state_mask &= ~xfeatures_mask_user_dynamic;
> + state_mask |= dynamic_xinuse;

That's a long-winded way to say:

state_mask &= ~dynamic_init;

But what happens if we don't have the XGETBV1 feature? Are we making
AMX support depend on XGETBV1?

How does this patch interact with "[PATCH v5 24/28] x86/fpu/xstate: Use
per-task xstate mask for saving xstate in signal frame"? They seem to
be try to do something similar but not quite the same, and they seem to
be patching the same function. The result seems odd.

Finally, isn't part of the point that we need to avoid even *allocating*
space for non-AMX-using tasks? That would require writing out the
compacted format and/or fiddling with XCR0.

2021-05-24 14:08:33

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 28/28] x86/fpu/amx: Clear the AMX state when appropriate

On 5/23/21 12:32 PM, Chang S. Bae wrote:
> + /*
> + * Since the current task's state is safely in the XSAVE buffer, TILERELEASE
> + * the TILE registers to guarantee that dirty state will not interfere with the
> + * hardware's ability to enter the core C6 idle state.
> + */

Is there an architectural reason to be so specific about this one idle
state? Are AMX and "C6 idle state" so intertwined that this comment
will age well?

I'm a bit worried that "C6 idle state" won't be meaningful to folks who
read this.

Could we maybe say:

/*
* Leaving state in the TILE registers may prevent the
* processor from entering low-power idle states. Use
* TILERELEASE to initialize the state. Destroying
* fpregs state is safe after the fpstate update.
*/

Also, referencing fpregs/fpstate is really nice because the codes
doesn't actually say "XSAVE" anywhere.

> + if (fpu->state_mask & XFEATURE_MASK_XTILE_DATA)
> + tile_release();

Doesn't this tile_release() need a fpregs_deactivate()? Otherwise, the
next XRSTOR might get optimized away because it thinks there's still
good data in the fpregs.

Will this unnecessarily thwart the modified optimization in cases where
we go and run this task again without ever going out to userspace? Will
this impact context-switch latency for *EVERY* context switch in order
to go to a lower idle state in a few minutes, hours, or never?

2021-05-24 14:11:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 28/28] x86/fpu/amx: Clear the AMX state when appropriate

On 5/23/21 8:13 PM, Andy Lutomirski wrote:
>> + /*
>> + * Since the current task's state is safely in the XSAVE buffer, TILERELEASE
>> + * the TILE registers to guarantee that dirty state will not interfere with the
>> + * hardware's ability to enter the core C6 idle state.
>> + */
>> + if (fpu->state_mask & XFEATURE_MASK_XTILE_DATA)
>> + tile_release();
>> return 1;
>> }
>>
>>
> This looks wrong -- you should also invalidate the state. And doing it
> in the save path seems inefficient.
>
> Can we do this just when going idle?

Chang, you might also want to talk with folks that do scheduler
performance work (I've cc'd Tim). I know we're always fighting to trim
down the idle and wakeup paths. There might be no other alternative,
but unconditionally forcing an AMX XRSTOR on return from idle might be
considered nasty.

2021-05-24 17:34:42

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH v5 28/28] x86/fpu/amx: Clear the AMX state when appropriate

On Mon, May 24, 2021 at 10:10 AM Dave Hansen <[email protected]> wrote:

> On 5/23/21 8:13 PM, Andy Lutomirski wrote:
> >
> > Can we do this just when going idle?
>
> Chang, you might also want to talk with folks that do scheduler
> performance work (I've cc'd Tim). I know we're always fighting to trim
> down the idle and wakeup paths. There might be no other alternative,
> but unconditionally forcing an AMX XRSTOR on return from idle might be
> considered nasty.

I'm not excited about burdening the generic idle path with a CPU
feature specific check
that would need to be checked on every idle entry.

thanks,
Len Brown, Intel Open Source Technology Center

2021-05-24 17:38:20

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH v5 28/28] x86/fpu/amx: Clear the AMX state when appropriate

On Mon, May 24, 2021 at 10:07 AM Dave Hansen <[email protected]> wrote:

> Could we maybe say:
>
> /*
> * Leaving state in the TILE registers may prevent the
> * processor from entering low-power idle states. Use
> * TILERELEASE to initialize the state. Destroying
> * fpregs state is safe after the fpstate update.
> */

Ack

> Also, referencing fpregs/fpstate is really nice because the codes
> doesn't actually say "XSAVE" anywhere.
>
> > + if (fpu->state_mask & XFEATURE_MASK_XTILE_DATA)
> > + tile_release();
>
> Doesn't this tile_release() need a fpregs_deactivate()? Otherwise, the
> next XRSTOR might get optimized away because it thinks there's still
> good data in the fpregs.
>
> Will this unnecessarily thwart the modified optimization in cases where
> we go and run this task again without ever going out to userspace? Will
> this impact context-switch latency for *EVERY* context switch in order
> to go to a lower idle state in a few minutes, hours, or never?

yeah, seems we missed that.


thanks!
Len Brown, Intel Open Source Technology Center

2021-05-24 17:40:50

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH v5 21/28] x86/fpu/amx: Initialize child's AMX state

On Sun, May 23, 2021 at 11:09 PM Andy Lutomirski <[email protected]> wrote:

> If we're going to start having different states with different
> behaviors, let's make them real defines.
>
> #define XFEATURE_MASK_CLEARED_ON_CLONE ...

Probably a good idea.

However, "cleared on clone" isn't the property.
The property is that the state is treated as VOLATILE.

Len Brown, Intel Open Source Technology Center

2021-05-24 17:41:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 28/28] x86/fpu/amx: Clear the AMX state when appropriate

On 5/24/21 10:32 AM, Len Brown wrote:
> On Mon, May 24, 2021 at 10:10 AM Dave Hansen <[email protected]> wrote:
>> On 5/23/21 8:13 PM, Andy Lutomirski wrote:
>>> Can we do this just when going idle?
>> Chang, you might also want to talk with folks that do scheduler
>> performance work (I've cc'd Tim). I know we're always fighting to trim
>> down the idle and wakeup paths. There might be no other alternative,
>> but unconditionally forcing an AMX XRSTOR on return from idle might be
>> considered nasty.
> I'm not excited about burdening the generic idle path with a CPU
> feature specific check that would need to be checked on every idle
> entry.

Me neither.

But, the check itself should be cheap. A cpu_feature_enabled(AMX) check
will eliminate even the cost of a branch on systems without AMX. You
could probably even get fancy and also use a static branch that doesn't
get enabled until the first AMX user shows up.

2021-05-24 18:10:57

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH v5 24/28] x86/fpu/xstate: Use per-task xstate mask for saving xstate in signal frame

On Sun, May 23, 2021 at 11:15 PM Andy Lutomirski <[email protected]> wrote:
>
> If I'm reading this right, it means that tasks that have ever used AMX
> get one format and tasks that haven't get another one.

No. The format of the XSTATE on the signal stack is uncompressed XSAVE
format for both AMX and non-AMX tasks, both before and after this patch.
That is because XSAVE gets the format from XCR0. It gets the fields
to write from the run-time parameter.

So the change here allows a non-AMX task to skip writing data (zeros)
to the AMX region of its XSTATE buffer.

The subsequent patch adds the further optimization of (manually) checking
for INIT state for an AMX-task and also skip writing data (zeros) in that case.

We should have done this optimization for AVX-512, but instead we
guaranteed writing zeros, which I think is a waste of both transfer time
and cache footprint.

Len Brown, Intel Open Source Technology Center

2021-05-24 18:17:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 21/28] x86/fpu/amx: Initialize child's AMX state



On Mon, May 24, 2021, at 10:37 AM, Len Brown wrote:
> On Sun, May 23, 2021 at 11:09 PM Andy Lutomirski <[email protected]> wrote:
>
> > If we're going to start having different states with different
> > behaviors, let's make them real defines.
> >
> > #define XFEATURE_MASK_CLEARED_ON_CLONE ...
>
> Probably a good idea.
>
> However, "cleared on clone" isn't the property.
> The property is that the state is treated as VOLATILE.
>

What does VOLATILE mean in this context?

2021-05-24 18:17:55

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH v5 25/28] x86/fpu/xstate: Skip writing zeros to signal frame for dynamic user states if in INIT-state

On Sun, May 23, 2021 at 11:28 PM Andy Lutomirski <[email protected]> wrote:

> But what happens if we don't have the XGETBV1 feature? Are we making
> AMX support depend on XGETBV1?

Yes, AMX systems always have XGETBV.

> How does this patch interact with "[PATCH v5 24/28] x86/fpu/xstate: Use
> per-task xstate mask for saving xstate in signal frame"? They seem to
> be try to do something similar but not quite the same, and they seem to
> be patching the same function. The result seems odd.

The previous patch allowed non-AMX tasks to skip writing 8KB of zeros
to their signal stack. This patch builds on that to allow an AMX-task
in INIT to also skip writing 8KB of zeros to its signal stack.

> Finally, isn't part of the point that we need to avoid even *allocating*
> space for non-AMX-using tasks? That would require writing out the
> compacted format and/or fiddling with XCR0.

The current signal stack ABI is that the user receives all
architectural state in
uncompressed XTATE format on the signal stack. They can XRESTOR it
and the right thing happens.

We can optimize the data transfer (and we have), but we can't optimize the space
without changing that ABI.

Again, I'm happy to investigate a new opt-in fast-signal ABI that doesn't
have all this state. But so far, nobody has asked for it, or even identified
an application that cares about minimum-overhead signals.
Can you recommend one?

thanks,
Len Brown, Intel Open Source Technology Center

2021-05-24 18:22:04

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH v5 21/28] x86/fpu/amx: Initialize child's AMX state

On Mon, May 24, 2021 at 2:14 PM Andy Lutomirski <[email protected]> wrote:

> What does VOLATILE mean in this context?

Volatile means caller-saved.

Volatile registers can not be used for globals, static, or for
parameter passing.

ie. By the time the callee is running, they must be assumed to be invalid.

This means that any routine, including the target of a system call,
such as fork/clone, can't assume that any data exists in these
registers.

--
Len Brown, Intel Open Source Technology Center

2021-05-24 18:25:53

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH v5 28/28] x86/fpu/amx: Clear the AMX state when appropriate

On Mon, May 24, 2021 at 1:39 PM Dave Hansen <[email protected]> wrote:

> >> might be considered nasty.

> > I'm not excited about burdening the generic idle path with a CPU
> > feature specific check that would need to be checked on every idle
> > entry.
>
> Me neither.
>
> But, the check itself should be cheap. A cpu_feature_enabled(AMX) check
> will eliminate even the cost of a branch on systems without AMX. You
> could probably even get fancy and also use a static branch that doesn't
> get enabled until the first AMX user shows up.

It isn't just the hardware run-time cost.
It is the source code complexity.
That code is 100% generic.

Len Brown, Intel Open Source Technology Center

2021-05-24 18:31:13

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 25/28] x86/fpu/xstate: Skip writing zeros to signal frame for dynamic user states if in INIT-state

On 5/24/21 11:15 AM, Len Brown wrote:
> On Sun, May 23, 2021 at 11:28 PM Andy Lutomirski <[email protected]> wrote:
>> But what happens if we don't have the XGETBV1 feature? Are we making
>> AMX support depend on XGETBV1?
>
> Yes, AMX systems always have XGETBV.

XGETBV1 support is actually separate from the XGETBV instruction itself.
XGETBV1 even has its own CPUID bit.

This means that a silly hypervisor or lazy future hardware
implementation could decide to enumerate support for AMX and neglect
XGETBV1.

But, it's perfectly fine to have our AMX support depend on XGETBV1. We
even have a little dependency checker table which will clear the AMX
feature in the presence of one of those silly hypervisors.

2021-05-24 21:40:10

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v5-fix 28/28] x86/fpu/amx: Clear the AMX state when appropriate

When AMX is enabled, and an AMX-task is saved, explicitly initialize the
AMX state after the XSAVE.

This assures that the kernel will only request idle states with clean AMX
state. In the case of the C6 idle state, this allows the hardware to get to
a deeper power saving condition.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v5:
* Fixed to deactivate fpregs. (Andy Lutomirski and Dave Hansen)
* Updated the code comment. (Dave Hansen)

Changes from v4:
* Added as a new patch. (Thomas Gleixner)
---
arch/x86/include/asm/special_insns.h | 6 ++++++
arch/x86/kernel/fpu/core.c | 11 +++++++++++
2 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 2acd6cb62328..f0ed063035eb 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -306,6 +306,12 @@ static inline int enqcmds(void __iomem *dst, const void *src)
return 0;
}

+static inline void tile_release(void)
+{
+ /* Instruction opcode for TILERELEASE; supported in binutils >= 2.36. */
+ asm volatile(".byte 0xc4, 0xe2, 0x78, 0x49, 0xc0");
+}
+
#endif /* __KERNEL__ */

#endif /* _ASM_X86_SPECIAL_INSNS_H */
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index cccfeafe81e5..14c8216d9a39 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -106,6 +106,17 @@ int copy_fpregs_to_fpstate(struct fpu *fpu)
*/
if (fpu->state->xsave.header.xfeatures & XFEATURE_MASK_AVX512)
fpu->avx512_timestamp = jiffies;
+
+ /*
+ * Leaving state in the TILE registers may prevent the
+ * processor from entering low-power idle states. Use
+ * TILERELEASE to initialize the state. Destroying
+ * fpregs state is safe after the fpstate update.
+ */
+ if (fpu->state_mask & XFEATURE_MASK_XTILE_DATA) {
+ tile_release();
+ fpregs_deactivate(fpu);
+ }
return 1;
}

--
2.17.1

2021-05-24 23:13:01

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH v5 15/28] x86/arch_prctl: Create ARCH_GET_XSTATE/ARCH_PUT_XSTATE

On Sun, May 23, 2021 at 3:39 PM Chang S. Bae <[email protected]> wrote:
>
> N.B. This interface is currently under active discussion on LKML. This [v4]]
> proposal implements a per-task system call with GET/PUT semantics. A
> per-process system call without PUT semantics may be superior.

There is a better way...

In the v5 strawman here, every library routine that uses AMX
must precede that use with a system call
requesting permission from the kernel,
and then a system call yielding permission.
These calls must be always present because
this API requires that it be invoked per task,
but the library has no idea if the current thread has called the API or not.

The two concepts at play are "fine grained permission" and "buffer management".
We can improve on this by treating them as dependent, rather than equivalent.

The reality is that process-wide, rather than task-specific permission
is sufficient.
For if there exists a use-case where an administrator would want to grant
AMX access to one task in a process, but deny it from another, I have
yet to imagine it,
and I pity the application/library programmer who would have to face it.

Further, I submit that granting permission to an application,
but then denying permission to a subsequent request is unrealistic.
Applications will have already initialized their threads
and data structures based on the capabilities of the hardware.
Asking them to cope with such a change at run time is not reasonable.

The reality is that if they actually look at a failed return code,
they will exit.
If they ignore the return code and proceed, they will trap and crash.
Not likely will programmers be excited, or willing, to write code
to actually handle dynamic loss of permission.

The down-side of permission (whether it be per-process as proposed here,
or per-task, as in the v5 patch) is that to check if AMX is enabled, user-space
must check three things instead of two:

1. CPUID has AMX
2. XCR0 has AMX
3. Linux permission has been requested and granted to this process

If we accept that fine-grained permission is required, I don't see a practical
or efficient way to do it without #3. (No, our hardware can not trap CPUID
reads, or avoid VMEXIT on XCR0 changes)

And so I propose:
1. Per-process permission (not per-task).
2. Permission, once granted, remains valid for the lifetime of that process.

And so any access to AMX before this process-wide permission is
granted will fail,
and any access to AMX after process-side permission is granted will succeed.

Period.

Which brings us to context switch buffer management.

After per-process permission is granted, we have two options on how
to handle context switch buffer management, both have merit:

1. On-demand. Any task in a process that has AMX permission can touch AMX.
When it does, it takes a #NM, the kernel allocates the 8KB buffer, disarms XFD
and returns. This is how v4 of this patch series worked.

The first benefit of on-demand is that user-space isn't mandated to do any
more Linux-specific system calls after the per-process permission is granted.

The second benefit of on-demand is that a process with 1,000 threads
and only 8 of them
in a pool actually touch AMX, then 8 buffers will be allocated, not 1,000.

The dis-advantage of on-demand is that there is no buffer release mechanism --
the buffer lives as long as the task lives. Though, per previous conversation,
a future kernel could easily implement a buffer re-claim mechanism
behind the scenes
since the kernel is empowered to re-arm XFD for whatever reason it wants...

2. Synchronous allocation. Any task in the process that has AMX permission can
make a 2nd system call to request that the kernel synchronously allocate the
8KB buffer for that task. *

* This could also be implemented to mean "allocate for this task
and upon the creation of all future threads in this process".
Though doing so could result in cases where the kernel allocates
many more buffers than are actually necessary.

The benefit of synchronous allocation is that after an application has created
all of its threads, it knows that it would have got a synchronous error code
from this special system call if any of them failed to allocate.

Note that detection of allocation failure upon thread creation
could also be implemented using dynamic allocation by simply touching
AMX state --
except the failure would be delivered through a signal,
rather than an error code from a special system call.

Although it is possible, and it is implemented in v5, I don't advocate that
synchronous allocation have a matching synchronous de-allocation.
I don't think programmers will use it; and I don't see a case for
complicating the kernel code with reference counters that are unused.

So the value proposition for synchronous allocation is thin, but it exists.

My recommendation is to implement both #1 and #2, and to document
that #1 may fail you with a signal, while #2 under the same scenario
would fail you with a return code. If you are a programmer that prefers
that error code rather than writing a signal handler, then use the
new system call.

-Len Brown
Intel Open Source Technology Center

2021-05-25 04:02:38

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 21/28] x86/fpu/amx: Initialize child's AMX state



On Mon, May 24, 2021, at 11:21 AM, Len Brown wrote:
> On Mon, May 24, 2021 at 2:14 PM Andy Lutomirski <[email protected]> wrote:
>
> > What does VOLATILE mean in this context?
>
> Volatile means caller-saved.

Just like every other extended math register except some XMMs on Windows. (Thanks you so, so much, Microsoft, for screwing this up, and thank you Intel for indulging Microsoft.)

>
> Volatile registers can not be used for globals, static, or for
> parameter passing.
>
> ie. By the time the callee is running, they must be assumed to be invalid.

Callees can’t assume anything about any of the registers unless explicitly specified. TILE is no different from RBP or XMM in this regard.

>
> This means that any routine, including the target of a system call,
> such as fork/clone, can't assume that any data exists in these
> registers.
>

If we actually believe this, then we should clear xmm, ymm, zmm, etc on every system call. Barring that, let’s come up with reasonable semantics, document it, and implement it.

2021-05-25 04:49:32

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 25/28] x86/fpu/xstate: Skip writing zeros to signal frame for dynamic user states if in INIT-state



On Mon, May 24, 2021, at 11:15 AM, Len Brown wrote:
> On Sun, May 23, 2021 at 11:28 PM Andy Lutomirski <[email protected]> wrote:
>
> > But what happens if we don't have the XGETBV1 feature? Are we making
> > AMX support depend on XGETBV1?
>
> Yes, AMX systems always have XGETBV.
>
> > How does this patch interact with "[PATCH v5 24/28] x86/fpu/xstate: Use
> > per-task xstate mask for saving xstate in signal frame"? They seem to
> > be try to do something similar but not quite the same, and they seem to
> > be patching the same function. The result seems odd.
>
> The previous patch allowed non-AMX tasks to skip writing 8KB of zeros
> to their signal stack. This patch builds on that to allow an AMX-task
> in INIT to also skip writing 8KB of zeros to its signal stack.

If we ever have a task that is “non-AMX” but has non-INIT AMX, then either we or the hardware has seriously screwed up.

So the logic boils down to: if (a) optimize; if (b) optimize; where a implies b. I maintain my objection — I think this is redundant.

>
> > Finally, isn't part of the point that we need to avoid even *allocating*
> > space for non-AMX-using tasks? That would require writing out the
> > compacted format and/or fiddling with XCR0.
>
> The current signal stack ABI is that the user receives all
> architectural state in
> uncompressed XTATE format on the signal stack. They can XRESTOR it
> and the right thing happens.
>
> We can optimize the data transfer (and we have), but we can't optimize the space
> without changing that ABI.

We are between a rock and a hard place here. On the one hand, our ABI (supposedly — no one seems to have really confirmed this) has the signal frame in uncompacted form. On the other hand, our ABI most definitely has signal frames under 2kB by a decent amount.

I think we should apply a simple patch to Linux to add a boot parameter x86_extra_signal_space=8192 that will pad the signal frame by 8192 bytes. And then we ask people to seriously test their workloads with that parameter set. If things break, then AMX in its current proposed form may be off the table.

At least my dynamic XCR0 proposal has the property that legacy programs and AMX can reliably coexist on the same system.

2021-05-25 04:52:29

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 24/28] x86/fpu/xstate: Use per-task xstate mask for saving xstate in signal frame



On Mon, May 24, 2021, at 11:06 AM, Len Brown wrote:
> On Sun, May 23, 2021 at 11:15 PM Andy Lutomirski <[email protected]> wrote:
> >
> > If I'm reading this right, it means that tasks that have ever used AMX
> > get one format and tasks that haven't get another one.
>
> No. The format of the XSTATE on the signal stack is uncompressed XSAVE
> format for both AMX and non-AMX tasks, both before and after this patch.
> That is because XSAVE gets the format from XCR0. It gets the fields
> to write from the run-time parameter.
>
> So the change here allows a non-AMX task to skip writing data (zeros)
> to the AMX region of its XSTATE buffer.

I misread the patch. I still think this patch is useless.

>
> The subsequent patch adds the further optimization of (manually) checking
> for INIT state for an AMX-task and also skip writing data (zeros) in that case.
>
> We should have done this optimization for AVX-512, but instead we
> guaranteed writing zeros, which I think is a waste of both transfer time
> and cache footprint.

If no one depends on it, it’s not ABI.

2021-05-25 14:06:59

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH v5 24/28] x86/fpu/xstate: Use per-task xstate mask for saving xstate in signal frame

On Tue, May 25, 2021 at 12:48 AM Andy Lutomirski <[email protected]> wrote:
>
>
>
> On Mon, May 24, 2021, at 11:06 AM, Len Brown wrote:
> > On Sun, May 23, 2021 at 11:15 PM Andy Lutomirski <[email protected]> wrote:
> > >
> > > If I'm reading this right, it means that tasks that have ever used AMX
> > > get one format and tasks that haven't get another one.
> >
> > No. The format of the XSTATE on the signal stack is uncompressed XSAVE
> > format for both AMX and non-AMX tasks, both before and after this patch.
> > That is because XSAVE gets the format from XCR0. It gets the fields
> > to write from the run-time parameter.
> >
> > So the change here allows a non-AMX task to skip writing data (zeros)
> > to the AMX region of its XSTATE buffer.
>
> I misread the patch. I still think this patch is useless.

This patch allows skipping writing 8KB of zeros in XSAVE, rather than
writing zeros.
This reduces both the cycle count and cache impact of context-switch.
Some might consider that useful, rather than useless.

> > The subsequent patch adds the further optimization of (manually) checking
> > for INIT state for an AMX-task and also skip writing data (zeros) in that case.
> >
> > We should have done this optimization for AVX-512, but instead we
> > guaranteed writing zeros, which I think is a waste of both transfer time
> > and cache footprint.
>
> If no one depends on it, it’s not ABI.

Agreed.
Perhaps in the future we can see if reducing AVX-512 cache footprint
this same way is beneficial.

--
Len Brown, Intel Open Source Technology Center

2021-05-25 18:11:48

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v5 15/28] x86/arch_prctl: Create ARCH_GET_XSTATE/ARCH_PUT_XSTATE

On Mon, May 24, 2021 at 07:10:57PM -0400, Len Brown wrote:
> 1. CPUID has AMX
> 2. XCR0 has AMX
> 3. Linux permission has been requested and granted to this process

Actually, you want *only* 3 as 1 is a bad idea - we're in this mess
because userspace does feature detection on its own even when kernel
support is needed.

When Linux grants the permission, 1 and 2 should be implicitly given.

> The dis-advantage of on-demand is that there is no buffer release mechanism --
> the buffer lives as long as the task lives. Though, per previous conversation,
> a future kernel could easily implement a buffer re-claim mechanism
> behind the scenes
> since the kernel is empowered to re-arm XFD for whatever reason it wants...

Why is buffer release even needed? 1) sounds like a simple and clean thing to
do.

> 2. Synchronous allocation. Any task in the process that has AMX permission can
> make a 2nd system call to request that the kernel synchronously allocate the
> 8KB buffer for that task. *

That doesn't sound as clean. More like unneeded work on the side of
userspace programmer which she/he can save her-/himself from.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-05-25 18:16:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 15/28] x86/arch_prctl: Create ARCH_GET_XSTATE/ARCH_PUT_XSTATE

On 5/25/21 10:27 AM, Borislav Petkov wrote:
> On Mon, May 24, 2021 at 07:10:57PM -0400, Len Brown wrote:
>> 1. CPUID has AMX
>> 2. XCR0 has AMX
>> 3. Linux permission has been requested and granted to this process
> Actually, you want *only* 3 as 1 is a bad idea - we're in this mess
> because userspace does feature detection on its own even when kernel
> support is needed.
>
> When Linux grants the permission, 1 and 2 should be implicitly given.

We did this for pkeys. Essentially, we said, "don't bother with
CPUID.OSPKE and XCRO[PKRU] checking". Just call alloc_pkey() and we'll
do the checking for you. One stop shopping.

So far, it's worked well.

2021-05-25 19:54:54

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 15/28] x86/arch_prctl: Create ARCH_GET_XSTATE/ARCH_PUT_XSTATE

> The kernel enforces access to the specified using XFD hardware support. By
> default, XFD is armed and results in #NM traps on un-authorized access.
> Upon successful ARCH_GET_XSTATE, XFD traps are dis-armed and the user is
> free to access the feature.

Does this really need to talk about XFD?

I also don't really like this talking about being "authorized" or not.
Isn't this interface simply to give userspace the opportunity to
deterministically avoid being killed by signals from ENOMEM during a #NM?

I'd also define the behavior a bit more generically. Maybe:

After a successful ARCH_GET_XSTATE, the kernel guarantees that no #NM
exception will be generated for access to any of the specified XSAVE
features. This guarantee will persist until at least the point where a
ARCH_PUT_XSTATE operation occurs, possibly longer.

The kernel may choose to return an error for any ARCH_GET_XSTATE request
at any time, even if a prior one succeeds. This might be as a result of
a memory allocation failure, resource exhaustion, or exceeding the
implementations limits for "outstanding" ARCH_GET_XSTATE operations.

The kernel may return errors if the number of ARCH_PUT_XSTATE operations
for a given XSAVE feature exceed the number of ARCH_GET_XSTATE operations.

--

Note that there's no discussion of XFD.

> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 25c9c7dad3f9..016c3adebec3 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -254,6 +254,8 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
>
> WARN_ON_FPU(src_fpu != &current->thread.fpu);
>
> + dst_fpu->refcount = NULL;

For the future, don't forget to call out the fork/exec() behavior.

> /*
> * The child does not inherit the dynamic states. Thus, use the buffer
> * embedded in struct task_struct, which has the minimum size.
> @@ -541,3 +543,15 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
> */
> return 0;
> }
> +
> +/**
> + * free_fpu() - Free up memory that belongs to the FPU context.
> + * @fpu: A struct fpu * pointer
> + *
> + * Return: Nothing
> + */
> +void free_fpu(struct fpu *fpu)
> +{
> + kfree(fpu->refcount);
> + free_xstate_buffer(fpu);
> +}

FWIW, I don't think that needs a formal kdoc comment.

> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index e60a20a1b24b..126c4a509669 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -21,6 +21,7 @@
> #include <asm/tlbflush.h>
> #include <asm/cpufeature.h>
> #include <asm/trace/fpu.h>
> +#include <asm/prctl.h>
>
> /*
> * Although we spell it out in here, the Processor Trace
> @@ -78,6 +79,11 @@ static unsigned int xstate_supervisor_only_offsets[XFEATURE_MAX] = { [ 0 ... XFE
> * byte boundary. Otherwise, it follows the preceding component immediately.
> */
> static bool xstate_aligns[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = false};
> +/*
> + * Remember the index number in the reference counter array that supports
> + * access request. '-1' indicates that a state component does not support it.
> + */
> +static unsigned int xstate_refcount_idx[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};

For now when we have a single feature, isn't this overkill? Also, even
if you decide to keep this, there are only 63 possible XSAVE features.
We don't need 'unsigned int' for storing a maximum value of 63.

> /**
> * struct fpu_xstate_buffer_config - xstate per-task buffer configuration
> @@ -969,8 +975,7 @@ void __init fpu__init_system_xstate(void)
> {
> unsigned int eax, ebx, ecx, edx;
> static int on_boot_cpu __initdata = 1;
> - int err;
> - int i;
> + int err, i, j;
>
> WARN_ON_FPU(!on_boot_cpu);
> on_boot_cpu = 0;
> @@ -1025,14 +1030,17 @@ void __init fpu__init_system_xstate(void)
> xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();
> xfeatures_mask_user_dynamic = 0;
>
> - for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
> + for (i = FIRST_EXTENDED_XFEATURE, j = 0; i < XFEATURE_MAX; i++) {
> u64 feature_mask = BIT_ULL(i);
>
> if (!(xfeatures_mask_user() & feature_mask))
> continue;
>
> - if (xfd_supported(i))
> + if (xfd_supported(i)) {
> xfeatures_mask_user_dynamic |= feature_mask;
> + xstate_refcount_idx[i] = j;
> + j++;
> + }
> }
>
> /* Enable xstate instructions to be able to continue with initialization: */
> @@ -1339,6 +1347,93 @@ int alloc_xstate_buffer(struct fpu *fpu, u64 mask)
> return 0;
> }
>
> +/**
> + * do_arch_prctl_xstate() - Handle xstate-related arch_prctl requests.

Not the most helpful patch description.

> + * @fpu: A struct fpu * pointer
> + * @option: A subfunction of arch_prctl()
> + * @mask: A xstate-component bitmap
> + *
> + * Return: 0 if successful; otherwise, return a relevant error code.
> + */
> +long do_arch_prctl_xstate(struct fpu *fpu, int option, unsigned long mask)
> +{
> + bool need_xfd_update = false;
> + int i;
> +
> + switch (option) {
> + case ARCH_GET_XSTATE: {
> + int err = 0;
> +
> + if (mask & ~xfeatures_mask_user())
> + return -EPERM;

This would also return -EPERM for unknown features. That's a bit odd.

How about just -EINVAL, to cover all cases: supervisor or unknown?

> + if (!fpu->refcount) {
> + fpu->refcount = kcalloc(hweight64(xfeatures_mask_user_dynamic),
> + sizeof(int), GFP_KERNEL);
> + if (!fpu->refcount)
> + return -ENOMEM;
> + }

If someone calls this on a non-XFD system, this kcalloc() will fail.
It's a bit odd that if I say "get XSTATE_FP", it returns -ENOMEM.

Maybe you could check 'mask" against xfeatures_mask_user_dynamic up front.

IIRC, this dynamic allocation costs 32 bytes of kmalloc() space for a
single integer, plus the pointer. This would be simpler, faster and
smaller if just a single XFD feature was supported for now.

> + for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
> + unsigned int idx = xstate_refcount_idx[i];
> +
> + if ((idx == -1) || !(BIT_ULL(i) & mask))
> + continue;
> +
> + if (fpu->refcount[idx] == INT_MAX)
> + return -EINVAL;
> +
> + fpu->refcount[idx]++;
> + }

Let's say you have 5 xfeatures that support XFD. The first 4 have their
fpu->refcount[]++ and the fifth hits the limit. This will bump those 4
refcounts and then return -EINVAL. How could the user ever recover from
that?

Also, a few comments in here would really help. It's bare and fairly
hard to grok at the moment. I *think* it's guaranteed by this point
that at least *ONE* refcount has to be bumped (or the kcalloc() would
fail), but it took me a while to convince myself that it works.

> + if ((mask & fpu->state_mask) == mask)
> + return 0;
> +
> + err = alloc_xstate_buffer(fpu, mask);
> + if (!err)
> + need_xfd_update = true;
> + else
> + return err;

'return' without dropping the refcounts?

> + break;
> + }
> + case ARCH_PUT_XSTATE: {
> + if (mask & ~xfeatures_mask_user())
> + return -EPERM;
> +
> + if (!fpu->refcount)
> + return -EINVAL;

This needs a comment:

/* No successful GET_XSTATE was ever performed */

> + for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
> + int idx = xstate_refcount_idx[i];
> + u64 feature_mask = BIT_ULL(i);
> +
> + if ((idx == -1) || !(feature_mask & mask))
> + continue;
> +
> + if (fpu->refcount[idx] <= 0)
> + return -EINVAL;

This has the same bug as the upper loop.

> + fpu->refcount[idx]--;
> + if (!fpu->refcount[idx]) {
> + need_xfd_update = true;
> + fpu->state_mask &= ~(feature_mask);
> + }

Because of that bug, it's possible to return from this without getting
to the 'need_xfd_update' below.

> + }
> + break;
> + }
> + default:
> + return -EINVAL;
> + }
> +
> + if (need_xfd_update) {
> + u64 fpu_xfd_mask = fpu->state_mask & xfd_capable();
> +
> + xfd_write(xfd_capable() ^ fpu_xfd_mask);
> + }
> + return 0;
> +}
> +
> static void fill_gap(struct membuf *to, unsigned *last, unsigned offset)
> {
> if (*last >= offset)
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 5252464a27e3..c166243f64e4 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -98,6 +98,12 @@ void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
> *size = get_xstate_config(XSTATE_MIN_SIZE);
> }
>
> +void arch_release_task_struct(struct task_struct *tsk)
> +{
> + if (boot_cpu_has(X86_FEATURE_FPU))
> + free_fpu(&tsk->thread.fpu);
> +}
> +
> /*
> * Free thread data structures etc..
> */
> @@ -990,13 +996,16 @@ unsigned long get_wchan(struct task_struct *p)
> }
>
> long do_arch_prctl_common(struct task_struct *task, int option,
> - unsigned long cpuid_enabled)
> + unsigned long arg2)
> {
> switch (option) {
> case ARCH_GET_CPUID:
> return get_cpuid_mode();
> case ARCH_SET_CPUID:
> - return set_cpuid_mode(task, cpuid_enabled);
> + return set_cpuid_mode(task, arg2);
> + case ARCH_GET_XSTATE:
> + case ARCH_PUT_XSTATE:
> + return do_arch_prctl_xstate(&task->thread.fpu, option, arg2);
> }
>
> return -EINVAL;
> --
> 2.17.1
>

2021-05-26 02:44:03

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH v5 15/28] x86/arch_prctl: Create ARCH_GET_XSTATE/ARCH_PUT_XSTATE

After today's discussion, I believe we are close to consensus on this plan:

1. Kernel sets XCR0.AMX=1 at boot, and leaves it set, always.

2. Kernel arms XFD for all tasks, by default.

3. New prctl() system call allows any task in a process to ask for AMX
permission for all tasks in that process. Permission is granted for
the lifetime of that process, and there is no interface for a process
to "un-request" permission.

4. If a task touches AMX without permission, #NM/signal kills the process

5. If a task touches AMX with permission, #NM handler will
transparently allocate a context switch buffer, and disarm XFD for
that task. (MSR_XFD is context switched per task)

6. If the #NM handler can not allocate the 8KB buffer, the task will
receive a signal at the instruction that took the #NM fault, likely
resulting in program exit.

7. In addition, a 2nd system call to request that buffers be
pre-allocated is available. This is a per task system call. This
synchronous allocate system call will return an error code if it
fails, which will also likely result in program exit.

8. NEW TODAY: Linux will exclude the AMX 8KB region from the XSTATE on
the signal stack for tasks in process that do not have AMX permission.

9. For tasks in processes that have requested and received AMX
permission, Linux will XSAVE/XRESTOR directly to/from the signal
stack, and the stack will always include the 8KB space for AMX. (we do
have a manual optimization to in place to skip writing zeros to the
stack frame if AMX=INIT)

10. Linux reserves the right to plumb the new permission syscall into
cgroup administrative interface in the future.

Comments:

Legacy software will not see signal stack growth on AMX hardware.

New AMX software will see AMX state on the signal stack.

If new AMX software uses an alternative signal stack, it should be
built using the signal.h ABI in glibc 2.34 or later, so that it can
calculate the appropriate size for the current hardware. Note that
non-AMX software that is newly built will get the same answer from the
ABI, which would handle the case if it does use AMX.

Today it is possible for an application to calculate the uncompressed
XSTATE size from XCR0 and CPUID, allocate buffers of that size, and
use XSAVE and XRESTOR on those buffers in user-space. Applications
can also XRESTOR from (and XSAVE back to) the signal stack, if they
choose. Now, this capability can break for non-AMX programs, because
their XSAVE will be 8KB larger than the buffer that they XRESTOR.
Andy L questions whether such applications actually exist, and Thomas
states that even if they do, that is a much smaller problem than 8KB
signal stack growth would be for legacy applications.

Unclear if we have consensus on the need for a synchronous allocation
system call (#7 above). Observe that this system call does not
improve the likelihood of failure or the timing of failure. An
#NM-based allocation and be done at exactly the same spot by simply
touching a TMM register. The benefit of this system call is that it
returns an error code to the caller, versus the program being
delivered a SIGSEGV at the offending instruction pointer. Both will
likely result in the program exiting, and at the same point in
execution.

A future mechanism to lazy harvest not-recently-used context switchy
buffers has been discussed. Eg. the kernel under low memory could
re-arm XFD for all AMX tasks, and if their buffers are clean, free
them. If that mechanism is implemented, and we also implement the
synchronous allocation system call, that mechanism must respect the
guarantee made by that system call and not harvest
system-call-allocated buffers.

Len Brown, Intel Open Source Technology Center

2021-05-27 11:19:30

by Borislav Petkov

[permalink] [raw]
Subject: second, sync-alloc syscall

On Tue, May 25, 2021 at 08:38:16PM -0400, Len Brown wrote:
> 7. In addition, a 2nd system call to request that buffers be
> pre-allocated is available. This is a per task system call. This
> synchronous allocate system call will return an error code if it
> fails, which will also likely result in program exit.

...

> Unclear if we have consensus on the need for a synchronous allocation
> system call (#7 above). Observe that this system call does not
> improve the likelihood of failure or the timing of failure.

Just when I was thinking that the use case for this is for application
writers to run this upfront and prealloc everything and *then* start
computations. I.e., to not risk doing some work and then get killed
later on the AMX buffer alloc failure and thus lose that work.

> An #NM-based allocation and be done at exactly the same spot by
> simply touching a TMM register. The benefit of this system call is
> that it returns an error code to the caller, versus the program
> being delivered a SIGSEGV at the offending instruction pointer. Both
> will likely result in the program exiting, and at the same point in
> execution.

So if this second syscall doesn't sound really great, I'd say we stick
to the #NM-based allocation and keep this one in the bag for now and
take it out only if it turns out that it makes sense as a use case.

As tglx said: it is easy to add stuff later. It is a lot harder - even
impossible - to remove already present machinery.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-05-27 16:27:01

by Len Brown

[permalink] [raw]
Subject: Re: second, sync-alloc syscall

On Thu, May 27, 2021 at 7:14 AM Borislav Petkov <[email protected]> wrote:

> So if this second syscall doesn't sound really great, I'd say we stick
> to the #NM-based allocation and keep this one in the bag for now and
> take it out only if it turns out that it makes sense as a use case.

I agree. Simple to add if later, if something requires it --
though given it's modest incremental value, currently hard to justify.

> As tglx said: it is easy to add stuff later. It is a lot harder - even
> impossible - to remove already present machinery.

Indeed.

thanks!
Len Brown, Intel Open Source Technology Center

2021-05-27 16:27:03

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH v5 28/28] x86/fpu/amx: Clear the AMX state when appropriate

On Thu, May 27, 2021 at 7:56 AM Peter Zijlstra <[email protected]> wrote:

> Why can't this live in intel_idle.c ?

You are right, and that looks reasonable.

I was thinking we'd have to hack mwait_idle_with_hints() used by everybody,
but we could clone intel_idle() above and have a plug in a tweaked version
for AMX hardware.

Len Brown, Intel Open Source Technology Center

2021-05-27 19:40:16

by Andy Lutomirski

[permalink] [raw]
Subject: Re: second, sync-alloc syscall



On Thu, May 27, 2021, at 6:59 AM, Len Brown wrote:
> On Thu, May 27, 2021 at 7:14 AM Borislav Petkov <[email protected]> wrote:
>
> > So if this second syscall doesn't sound really great, I'd say we stick
> > to the #NM-based allocation and keep this one in the bag for now and
> > take it out only if it turns out that it makes sense as a use case.
>
> I agree. Simple to add if later, if something requires it --
> though given it's modest incremental value, currently hard to justify.
>
> > As tglx said: it is easy to add stuff later. It is a lot harder - even
> > impossible - to remove already present machinery.

Also, in case you haven’t been watching the other thread, this whole thing needs to wait until the existing code gets cleaned up. The current signal code is a disaster.

>
> Indeed.
>
> thanks!
> Len Brown, Intel Open Source Technology Center
>

2021-05-27 20:48:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 28/28] x86/fpu/amx: Clear the AMX state when appropriate

On Mon, May 24, 2021 at 02:24:54PM -0400, Len Brown wrote:
> On Mon, May 24, 2021 at 1:39 PM Dave Hansen <[email protected]> wrote:
>
> > >> might be considered nasty.
>
> > > I'm not excited about burdening the generic idle path with a CPU
> > > feature specific check that would need to be checked on every idle
> > > entry.
> >
> > Me neither.
> >
> > But, the check itself should be cheap. A cpu_feature_enabled(AMX) check
> > will eliminate even the cost of a branch on systems without AMX. You
> > could probably even get fancy and also use a static branch that doesn't
> > get enabled until the first AMX user shows up.
>
> It isn't just the hardware run-time cost.
> It is the source code complexity.
> That code is 100% generic.

Why can't this live in intel_idle.c ? We had to pull out
CPUIDLE_FLAG_TLB_FLUSH because leave_mm() requires RCU, but afaict both
TILERELEASE and XRESTOR do not and could be added to intel_idle_tile(),
which can be used in XXX_cstates[] for the relevant chips instead of
intel_idle() for C6 and up.

intel_idle_tile(args)
{
bool has_tile = // something XINUSE

if (has_tile)
TILERELEASE

intel_idle(args...);

if (has_tile)
// something XRESTOR
}

Hmm?

2021-06-16 23:51:47

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state

On 5/23/21 12:32 PM, Chang S. Bae wrote:
> Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
> architecture. XFD allows the kernel to enable a feature state in XCR0 and
> to receive a #NM trap when a task uses instructions accessing that state.
> In this way, Linux can control the access to that state.

Please change the subject of this patch. It's a crime to not have it
say XFD, "feature" or "disable".

I'd suggest:

x86/fpu/xstate: use feature disable (XFD) to protect dynamic
user state

2021-06-16 23:52:56

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state

On 5/23/21 12:32 PM, Chang S. Bae wrote:
> @@ -571,6 +612,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
>
> set_thread_flag(TIF_NEED_FPU_LOAD);
>
> + xfd_switch(old_fpu, new_fpu);

This seems fragile.

XSAVE* will decline to write out state for feature i when XFD[i]=1 and
will instead write out the init state. That means that, at this point,
we switch XFD and yet leave state for feature i in place.

That means this *MUST* come before any copy_fpregs_to_fpstate() occurs.

Could we please add some FPU_WARN_ON() checks to ensure that no XSAVE*
is ever performed with XINUSE=1 *and* XFD armed?

2021-06-17 01:46:37

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state

On Jun 16, 2021, at 12:01, Hansen, Dave <[email protected]> wrote:
> On 6/16/21 11:47 AM, Bae, Chang Seok wrote:
>> Reading XINUSE via XGETBV is cheap but not free. I don't know spending a
>> hundred cycles for this WARN is big deal but this is one of the most
>> performance-critical paths.
> Is XGETBV(1) really a hundred cycles? That seems absurdly high for a
> non-serializing register read.

This was checked to convince the benefit intended by PATCH25 --
https://lore.kernel.org/lkml/[email protected]/

> If you're really worried, let's put it under X86_FEATURE_DEBUG_FPU or
> something.

Yes. I will also make sure to include comments for this.

Thanks,
Chang

2021-06-17 01:52:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state

On 6/16/21 12:23 PM, Bae, Chang Seok wrote:
> On Jun 16, 2021, at 12:01, Hansen, Dave <[email protected]> wrote:
>> On 6/16/21 11:47 AM, Bae, Chang Seok wrote:
>>> Reading XINUSE via XGETBV is cheap but not free. I don't know spending a
>>> hundred cycles for this WARN is big deal but this is one of the most
>>> performance-critical paths.
>> Is XGETBV(1) really a hundred cycles? That seems absurdly high for a
>> non-serializing register read.
> This was checked to convince the benefit intended by PATCH25 --
> https://lore.kernel.org/lkml/[email protected]/

That's odd. How is it possible that the performance of XGETBV(1)
informed the design of that patch without there being any mention of
XGETBV in the comments or changelog?

2021-06-17 01:54:28

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state

On Jun 16, 2021, at 12:28, Hansen, Dave <[email protected]> wrote:
> That's odd. How is it possible that the performance of XGETBV(1)
> informed the design of that patch without there being any mention of
> XGETBV in the comments or changelog?

Yes, I admit that it is wrong that the text there highlights the benefit
without fairly mentioning the cost. I will make sure v6 covers them all.

Thanks,
Chang

2021-06-17 03:11:42

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state

On 6/16/21 9:27 AM, Dave Hansen wrote:
> On 5/23/21 12:32 PM, Chang S. Bae wrote:
>> @@ -571,6 +612,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
>>
>> set_thread_flag(TIF_NEED_FPU_LOAD);
>>
>> + xfd_switch(old_fpu, new_fpu);
>
> This seems fragile.
>
> XSAVE* will decline to write out state for feature i when XFD[i]=1 and
> will instead write out the init state. That means that, at this point,
> we switch XFD and yet leave state for feature i in place.
>
> That means this *MUST* come before any copy_fpregs_to_fpstate() occurs.
>
> Could we please add some FPU_WARN_ON() checks to ensure that no XSAVE*
> is ever performed with XINUSE=1 *and* XFD armed?
>

Wait, really? I somehow thought that XSAVE* would write out the actual
state even if XFD=1.

This seems like it's asking for some kind of bug.

2021-06-17 03:12:52

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state

On Jun 16, 2021, at 11:12, Andy Lutomirski <[email protected]> wrote:
> On 6/16/21 9:27 AM, Dave Hansen wrote:
>> On 5/23/21 12:32 PM, Chang S. Bae wrote:
>>> @@ -571,6 +612,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
>>>
>>> set_thread_flag(TIF_NEED_FPU_LOAD);
>>>
>>> + xfd_switch(old_fpu, new_fpu);
>>
>> This seems fragile.
>>
>> XSAVE* will decline to write out state for feature i when XFD[i]=1 and
>> will instead write out the init state. That means that, at this point,
>> we switch XFD and yet leave state for feature i in place.
>>
>> That means this *MUST* come before any copy_fpregs_to_fpstate() occurs.
>>
>> Could we please add some FPU_WARN_ON() checks to ensure that no XSAVE*
>> is ever performed with XINUSE=1 *and* XFD armed?

So, xfd_switch() is after copy_fpregs_to_fpstate():

__switch_to()
--> switch_fpu_prepare() --> copy_fpregs_to_fpstate()
--> switch_fpu_finish() --> xfd_switch()

I don't see someone intentionally flip this order.

Reading XINUSE via XGETBV is cheap but not free. I don't know spending a
hundred cycles for this WARN is big deal but this is one of the most
performance-critical paths.

> Wait, really? I somehow thought that XSAVE* would write out the actual
> state even if XFD=1.

No, in section 3.2.6 Extended Feature Disable (XFD) of [1]:

"If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i,
the instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1;
instead, it saves bit i of XSTATE_BV field of the XSAVE header as 0 (
indicating that the state component is in its initialized state). With
the exception of XSAVE, no data is saved for the state component (XSAVE
saves the initial value of the state component; for XTILEDATA, this is
all zeroes)."

> This seems like it's asking for some kind of bug.

[1]: Intel Architecture Instruction Set Extensions Programming Reference, https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

2021-06-17 03:13:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state

On 6/16/21 11:47 AM, Bae, Chang Seok wrote:
> On Jun 16, 2021, at 11:12, Andy Lutomirski <[email protected]> wrote:
>> On 6/16/21 9:27 AM, Dave Hansen wrote:
>>> On 5/23/21 12:32 PM, Chang S. Bae wrote:
>>>> @@ -571,6 +612,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
>>>>
>>>> set_thread_flag(TIF_NEED_FPU_LOAD);
>>>>
>>>> + xfd_switch(old_fpu, new_fpu);
>>> This seems fragile.
>>>
>>> XSAVE* will decline to write out state for feature i when XFD[i]=1 and
>>> will instead write out the init state. That means that, at this point,
>>> we switch XFD and yet leave state for feature i in place.
>>>
>>> That means this *MUST* come before any copy_fpregs_to_fpstate() occurs.
>>>
>>> Could we please add some FPU_WARN_ON() checks to ensure that no XSAVE*
>>> is ever performed with XINUSE=1 *and* XFD armed?
> So, xfd_switch() is after copy_fpregs_to_fpstate():
>
> __switch_to()
> --> switch_fpu_prepare() --> copy_fpregs_to_fpstate()
> --> switch_fpu_finish() --> xfd_switch()
>
> I don't see someone intentionally flip this order.

I'm more worried about how it might interact with KVM, or how future
code might try to optimize the whole thing.

Either way: the point stands: XSAVE with XINIT[i]=0 and XFD[i]=1 kills
puppies. I ideally want a warning if the puppy slaughter ever happens.
If that's not feasible, I'll also settle for a big fat comment about it.

But, something is needed. The XSAVE code is littered with assumptions
that demonstrate a lack of imagination about how future hacking might
kill puppies.

> Reading XINUSE via XGETBV is cheap but not free. I don't know spending a
> hundred cycles for this WARN is big deal but this is one of the most
> performance-critical paths.
Is XGETBV(1) really a hundred cycles? That seems absurdly high for a
non-serializing register read.

If you're really worried, let's put it under X86_FEATURE_DEBUG_FPU or
something.

2021-06-28 10:12:53

by Liu, Jing2

[permalink] [raw]
Subject: Re: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state



On 6/17/2021 3:28 AM, Dave Hansen wrote:
> On 6/16/21 12:23 PM, Bae, Chang Seok wrote:
>> On Jun 16, 2021, at 12:01, Hansen, Dave <[email protected]> wrote:
>>> On 6/16/21 11:47 AM, Bae, Chang Seok wrote:
>>>> Reading XINUSE via XGETBV is cheap but not free. I don't know spending a
>>>> hundred cycles for this WARN is big deal but this is one of the most
>>>> performance-critical paths.
>>> Is XGETBV(1) really a hundred cycles? That seems absurdly high for a
>>> non-serializing register read.
>> This was checked to convince the benefit intended by PATCH25 --
>> https://lore.kernel.org/lkml/[email protected]/
> That's odd. How is it possible that the performance of XGETBV(1)
> informed the design of that patch without there being any mention of
> XGETBV in the comments or changelog?
Hi Chang,

I noticed the XGETBV(1) cycles you ran, however I calculated only ~16
cycles
in the corresponding machine.

BRs,
Jing

2021-06-29 18:08:49

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state

On 6/29/21 10:43 AM, Bae, Chang Seok wrote:
> On Jun 16, 2021, at 12:01, Dave Hansen <[email protected]> wrote:
>> On 6/16/21 11:47 AM, Bae, Chang Seok wrote:
>>> Reading XINUSE via XGETBV is cheap but not free. I don't know spending a
>>> hundred cycles for this WARN is big deal but this is one of the most
>>> performance-critical paths.
>> Is XGETBV(1) really a hundred cycles? That seems absurdly high for a
>> non-serializing register read.
> My apologies -- I neglect not to check the log immediately. XGETBV(1) took
> about 26 cycles on my test.

How about we pay this 26-cycle cost, but only when XFD is in use? We
could either look at the shadowed value of the XFD MSR, or flip a global
variable the first time XFD gets used.

2021-06-29 18:39:59

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state

On Jun 29, 2021, at 10:54, Hansen, Dave <[email protected]> wrote:
> How about we pay this 26-cycle cost, but only when XFD is in use? We
> could either look at the shadowed value of the XFD MSR, or flip a global
> variable the first time XFD gets used.

How about something like this:

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 777cccab0b47..9752ebe6be79 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -99,6 +99,23 @@ EXPORT_SYMBOL(irq_fpu_usable);
void save_fpregs_to_fpstate(struct fpu *fpu)
{
if (likely(use_xsave())) {
+ /*
+ * MSR IA32_XFD write follows after this XSAVE(S). So if a
+ * state component is in use, XFD should not be armed for
+ * current. But, for potential changes in the future,
+ * cross-check XINUSE and XFD values. If a XINUSE state
+ * is XFD-armed, the following XSAVE(S) does not save the
+ * state.
+ *
+ * Reference the shadow XFD value instead of reading the
+ * MSR.
+ */
+ if (xfd_capable() && boot_cpu_has(X86_FEATURE_XGETBV1)) {
+ u64 current_xfd = (fpu->state_mask & xfd_capable()) ^ xfd_capable();
+
+ WARN_ON_FPU(xgetbv(1) & current_xfd);
+ }
+
os_xsave(&fpu->state->xsave, fpu->state_mask);

Thanks,
Chang

2021-06-29 19:19:55

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state

On Jun 29, 2021, at 11:50, Hansen, Dave <[email protected]> wrote:
> On 6/29/21 11:35 AM, Bae, Chang Seok wrote:
>> if (likely(use_xsave())) {
>> + /*
>> + * MSR IA32_XFD write follows after this XSAVE(S). So if a
>> + * state component is in use, XFD should not be armed for
>> + * current. But, for potential changes in the future,
>> + * cross-check XINUSE and XFD values. If a XINUSE state
>> + * is XFD-armed, the following XSAVE(S) does not save the
>> + * state.
>> + *
>> + * Reference the shadow XFD value instead of reading the
>> + * MSR.
>> + */
>> + if (xfd_capable() && boot_cpu_has(X86_FEATURE_XGETBV1)) {
>> + u64 current_xfd = (fpu->state_mask & xfd_capable()) ^ xfd_capable();
>> +
>> + WARN_ON_FPU(xgetbv(1) & current_xfd);
>> + }
>
> The code looks fine. But, as usual, I hate the comment. Maybe:
>
> /*
> * If XFD is armed for an xfeature, XSAVE* will not save
> * its state. Ensure XFD is clear for all features that
> * are in use before XSAVE*.
> */
>
> BTW, the ->state_mask calculation is a little confusing to me. I
> understand that fpu->state_mask shouldn't have any bits set that are
> unset in xgetbv(1).
>
> This code seems to be asking the question: Are any dynamic features in
> their init state *and* XFD-armed?
>
> Is it actually important to make sure that they are dynamic features?
> Is there *any* case where a feature (dynamic or not) can have XFD armed
> and be out of its init state?

In this AMX series, XFD is only used for the xstate buffer management. The
code is made in a such way that XFD and dynamic states are a bit coupled.

But I think XFD can be extended for other usages in the future. Then, yes.
(This warning is also for future code changes.)

So, reading the MSR is just simple and clean here, but it consumes cycles. Or,
a task may have a field for XFD value per se unless this conversion is
acceptable.

Thanks,
Chang

2021-06-29 19:24:45

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state

On Jun 16, 2021, at 12:01, Dave Hansen <[email protected]> wrote:
> On 6/16/21 11:47 AM, Bae, Chang Seok wrote:
>> Reading XINUSE via XGETBV is cheap but not free. I don't know spending a
>> hundred cycles for this WARN is big deal but this is one of the most
>> performance-critical paths.
> Is XGETBV(1) really a hundred cycles? That seems absurdly high for a
> non-serializing register read.

My apologies -- I neglect not to check the log immediately. XGETBV(1) took
about 26 cycles on my test.

Thanks,
Chang

2021-06-29 19:30:25

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state

On 6/29/21 11:35 AM, Bae, Chang Seok wrote:
> if (likely(use_xsave())) {
> + /*
> + * MSR IA32_XFD write follows after this XSAVE(S). So if a
> + * state component is in use, XFD should not be armed for
> + * current. But, for potential changes in the future,
> + * cross-check XINUSE and XFD values. If a XINUSE state
> + * is XFD-armed, the following XSAVE(S) does not save the
> + * state.
> + *
> + * Reference the shadow XFD value instead of reading the
> + * MSR.
> + */
> + if (xfd_capable() && boot_cpu_has(X86_FEATURE_XGETBV1)) {
> + u64 current_xfd = (fpu->state_mask & xfd_capable()) ^ xfd_capable();
> +
> + WARN_ON_FPU(xgetbv(1) & current_xfd);
> + }

The code looks fine. But, as usual, I hate the comment. Maybe:

/*
* If XFD is armed for an xfeature, XSAVE* will not save
* its state. Ensure XFD is clear for all features that
* are in use before XSAVE*.
*/

BTW, the ->state_mask calculation is a little confusing to me. I
understand that fpu->state_mask shouldn't have any bits set that are
unset in xgetbv(1).

This code seems to be asking the question: Are any dynamic features in
their init state *and* XFD-armed?

Is it actually important to make sure that they are dynamic features?
Is there *any* case where a feature (dynamic or not) can have XFD armed
and be out of its init state?

2021-06-29 19:33:47

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 14/28] x86/fpu/xstate: Prevent unauthorised use of dynamic user state

On 6/29/21 12:13 PM, Bae, Chang Seok wrote:
>>
>> Is it actually important to make sure that they are dynamic features?
>> Is there *any* case where a feature (dynamic or not) can have XFD armed
>> and be out of its init state?
> In this AMX series, XFD is only used for the xstate buffer management. The
> code is made in a such way that XFD and dynamic states are a bit coupled.
>
> But I think XFD can be extended for other usages in the future. Then, yes.
> (This warning is also for future code changes.)
>
> So, reading the MSR is just simple and clean here, but it consumes cycles. Or,
> a task may have a field for XFD value per se unless this conversion is
> acceptable.

I'm not following.

All that I see here is that you made what could be a very generic check:

Thou shalt never XSAVE a xfeature which is XFD-armed

and made it more specific:

Thou shalt never XSAVE a *dynamic* xfeature which is XFD-armed

I'm just saying that I don't see the value in a less-broad check. Let's
make the sanity check as broad as possible. That actually makes the
code simpler.