LinuxLists.cc - [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions

[permalink] [raw]

Subject: [PATCH v9 15/26] x86/fpu/xstate: Support both legacy and expanded signal XSTATE size

Prepare to support two XSTATE sizes on the signal stack -- legacy and
expanded. Legacy programs have not requested access to AMX (or later
features), and the XSTATE on their signal stack can include up through
AVX-512.

Programs that request access to AVX (and/or later features) will have an
uncompressed XSTATE that includes those features. If such program that also
use the sigaltstack, they must assure that their sigaltstack is large
enough to handle that full XSTATE format. (This is most easily done by
using signal.h from glibc 2.34 or later)

Introduce a new XSTATE size variable for the legacy stack and some helpers.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v6:
* Massage the code comments.

Changes form v5:
* Added as a new patch.
---
arch/x86/include/asm/fpu/internal.h | 23 +++++++++--
arch/x86/include/asm/fpu/xstate.h | 3 +-
arch/x86/kernel/fpu/init.c | 1 +
arch/x86/kernel/fpu/signal.c | 63 ++++++++++++++++++++---------
arch/x86/kernel/fpu/xstate.c | 25 +++++++++++-
5 files changed, 89 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index e3590cf55325..3b52cfb62ab5 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -337,15 +337,30 @@ static inline void os_xrstor(struct xregs_state *xstate, u64 mask)
*/
static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
{
+ u32 lmask, hmask;
+ u64 mask;
+ int err;
+
/*
* Include the features which are not xsaved/rstored by the kernel
* internally, e.g. PKRU. That's user space ABI and also required
* to allow the signal handler to modify PKRU.
*/
- u64 mask = xfeatures_mask_uabi();
- u32 lmask = mask;
- u32 hmask = mask >> 32;
- int err;
+ mask = xfeatures_mask_uabi();
+
+ /*
+ * Exclude dynamic user states for non-opt-in threads.
+ */
+ if (xfeatures_mask_user_dynamic) {
+ struct fpu *fpu = &current->thread.fpu;
+
+ mask &= fpu->dynamic_state_perm ?
+ fpu->state_mask :
+ ~xfeatures_mask_user_dynamic;
+ }
+
+ lmask = mask;
+ hmask = mask >> 32;

/*
* Clear the xsave header first, so that reserved fields are
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 9fb6308aaf07..c39ea8bac68f 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -139,7 +139,8 @@ extern void __init update_regset_xstate_info(unsigned int size,
enum xstate_config {
XSTATE_MIN_SIZE,
XSTATE_MAX_SIZE,
- XSTATE_USER_SIZE
+ XSTATE_USER_SIZE,
+ XSTATE_USER_MINSIG_SIZE,
};

extern unsigned int get_xstate_config(enum xstate_config cfg);
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 3e4e14ca723b..acbd3da0e022 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -210,6 +210,7 @@ static void __init fpu__init_system_xstate_size_legacy(void)
set_xstate_config(XSTATE_MIN_SIZE, xstate_size);
set_xstate_config(XSTATE_MAX_SIZE, xstate_size);
set_xstate_config(XSTATE_USER_SIZE, xstate_size);
+ set_xstate_config(XSTATE_USER_MINSIG_SIZE, xstate_size);
}

/* Legacy code to initialize eager fpu mode. */
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index f70f84d53442..78696b412b56 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -15,9 +15,26 @@
#include <asm/sigframe.h>
#include <asm/trace/fpu.h>

+/*
+ * Record the signal xstate size and feature bits. Exclude dynamic user
+ * states. See fpu__init_prepare_fx_sw_frame(). The opt-in tasks will
+ * dynamically adjust the data.
+ */
static struct _fpx_sw_bytes fx_sw_reserved __ro_after_init;
static struct _fpx_sw_bytes fx_sw_reserved_ia32 __ro_after_init;

+static unsigned int current_sig_xstate_size(void)
+{
+ return current->thread.fpu.dynamic_state_perm ?
+ get_xstate_config(XSTATE_USER_SIZE) :
+ get_xstate_config(XSTATE_USER_MINSIG_SIZE);
+}
+
+static inline int extend_sig_xstate_size(unsigned int size)
+{
+ return use_xsave() ? size + FP_XSTATE_MAGIC2_SIZE : size;
+}
+
/*
* Check for the presence of extended state information in the
* user fpstate pointer in the sigcontext.
@@ -36,7 +53,7 @@ static inline int check_xstate_in_sigframe(struct fxregs_state __user *fxbuf,
/* Check for the first magic field and other error scenarios. */
if (fx_sw->magic1 != FP_XSTATE_MAGIC1 ||
fx_sw->xstate_size < min_xstate_size ||
- fx_sw->xstate_size > get_xstate_config(XSTATE_USER_SIZE) ||
+ fx_sw->xstate_size > current_sig_xstate_size() ||
fx_sw->xstate_size > fx_sw->extended_size)
goto setfx;

@@ -94,20 +111,32 @@ static inline int save_fsave_header(struct task_struct *tsk, void __user *buf)

static inline int save_xstate_epilog(void __user *buf, int ia32_frame)
{
+ unsigned int current_xstate_size = current_sig_xstate_size();
struct xregs_state __user *x = buf;
- struct _fpx_sw_bytes *sw_bytes;
+ struct _fpx_sw_bytes sw_bytes;
u32 xfeatures;
int err;

- /* Setup the bytes not touched by the [f]xsave and reserved for SW. */
- sw_bytes = ia32_frame ? &fx_sw_reserved_ia32 : &fx_sw_reserved;
- err = __copy_to_user(&x->i387.sw_reserved, sw_bytes, sizeof(*sw_bytes));
+ /*
+ * Setup the bytes not touched by the [f]xsave and reserved for SW.
+ *
+ * Use the recorded values if it matches with the current task. Otherwise,
+ * adjust it.
+ */
+ sw_bytes = ia32_frame ? fx_sw_reserved_ia32 : fx_sw_reserved;
+ if (sw_bytes.xstate_size != current_xstate_size) {
+ unsigned int default_xstate_size = sw_bytes.xstate_size;
+
+ sw_bytes.xfeatures = xfeatures_mask_uabi();
+ sw_bytes.xstate_size = current_xstate_size;
+ sw_bytes.extended_size += (current_xstate_size - default_xstate_size);
+ }
+ err = __copy_to_user(&x->i387.sw_reserved, &sw_bytes, sizeof(sw_bytes));

if (!use_xsave())
return err;

- err |= __put_user(FP_XSTATE_MAGIC2,
- (__u32 __user *)(buf + get_xstate_config(XSTATE_USER_SIZE)));
+ err |= __put_user(FP_XSTATE_MAGIC2, (__u32 __user *)(buf + current_xstate_size));

/*
* Read the xfeatures which we copied (directly from the cpu or
@@ -144,7 +173,7 @@ static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf)
else
err = fnsave_to_user_sigframe((struct fregs_state __user *) buf);

- if (unlikely(err) && __clear_user(buf, get_xstate_config(XSTATE_USER_SIZE)))
+ if (unlikely(err) && __clear_user(buf, current_sig_xstate_size()))
err = -EFAULT;
return err;
}
@@ -205,7 +234,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
fpregs_unlock();

if (ret) {
- if (!fault_in_pages_writeable(buf_fx, get_xstate_config(XSTATE_USER_SIZE)))
+ if (!fault_in_pages_writeable(buf_fx, current_sig_xstate_size()))
goto retry;
return -EFAULT;
}
@@ -418,19 +447,13 @@ static int __fpu_restore_sig(void __user *buf, void __user *buf_fx,
fpregs_unlock();
return ret;
}
-static inline int xstate_sigframe_size(void)
-{
- int xstate_size = get_xstate_config(XSTATE_USER_SIZE);
-
- return use_xsave() ? xstate_size + FP_XSTATE_MAGIC2_SIZE : xstate_size;
-}

/*
* Restore FPU state from a sigframe:
*/
int fpu__restore_sig(void __user *buf, int ia32_frame)
{
- unsigned int size = xstate_sigframe_size();
+ unsigned int size = extend_sig_xstate_size(current_sig_xstate_size());
struct fpu *fpu = &current->thread.fpu;
void __user *buf_fx = buf;
bool ia32_fxstate = false;
@@ -477,7 +500,7 @@ unsigned long
fpu__alloc_mathframe(unsigned long sp, int ia32_frame,
unsigned long *buf_fx, unsigned long *size)
{
- unsigned long frame_size = xstate_sigframe_size();
+ unsigned long frame_size = extend_sig_xstate_size(current_sig_xstate_size());

*buf_fx = sp = round_down(sp - frame_size, 64);
if (ia32_frame && use_fxsr()) {
@@ -492,7 +515,7 @@ fpu__alloc_mathframe(unsigned long sp, int ia32_frame,

unsigned long fpu__get_fpstate_size(void)
{
- unsigned long ret = xstate_sigframe_size();
+ unsigned long ret = extend_sig_xstate_size(get_xstate_config(XSTATE_USER_SIZE));

/*
* This space is needed on (most) 32-bit kernels, or when a 32-bit
@@ -517,12 +540,12 @@ unsigned long fpu__get_fpstate_size(void)
*/
void fpu__init_prepare_fx_sw_frame(void)
{
- int xstate_size = get_xstate_config(XSTATE_USER_SIZE);
+ int xstate_size = get_xstate_config(XSTATE_USER_MINSIG_SIZE);
int ext_size = xstate_size + FP_XSTATE_MAGIC2_SIZE;

fx_sw_reserved.magic1 = FP_XSTATE_MAGIC1;
fx_sw_reserved.extended_size = ext_size;
- fx_sw_reserved.xfeatures = xfeatures_mask_uabi();
+ fx_sw_reserved.xfeatures = xfeatures_mask_uabi() & ~xfeatures_mask_user_dynamic;
fx_sw_reserved.xstate_size = xstate_size;

if (IS_ENABLED(CONFIG_IA32_EMULATION) ||
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 84dda445386e..c539e02965a6 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -94,10 +94,13 @@ static bool xstate_aligns[XFEATURE_MAX] __ro_after_init =
* contains all the enabled state components.
* @user_size: The size of user-space buffer for signal and
* ptrace frames, in the non-compacted format.
+ * @user_minsig_size: The non-compacted legacy xstate size for signal.
+ * Legacy programs do not request to access dynamic
+ * states.
*/
struct fpu_xstate_buffer_config {
unsigned int min_size, max_size;
- unsigned int user_size;
+ unsigned int user_size, user_minsig_size;
};

static struct fpu_xstate_buffer_config buffer_config __ro_after_init;
@@ -111,6 +114,8 @@ unsigned int get_xstate_config(enum xstate_config cfg)
return buffer_config.max_size;
case XSTATE_USER_SIZE:
return buffer_config.user_size;
+ case XSTATE_USER_MINSIG_SIZE:
+ return buffer_config.user_minsig_size;
default:
return 0;
}
@@ -128,6 +133,9 @@ void set_xstate_config(enum xstate_config cfg, unsigned int value)
break;
case XSTATE_USER_SIZE:
buffer_config.user_size = value;
+ break;
+ case XSTATE_USER_MINSIG_SIZE:
+ buffer_config.user_minsig_size = value;
}
}

@@ -859,6 +867,21 @@ static int __init init_xstate_size(void)
* User space is always in standard format.
*/
set_xstate_config(XSTATE_USER_SIZE, xsave_size);
+
+ /*
+ * The minimum signal xstate size is for non-opt-in user threads
+ * that do not access dynamic states.
+ */
+ if (xfeatures_mask_user_dynamic) {
+ int nr = fls64(xfeatures_mask_uabi() & ~xfeatures_mask_user_dynamic) - 1;
+ unsigned int size, offset, ecx, edx;
+
+ cpuid_count(XSTATE_CPUID, nr, &size, &offset, &ecx, &edx);
+ set_xstate_config(XSTATE_USER_MINSIG_SIZE, offset + size);
+ } else {
+ set_xstate_config(XSTATE_USER_MINSIG_SIZE, xsave_size);
+ }
+
return 0;
}

--
2.17.1

2021-07-30 15:09:17

[permalink] [raw]

Subject: [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE

arch_prctl(ARCH_SET_STATE_ENABLE, u64 bitmask)
Some XSTATE features, such as AMX, are unavailable to applications
until that process explicitly requests them via this call. Requests can
be made for any number of valid user XSTATEs in a single call. This
call is intended to be invoked very early in process initialization. A
forked child inherits access, but permission is reset upon exec. There
is no concept of un-requesting XSTATE access.
Return codes:
0: success (including repeated calls)
EINVAL: no hardware feature for the request
EBUSY: error in updating all threads in the process

arch_prctl(ARCH_GET_STATE_ENABLE, u64 *bitmask)
Return the bitmask of permitted user XSTATE features. If XSAVE
is disabled, the bitmask indicates only legacy states.

The permission is checked at every XSTATE buffer expansion: e.g.
XFD-induced #NM event, and ptracer's XSTATE injection. When no permission
is found, inform userspace via SIGSEGV or with error code.

The notion of granted permission is broadcast to all threads in a process.
(This approach follows the PR_SET_FP_MODE prctl(2) implementation.)

Detect a fork race by aborting and returning -EBUSY if the number of
threads at the end of call changed.

[ An alternative implementation would not save the permission bitmap in
every task. But instead would extend the per-process signal data, and
that would not be subject to this race. ]

Rename the third argument for do_arch_prctl_common() to reflect its generic
use.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v8:
* Update arch_prctl prototype for consistency with other arch_prctl's. It
now takes an address of return bitmask as a parameter.
* Optimize the reset function.

Changes from v7:
* Rename the syscalls. (Thiago Macieira and Dave Hansen)
* If XSAVE is disabled, assure that syscall correctly indicates legacy
states. (Thiago Macieira and Dave Hansen)

Changes from v6:
* Add state bitmap param to proposed syscall. (Thiago Macieira)
* Add companion syscall to return the current permission bitmap.
* Update the ptrace path to return EFAULT when no permission to write
XTILEDATA.
* Update do_arch_prctl_common().

Changes from v5:
* Switched to per-process permission. (Based on the discussion on LKML)
---
arch/x86/include/asm/fpu/types.h | 8 +++
arch/x86/include/asm/fpu/xstate.h | 5 ++
arch/x86/include/asm/proto.h | 2 +-
arch/x86/include/uapi/asm/prctl.h | 3 +
arch/x86/kernel/fpu/regset.c | 17 ++++--
arch/x86/kernel/fpu/xstate.c | 96 +++++++++++++++++++++++++++++++
arch/x86/kernel/process.c | 8 ++-
arch/x86/kernel/process_64.c | 6 ++
arch/x86/kernel/traps.c | 8 ++-
9 files changed, 141 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index c0192e16cadb..03160a1a79ad 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -336,6 +336,14 @@ struct fpu {
*/
unsigned long avx512_timestamp;

+ /*
+ * @state_perm:
+ *
+ * The bitmap indicates the permission of using some state
+ * components which are dynamically stored in the per-task buffer.
+ */
+ u64 dynamic_state_perm;
+
/*
* @state_mask:
*
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 45735441fbe8..9fb6308aaf07 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -149,6 +149,11 @@ void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
unsigned int get_xstate_size(u64 mask);
int alloc_xstate_buffer(struct fpu *fpu, u64 mask);
void free_xstate_buffer(struct fpu *fpu);
+
+long set_process_xstate_perm(struct task_struct *tsk, u64 state_perm);
+void reset_task_xstate_perm(struct task_struct *tsk);
+unsigned long get_task_state_perm(struct task_struct *tsk);
+
int xfeature_size(int xfeature_nr);
int copy_uabi_from_kernel_to_xstate(struct fpu *fpu, const void *kbuf);
int copy_sigframe_from_user_to_xstate(struct fpu *fpu, const void __user *ubuf);
diff --git a/arch/x86/include/asm/proto.h b/arch/x86/include/asm/proto.h
index 8c5d1910a848..feed36d44d04 100644
--- a/arch/x86/include/asm/proto.h
+++ b/arch/x86/include/asm/proto.h
@@ -40,6 +40,6 @@ void x86_report_nx(void);
extern int reboot_force;

long do_arch_prctl_common(struct task_struct *task, int option,
- unsigned long cpuid_enabled);
+ unsigned long arg2);

#endif /* _ASM_X86_PROTO_H */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9fa41f..c73e141ce90a 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -10,6 +10,9 @@
#define ARCH_GET_CPUID 0x1011
#define ARCH_SET_CPUID 0x1012

+#define ARCH_SET_STATE_ENABLE 0x1021
+#define ARCH_GET_STATE_ENABLE 0x1022
+
#define ARCH_MAP_VDSO_X32 0x2001
#define ARCH_MAP_VDSO_32 0x2002
#define ARCH_MAP_VDSO_64 0x2003
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 244e672c3e3d..ee71ffd7c221 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -166,22 +166,27 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
/*
* When a ptracer attempts to write any dynamic user state in the
* target buffer but not sufficiently allocated, it dynamically
- * expands the buffer.
+ * expands the buffer if permitted.
*
* Check if the expansion is possibly needed.
*/
if (xfeatures_mask_user_dynamic &&
((fpu->state_mask & xfeatures_mask_user_dynamic) != xfeatures_mask_user_dynamic)) {
- u64 state_mask;
+ u64 state_mask, dynstate_mask;

/* Retrieve XSTATE_BV. */
memcpy(&state_mask, (kbuf ?: tmpbuf) + offsetof(struct xregs_state, header),
sizeof(u64));

- /* Expand the xstate buffer based on the XSTATE_BV. */
- state_mask &= xfeatures_mask_user_dynamic;
- if (state_mask) {
- ret = alloc_xstate_buffer(fpu, state_mask);
+ /* Check the permission and expand the xstate buffer. */
+ dynstate_mask = state_mask & xfeatures_mask_user_dynamic;
+ if (dynstate_mask) {
+ if ((dynstate_mask & fpu->dynamic_state_perm) != dynstate_mask) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ ret = alloc_xstate_buffer(fpu, dynstate_mask);
if (ret)
goto out;
}
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c6ff0575d87d..84dda445386e 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -961,6 +961,7 @@ void __init fpu__init_system_xstate(void)
goto out_disable;

/* Make sure init_task does not include the dynamic user states. */
+ current->thread.fpu.dynamic_state_perm = 0;
current->thread.fpu.state_mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);

/*
@@ -1233,6 +1234,101 @@ int alloc_xstate_buffer(struct fpu *fpu, u64 mask)
return 0;
}

+/**
+ * set_process_xstate_perm - Set a per-process permission to use dynamic
+ * user xstates.
+ * @tsk: A struct task_struct * pointer
+ * @state_perm: A bitmap to indicate which state's permission to be set.
+ * Return: 0 if successful; otherwise, error code.
+ */
+long set_process_xstate_perm(struct task_struct *tsk, u64 state_perm)
+{
+ u64 req_dynstate_perm, old_dynstate_perm;
+ struct task_struct *t;
+ int nr_threads = 0;
+ u64 features_mask;
+
+ if (!boot_cpu_has(X86_FEATURE_FPU))
+ features_mask = 0;
+ else if (use_xsave())
+ features_mask = xfeatures_mask_uabi();
+ else if (use_fxsr())
+ features_mask = XFEATURE_MASK_FPSSE;
+ else
+ features_mask = XFEATURE_MASK_FP;
+
+ if (state_perm & ~features_mask)
+ return -EINVAL;
+
+ req_dynstate_perm = state_perm & xfeatures_mask_user_dynamic;
+ if (!req_dynstate_perm)
+ return 0;
+
+ old_dynstate_perm = tsk->thread.fpu.dynamic_state_perm;
+
+ for_each_thread(tsk, t) {
+ t->thread.fpu.dynamic_state_perm |= req_dynstate_perm;
+ nr_threads++;
+ }
+
+ if (nr_threads != tsk->signal->nr_threads) {
+ for_each_thread(tsk, t)
+ t->thread.fpu.dynamic_state_perm = old_dynstate_perm;
+ pr_err("x86/fpu: ARCH_XSTATE_PERM failed as thread number mismatched.\n");
+ return -EBUSY;
+ }
+ return 0;
+}
+
+/**
+ * reset_task_xstate_perm - Reset a task's permission to use dynamic user
+ * xstates.
+ *
+ * It is expected to call at exec in which one task runs in a process.
+ *
+ * @task: A struct task_struct * pointer
+ */
+void reset_task_xstate_perm(struct task_struct *tsk)
+{
+ struct fpu *fpu = &tsk->thread.fpu;
+
+ if (!xfeatures_mask_user_dynamic ||
+ !(fpu->state_mask & xfeatures_mask_user_dynamic))
+ return;
+
+ WARN_ON(tsk->signal->nr_threads > 1);
+
+ fpu->state_mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
+ free_xstate_buffer(fpu);
+ fpu->state = &fpu->__default_state;
+ if (boot_cpu_has(X86_FEATURE_XSAVES))
+ fpstate_init_xstate(&fpu->state->xsave, fpu->state_mask);
+
+ xfd_write(xfd_capable() ^ (fpu->state_mask & xfd_capable()));
+
+ fpu->dynamic_state_perm = 0;
+}
+
+/**
+ * get_task_state_perm - get the state permission bitmap
+ * @tsk: A struct task_struct * pointer
+ * Return: A bitmap to indicate which state's permission is set.
+ */
+unsigned long get_task_state_perm(struct task_struct *tsk)
+{
+ if (!boot_cpu_has(X86_FEATURE_FPU))
+ return 0;
+
+ if (use_xsave())
+ return (xfeatures_mask_uabi() & ~xfeatures_mask_user_dynamic) |
+ tsk->thread.fpu.dynamic_state_perm;
+
+ if (use_fxsr())
+ return XFEATURE_MASK_FPSSE;
+
+ return XFEATURE_MASK_FP;
+}
+
static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
void *init_xstate, unsigned int size)
{
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b85fa499f195..5b4f9b82aea1 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -1012,13 +1012,17 @@ unsigned long get_wchan(struct task_struct *p)
}

long do_arch_prctl_common(struct task_struct *task, int option,
- unsigned long cpuid_enabled)
+ unsigned long arg2)
{
switch (option) {
case ARCH_GET_CPUID:
return get_cpuid_mode();
case ARCH_SET_CPUID:
- return set_cpuid_mode(task, cpuid_enabled);
+ return set_cpuid_mode(task, arg2);
+ case ARCH_SET_STATE_ENABLE:
+ return set_process_xstate_perm(task, arg2);
+ case ARCH_GET_STATE_ENABLE:
+ return put_user(get_task_state_perm(task), (unsigned long __user *)arg2);
}

return -EINVAL;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 41c9855158d6..065ea28328b9 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -678,6 +678,9 @@ void set_personality_64bit(void)
so it's not too bad. The main problem is just that
32bit children are affected again. */
current->personality &= ~READ_IMPLIES_EXEC;
+
+ /* Make sure to reset the dynamic state permission. */
+ reset_task_xstate_perm(current);
}

static void __set_personality_x32(void)
@@ -723,6 +726,9 @@ void set_personality_ia32(bool x32)
/* Make sure to be in 32bit mode */
set_thread_flag(TIF_ADDR32);

+ /* Make sure to reset the dynamic state permission. */
+ reset_task_xstate_perm(current);
+
if (x32)
__set_personality_x32();
else
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index dd66d528afd8..c94f3b76c126 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1132,10 +1132,12 @@ DEFINE_IDTENTRY(exc_device_not_available)
int err = -1;

/*
- * Make sure not in interrupt context as handling a
- * trap from userspace.
+ * Make sure that dynamic buffer expansion is permitted
+ * and not in interrupt context as handling a trap from
+ * userspace.
*/
- if (!WARN_ON(in_interrupt())) {
+ if (((xfd_event & fpu->dynamic_state_perm) == xfd_event) &&
+ !WARN_ON(in_interrupt())) {
err = alloc_xstate_buffer(fpu, xfd_event);
if (!err)
xfd_write((fpu->state_mask & xfd_capable()) ^
--
2.17.1

2021-07-30 15:09:30

[permalink] [raw]

Subject: [PATCH v9 19/26] x86/fpu/amx: Define AMX state components and have it used for boot-time checks

Linux uses check_xstate_against_struct() to sanity check the size of
XSTATE-enabled features. AMX is the XSAVE-enabled feature, and its size is
not hard-coded but discoverable at run-time via CPUID.

The AMX state is composed of state components 17 and 18, which are all user
state components. The first component is the XTILECFG state of a 64-byte
tile-related control register. The state component 18, called XTILEDATA,
contains the actual tile data, and the state size varies on
implementations. The architectural maximum, as defined in the CPUID(0x1d,
1): EAX[15:0], is a byte less than 64KB. The first implementation supports
8KB.

Check the XTILEDATA state size dynamically. The feature introduces the new
tile register, TMM. Define one register struct only and read the number of
registers from CPUID. Cross-check the overall size with CPUID again.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v8:
* bugfix: Fix off-by-one-error in check_xstate_against_struct() feature
number argument.

Changes from v4:
* Changed to return an error when tile data size mismatches. (Thomas Gleixner)
* Updated the function description and code comments.

Changes from v2:
* Updated the code comments.

Changes from v1:
* Rebased on the upstream kernel (5.10)
---
arch/x86/include/asm/fpu/types.h | 27 +++++++++++
arch/x86/include/asm/fpu/xstate.h | 2 +
arch/x86/kernel/fpu/xstate.c | 80 ++++++++++++++++++++++++++++++-
3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 03160a1a79ad..f24b58b606dc 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -120,6 +120,9 @@ enum xfeature {
XFEATURE_RSRVD_COMP_13,
XFEATURE_RSRVD_COMP_14,
XFEATURE_LBR,
+ XFEATURE_RSRVD_COMP_16,
+ XFEATURE_XTILE_CFG,
+ XFEATURE_XTILE_DATA,

XFEATURE_MAX,
};
@@ -136,11 +139,15 @@ enum xfeature {
#define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
#define XFEATURE_MASK_PASID (1 << XFEATURE_PASID)
#define XFEATURE_MASK_LBR (1 << XFEATURE_LBR)
+#define XFEATURE_MASK_XTILE_CFG (1 << XFEATURE_XTILE_CFG)
+#define XFEATURE_MASK_XTILE_DATA (1 << XFEATURE_XTILE_DATA)

#define XFEATURE_MASK_FPSSE (XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
#define XFEATURE_MASK_AVX512 (XFEATURE_MASK_OPMASK \
| XFEATURE_MASK_ZMM_Hi256 \
| XFEATURE_MASK_Hi16_ZMM)
+#define XFEATURE_MASK_XTILE (XFEATURE_MASK_XTILE_DATA \
+ | XFEATURE_MASK_XTILE_CFG)

#define FIRST_EXTENDED_XFEATURE XFEATURE_YMM

@@ -153,6 +160,9 @@ struct reg_256_bit {
struct reg_512_bit {
u8 regbytes[512/8];
};
+struct reg_1024_byte {
+ u8 regbytes[1024];
+};

/*
* State component 2:
@@ -255,6 +265,23 @@ struct arch_lbr_state {
u64 ler_to;
u64 ler_info;
struct lbr_entry entries[];
+};
+
+/*
+ * State component 17: 64-byte tile configuration register.
+ */
+struct xtile_cfg {
+ u64 tcfg[8];
+} __packed;
+
+/*
+ * State component 18: 1KB tile data register.
+ * Each register represents 16 64-byte rows of the matrix
+ * data. But the number of registers depends on the actual
+ * implementation.
+ */
+struct xtile_data {
+ struct reg_1024_byte tmm;
} __packed;

/*
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index c39ea8bac68f..e7c6396261ca 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -14,6 +14,8 @@

#define XSTATE_CPUID 0x0000000d

+#define TILE_CPUID 0x0000001d
+
#define FXSAVE_SIZE 512

#define XSAVE_HDR_SIZE 64
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index a36e24028ca7..dac01e4d7654 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -41,6 +41,14 @@ static const char *xfeature_names[] =
"Protection Keys User registers",
"PASID state",
"unknown xstate feature" ,
+ "unknown xstate feature" ,
+ "unknown xstate feature" ,
+ "unknown xstate feature" ,
+ "unknown xstate feature" ,
+ "unknown xstate feature" ,
+ "AMX Tile config" ,
+ "AMX Tile data" ,
+ "unknown xstate feature" ,
};

static unsigned short xsave_cpuid_features[] __initdata = {
@@ -54,6 +62,8 @@ static unsigned short xsave_cpuid_features[] __initdata = {
[XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
[XFEATURE_PKRU] = X86_FEATURE_PKU,
[XFEATURE_PASID] = X86_FEATURE_ENQCMD,
+ [XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
+ [XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
};

/*
@@ -389,6 +399,8 @@ static void __init print_xstate_features(void)
print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
print_xstate_feature(XFEATURE_MASK_PKRU);
print_xstate_feature(XFEATURE_MASK_PASID);
+ print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
+ print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
}

/*
@@ -662,6 +674,67 @@ static void __xstate_dump_leaves(void)
} \
} while (0)

+/**
+ * check_xtile_data_against_struct - Check tile data state size.
+ *
+ * Calculate the state size by multiplying the single tile size which is
+ * recorded in a C struct, and the number of tiles that the CPU informs.
+ * Compare the provided size with the calculation.
+ *
+ * @size: The tile data state size
+ *
+ * Returns: 0 on success, -EINVAL on mismatch.
+ */
+static int check_xtile_data_against_struct(int size)
+{
+ u32 max_palid, palid, state_size;
+ u32 eax, ebx, ecx, edx;
+ u16 max_tile;
+
+ /*
+ * Check the maximum palette id:
+ * eax: the highest numbered palette subleaf.
+ */
+ cpuid_count(TILE_CPUID, 0, &max_palid, &ebx, &ecx, &edx);
+
+ /*
+ * Cross-check each tile size and find the maximum number of
+ * supported tiles.
+ */
+ for (palid = 1, max_tile = 0; palid <= max_palid; palid++) {
+ u16 tile_size, max;
+
+ /*
+ * Check the tile size info:
+ * eax[31:16]: bytes per title
+ * ebx[31:16]: the max names (or max number of tiles)
+ */
+ cpuid_count(TILE_CPUID, palid, &eax, &ebx, &edx, &edx);
+ tile_size = eax >> 16;
+ max = ebx >> 16;
+
+ if (tile_size != sizeof(struct xtile_data)) {
+ pr_err("%s: struct is %zu bytes, cpu xtile %d bytes\n",
+ __stringify(XFEATURE_XTILE_DATA),
+ sizeof(struct xtile_data), tile_size);
+ __xstate_dump_leaves();
+ return -EINVAL;
+ }
+
+ if (max > max_tile)
+ max_tile = max;
+ }
+
+ state_size = sizeof(struct xtile_data) * max_tile;
+ if (size != state_size) {
+ pr_err("%s: calculated size is %u bytes, cpu state %d bytes\n",
+ __stringify(XFEATURE_XTILE_DATA), state_size, size);
+ __xstate_dump_leaves();
+ return -EINVAL;
+ }
+ return 0;
+}
+
/*
* We have a C struct for each 'xstate'. We need to ensure
* that our software representation matches what the CPU
@@ -685,6 +758,11 @@ static int check_xstate_against_struct(int nr)
XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM, struct avx_512_hi16_state);
XCHECK_SZ(sz, nr, XFEATURE_PKRU, struct pkru_state);
XCHECK_SZ(sz, nr, XFEATURE_PASID, struct ia32_pasid_state);
+ XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
+
+ /* The tile data size varies between implementations. */
+ if (nr == XFEATURE_XTILE_DATA)
+ check_xtile_data_against_struct(sz);

/*
* Make *SURE* to add any feature numbers in below if
@@ -694,7 +772,7 @@ static int check_xstate_against_struct(int nr)
if ((nr < XFEATURE_YMM) ||
(nr >= XFEATURE_MAX) ||
(nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
- ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_LBR))) {
+ ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_RSRVD_COMP_16))) {
pr_err("no structure for xstate: %d\n", nr);
XSTATE_WARN_ON(1);
return -EINVAL;
--
2.17.1

2021-07-30 15:09:42

[permalink] [raw]

Subject: [PATCH v9 18/26] x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature bits

Intel's Advanced Matrix Extension (AMX) is a new 64-bit extended feature
consisting of two-dimensional registers and an accelerator unit. The first
implementation of the latter is the tile matrix multiply unit (TMUL). TMUL
performs SIMD dot-products on four bytes (INT8) or two bfloat16
floating-point (BF16) elements.

Here enumerate this hardware capability to be shown as 'amx_tile',
'amx_bf16', and 'amx_int8' in /proc/cpuinfo.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Massaged the changelog a bit.
---
arch/x86/include/asm/cpufeatures.h | 3 +++
arch/x86/kernel/cpu/cpuid-deps.c | 3 +++
2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 37150b7a8e44..9e9763ec7713 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -386,7 +386,10 @@
#define X86_FEATURE_TSXLDTRK (18*32+16) /* TSX Suspend Load Address Tracking */
#define X86_FEATURE_PCONFIG (18*32+18) /* Intel PCONFIG */
#define X86_FEATURE_ARCH_LBR (18*32+19) /* Intel ARCH LBR */
+#define X86_FEATURE_AMX_BF16 (18*32+22) /* AMX BF16 Support */
#define X86_FEATURE_AVX512_FP16 (18*32+23) /* AVX512 FP16 */
+#define X86_FEATURE_AMX_TILE (18*32+24) /* AMX tile Support */
+#define X86_FEATURE_AMX_INT8 (18*32+25) /* AMX INT8 Support */
#define X86_FEATURE_SPEC_CTRL (18*32+26) /* "" Speculation Control (IBRS + IBPB) */
#define X86_FEATURE_INTEL_STIBP (18*32+27) /* "" Single Thread Indirect Branch Predictors */
#define X86_FEATURE_FLUSH_L1D (18*32+28) /* Flush L1D cache */
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index 7f891d2eb52e..9a520ab259ac 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -76,6 +76,9 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_SGX1, X86_FEATURE_SGX },
{ X86_FEATURE_SGX2, X86_FEATURE_SGX1 },
{ X86_FEATURE_XFD, X86_FEATURE_XSAVE },
+ { X86_FEATURE_AMX_TILE, X86_FEATURE_XSAVE },
+ { X86_FEATURE_AMX_INT8, X86_FEATURE_AMX_TILE },
+ { X86_FEATURE_AMX_BF16, X86_FEATURE_AMX_TILE },
{}
};

--
2.17.1

2021-07-30 15:09:52

[permalink] [raw]

Subject: [PATCH v9 20/26] x86/fpu/amx: Initialize child's AMX state

Assure that a forked child starts AMX registers in the INIT-state.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v5:
* Introduced a new define. (Andy Lutomirski)

Changes from v4:
* Added as a new patch. This was missing on previous versions.
---
arch/x86/include/asm/fpu/xstate.h | 3 +++
arch/x86/kernel/fpu/core.c | 3 +++
2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index e7c6396261ca..912b420cb148 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -80,6 +80,9 @@
XFEATURE_MASK_INDEPENDENT | \
XFEATURE_MASK_SUPERVISOR_UNSUPPORTED)

+/* Volatile states that a child does not inherit. */
+#define XFEATURE_MASK_CLEARED_ON_CLONE XFEATURE_MASK_XTILE
+
#ifdef CONFIG_X86_64
#define REX_PREFIX "0x48, "
#else
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 541628bfc8c0..387118127f93 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -299,6 +299,9 @@ int fpu_clone(struct task_struct *dst)
save_fpregs_to_fpstate(dst_fpu);
fpregs_unlock();

+ if (xfeatures_mask_all & XFEATURE_MASK_CLEARED_ON_CLONE)
+ dst_fpu->state->xsave.header.xfeatures &= ~XFEATURE_MASK_CLEARED_ON_CLONE;
+
set_tsk_thread_flag(dst, TIF_NEED_FPU_LOAD);

trace_x86_fpu_copy_src(src_fpu);
--
2.17.1

2021-07-30 15:09:56

[permalink] [raw]

Subject: [PATCH v9 01/26] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers

Have the function initializing the XSTATE buffer take a struct fpu *
pointer in preparation for dynamic state buffer support.

init_fpstate is a special case, which is indicated by a null pointer
parameter to fpstate_init().

Also, fpstate_init_xstate() now accepts the state component bitmap to
customize the compacted format.

No functional change.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from v5:
* Moved fpstate_init_xstate() back to the header (again).
* Massaged the changelog.

Changes from v4:
* Added a proper function description. (Borislav Petkov)
* Added the likely() statement as a null pointer is a special case.

Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the function comment to use kernel-doc style. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
---
arch/x86/include/asm/fpu/internal.h | 11 ++++++++++-
arch/x86/kernel/fpu/core.c | 28 +++++++++++++++++-----------
arch/x86/kernel/fpu/init.c | 2 +-
arch/x86/kernel/fpu/xstate.c | 3 +--
arch/x86/kvm/x86.c | 2 +-
5 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 5a18694a89b2..c7a64e2806a9 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -80,7 +80,7 @@ static __always_inline __pure bool use_fxsr(void)

extern union fpregs_state init_fpstate;

-extern void fpstate_init(union fpregs_state *state);
+extern void fpstate_init(struct fpu *fpu);
#ifdef CONFIG_MATH_EMULATION
extern void fpstate_init_soft(struct swregs_state *soft);
#else
@@ -88,6 +88,15 @@ static inline void fpstate_init_soft(struct swregs_state *soft) {}
#endif
extern void save_fpregs_to_fpstate(struct fpu *fpu);

+static inline void fpstate_init_xstate(struct xregs_state *xsave, u64 mask)
+{
+ /*
+ * XRSTORS requires these bits set in xcomp_bv, or it will
+ * trigger #GP:
+ */
+ xsave->header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT | mask;
+}
+
/* Returns 0 or the negated trap number, which results in -EFAULT for #PF */
#define user_insn(insn, output, input...) \
({ \
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 7ada7bd03a32..c0098f8422de 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -203,15 +203,6 @@ void fpu_sync_fpstate(struct fpu *fpu)
fpregs_unlock();
}

-static inline void fpstate_init_xstate(struct xregs_state *xsave)
-{
- /*
- * XRSTORS requires these bits set in xcomp_bv, or it will
- * trigger #GP:
- */
- xsave->header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT | xfeatures_mask_all;
-}
-
static inline void fpstate_init_fxstate(struct fxregs_state *fx)
{
fx->cwd = 0x37f;
@@ -229,8 +220,23 @@ static inline void fpstate_init_fstate(struct fregs_state *fp)
fp->fos = 0xffff0000u;
}

-void fpstate_init(union fpregs_state *state)
+/**
+ *
+ * fpstate_init - initialize the xstate buffer
+ *
+ * If @fpu is NULL, initialize init_fpstate.
+ *
+ * @fpu: A struct fpu * pointer
+ */
+void fpstate_init(struct fpu *fpu)
{
+ union fpregs_state *state;
+
+ if (likely(fpu))
+ state = &fpu->state;
+ else
+ state = &init_fpstate;
+
if (!static_cpu_has(X86_FEATURE_FPU)) {
fpstate_init_soft(&state->soft);
return;
@@ -239,7 +245,7 @@ void fpstate_init(union fpregs_state *state)
memset(state, 0, fpu_kernel_xstate_size);

if (static_cpu_has(X86_FEATURE_XSAVES))
- fpstate_init_xstate(&state->xsave);
+ fpstate_init_xstate(&state->xsave, xfeatures_mask_all);
if (static_cpu_has(X86_FEATURE_FXSR))
fpstate_init_fxstate(&state->fxsave);
else
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 64e29927cc32..e14c72bc8706 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -124,7 +124,7 @@ static void __init fpu__init_system_generic(void)
* Set up the legacy init FPU context. (xstate init might overwrite this
* with a more modern format, if the CPU supports it.)
*/
- fpstate_init(&init_fpstate);
+ fpstate_init(NULL);

fpu__init_system_mxcsr();
}
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c8def1b7f8fb..d4fdceb9a309 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -395,8 +395,7 @@ static void __init setup_init_fpu_buf(void)
print_xstate_features();

if (boot_cpu_has(X86_FEATURE_XSAVES))
- init_fpstate.xsave.header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT |
- xfeatures_mask_all;
+ fpstate_init_xstate(&init_fpstate.xsave, xfeatures_mask_all);

/*
* Init all the features state with header.xfeatures being 0x0
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a4fd10604f72..76a4e5e274d8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10599,7 +10599,7 @@ static void fx_init(struct kvm_vcpu *vcpu)
if (!vcpu->arch.guest_fpu)
return;

- fpstate_init(&vcpu->arch.guest_fpu->state);
+ fpstate_init(vcpu->arch.guest_fpu);
if (boot_cpu_has(X86_FEATURE_XSAVES))
vcpu->arch.guest_fpu->state.xsave.header.xcomp_bv =
host_xcr0 | XSTATE_COMPACTION_ENABLED;
--
2.17.1

2021-07-30 15:09:58

[permalink] [raw]

Subject: [PATCH v9 21/26] x86/fpu/amx: Enable the AMX feature in 64-bit mode

In 64-bit mode, include the AMX state components in
XFEATURE_MASK_USER_SUPPORTED.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v5:
* Adjusted macro changes and moved the disabling code for non-64-bit mode
for the new base changes.

Changes from v4:
* Removed the irrelevant line from the changelog. (Thomas Gleixner)
---
arch/x86/include/asm/fpu/xstate.h | 3 ++-
arch/x86/kernel/fpu/xstate.c | 6 +++++-
2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 912b420cb148..f934ce88c048 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -35,7 +35,8 @@
XFEATURE_MASK_Hi16_ZMM | \
XFEATURE_MASK_PKRU | \
XFEATURE_MASK_BNDREGS | \
- XFEATURE_MASK_BNDCSR)
+ XFEATURE_MASK_BNDCSR | \
+ XFEATURE_MASK_XTILE)

/*
* Features which are restored when returning to user space.
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index dac01e4d7654..96056f49bcff 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -538,7 +538,8 @@ static void __init print_xstate_offset_size(void)
XFEATURE_MASK_PKRU | \
XFEATURE_MASK_BNDREGS | \
XFEATURE_MASK_BNDCSR | \
- XFEATURE_MASK_PASID)
+ XFEATURE_MASK_PASID | \
+ XFEATURE_MASK_XTILE)

/*
* setup the xstate image representing the init state
@@ -1054,6 +1055,9 @@ void __init fpu__init_system_xstate(void)
xfeatures_mask_all &= XFEATURE_MASK_USER_SUPPORTED |
XFEATURE_MASK_SUPERVISOR_SUPPORTED;

+ if (!IS_ENABLED(CONFIG_X86_64))
+ xfeatures_mask_all &= ~XFEATURE_MASK_XTILE;
+
/* Store it for paranoia check at the end */
xfeatures = xfeatures_mask_all;
/* Do not support the dynamically allocated buffer yet. */
--
2.17.1

2021-07-30 15:10:04

[permalink] [raw]

Subject: [PATCH v9 23/26] selftest/x86/amx: Test cases for the AMX state management

This selftest verifies that the XSTATE arch_prctl works for AMX state and
that a forked task has the AMX state in the INIT-state.

In addition, this test verifies that the kernel correctly context switches
unique AMX data, when multiple threads are using AMX. The test also
verifies that ptrace() can insert data into existing threads.

Finally, add a test case to verify that unused states are excluded, by
leaving a known pattern on the signal stack and verifying that it is still
intact after taking a subsequent signal.

These test cases do not depend on AMX compiler support, as they employ
userspace-XSAVE directly to access AMX state.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v8:
* Adjust for the arch_prctl change.
* Assure XTILECFG is recovered upon sigreturn..

Changes from v7:
* Adjust for SIGILL.
* Test XTILECFG for legacy signal delivery.

Changes from v6:
* Adjust for the syscall and ptrace path changes.

Changes from v5:
* Adjusted arch_prctl for the updated ABI.
* Added test for the dynamic signal xstate buffer.
* Fixed XSAVE buffer's header data.

Changes from v4:
* Added test for arch_prctl.
* Excluded tile config details to focus on testing the kernel's ability to
manage dynamic user state.
* Removed tile instructions.
* Simplified the fork() and ptrace() test routine.
* Massaged the changelog.

Changes from v2:
* Updated the test messages and the changelog as tile data is not inherited
to a child anymore.
* Removed bytecode for the instructions already supported by binutils.
* Changed to check the XSAVE availability in a reliable way.

Changes from v1:
* Removed signal testing code
---
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/amx.c | 968 +++++++++++++++++++++++++++
2 files changed, 969 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/x86/amx.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index b4142cd1c5c2..8a1f62ab3c8e 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -18,7 +18,7 @@ TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
test_FCMOV test_FCOMI test_FISTTP \
vdso_restorer
TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
- corrupt_xstate_header
+ corrupt_xstate_header amx
# Some selftests require 32bit support enabled also on 64bit systems
TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall

diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c
new file mode 100644
index 000000000000..afd8c66ca206
--- /dev/null
+++ b/tools/testing/selftests/x86/amx.c
@@ -0,0 +1,968 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <errno.h>
+#include <elf.h>
+#include <pthread.h>
+#include <setjmp.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdbool.h>
+#include <unistd.h>
+#include <x86intrin.h>
+
+#include <linux/futex.h>
+
+#include <sys/ptrace.h>
+#include <sys/shm.h>
+#include <sys/syscall.h>
+#include <sys/wait.h>
+#include <sys/uio.h>
+
+#ifndef __x86_64__
+# error This test is 64-bit only
+#endif
+
+static inline uint64_t xgetbv(uint32_t index)
+{
+ uint32_t eax, edx;
+
+ asm volatile("xgetbv;"
+ : "=a" (eax), "=d" (edx)
+ : "c" (index));
+ return eax + ((uint64_t)edx << 32);
+}
+
+static inline void cpuid(uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx)
+{
+ asm volatile("cpuid;"
+ : "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx)
+ : "0" (*eax), "2" (*ecx));
+}
+
+static inline void xsave(void *xbuf, uint32_t lo, uint32_t hi)
+{
+ asm volatile("xsave (%%rdi)"
+ : : "D" (xbuf), "a" (lo), "d" (hi)
+ : "memory");
+}
+
+static inline void xrstor(void *xbuf, uint32_t lo, uint32_t hi)
+{
+ asm volatile("xrstor (%%rdi)"
+ : : "D" (xbuf), "a" (lo), "d" (hi));
+}
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+ int flags)
+{
+ struct sigaction sa;
+
+ memset(&sa, 0, sizeof(sa));
+ sa.sa_sigaction = handler;
+ sa.sa_flags = SA_SIGINFO | flags;
+ sigemptyset(&sa.sa_mask);
+ if (sigaction(sig, &sa, 0))
+ err(1, "sigaction");
+}
+
+static void clearhandler(int sig)
+{
+ struct sigaction sa;
+
+ memset(&sa, 0, sizeof(sa));
+ sa.sa_handler = SIG_DFL;
+ sigemptyset(&sa.sa_mask);
+ if (sigaction(sig, &sa, 0))
+ err(1, "sigaction");
+}
+
+static jmp_buf jmpbuf;
+
+/* Hardware info check: */
+
+static bool noxsave;
+
+static void handle_noxsave(int sig, siginfo_t *si, void *ctx_void)
+{
+ noxsave = true;
+ siglongjmp(jmpbuf, 1);
+}
+
+#define XFEATURE_XTILECFG 17
+#define XFEATURE_XTILEDATA 18
+#define XFEATURE_MASK_XTILECFG (1 << XFEATURE_XTILECFG)
+#define XFEATURE_MASK_XTILEDATA (1 << XFEATURE_XTILEDATA)
+#define XFEATURE_MASK_XTILE (XFEATURE_MASK_XTILECFG | XFEATURE_MASK_XTILEDATA)
+
+static inline bool check_xtile(void)
+{
+ bool xtile_enable;
+
+ sethandler(SIGILL, handle_noxsave, 0);
+
+ if ((!sigsetjmp(jmpbuf, 1)) && (xgetbv(0) & XFEATURE_MASK_XTILE)) {
+ xtile_enable = true;
+ goto out;
+ }
+ xtile_enable = false;
+out:
+ clearhandler(SIGILL);
+ return xtile_enable;
+}
+
+static uint32_t xsave_size;
+static uint32_t xsave_xtiledata_offset, xsave_xtilecfg_offset;
+static uint32_t xtiledata_size, xtilecfg_size;
+
+static struct _tile_spec {
+ uint16_t bytes_per_row;
+ uint16_t max_names;
+ uint16_t max_rows;
+} tile_spec;
+
+#define XSTATE_CPUID 0xd
+#define XSTATE_USER_STATE_SUBLEAVE 0x0
+#define TILE_CPUID 0x1d
+#define TILE_PALETTE_ID 0x1
+
+static void check_cpuid(void)
+{
+ uint32_t eax, ebx, ecx, edx;
+
+ eax = XSTATE_CPUID;
+ ecx = XSTATE_USER_STATE_SUBLEAVE;
+
+ cpuid(&eax, &ebx, &ecx, &edx);
+ if (!ebx)
+ err(1, "xstate cpuid: xsave size");
+
+ xsave_size = ebx;
+
+ eax = XSTATE_CPUID;
+ ecx = XFEATURE_XTILECFG;
+
+ cpuid(&eax, &ebx, &ecx, &edx);
+ if (!eax || !ebx)
+ err(1, "xstate cpuid: tile config state");
+
+ xtilecfg_size = eax;
+ xsave_xtilecfg_offset = ebx;
+
+ eax = XSTATE_CPUID;
+ ecx = XFEATURE_XTILEDATA;
+
+ cpuid(&eax, &ebx, &ecx, &edx);
+ if (!eax || !ebx)
+ err(1, "xstate cpuid: tile data state");
+
+ xtiledata_size = eax;
+ xsave_xtiledata_offset = ebx;
+
+ eax = TILE_CPUID;
+ ecx = TILE_PALETTE_ID;
+
+ cpuid(&eax, &ebx, &ecx, &edx);
+ if (!eax || !ebx || !ecx)
+ err(1, "tile cpuid: palette 1");
+
+ tile_spec.max_names = ebx >> 16;
+ tile_spec.bytes_per_row = ebx;
+ tile_spec.max_rows = ecx;
+}
+
+/* The helpers for managing XSAVE buffer and tile states: */
+
+void *alloc_xsave_buffer(void)
+{
+ void *xbuf;
+
+ /* XSAVE buffer should be 64B-aligned. */
+ xbuf = aligned_alloc(64, xsave_size);
+ if (!xbuf)
+ err(1, "aligned_alloc()");
+ return xbuf;
+}
+
+#define XSAVE_HDR_OFFSET 512
+#define XSAVE_HDR_SIZE 64
+
+static inline void clear_xstate_header(void *buffer)
+{
+ memset(buffer + XSAVE_HDR_OFFSET, 0, XSAVE_HDR_SIZE);
+}
+
+static inline uint64_t get_xstatebv(void *buffer)
+{
+ return *(uint64_t *)(buffer + XSAVE_HDR_OFFSET);
+}
+
+static inline void set_xstatebv(void *buffer, uint64_t bv)
+{
+ *(uint64_t *)(buffer + XSAVE_HDR_OFFSET) = bv;
+}
+
+static void set_rand_tiledata(void *tiledata)
+{
+ int *ptr = tiledata;
+ int data = rand();
+ int i;
+
+ for (i = 0; i < xtiledata_size / sizeof(int); i++, ptr++)
+ *ptr = data;
+}
+
+#define MAX_TILES 16
+#define RESERVED_BYTES 14
+
+struct tile_config {
+ uint8_t palette_id;
+ uint8_t start_row;
+ uint8_t reserved[RESERVED_BYTES];
+ uint16_t colsb[MAX_TILES];
+ uint8_t rows[MAX_TILES];
+};
+
+static void set_tilecfg(void *tilecfg)
+{
+ struct tile_config *cfg = tilecfg;
+ int i;
+
+ memset(cfg, 0, sizeof(*cfg));
+ cfg->palette_id = TILE_PALETTE_ID;
+ for (i = 0; i < tile_spec.max_names; i++) {
+ cfg->colsb[i] = tile_spec.bytes_per_row;
+ cfg->rows[i] = tile_spec.max_rows;
+ }
+}
+
+static void *xsave_buffer, *tiledata, *tilecfg;
+static int nerrs, errs;
+
+/* See 'struct _fpx_sw_bytes' at sigcontext.h */
+#define SW_BYTES_OFFSET 464
+/* N.B. The struct's field name varies so read from the offset. */
+#define SW_BYTES_BV_OFFSET (SW_BYTES_OFFSET + 8)
+
+static inline struct _fpx_sw_bytes *get_fpx_sw_bytes(void *buffer)
+{
+ return (struct _fpx_sw_bytes *)(buffer + SW_BYTES_OFFSET);
+}
+
+static inline uint64_t get_fpx_sw_bytes_xstatebv(void *buffer)
+{
+ return *(uint64_t *)(buffer + SW_BYTES_BV_OFFSET);
+}
+
+static volatile bool noperm;
+static bool check_tilecfg_sigframe;
+
+static void handle_noperm(int sig, siginfo_t *si, void *ctx_void)
+{
+ ucontext_t *ctx = (ucontext_t *)ctx_void;
+ void *xbuf = ctx->uc_mcontext.fpregs;
+ struct _fpx_sw_bytes *sw_bytes;
+
+ printf("\tAt SIGILL handler,\n");
+
+ if (si->si_code != ILL_ILLOPC) {
+ errs++;
+ printf("[FAIL]\tInvalid signal code (%x).\n", si->si_code);
+ } else {
+ printf("[OK]\tValid signal code (ILL_ILLOPC).\n");
+ }
+
+ sw_bytes = get_fpx_sw_bytes(xbuf);
+ if (!(sw_bytes->xstate_size < xsave_xtiledata_offset) &&
+ !(get_fpx_sw_bytes_xstatebv(xbuf) & XFEATURE_MASK_XTILEDATA)) {
+ printf("[OK]\tValid xstate size and mask in the SW data of xstate buffer.\n");
+ } else {
+ errs++;
+ printf("[FAIL]\tInvalid xstate size and/or mask in the SW data of xstate buf.\n");
+ }
+
+ if (check_tilecfg_sigframe) {
+ if (memcmp(tilecfg, xbuf + xsave_xtilecfg_offset, xtilecfg_size)) {
+ errs++;
+ printf("[FAIL]\tTILECFG is corrupted.\n");
+ } else {
+ printf("[OK]\tTILECFG is successfully delivered.\n");
+ }
+ }
+
+ noperm = true;
+ ctx->uc_mcontext.gregs[REG_RIP] += 3; /* Skip the faulting XRSTOR */
+}
+
+/* Return true if XRSTOR is successful; otherwise, false. */
+static inline bool xrstor_safe(void *buffer, uint32_t lo, uint32_t hi)
+{
+ noperm = false;
+ xrstor(buffer, lo, hi);
+ return !noperm;
+}
+
+/* arch_prctl test */
+
+#define ARCH_SET_STATE_ENABLE 0x1021
+#define ARCH_GET_STATE_ENABLE 0x1022
+
+static void enable_tiledata(void)
+{
+ unsigned long bitmask;
+ long rc;
+
+ rc = syscall(SYS_arch_prctl, ARCH_SET_STATE_ENABLE, XFEATURE_MASK_XTILEDATA);
+ if (rc)
+ goto fail;
+
+ rc = syscall(SYS_arch_prctl, ARCH_GET_STATE_ENABLE, &bitmask);
+ if (rc)
+ err(1, "ARCH_GET_STATE_ENABLE");
+ else if (bitmask & XFEATURE_MASK_XTILEDATA)
+ return;
+
+fail:
+ err(1, "ARCH_SET_STATE_ENABLE");
+}
+
+#define TEST_EXECV_ARG "nested"
+
+static void test_arch_prctl(int argc, char **argv)
+{
+ pid_t parent, child, grandchild;
+
+ parent = fork();
+ if (parent < 0) {
+ err(1, "fork");
+ } else if (parent > 0) {
+ int status;
+
+ wait(&status);
+ if (!WIFEXITED(status) || WEXITSTATUS(status))
+ err(1, "arch_prctl test parent exit");
+ return;
+ }
+
+ printf("[RUN]\tCheck ARCH_SET_STATE_ENABLE around process fork().\n");
+
+ printf("\tFork a child.\n");
+ child = fork();
+ if (child < 0) {
+ err(1, "fork");
+ } else if (child > 0) {
+ int status;
+
+ enable_tiledata();
+ printf("\tDo ARCH_SET_STATE_ENABLE at parent\n");
+
+ wait(&status);
+ if (!WIFEXITED(status) || WEXITSTATUS(status))
+ err(1, "arch_prctl test child exit");
+ _exit(0);
+ }
+
+ clear_xstate_header(xsave_buffer);
+
+ /* By default, XTILECFG is permitted to use. */
+ set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILECFG);
+ set_tilecfg(xsave_buffer + xsave_xtilecfg_offset);
+ xrstor(xsave_buffer, -1, -1);
+ memcpy(tilecfg, xsave_buffer + xsave_xtilecfg_offset, xtilecfg_size);
+
+ set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILEDATA);
+ set_rand_tiledata(xsave_buffer + xsave_xtiledata_offset);
+
+ printf("\tLoad tile data without ARCH_SET_STATE_ENABLE at child.\n");
+ /*
+ * Test XTILECFG state delivery via signal, when XTILEDATA is not
+ * permitted.
+ */
+ check_tilecfg_sigframe = true;
+ if (xrstor_safe(xsave_buffer, -1, -1)) {
+ nerrs++;
+ printf("[FAIL]\tSucceeded at child.\n");
+ } else {
+ printf("[OK]\tBlocked at child.\n");
+
+ /* Assure XTILECFG state recovery at sigreturn. */
+ printf("\tReturn from signal handler,\n");
+ xsave(xsave_buffer, XFEATURE_MASK_XTILECFG, 0);
+ if (memcmp(tilecfg, xsave_buffer + xsave_xtilecfg_offset, xtilecfg_size)) {
+ nerrs++;
+ printf("[FAIL]\tTilecfg is not restored.\n");
+ } else {
+ printf("[OK]\tTilecfg is restored.\n");
+ }
+ }
+
+ printf("\tDo ARCH_SET_STATE_ENABLE at child.\n");
+ enable_tiledata();
+
+ printf("\tLoad tile data with ARCH_SET_STATE_ENABLE at child:\n");
+ check_tilecfg_sigframe = false;
+ if (xrstor_safe(xsave_buffer, -1, -1)) {
+ printf("[OK]\tSucceeded at child.\n");
+ } else {
+ nerrs++;
+ printf("[FAIL]\tBlocked at child.\n");
+ }
+
+ printf("\tFork a grandchild.\n");
+ grandchild = fork();
+ if (grandchild < 0) {
+ err(1, "fork");
+ } else if (!grandchild) {
+ char *args[] = {argv[0], TEST_EXECV_ARG, NULL};
+
+ if (xrstor_safe(xsave_buffer, -1, -1)) {
+ printf("[OK]\tSucceeded at grandchild.\n");
+ } else {
+ nerrs++;
+ printf("[FAIL]\tBlocked at grandchild.\n");
+ }
+ nerrs += execv(args[0], args);
+ } else {
+ int status;
+
+ wait(&status);
+ if (!WIFEXITED(status) || WEXITSTATUS(status))
+ err(1, "fork test grandchild");
+ }
+ _exit(0);
+}
+
+/* Testing tile data inheritance */
+
+static void test_fork(void)
+{
+ pid_t child, grandchild;
+
+ child = fork();
+ if (child < 0) {
+ err(1, "fork");
+ } else if (child > 0) {
+ int status;
+
+ wait(&status);
+ if (!WIFEXITED(status) || WEXITSTATUS(status))
+ err(1, "fork test child");
+ return;
+ }
+
+ printf("[RUN]\tCheck tile data inheritance.\n\tBefore fork(), load tile data -- yes:\n");
+
+ clear_xstate_header(xsave_buffer);
+ set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILE);
+ set_tilecfg(xsave_buffer + xsave_xtilecfg_offset);
+ set_rand_tiledata(xsave_buffer + xsave_xtiledata_offset);
+ xrstor_safe(xsave_buffer, -1, -1);
+
+ grandchild = fork();
+ if (grandchild < 0) {
+ err(1, "fork");
+ } else if (grandchild > 0) {
+ int status;
+
+ wait(&status);
+ if (!WIFEXITED(status) || WEXITSTATUS(status))
+ err(1, "fork test grand child");
+ _exit(0);
+ }
+
+ if (xgetbv(1) & XFEATURE_MASK_XTILE) {
+ nerrs++;
+ printf("[FAIL]\tIn a child, AMX state is not initialized.\n");
+ } else {
+ printf("[OK]\tIn a child, AMX state is initialized.\n");
+ }
+ _exit(0);
+}
+
+/* Context switching test */
+
+#define ITERATIONS 10
+#define NUM_THREADS 5
+
+struct futex_info {
+ int current;
+ int *futex;
+ int next;
+};
+
+static inline void command_wait(struct futex_info *info, int value)
+{
+ do {
+ sched_yield();
+ } while (syscall(SYS_futex, info->futex, FUTEX_WAIT, value, 0, 0, 0));
+}
+
+static inline void command_wake(struct futex_info *info, int value)
+{
+ do {
+ *info->futex = value;
+ while (!syscall(SYS_futex, info->futex, FUTEX_WAKE, 1, 0, 0, 0))
+ sched_yield();
+ } while (0);
+}
+
+static inline int get_iterative_value(int id)
+{
+ return ((id << 1) & ~0x1);
+}
+
+static inline int get_endpoint_value(int id)
+{
+ return ((id << 1) | 0x1);
+}
+
+static void *check_tiledata(void *info)
+{
+ struct futex_info *finfo = (struct futex_info *)info;
+ void *xbuf, *tdata;
+ int i;
+
+ xbuf = alloc_xsave_buffer();
+ tdata = malloc(xtiledata_size);
+ if (!tdata)
+ err(1, "malloc()");
+
+ set_xstatebv(xbuf, XFEATURE_MASK_XTILEDATA);
+ set_rand_tiledata(xbuf + xsave_xtiledata_offset);
+ xrstor_safe(xbuf, -1, -1);
+ memcpy(tdata, xbuf + xsave_xtiledata_offset, xtiledata_size);
+
+ for (i = 0; i < ITERATIONS; i++) {
+ command_wait(finfo, get_iterative_value(finfo->current));
+
+ xsave(xbuf, XFEATURE_MASK_XTILEDATA, 0);
+ if (memcmp(tdata, xbuf + xsave_xtiledata_offset, xtiledata_size))
+ errs++;
+
+ set_rand_tiledata(xbuf + xsave_xtiledata_offset);
+ xrstor_safe(xbuf, -1, -1);
+ memcpy(tdata, xbuf + xsave_xtiledata_offset, xtiledata_size);
+
+ command_wake(finfo, get_iterative_value(finfo->next));
+ }
+
+ command_wait(finfo, get_endpoint_value(finfo->current));
+
+ free(xbuf);
+ free(tdata);
+ return NULL;
+}
+
+static int create_threads(int num, struct futex_info *finfo)
+{
+ const int shm_id = shmget(IPC_PRIVATE, sizeof(int), IPC_CREAT | 0666);
+ int *futex = shmat(shm_id, NULL, 0);
+ pthread_t thread;
+ int i;
+
+ for (i = 0; i < num; i++) {
+ finfo[i].futex = futex;
+ finfo[i].current = i + 1;
+ finfo[i].next = (i + 2) % (num + 1);
+
+ if (pthread_create(&thread, NULL, check_tiledata, &finfo[i]))
+ err(1, "pthread_create()");
+ }
+ return 0;
+}
+
+static void test_context_switch(void)
+{
+ struct futex_info *finfo;
+ int i;
+
+ printf("[RUN]\tCheck tile data context switches.\n");
+ printf("\t# of context switches -- %u, # of threads -- %d:\n",
+ ITERATIONS * NUM_THREADS, NUM_THREADS);
+
+ errs = 0;
+
+ finfo = malloc(sizeof(*finfo) * NUM_THREADS);
+ if (!finfo)
+ err(1, "malloc()");
+
+ create_threads(NUM_THREADS, finfo);
+
+ for (i = 0; i < ITERATIONS; i++) {
+ command_wake(finfo, get_iterative_value(1));
+ command_wait(finfo, get_iterative_value(0));
+ }
+
+ for (i = 1; i <= NUM_THREADS; i++)
+ command_wake(finfo, get_endpoint_value(i));
+
+ if (errs) {
+ nerrs += errs;
+ printf("[FAIL]\tIncorrect cases were found -- (%d / %u).\n",
+ errs, ITERATIONS * NUM_THREADS);
+ } else {
+ printf("[OK]\tNo incorrect case was found.\n");
+ }
+
+ free(finfo);
+}
+
+/* Ptrace test */
+
+static bool ptracee_state_perm;
+
+static int inject_tiledata(pid_t target)
+{
+ struct iovec iov;
+
+ iov.iov_base = xsave_buffer;
+ iov.iov_len = xsave_size;
+
+ clear_xstate_header(xsave_buffer);
+ set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILEDATA);
+ set_rand_tiledata(xsave_buffer + xsave_xtiledata_offset);
+ memcpy(tiledata, xsave_buffer + xsave_xtiledata_offset, xtiledata_size);
+
+ if (ptrace(PTRACE_SETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov)) {
+ if (errno != EFAULT)
+ err(1, "PTRACE_SETREGSET");
+ else
+ return errno;
+ }
+
+ if (ptrace(PTRACE_GETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov))
+ err(1, "PTRACE_GETREGSET");
+
+ if (!memcmp(tiledata, xsave_buffer + xsave_xtiledata_offset, xtiledata_size))
+ return 0;
+ else
+ return -1;
+}
+
+static void test_tile_write(void)
+{
+ int status, rc;
+ pid_t child;
+ bool pass;
+
+ child = fork();
+ if (child < 0) {
+ err(1, "fork");
+ } else if (!child) {
+ if (ptracee_state_perm)
+ enable_tiledata();
+
+ if (ptrace(PTRACE_TRACEME, 0, NULL, NULL))
+ err(1, "PTRACE_TRACEME");
+
+ raise(SIGTRAP);
+ _exit(0);
+ }
+
+ do {
+ wait(&status);
+ } while (WSTOPSIG(status) != SIGTRAP);
+
+ printf("\tInject tile data %s ARCH_SET_STATE_ENABLE\n",
+ ptracee_state_perm ? "with" : "without");
+
+ rc = inject_tiledata(child);
+ pass = (rc == EFAULT && !ptracee_state_perm) ||
+ (!rc && ptracee_state_perm);
+ if (!pass)
+ nerrs++;
+ printf("[%s]\tTile data was %swritten on ptracee.\n",
+ pass ? "OK" : "FAIL", errs ? "not " : "");
+
+ ptrace(PTRACE_DETACH, child, NULL, NULL);
+ wait(&status);
+ if (!WIFEXITED(status) || WEXITSTATUS(status))
+ err(1, "ptrace test");
+}
+
+static void test_ptrace(void)
+{
+ printf("[RUN]\tCheck ptrace() to inject tile data.\n");
+
+ ptracee_state_perm = false;
+ test_tile_write();
+
+ ptracee_state_perm = true;
+ test_tile_write();
+}
+
+/* Signal handling test */
+
+static bool init_tiledata, load_tiledata;
+static volatile bool signaled, sigstk_prefill;
+
+#define SIGFRAME_TILEDATA_SIGNATURE 0xEE
+
+static void handle_sigstk_prefill(int sig, siginfo_t *info, void *ctx_void)
+{
+ void *xbuf = ((ucontext_t *)ctx_void)->uc_mcontext.fpregs;
+ struct _fpx_sw_bytes *sw_bytes = get_fpx_sw_bytes(xsave);
+
+ if (sw_bytes->xstate_size >= (xsave_xtiledata_offset + xtiledata_size)) {
+ memset(xbuf + xsave_xtiledata_offset, SIGFRAME_TILEDATA_SIGNATURE,
+ xtiledata_size);
+ }
+
+ sigstk_prefill = true;
+}
+
+static void handle_signal(int sig, siginfo_t *info, void *ctx_void)
+{
+ bool tiledata_area, tiledata_bit, tiledata_inuse;
+ void *xbuf = ((ucontext_t *)ctx_void)->uc_mcontext.fpregs;
+ struct _fpx_sw_bytes *sw_bytes = get_fpx_sw_bytes(xbuf);
+ char d = SIGFRAME_TILEDATA_SIGNATURE;
+ int i;
+
+ printf("\tAt signal delivery,\n");
+
+ /* Check SW reserved data in the buffer: */
+ if ((sw_bytes->xstate_size >= (xsave_xtiledata_offset + xtiledata_size)) &&
+ (get_fpx_sw_bytes_xstatebv(xbuf) & XFEATURE_MASK_XTILEDATA)) {
+ printf("[OK]\tValid xstate size and mask in the SW data of xstate buffer\n");
+ } else {
+ errs++;
+ printf("[FAIL]\tInvalid xstate size and/or mask in the SW data of xstate buffer\n");
+ }
+
+ /* Check XSAVE buffer header: */
+ tiledata_inuse = (load_tiledata && !init_tiledata);
+ tiledata_bit = get_xstatebv(xbuf) & XFEATURE_MASK_XTILEDATA;
+
+ if (tiledata_bit == tiledata_inuse) {
+ printf("[OK]\tTiledata bit is %sset in XSTATE_BV of xstate buffer.\n",
+ tiledata_bit ? "" : "not ");
+ } else {
+ errs++;
+ printf("[FAIL]\tTiledata bit is %sset in XSTATE_BV of xstate buffer.\n",
+ tiledata_bit ? "" : "not ");
+ }
+
+ /*
+ * Check the sigframe data:
+ */
+
+ tiledata_inuse = (load_tiledata && !init_tiledata);
+ tiledata_area = false;
+ if (sw_bytes->xstate_size >= (xsave_xtiledata_offset + xtiledata_size)) {
+ for (i = 0; i < xtiledata_size; i++) {
+ if (memcmp(xbuf + xsave_xtiledata_offset + i, &d, 1)) {
+ tiledata_area = true;
+ break;
+ }
+ }
+ }
+
+ if (tiledata_area == tiledata_inuse) {
+ printf("[OK]\tTiledata is %ssaved in signal buffer.\n",
+ tiledata_area ? "" : "not ");
+ } else {
+ errs++;
+ printf("[FAIL]\tTiledata is %ssaved in signal buffer.\n",
+ tiledata_area ? "" : "not ");
+ }
+
+ /* Load random tiledata to test sigreturn: */
+ clear_xstate_header(xsave_buffer);
+ set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILEDATA);
+ set_rand_tiledata(xsave_buffer + xsave_xtiledata_offset);
+ xrstor_safe(xsave_buffer, -1, -1);
+ signaled = true;
+}
+
+static void test_signal_handling(void)
+{
+ pid_t child;
+
+ signaled = false;
+ sigstk_prefill = false;
+
+ child = fork();
+ if (child < 0) {
+ err(1, "fork");
+ } else if (child > 0) {
+ do {
+ int status;
+
+ wait(&status);
+ if (WIFSTOPPED(status))
+ kill(child, SIGCONT);
+ else if (WIFEXITED(status) && !WEXITSTATUS(status))
+ break;
+ else
+ err(1, "signal test child");
+ } while (1);
+ return;
+ }
+
+ printf("\tBefore signal, load tile data -- %s", load_tiledata ? "yes, " : "no:\n");
+ if (load_tiledata)
+ printf("re-initialized -- %s:\n", init_tiledata ? "yes" : "no");
+
+ /*
+ * Raise SIGUSR1 to pre-fill sig stack. Also, load tiledata to size the pre-fill.
+ */
+
+ if (load_tiledata) {
+ clear_xstate_header(xsave_buffer);
+ set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILEDATA);
+ xrstor_safe(xsave_buffer, -1, -1);
+ }
+
+ raise(SIGUSR1);
+ if (!sigstk_prefill)
+ err(1, "SIGUSR1");
+
+ /*
+ * Raise SIGALRM to test AMX state handling in signal delivery. Set up the state and
+ * data before the test.
+ */
+
+ if (load_tiledata) {
+ clear_xstate_header(xsave_buffer);
+ set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILEDATA);
+ set_rand_tiledata(xsave_buffer + xsave_xtiledata_offset);
+ xrstor_safe(xsave_buffer, -1, -1);
+
+ if (init_tiledata) {
+ clear_xstate_header(xsave_buffer);
+ set_xstatebv(xsave_buffer, 0);
+ xrstor_safe(xsave_buffer, -1, -1);
+ memset(tiledata, 0, xtiledata_size);
+ } else {
+ memcpy(tiledata, xsave_buffer + xsave_xtiledata_offset,
+ xtiledata_size);
+ }
+ } else {
+ memset(tiledata, 0, xtiledata_size);
+ }
+
+ raise(SIGALRM);
+ if (!signaled)
+ err(1, "SIGALRM");
+
+ printf("\tReturn from signal handler,\n");
+ xsave(xsave_buffer, XFEATURE_MASK_XTILEDATA, 0);
+ if (memcmp(tiledata, xsave_buffer + xsave_xtiledata_offset, xtiledata_size)) {
+ errs++;
+ printf("[FAIL]\tTiledata is not restored.\n");
+ } else {
+ printf("[OK]\tTiledata is restored.\n");
+ }
+
+ if (errs)
+ nerrs++;
+ _exit(0);
+}
+
+static void test_signal(void)
+{
+ printf("[RUN]\tCheck tile data state in signal path:\n");
+
+ sethandler(SIGALRM, handle_signal, 0);
+ sethandler(SIGUSR1, handle_sigstk_prefill, 0);
+
+ load_tiledata = false;
+ init_tiledata = false;
+ errs = 0;
+ test_signal_handling();
+
+ load_tiledata = true;
+ init_tiledata = false;
+ errs = 0;
+ test_signal_handling();
+
+ load_tiledata = true;
+ init_tiledata = true;
+ errs = 0;
+ test_signal_handling();
+
+ clearhandler(SIGALRM);
+ clearhandler(SIGUSR1);
+}
+
+int main(int argc, char **argv)
+{
+ cpu_set_t cpuset;
+
+ if (argc == 2) {
+ int ret;
+
+ if (strcmp(argv[1], TEST_EXECV_ARG))
+ return 0;
+
+ printf("\tRun after execv().\n");
+
+ xsave_buffer = alloc_xsave_buffer();
+ clear_xstate_header(xsave_buffer);
+
+ set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILE);
+ set_rand_tiledata(xsave_buffer + xsave_xtiledata_offset);
+
+ sethandler(SIGILL, handle_noperm, 0);
+
+ if (xrstor_safe(xsave_buffer, -1, -1)) {
+ printf("[FAIL]\tSucceeded after execv().\n");
+ ret = 1;
+ } else {
+ printf("[OK]\tBlocked after execv().\n");
+ ret = 0;
+ }
+
+ clearhandler(SIGILL);
+ free(xsave_buffer);
+ _exit(ret);
+ }
+
+ /* Check hardware availability at first */
+
+ if (!check_xtile()) {
+ printf("%s is disabled.\n", noxsave ? "XSAVE" : "AMX");
+ return 0;
+ }
+
+ check_cpuid();
+
+ xsave_buffer = alloc_xsave_buffer();
+ clear_xstate_header(xsave_buffer);
+
+ tiledata = malloc(xtiledata_size);
+ if (!tiledata)
+ err(1, "malloc()");
+
+ tilecfg = malloc(xtilecfg_size);
+ if (!tilecfg)
+ err(1, "malloc()");
+ set_tilecfg(tilecfg);
+
+ nerrs = 0;
+
+ sethandler(SIGILL, handle_noperm, 0);
+
+ CPU_ZERO(&cpuset);
+ CPU_SET(0, &cpuset);
+
+ if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0)
+ err(1, "sched_setaffinity to CPU 0");
+
+ test_arch_prctl(argc, argv);
+ test_ptrace();
+
+ enable_tiledata();
+ test_context_switch();
+ test_fork();
+ test_signal();
+
+ clearhandler(SIGILL);
+
+ free(tilecfg);
+ free(tiledata);
+ free(xsave_buffer);
+ return nerrs ? 1 : 0;
+}
--
2.17.1

2021-07-30 15:10:17

[permalink] [raw]

Subject: [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability

Add a custom Sapphire Rapids (SPR) C-state table to intel_idle driver. The
parameters in this table are preferred over those supplied by ACPI.

SPR supports AMX, and so this custom table uses idle entry points that know
how to initialize AMX TMM state, if necessary.

This guarantees that AMX TMM state will never be the cause of hardware
C-state demotion from C6 to C1E. Under some conditions this may result in
improved power savings, and thus higher available turbo frequency budget.

[ Based on patch by Artem Bityutskiy <[email protected]>. ]

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Reviewed-by: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from v6:
* Update the changelog and function description. (Rafael J. Wysocki)

Changes from v5:
* Moved the code to intel_idle. (Peter Zijlstra)
* Fixed to deactivate fpregs. (Andy Lutomirski and Dave Hansen)
* Updated the code comment. (Dave Hansen)

Changes from v4:
* Added as a new patch. (Thomas Gleixner)
---
arch/x86/include/asm/special_insns.h | 6 +++
drivers/idle/intel_idle.c | 79 ++++++++++++++++++++++++++++
2 files changed, 85 insertions(+)

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index f3fbb84ff8a7..fada1bb82c7b 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -294,6 +294,12 @@ static inline int enqcmds(void __iomem *dst, const void *src)
return 0;
}

+static inline void tile_release(void)
+{
+ /* Instruction opcode for TILERELEASE; supported in binutils >= 2.36. */
+ asm volatile(".byte 0xc4, 0xe2, 0x78, 0x49, 0xc0");
+}
+
#endif /* __KERNEL__ */

#endif /* _ASM_X86_SPECIAL_INSNS_H */
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index e6c543b5ee1d..fe1ba26cc797 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -54,6 +54,8 @@
#include <asm/intel-family.h>
#include <asm/mwait.h>
#include <asm/msr.h>
+#include <asm/fpu/internal.h>
+#include <asm/special_insns.h>

#define INTEL_IDLE_VERSION "0.5.1"

@@ -155,6 +157,55 @@ static __cpuidle int intel_idle_s2idle(struct cpuidle_device *dev,
return 0;
}

+/**
+ * idle_tile - Initialize TILE registers in INIT-state
+ *
+ * Leaving state in the dirty TILE registers may prevent the processor from
+ * entering lower-power idle states. Use TILERELEASE to initialize the
+ * state. Destroying fpregs state is safe after the fpstate update.
+ */
+static inline void idle_tile(void)
+{
+ if (boot_cpu_has(X86_FEATURE_XGETBV1) && (xgetbv(1) & XFEATURE_MASK_XTILE)) {
+ tile_release();
+ fpregs_deactivate(&current->thread.fpu);
+ }
+}
+
+/**
+ * intel_idle_tile - Ask the processor to enter the given idle state.
+ * @dev: cpuidle device of the target CPU.
+ * @drv: cpuidle driver (assumed to point to intel_idle_driver).
+ * @index: Target idle state index.
+ *
+ * Ensure TILE registers in INIT-state before using intel_idle() to
+ * enter the idle state.
+ */
+static __cpuidle int intel_idle_tile(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv, int index)
+{
+ idle_tile();
+
+ return intel_idle(dev, drv, index);
+}
+
+/**
+ * intel_idle_s2idle_tile - Ask the processor to enter the given idle state.
+ * @dev: cpuidle device of the target CPU.
+ * @drv: cpuidle driver (assumed to point to intel_idle_driver).
+ * @index: Target idle state index.
+ *
+ * Ensure TILE registers in INIT-state before using intel_idle_s2idle() to
+ * enter the idle state.
+ */
+static __cpuidle int intel_idle_s2idle_tile(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv, int index)
+{
+ idle_tile();
+
+ return intel_idle_s2idle(dev, drv, index);
+}
+
/*
* States are indexed by the cstate number,
* which is also the index into the MWAIT hint array.
@@ -752,6 +803,27 @@ static struct cpuidle_state icx_cstates[] __initdata = {
.enter = NULL }
};

+static struct cpuidle_state spr_cstates[] __initdata = {
+ {
+ .name = "C1",
+ .desc = "MWAIT 0x00",
+ .flags = MWAIT2flg(0x00),
+ .exit_latency = 1,
+ .target_residency = 1,
+ .enter = &intel_idle,
+ .enter_s2idle = intel_idle_s2idle, },
+ {
+ .name = "C6",
+ .desc = "MWAIT 0x20",
+ .flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED,
+ .exit_latency = 128,
+ .target_residency = 384,
+ .enter = &intel_idle_tile,
+ .enter_s2idle = intel_idle_s2idle_tile, },
+ {
+ .enter = NULL }
+};
+
static struct cpuidle_state atom_cstates[] __initdata = {
{
.name = "C1E",
@@ -1095,6 +1167,12 @@ static const struct idle_cpu idle_cpu_icx __initconst = {
.use_acpi = true,
};

+static const struct idle_cpu idle_cpu_spr __initconst = {
+ .state_table = spr_cstates,
+ .disable_promotion_to_c1e = true,
+ .use_acpi = true,
+};
+
static const struct idle_cpu idle_cpu_avn __initconst = {
.state_table = avn_cstates,
.disable_promotion_to_c1e = true,
@@ -1157,6 +1235,7 @@ static const struct x86_cpu_id intel_idle_ids[] __initconst = {
X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_X, &idle_cpu_skx),
X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, &idle_cpu_icx),
X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_D, &idle_cpu_icx),
+ X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, &idle_cpu_spr),
X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNL, &idle_cpu_knl),
X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNM, &idle_cpu_knl),
X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT, &idle_cpu_bxt),
--
2.17.1

2021-07-30 15:10:21

[permalink] [raw]

Subject: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
architecture. XFD allows the kernel to enable a feature state in XCR0 and
to receive a #NM trap when a task uses instructions accessing that state.
In this way, Linux can defer allocating the large XSAVE buffer until tasks
need it.

XFD introduces two MSRs: IA32_XFD to enable/disable the feature and
IA32_XFD_ERR to assist the #NM trap handler. Both use the same
xstate-component bitmap format, used by XCR0.

Use this hardware capability to find the right time to expand the XSTATE
buffer. The #NM handler induces the buffer expansion.

Introduce helper functions:
xfd_write() - write IA32_XFD MSR
xfd_read() - read IA32_XFD MSR
xfd_switch() - switch IA32_XFD MSR
xfd_capable() - indicate XFD-capable xfeatures

In the event of vzalloc() failure, send SIGILL with si_code ILL_ILL_OPC.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v7:
* Update #NM handler to raise SIGILL rather than SIGSEGV. (Thiago
Macieira)

Changes from v6:
* Update the #NM handler a little bit.
* Clean up the code comment.

Changes from v5:
* Excluded the access request check here and included the buffer allocation
again in #NM handler. The access request will be dealt in next patch.
* Updated the title. (Dave Hansen)
* Updated the code comment.

Changes from v4:
* Changed to use XFD to support the access request policy. Updated #NM
handler to raise a signal instead of buffer allocation.
* Decoupled XFD from the use of XSAVE compacted format.
* Updated helper functions.
* Updated function descriptions in a proper format.
* Updated some code comments.

Changes from v3:
* Removed 'no functional change' in the changelog. (Borislav Petkov)

Changes from v2:
* Changed to enable XFD only when the compacted format is used.
* Updated the changelog with task->fpu removed. (Borislav Petkov)

Changes from v1:
* Inlined the XFD-induced #NM handling code (Andy Lutomirski)
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/fpu/internal.h | 45 +++++++++++++++++++++++++++--
arch/x86/include/asm/msr-index.h | 2 ++
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
arch/x86/kernel/fpu/xstate.c | 44 ++++++++++++++++++++++++++--
arch/x86/kernel/process.c | 6 ++++
arch/x86/kernel/process_32.c | 2 +-
arch/x86/kernel/process_64.c | 2 +-
arch/x86/kernel/traps.c | 39 +++++++++++++++++++++++++
9 files changed, 136 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d0ce5cfd3ac1..37150b7a8e44 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -277,6 +277,7 @@
#define X86_FEATURE_XSAVEC (10*32+ 1) /* XSAVEC instruction */
#define X86_FEATURE_XGETBV1 (10*32+ 2) /* XGETBV with ECX = 1 instruction */
#define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS instructions */
+#define X86_FEATURE_XFD (10*32+ 4) /* eXtended Feature Disabling */

/*
* Extended auxiliary flags: Linux defined - for features scattered in various
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 263e349ff85a..e3590cf55325 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -535,14 +535,55 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
* Misc helper functions:
*/

+/* The Extended Feature Disable (XFD) helpers: */
+
+static inline void xfd_write(u64 value)
+{
+ wrmsrl_safe(MSR_IA32_XFD, value);
+}
+
+static inline u64 xfd_read(void)
+{
+ u64 value;
+
+ rdmsrl_safe(MSR_IA32_XFD, &value);
+ return value;
+}
+
+static inline u64 xfd_capable(void)
+{
+ return xfeatures_mask_user_dynamic;
+}
+
+/**
+ * xfd_switch - Switches the MSR IA32_XFD context if needed.
+ * @prev: The previous task's struct fpu pointer
+ * @next: The next task's struct fpu pointer
+ */
+static inline void xfd_switch(struct fpu *prev, struct fpu *next)
+{
+ u64 prev_xfd_mask, next_xfd_mask;
+
+ if (!static_cpu_has(X86_FEATURE_XFD) || !xfd_capable())
+ return;
+
+ prev_xfd_mask = prev->state_mask & xfd_capable();
+ next_xfd_mask = next->state_mask & xfd_capable();
+
+ if (unlikely(prev_xfd_mask != next_xfd_mask))
+ xfd_write(xfd_capable() ^ next_xfd_mask);
+}
+
/*
* Delay loading of the complete FPU state until the return to userland.
* PKRU is handled separately.
*/
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
{
- if (cpu_feature_enabled(X86_FEATURE_FPU))
+ if (cpu_feature_enabled(X86_FEATURE_FPU)) {
set_thread_flag(TIF_NEED_FPU_LOAD);
+ xfd_switch(old_fpu, new_fpu);
+ }
}

#endif /* _ASM_X86_FPU_INTERNAL_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a7c413432b33..eac0cfd9210b 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -626,6 +626,8 @@
#define MSR_IA32_BNDCFGS_RSVD 0x00000ffc

#define MSR_IA32_XSS 0x00000da0
+#define MSR_IA32_XFD 0x000001c4
+#define MSR_IA32_XFD_ERR 0x000001c5

#define MSR_IA32_APICBASE 0x0000001b
#define MSR_IA32_APICBASE_BSP (1<<8)
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index defda61f372d..7f891d2eb52e 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -75,6 +75,7 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_SGX_LC, X86_FEATURE_SGX },
{ X86_FEATURE_SGX1, X86_FEATURE_SGX },
{ X86_FEATURE_SGX2, X86_FEATURE_SGX1 },
+ { X86_FEATURE_XFD, X86_FEATURE_XSAVE },
{}
};

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 3b56e7612c45..c6ff0575d87d 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -182,6 +182,26 @@ static bool xfeature_is_supervisor(int xfeature_nr)
return ecx & 1;
}

+/**
+ * xfd_supported - Check if the feature supports Extended Feature Disable (XFD).
+ * @feature_nr: The feature number.
+ *
+ * Returns: True if supported; otherwise, false.
+ */
+static bool xfd_supported(int feature_nr)
+{
+ u32 eax, ebx, ecx, edx;
+
+ if (!boot_cpu_has(X86_FEATURE_XFD))
+ return false;
+
+ /*
+ * If state component 'i' supports it, ECX[2] return 1; otherwise, 0.
+ */
+ cpuid_count(XSTATE_CPUID, feature_nr, &eax, &ebx, &ecx, &edx);
+ return ecx & 4;
+}
+
/**
* get_xstate_comp_offset - Find the feature's offset in the compacted
* format.
@@ -274,6 +294,9 @@ void fpu__init_cpu_xstate(void)
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
xfeatures_mask_independent());
}
+
+ if (boot_cpu_has(X86_FEATURE_XFD))
+ xfd_write(xfd_capable());
}

static bool xfeature_enabled(enum xfeature xfeature)
@@ -473,8 +496,9 @@ static void __init print_xstate_offset_size(void)
for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
if (!xfeature_enabled(i))
continue;
- pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d\n",
- i, xstate_comp_offsets[i], i, xstate_sizes[i]);
+ pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d (%s)\n",
+ i, xstate_comp_offsets[i], i, xstate_sizes[i],
+ (xfeatures_mask_user_dynamic & BIT_ULL(i)) ? "dynamic" : "default");
}
}

@@ -920,6 +944,16 @@ void __init fpu__init_system_xstate(void)
/* Do not support the dynamically allocated buffer yet. */
xfeatures_mask_user_dynamic = 0;

+ for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+ u64 feature_mask = BIT_ULL(i);
+
+ if (!(xfeatures_mask_uabi() & feature_mask))
+ continue;
+
+ if (xfd_supported(i))
+ xfeatures_mask_user_dynamic |= feature_mask;
+ }
+
/* Enable xstate instructions to be able to continue with initialization: */
fpu__init_cpu_xstate();
err = init_xstate_size();
@@ -981,6 +1015,12 @@ void fpu__resume_cpu(void)
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
xfeatures_mask_independent());
}
+
+ if (boot_cpu_has(X86_FEATURE_XFD)) {
+ u64 fpu_xfd_mask = current->thread.fpu.state_mask & xfd_capable();
+
+ xfd_write(xfd_capable() ^ fpu_xfd_mask);
+ }
}

/**
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 534b9fb7e7ee..b85fa499f195 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -97,6 +97,12 @@ void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
*size = get_xstate_config(XSTATE_MIN_SIZE);
}

+void arch_release_task_struct(struct task_struct *task)
+{
+ if (cpu_feature_enabled(X86_FEATURE_FPU))
+ free_xstate_buffer(&task->thread.fpu);
+}
+
/*
* Free thread data structures etc..
*/
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4f2f54e1281c..7bd5d08eeb41 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -213,7 +213,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)

this_cpu_write(current_task, next_p);

- switch_fpu_finish(next_fpu);
+ switch_fpu_finish(prev_fpu, next_fpu);

/* Load the Intel cache allocation PQR MSR. */
resctrl_sched_in();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ec0d836a13b1..41c9855158d6 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -620,7 +620,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
this_cpu_write(current_task, next_p);
this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));

- switch_fpu_finish(next_fpu);
+ switch_fpu_finish(prev_fpu, next_fpu);

/* Reload sp0. */
update_task_stack(next_p);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..dd66d528afd8 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1112,6 +1112,45 @@ DEFINE_IDTENTRY(exc_device_not_available)
{
unsigned long cr0 = read_cr0();

+ if (boot_cpu_has(X86_FEATURE_XFD)) {
+ u64 xfd_err;
+
+ rdmsrl_safe(MSR_IA32_XFD_ERR, &xfd_err);
+ wrmsrl_safe(MSR_IA32_XFD_ERR, 0);
+
+ if (xfd_err) {
+ u64 xfd_event = xfd_err & xfd_capable();
+
+ if (WARN_ON(!xfd_event)) {
+ /*
+ * Unexpected event is raised. But update XFD state to
+ * unblock the task.
+ */
+ xfd_write(xfd_read() & ~xfd_err);
+ } else {
+ struct fpu *fpu = &current->thread.fpu;
+ int err = -1;
+
+ /*
+ * Make sure not in interrupt context as handling a
+ * trap from userspace.
+ */
+ if (!WARN_ON(in_interrupt())) {
+ err = alloc_xstate_buffer(fpu, xfd_event);
+ if (!err)
+ xfd_write((fpu->state_mask & xfd_capable()) ^
+ xfd_capable());
+ }
+
+ /* Raise a signal when it failed to handle. */
+ if (err)
+ force_sig_fault(SIGILL, ILL_ILLOPC,
+ error_get_trap_addr(regs));
+ }
+ return;
+ }
+ }
+
#ifdef CONFIG_MATH_EMULATION
if (!boot_cpu_has(X86_FEATURE_FPU) && (cr0 & X86_CR0_EM)) {
struct math_emu_info info = { };
--
2.17.1

2021-07-30 15:10:30

[permalink] [raw]

Subject: [PATCH v9 16/26] x86/fpu/xstate: Adjust the XSAVE feature table to address gaps in state component numbers

At compile-time xfeatures_mask_all includes all possible XCR0 features. At
run-time fpu__init_system_xstate() clears features in xfeatures_mask_all
that are not enabled in CPUID. It does this by looping through all possible
XCR0 features.

Update the code to handle the possibility that there will be gaps in the
XCR0 feature bit numbers.

No functional change.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v5:
* Folded a few lines.

Changes from v4:
* Simplified the implementation. (Thomas Gleixner)
* Updated the patch title accordingly.

Changes from v1:
* Rebased on the upstream kernel (5.10)
---
arch/x86/kernel/fpu/xstate.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c539e02965a6..930e72f55e75 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -43,18 +43,17 @@ static const char *xfeature_names[] =
"unknown xstate feature" ,
};

-static short xsave_cpuid_features[] __initdata = {
- X86_FEATURE_FPU,
- X86_FEATURE_XMM,
- X86_FEATURE_AVX,
- X86_FEATURE_MPX,
- X86_FEATURE_MPX,
- X86_FEATURE_AVX512F,
- X86_FEATURE_AVX512F,
- X86_FEATURE_AVX512F,
- X86_FEATURE_INTEL_PT,
- X86_FEATURE_PKU,
- X86_FEATURE_ENQCMD,
+static unsigned short xsave_cpuid_features[] __initdata = {
+ [XFEATURE_SSE] = X86_FEATURE_XMM,
+ [XFEATURE_YMM] = X86_FEATURE_AVX,
+ [XFEATURE_BNDREGS] = X86_FEATURE_MPX,
+ [XFEATURE_BNDCSR] = X86_FEATURE_MPX,
+ [XFEATURE_OPMASK] = X86_FEATURE_AVX512F,
+ [XFEATURE_ZMM_Hi256] = X86_FEATURE_AVX512F,
+ [XFEATURE_Hi16_ZMM] = X86_FEATURE_AVX512F,
+ [XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
+ [XFEATURE_PKRU] = X86_FEATURE_PKU,
+ [XFEATURE_PASID] = X86_FEATURE_ENQCMD,
};

/*
@@ -955,7 +954,8 @@ void __init fpu__init_system_xstate(void)
* Clear XSAVE features that are disabled in the normal CPUID.
*/
for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
- if (!boot_cpu_has(xsave_cpuid_features[i]))
+ if (((i == 0) || xsave_cpuid_features[i]) &&
+ !boot_cpu_has(xsave_cpuid_features[i]))
xfeatures_mask_all &= ~BIT_ULL(i);
}

--
2.17.1

2021-07-30 15:10:30

[permalink] [raw]

Subject: [PATCH v9 17/26] x86/fpu/xstate: Disable XSTATE support if an inconsistent state is detected

The kernel has a sanity check between two methods to calculate XSTATE size.
In the unlikely event that they disagree, disable the use of XSTATE.

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Added as a new patch. (Thomas Gleixner)
---
arch/x86/kernel/fpu/xstate.c | 40 ++++++++++++++++++++++++------------
1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 930e72f55e75..a36e24028ca7 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -654,11 +654,11 @@ static void __xstate_dump_leaves(void)
} while (0)

#define XCHECK_SZ(sz, nr, nr_macro, __struct) do { \
- if ((nr == nr_macro) && \
- WARN_ONCE(sz != sizeof(__struct), \
- "%s: struct is %zu bytes, cpu state %d bytes\n", \
- __stringify(nr_macro), sizeof(__struct), sz)) { \
+ if ((nr == nr_macro) && (sz != sizeof(__struct))) { \
+ pr_err("%s: struct is %zu bytes, cpu state %d bytes\n", \
+ __stringify(nr_macro), sizeof(__struct), sz); \
__xstate_dump_leaves(); \
+ return -EINVAL; \
} \
} while (0)

@@ -667,7 +667,7 @@ static void __xstate_dump_leaves(void)
* that our software representation matches what the CPU
* tells us about the state's size.
*/
-static void check_xstate_against_struct(int nr)
+static int check_xstate_against_struct(int nr)
{
/*
* Ask the CPU for the size of the state.
@@ -695,9 +695,12 @@ static void check_xstate_against_struct(int nr)
(nr >= XFEATURE_MAX) ||
(nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_LBR))) {
- WARN_ONCE(1, "no structure for xstate: %d\n", nr);
+ pr_err("no structure for xstate: %d\n", nr);
XSTATE_WARN_ON(1);
+ return -EINVAL;
}
+
+ return 0;
}

/**
@@ -707,13 +710,14 @@ static void check_xstate_against_struct(int nr)
* excluded. Only the size of the buffer for task->fpu is checked here.
*
* @include_dynamic_states: A knob to include dynamic states or not.
+ * @size: A pointer to record the size.
*
- * Return: The calculated xstate size.
+ * Return: 0 if successful; otherwise, error code.
*/
-static unsigned int calculate_xstate_size(bool include_dynamic_states)
+static int calculate_xstate_size(bool include_dynamic_states, unsigned int *size)
{
unsigned int xstate_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
- int i;
+ int i, err;

for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
if (!xfeature_enabled(i))
@@ -722,7 +726,10 @@ static unsigned int calculate_xstate_size(bool include_dynamic_states)
if (!include_dynamic_states && (xfeatures_mask_user_dynamic & BIT_ULL(i)))
continue;

- check_xstate_against_struct(i);
+ err = check_xstate_against_struct(i);
+ if (err)
+ return err;
+
/*
* Supervisor state components can be managed only by
* XSAVES.
@@ -748,7 +755,9 @@ static unsigned int calculate_xstate_size(bool include_dynamic_states)
xstate_size += xfeature_size(i);
}

- return xstate_size;
+ if (size)
+ *size = xstate_size;
+ return 0;
}

@@ -835,6 +844,7 @@ static int __init init_xstate_size(void)
/* Recompute the context size for enabled features: */
unsigned int possible_xstate_size, xstate_size;
unsigned int xsave_size;
+ int err;

xsave_size = get_xsave_size();

@@ -848,7 +858,9 @@ static int __init init_xstate_size(void)
* 'true' to include dynamic states. Cross-check with the CPUID-
* provided size and record it.
*/
- xstate_size = calculate_xstate_size(true);
+ err = calculate_xstate_size(true, &xstate_size);
+ if (err)
+ return err;
XSTATE_WARN_ON(possible_xstate_size != xstate_size);
set_xstate_config(XSTATE_MAX_SIZE, possible_xstate_size);

@@ -857,7 +869,9 @@ static int __init init_xstate_size(void)
* 'false' to exclude dynamic states. Ensure the size fits in
* the statically-allocated buffer and record it.
*/
- xstate_size = calculate_xstate_size(false);
+ err = calculate_xstate_size(false, &xstate_size);
+ if (err)
+ return err;
if (!is_supported_xstate_size(xstate_size))
return -EINVAL;
set_xstate_config(XSTATE_MIN_SIZE, xstate_size);
--
2.17.1

2021-07-30 15:10:34

[permalink] [raw]

Subject: [PATCH v9 26/26] x86/fpu/xstate: Add a sanity check for XFD state when saving XSTATE

Add a DEBUG sanity check that XFD state matches with XINUSE state.

Instead of reading MSR IA32_XFD directly, read a per-cpu value that is
recorded at every MSR write.

Suggested-by: Dave Hansen <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v5:
* Added as a new patch. (Dave Hansen)
---
arch/x86/include/asm/fpu/internal.h | 15 +++++++++++++++
arch/x86/kernel/fpu/core.c | 13 +++++++++++++
2 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 04021f0b7dd7..dd845829ac15 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -570,10 +570,25 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)

/* The Extended Feature Disable (XFD) helpers: */

+#ifdef CONFIG_X86_DEBUG_FPU
+DECLARE_PER_CPU(u64, xfd_shadow);
+static inline u64 xfd_debug_shadow(void)
+{
+ return this_cpu_read(xfd_shadow);
+}
+
+static inline void xfd_write(u64 value)
+{
+ wrmsrl_safe(MSR_IA32_XFD, value);
+ this_cpu_write(xfd_shadow, value);
+}
+#else
+#define xfd_debug_shadow() 0
static inline void xfd_write(u64 value)
{
wrmsrl_safe(MSR_IA32_XFD, value);
}
+#endif

static inline u64 xfd_read(void)
{
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 387118127f93..650c2d3cc45d 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -82,6 +82,10 @@ bool irq_fpu_usable(void)
}
EXPORT_SYMBOL(irq_fpu_usable);

+#ifdef CONFIG_X86_DEBUG_FPU
+DEFINE_PER_CPU(u64, xfd_shadow);
+#endif
+
/*
* Save the FPU register state in fpu->state. The register state is
* preserved.
@@ -99,6 +103,15 @@ EXPORT_SYMBOL(irq_fpu_usable);
void save_fpregs_to_fpstate(struct fpu *fpu)
{
if (likely(use_xsave())) {
+ /*
+ * If XFD is armed for an xfeature, XSAVE* will not save
+ * its state. Verify XFD is clear for all features that
+ * are in use before XSAVE*.
+ */
+ if (IS_ENABLED(CONFIG_X86_DEBUG_FPU) && xfd_capable() &&
+ boot_cpu_has(X86_FEATURE_XGETBV1))
+ WARN_ON_FPU(xgetbv(1) & xfd_debug_shadow());
+
os_xsave(&fpu->state->xsave, fpu->state_mask);

/*
--
2.17.1

2021-07-30 15:11:35

[permalink] [raw]

Subject: [PATCH v9 13/26] x86/fpu/xstate: Support ptracer-induced XSTATE buffer expansion

ptrace() may update XSTATE data before the target task has taken an XFD
fault and expanded the XSTATE buffer. Detect this case and allocate a
sufficient buffer to support the request. Also, disable the (now
unnecessary) associated first-use fault.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v5:
* Adjusted to use 'tmpbuf' for the new base code.

Changes from v4:
* Improved the condition check for the expansion.
* Simplified the XSTATE_BV retrieval.
* Updated the code comment.

Changes from v3:
* Removed 'no functional changes' in the changelog. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
* Updated the code comments.
---
arch/x86/kernel/fpu/regset.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 73d7d7b489fe..244e672c3e3d 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -163,6 +163,30 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
}
}

+ /*
+ * When a ptracer attempts to write any dynamic user state in the
+ * target buffer but not sufficiently allocated, it dynamically
+ * expands the buffer.
+ *
+ * Check if the expansion is possibly needed.
+ */
+ if (xfeatures_mask_user_dynamic &&
+ ((fpu->state_mask & xfeatures_mask_user_dynamic) != xfeatures_mask_user_dynamic)) {
+ u64 state_mask;
+
+ /* Retrieve XSTATE_BV. */
+ memcpy(&state_mask, (kbuf ?: tmpbuf) + offsetof(struct xregs_state, header),
+ sizeof(u64));
+
+ /* Expand the xstate buffer based on the XSTATE_BV. */
+ state_mask &= xfeatures_mask_user_dynamic;
+ if (state_mask) {
+ ret = alloc_xstate_buffer(fpu, state_mask);
+ if (ret)
+ goto out;
+ }
+ }
+
fpu_force_restore(fpu);
ret = copy_uabi_from_kernel_to_xstate(fpu, kbuf ?: tmpbuf);

--
2.17.1

2021-07-30 15:12:14

[permalink] [raw]

Subject: [PATCH v9 22/26] x86/fpu/xstate: Skip writing zeros to signal frame for dynamic user states if in INIT-state

By default, for XSTATE features in the INIT-state, XSAVE writes zeros to
the uncompressed destination buffer.

E.g., if you are not using AVX-512, you will still get a bunch of zeros on
the signal stack where live AVX-512 data would go.

For 'dynamic user state' (currently only XTILEDATA), explicitly skip this
data transfer. The result is that the user buffer for the AMX region will
not be touched by XSAVE.

[ Reading XINUSE takes about 20-30 cycles, but writing zeros consumes about
5-times or more, e.g., for XTILEDATA. ]

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v5:
* Mentioned the optimization trade-offs in the changelog. (Dave Hansen)
* Added code comment.

Changes from v4:
* Added as a new patch.
---
arch/x86/include/asm/fpu/internal.h | 38 +++++++++++++++++++++--------
1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 3b52cfb62ab5..04021f0b7dd7 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -337,8 +337,9 @@ static inline void os_xrstor(struct xregs_state *xstate, u64 mask)
*/
static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
{
+ struct fpu *fpu = &current->thread.fpu;
u32 lmask, hmask;
- u64 mask;
+ u64 state_mask;
int err;

/*
@@ -346,21 +347,38 @@ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
* internally, e.g. PKRU. That's user space ABI and also required
* to allow the signal handler to modify PKRU.
*/
- mask = xfeatures_mask_uabi();
+ state_mask = xfeatures_mask_uabi();
+
+ if (!xfeatures_mask_user_dynamic)
+ goto mask_ready;

/*
* Exclude dynamic user states for non-opt-in threads.
*/
- if (xfeatures_mask_user_dynamic) {
- struct fpu *fpu = &current->thread.fpu;
-
- mask &= fpu->dynamic_state_perm ?
- fpu->state_mask :
- ~xfeatures_mask_user_dynamic;
+ if (!fpu->dynamic_state_perm) {
+ state_mask &= ~xfeatures_mask_user_dynamic;
+ } else {
+ u64 dynamic_state_mask;
+
+ state_mask &= fpu->state_mask;
+
+ dynamic_state_mask = state_mask & xfeatures_mask_user_dynamic;
+ if (dynamic_state_mask && boot_cpu_has(X86_FEATURE_XGETBV1)) {
+ u64 dynamic_xinuse, dynamic_init;
+ u64 xinuse = xgetbv(1);
+
+ dynamic_xinuse = xinuse & dynamic_state_mask;
+ dynamic_init = ~xinuse & dynamic_state_mask;
+ if (dynamic_init) {
+ state_mask &= ~xfeatures_mask_user_dynamic;
+ state_mask |= dynamic_xinuse;
+ }
+ }
}

- lmask = mask;
- hmask = mask >> 32;
+mask_ready:
+ lmask = state_mask;
+ hmask = state_mask >> 32;

/*
* Clear the xsave header first, so that reserved fields are
--
2.17.1

2021-07-30 15:12:16

[permalink] [raw]

Subject: [PATCH v9 24/26] x86/insn/amx: Add TILERELEASE instruction to the opcode map

Include the opcode of TILERELEASE that returns all the AMX state to
INIT-state.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Added as a new patch as preparatory to use the instruction in the kernel.
---
arch/x86/lib/x86-opcode-map.txt | 8 +++++++-
tools/arch/x86/lib/x86-opcode-map.txt | 8 +++++++-
2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index ec31f5b60323..dbc5078ccafe 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -690,7 +690,9 @@ AVXcode: 2
45: vpsrlvd/q Vx,Hx,Wx (66),(v)
46: vpsravd Vx,Hx,Wx (66),(v) | vpsravd/q Vx,Hx,Wx (66),(evo)
47: vpsllvd/q Vx,Hx,Wx (66),(v)
-# Skip 0x48-0x4b
+# Skip 0x48
+49: Grp22 (1A)
+# Skip 0x4a-0x4b
4c: vrcp14ps/d Vpd,Wpd (66),(ev)
4d: vrcp14ss/d Vsd,Hpd,Wsd (66),(ev)
4e: vrsqrt14ps/d Vpd,Wpd (66),(ev)
@@ -1082,6 +1084,10 @@ GrpTable: Grp21
7: ENDBR64 (F3),(010),(11B) | ENDBR32 (F3),(011),(11B)
EndTable

+GrpTable: Grp22
+0: TILERELEASE (!F3),(v1),(11B)
+EndTable
+
# AMD's Prefetch Group
GrpTable: GrpP
0: PREFETCH
diff --git a/tools/arch/x86/lib/x86-opcode-map.txt b/tools/arch/x86/lib/x86-opcode-map.txt
index ec31f5b60323..dbc5078ccafe 100644
--- a/tools/arch/x86/lib/x86-opcode-map.txt
+++ b/tools/arch/x86/lib/x86-opcode-map.txt
@@ -690,7 +690,9 @@ AVXcode: 2
45: vpsrlvd/q Vx,Hx,Wx (66),(v)
46: vpsravd Vx,Hx,Wx (66),(v) | vpsravd/q Vx,Hx,Wx (66),(evo)
47: vpsllvd/q Vx,Hx,Wx (66),(v)
-# Skip 0x48-0x4b
+# Skip 0x48
+49: Grp22 (1A)
+# Skip 0x4a-0x4b
4c: vrcp14ps/d Vpd,Wpd (66),(ev)
4d: vrcp14ss/d Vsd,Hpd,Wsd (66),(ev)
4e: vrsqrt14ps/d Vpd,Wpd (66),(ev)
@@ -1082,6 +1084,10 @@ GrpTable: Grp21
7: ENDBR64 (F3),(010),(11B) | ENDBR32 (F3),(011),(11B)
EndTable

+GrpTable: Grp22
+0: TILERELEASE (!F3),(v1),(11B)
+EndTable
+
# AMD's Prefetch Group
GrpTable: GrpP
0: PREFETCH
--
2.17.1

2021-07-30 18:43:02

[permalink] [raw]

Subject: Re: [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability

> #endif /* _ASM_X86_SPECIAL_INSNS_H */
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index e6c543b5ee1d..fe1ba26cc797 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -54,6 +54,8 @@
> #include <asm/intel-family.h>
> #include <asm/mwait.h>
> #include <asm/msr.h>
> +#include <asm/fpu/internal.h>
> +#include <asm/special_insns.h>
>
> #define INTEL_IDLE_VERSION "0.5.1"
>
> @@ -155,6 +157,55 @@ static __cpuidle int intel_idle_s2idle(struct cpuidle_device *dev,
> return 0;
> }
>
> +/**
> + * idle_tile - Initialize TILE registers in INIT-state
> + *
> + * Leaving state in the dirty TILE registers may prevent the processor from
> + * entering lower-power idle states. Use TILERELEASE to initialize the
> + * state. Destroying fpregs state is safe after the fpstate update.
> + */
> +static inline void idle_tile(void)
> +{
> + if (boot_cpu_has(X86_FEATURE_XGETBV1) && (xgetbv(1) & XFEATURE_MASK_XTILE)) {
> + tile_release();
> + fpregs_deactivate(&current->thread.fpu);
> + }
> +}

This isn't obviously safe code. There's a window in there when we have
bogus, destroyed FPU register state but where we might be rescheduled.

I would assume that preempt is off *somewhere* in this, but it would be
nice to make sure of that, or at least mention the requirement for it to
be off before this code is safe.

I'm also not sure TILERELEASE is *technically* what you want here.
xgetbv(1) tells you whether a feature is being tracked by the processor
as being in its init state. tile_release() gets a feature into its init
state, but does not guarantee that the processor will *track* it as
being in the init state.

TILERELEASE is not documented to have an effect on XINUSE (init tracking).

XRSTOR, on the other hand, is at least documented to affect XINUSE.

It sounds like we either need a documentation update, or a clear
explanation why TILERELEASE is being used over XRSTOR.

2021-07-30 20:20:19

[permalink] [raw]

Subject: Re: [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability

On 7/30/21 7:59 AM, Chang S. Bae wrote:
> SPR supports AMX, and so this custom table uses idle entry points that know
> how to initialize AMX TMM state, if necessary.

That's pretty direct with where this is showing up.

But, the cover letter is quite a bit more cagey:

> Intel Advanced Matrix Extensions (AMX)[1][2] will be shipping on servers
> soon.

If you do another version of these, it might be handy to make sure those
are as consistent as possible.

2021-08-03 21:47:53

[permalink] [raw]

Subject: Re: [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability

On Jul 30, 2021, at 11:41, Hansen, Dave <[email protected]> wrote:
>> +/**
>> + * idle_tile - Initialize TILE registers in INIT-state
>> + *
>> + * Leaving state in the dirty TILE registers may prevent the processor from
>> + * entering lower-power idle states. Use TILERELEASE to initialize the
>> + * state. Destroying fpregs state is safe after the fpstate update.
>> + */
>> +static inline void idle_tile(void)
>> +{
>> + if (boot_cpu_has(X86_FEATURE_XGETBV1) && (xgetbv(1) & XFEATURE_MASK_XTILE)) {
>> + tile_release();
>> + fpregs_deactivate(&current->thread.fpu);
>> + }
>> +}
>
> This isn't obviously safe code. There's a window in there when we have
> bogus, destroyed FPU register state but where we might be rescheduled.
>
> I would assume that preempt is off *somewhere* in this, but it would be
> nice to make sure of that, or at least mention the requirement for it to
> be off before this code is safe.

I can see preempt_disable() in this path:

$kernel/sched/idle.c::play_idle_precise()
--> preempt_disable()
...
--> do_idle()
--> cpuidle_idle_call()
--> call_cpuidle()
--> $drivers/cpuidle/cpuidle.c::cpuidle_enter()
--> cpuidle_enter_state()
--> target_state->enter()
--> $drivers/idle/intel_idle.c::intel_idle_tile()
--> idle_tile()
...
--> preempt_enable()

> I'm also not sure TILERELEASE is *technically* what you want here.
> xgetbv(1) tells you whether a feature is being tracked by the processor
> as being in its init state. tile_release() gets a feature into its init
> state, but does not guarantee that the processor will *track* it as
> being in the init state.
>
> TILERELEASE is not documented to have an effect on XINUSE (init tracking).
>
> XRSTOR, on the other hand, is at least documented to affect XINUSE.
>
> It sounds like we either need a documentation update, or a clear
> explanation why TILERELEASE is being used over XRSTOR.

TILERELEASE impacts INIT tracking at least on the first AMX implementation. I
agree that the documentation needs some update.

Thanks,
Chang

2021-08-03 21:48:30

[permalink] [raw]

Subject: Re: [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability

On 8/3/21 2:32 PM, Bae, Chang Seok wrote:
>>> +static inline void idle_tile(void)
>>> +{
>>> + if (boot_cpu_has(X86_FEATURE_XGETBV1) && (xgetbv(1) & XFEATURE_MASK_XTILE)) {
>>> + tile_release();
>>> + fpregs_deactivate(&current->thread.fpu);
>>> + }
>>> +}
>> This isn't obviously safe code. There's a window in there when we have
>> bogus, destroyed FPU register state but where we might be rescheduled.
>>
>> I would assume that preempt is off *somewhere* in this, but it would be
>> nice to make sure of that, or at least mention the requirement for it to
>> be off before this code is safe.
> I can see preempt_disable() in this path:
>
> $kernel/sched/idle.c::play_idle_precise()
> --> preempt_disable()
> ...
> --> do_idle()
> --> cpuidle_idle_call()
> --> call_cpuidle()
> --> $drivers/cpuidle/cpuidle.c::cpuidle_enter()
> --> cpuidle_enter_state()
> --> target_state->enter()
> --> $drivers/idle/intel_idle.c::intel_idle_tile()
> --> idle_tile()
> ...
> --> preempt_enable()

OK, that's good. Can we comment about the preempt requirement
somewhere? Or, maybe add a !in_atomic() warning?

Also, should this have something like a fpregs_state_valid() check? If
the registers are invalid, should this be calling tile_release()?

2021-08-03 21:50:53

by Brown, Len

[permalink] [raw]

Subject: RE: [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability

> Also, should this have something like a fpregs_state_valid() check? If the registers are invalid, should this be calling tile_release()?

From a correctness point of view, it is valid to always call tile_release() here.
From a performance point of view, tile_release() is very fast.

2021-08-06 23:54:05

[permalink] [raw]

Subject: Re: [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE

On Friday, 30 July 2021 07:59:45 PDT Chang S. Bae wrote:
> + for_each_thread(tsk, t) {
> + t->thread.fpu.dynamic_state_perm |= req_dynstate_perm;
> + nr_threads++;
> + }
> +
> + if (nr_threads != tsk->signal->nr_threads) {
> + for_each_thread(tsk, t)
> + t->thread.fpu.dynamic_state_perm =
> old_dynstate_perm;
> + pr_err("x86/fpu: ARCH_XSTATE_PERM failed
> as thread number mismatched.\n");
> + return -EBUSY;
> + }
> + return 0;
> +}

Hello all

As I was trying to write the matching userspace code, I think the solution
above had two problems.

First the simpler one: that EBUSY. It must go and you can do that with a lock.
Library code cannot ensure that it is running in single-threaded state and
that no other threads are started or exit while they make the system call.
There's nothing the library in question can do if it got an EBUSY. Do you want
me to try again? What if it fails again? What's the state of the dynamically
permitted states after an EBUSY? It's probably inconsistent. Moreover, there's
an ABA problem there: what happens if a thread starts and another exits while
this system call is running? And what happens if two threads are making this
system call?
(also, shouldn't tsk->signal->nr_threads be an atomic read?)

The second and bigger problem is the consequence of not issuing the
ARCH_SET_STATE_ENABLE call: a SIGILL. Up until now, this hasn't happened, so I
expect this to be a surprise to people, in the worst possible way. The Intel
Software Developer Manual and every single tutorial out there says that the
sequence of actions is:
1) check that OSXSAVE is enabled
2) check that the AVX, AVX512 or AMX instructions are supported with CPUID
3) execute XGETBV EAX=0
4) disable any instructions whose matching state is not enabled by the OS

This is what software developers will write for AMX and any new future state,
until they learn better. This is also all that other OSes will require to run.
Moreover, until developers can actually run their software on CPUs with AMX
support, they will not notice any missed system calls (the Software
Development Emulator tool will execute the instructions whether you've issued
the syscall or not).

As a consequence, there's a large chance that a test escape like that will
cause software to start crashing when run on AMX-capable CPUs when those start
showing up and get enabled in public clouds.

So I have to insist that the XGETBV instruction's result match exactly what is
permitted to run. That means we either enable AMX unconditionally with no need
for system calls (with or without XFD trapping to dynamically allocate more
state), or that the XCR0 register be set without the AMX bits by default,
until the system call is issued.

--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel DPG Cloud Engineering

2021-08-10 03:48:53

[permalink] [raw]

Subject: Re: [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE

On Aug 6, 2021, at 09:46, Macieira, Thiago <[email protected]> wrote:
> On Friday, 30 July 2021 07:59:45 PDT Chang S. Bae wrote:
>> + for_each_thread(tsk, t) {
>> + t->thread.fpu.dynamic_state_perm |= req_dynstate_perm;
>> + nr_threads++;
>> + }
>> +
>> + if (nr_threads != tsk->signal->nr_threads) {
>> + for_each_thread(tsk, t)
>> + t->thread.fpu.dynamic_state_perm =
>> old_dynstate_perm;
>> + pr_err("x86/fpu: ARCH_XSTATE_PERM failed
>> as thread number mismatched.\n");
>> + return -EBUSY;
>> + }
>> + return 0;
>> +}

<snip>

> First the simpler one: that EBUSY. It must go and you can do that with a lock.
> Library code cannot ensure that it is running in single-threaded state and
> that no other threads are started or exit while they make the system call.
> There's nothing the library in question can do if it got an EBUSY. Do you want
> me to try again? What if it fails again? What's the state of the dynamically
> permitted states after an EBUSY? It's probably inconsistent. Moreover, there's
> an ABA problem there: what happens if a thread starts and another exits while
> this system call is running? And what happens if two threads are making this
> system call?
> (also, shouldn't tsk->signal->nr_threads be an atomic read?)

I suspect the EBUSY situation is somewhat imaginative. In theory, the
situation might be possible one thread calls this syscall at some point when a
new task is being created -- after task data duplication [1] and before
enlisted [2].

As stated in the changelog, the alternative is possible:
> An alternative implementation would not save the permission bitmap in
> every task. But instead would extend the per-process signal data, and
> that would not be subject to this race.
But it involves quite a bit of code complexity and this is pretty much
backend. I think it is possible to follow up and update when the case ever
turns out to be real. At least, I'm not aware of any report against the
PR_SET_FP_MODE prctl(2) [3] which took the same way -- walk and update the
task list.

Perhaps, the hunk above can be improved to be atomic.

<snip>

> So I have to insist that the XGETBV instruction's result match exactly what is
> permitted to run. That means we either enable AMX unconditionally with no need
> for system calls (with or without XFD trapping to dynamically allocate more
> state), or that the XCR0 register be set without the AMX bits by default,
> until the system call is issued.

XCR0 provokes VMEXIT which will impact the performance hardly. At least the
opt-in model is a consensus out of the long debate [4]. Let alone the question
on how well advertise this new syscall though.

Thanks,
Chang

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c#n2128
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c#n2320
[3]: https://man7.org/linux/man-pages/man2/prctl.2.html
[4]: https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ@mail.gmail.com/

2021-08-10 04:20:25

[permalink] [raw]

Subject: Re: [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE

On Monday, 9 August 2021 15:08:19 PDT Bae, Chang Seok wrote:
> I suspect the EBUSY situation is somewhat imaginative. In theory, the
> situation might be possible one thread calls this syscall at some point when
> a new task is being created -- after task data duplication [1] and before
> enlisted [2].
>
> As stated in the changelog, the alternative is possible:
> > An alternative implementation would not save the permission bitmap in
> > every task. But instead would extend the per-process signal data, and
> > that would not be subject to this race.
>
> But it involves quite a bit of code complexity and this is pretty much
> backend. I think it is possible to follow up and update when the case ever
> turns out to be real. At least, I'm not aware of any report against the
> PR_SET_FP_MODE prctl(2) [3] which took the same way -- walk and update the
> task list.
>
> Perhaps, the hunk above can be improved to be atomic.
>
> <snip>

Hello Chang

Thanks for taking a look at this. I agree that this is a very, very tiny
corner case and the issue can be treated as a bugfix later. The API between
userspace and kernel is fine, which is the big issue right now.

But explaining what the issue I see is: a userspace library cannot enforce
that other threads in the same process aren't either making this same system
call or starting/exiting threads. So I see two scenarios where the corner case
can be reached:

1) two simultaneous ARCH_SET_STATE_ENABLE calls
Imagine two threads, each setting a different bit (say bits 18 and 19). Since
they race with each other and this line:
t->thread.fpu.dynamic_state_perm |= req_dynstate_perm;
is not using an atomic, the compiler won't emit LOCK OR, so it's possible the
two calls will step over each other and partially undo the other's work. The
result after the two calls is completely indeterminate, yet both functions
returned success.

Since right now there's only one bit that can be set, we know that the two
calls are doing the same thing, so they're not effectively racing each other.
So this case is not an issue *right* *now*. There's only duplicate work.

2) one thread calls ARCH_SET_STATE_ENABLE while another thread exits/starts
In this case, the nr_threads != tsk->signal->nr_threads test will fail
resulting in the dynamic state to be rolled back and the EBUSY condition. I'd
like a recommendation from the kernel on how to deal with that signal: should
I retry? For now, I'm going to treat EBUSY like EINVAL and assume I cannot use
the feature.

1+2) both situations at the same time
This means the corruption can get worse since the rollback code can undo or
partially undo the progression of the other ARCH_SET_STATE_ENABLE.

> > So I have to insist that the XGETBV instruction's result match exactly
> > what is permitted to run. That means we either enable AMX unconditionally
> > with no need for system calls (with or without XFD trapping to
> > dynamically allocate more state), or that the XCR0 register be set
> > without the AMX bits by default, until the system call is issued.
>
> XCR0 provokes VMEXIT which will impact the performance hardly. At least the
> opt-in model is a consensus out of the long debate [4]. Let alone the
> question on how well advertise this new syscall though.

I understand.

I am pointing out that this will cause crashes because of improperly /
insufficiently-tested software. That is, software that violates the contract
of the new API because we inserted a new requirement that didn't exist for old
features. Yes, said software is buggy.

The problem is that the crashes can be surprising and will only show up after
the software gets run on an AMX-capable machine. That may happen, for example,
if a cloud provider "upgrades" the instance of a VM from a previous generation
of processor to a new one, or if a batch job does include the new instance
type. That is, the crashes will not happen for the developer of the software
in question, but instead for the user.

However, given the requirements that:
a) XCR0 not be context-switched
b) a new API call be required to allow the new instructions

Then there's no alternative.

--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel DPG Cloud Engineering

2021-08-10 05:24:36

[permalink] [raw]

Subject: Re: [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE

On Aug 9, 2021, at 16:42, Macieira, Thiago <[email protected]> wrote:
>
> This means the corruption can get worse since the rollback code can undo or
> partially undo the progression of the other ARCH_SET_STATE_ENABLE.

Maybe something like this can help here to ensure a valid rollback.

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 96056f49bcff..3468bc0ee654 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1353,6 +1353,8 @@ int alloc_xstate_buffer(struct fpu *fpu, u64 mask)
return 0;
}

+static DEFINE_SPINLOCK(set_xstate_perm_lock);
+
/**
* set_process_xstate_perm - Set a per-process permission to use dynamic
* user xstates.
@@ -1383,6 +1385,8 @@ long set_process_xstate_perm(struct task_struct *tsk, u64 state_perm)
if (!req_dynstate_perm)
return 0;

+ spin_lock(&set_xstate_perm_lock);
+
old_dynstate_perm = tsk->thread.fpu.dynamic_state_perm;

for_each_thread(tsk, t) {
@@ -1396,6 +1400,8 @@ long set_process_xstate_perm(struct task_struct *tsk, u64 state_perm)
pr_err("x86/fpu: ARCH_XSTATE_PERM failed as thread number mismatched.\n");
return -EBUSY;
}
+
+ spin_unlock(&set_xstate_perm_lock);
return 0;
}

Thanks,
Chang

2021-08-13 19:46:55

[permalink] [raw]

Subject: Re: [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE

On Aug 9, 2021, at 17:57, Bae, Chang Seok <[email protected]> wrote:
> On Aug 9, 2021, at 16:42, Macieira, Thiago <[email protected]> wrote:
>>
>> This means the corruption can get worse since the rollback code can undo or
>> partially undo the progression of the other ARCH_SET_STATE_ENABLE.
>
> Maybe something like this can help here to ensure a valid rollback.

After reconsidering this, I think the group_leader task’s permission value is
reliable. Perhaps, reference group_leader’s everywhere, instead of each
task's. I think that way resolves the corner case in a simple way.

Thanks,
Chang

2021-08-18 16:28:09

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Fri, Jul 30, 2021 at 07:59:43AM -0700, Chang S. Bae wrote:
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index d0ce5cfd3ac1..37150b7a8e44 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -277,6 +277,7 @@
> #define X86_FEATURE_XSAVEC (10*32+ 1) /* XSAVEC instruction */
> #define X86_FEATURE_XGETBV1 (10*32+ 2) /* XGETBV with ECX = 1 instruction */
> #define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS instructions */
> +#define X86_FEATURE_XFD (10*32+ 4) /* eXtended Feature Disabling */
^
|

Add "" at the marker above - it doesn't look like we wanna show "xfd" in
/proc/cpuinfo.

> * Extended auxiliary flags: Linux defined - for features scattered in various
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index 263e349ff85a..e3590cf55325 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -535,14 +535,55 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
> * Misc helper functions:
> */
>
> +/* The Extended Feature Disable (XFD) helpers: */
> +
> +static inline void xfd_write(u64 value)
> +{
> + wrmsrl_safe(MSR_IA32_XFD, value);
> +}
> +
> +static inline u64 xfd_read(void)
> +{
> + u64 value;
> +
> + rdmsrl_safe(MSR_IA32_XFD, &value);
> + return value;
> +}

Those look useless. Imagine we had to add wrappers around *every* MSR we
touch in the kernel...

> +
> +static inline u64 xfd_capable(void)
> +{
> + return xfeatures_mask_user_dynamic;
> +}

A small helper which returns an u64 but is used in boolean context?

Also, this name is wrong: XFD capable is a system which has
X86_FEATURE_XFD set. You should simply use xfeatures_mask_user_dynamic
everywhere since it is __ro_after_init.

> +/**
> + * xfd_switch - Switches the MSR IA32_XFD context if needed.
> + * @prev: The previous task's struct fpu pointer
> + * @next: The next task's struct fpu pointer
> + */
> +static inline void xfd_switch(struct fpu *prev, struct fpu *next)
> +{
> + u64 prev_xfd_mask, next_xfd_mask;
> +
> + if (!static_cpu_has(X86_FEATURE_XFD) || !xfd_capable())

cpu_feature_enabled(). Use that everywhere in your patchset. But you
know already...

> + return;
> +
> + prev_xfd_mask = prev->state_mask & xfd_capable();
> + next_xfd_mask = next->state_mask & xfd_capable();

This is just plain misleading:

You're *AND*ing a mask with xfd_capable?!?

Just use xfeatures_mask_user_dynamic directly instead, as already
mentioned.

> + if (unlikely(prev_xfd_mask != next_xfd_mask))
> + xfd_write(xfd_capable() ^ next_xfd_mask);
> +}

Here too.

Also, I must be missing something. Let's play with some imaginary masks:

prev->state_mask = 110b
next->state_mask = 111b
dyn = 101b

("dyn" is short for xfeatures_mask_user_dynamic)

prev_xfd_mask = 100b
next_xfd_mask = 101b

if (unlikely(100b != 101b))
xfd_write(101b ^ 101b) == xfd_write(0)

so next has bits 2 and 0 set but the xfd write zaps them so next won't
get any more #NMs for those states.

Why?

> /*
> * Delay loading of the complete FPU state until the return to userland.
> * PKRU is handled separately.
> */
> -static inline void switch_fpu_finish(struct fpu *new_fpu)
> +static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
> {
> - if (cpu_feature_enabled(X86_FEATURE_FPU))
> + if (cpu_feature_enabled(X86_FEATURE_FPU)) {
> set_thread_flag(TIF_NEED_FPU_LOAD);
> + xfd_switch(old_fpu, new_fpu);
> + }
> }
>
> #endif /* _ASM_X86_FPU_INTERNAL_H */
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index a7c413432b33..eac0cfd9210b 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -626,6 +626,8 @@
> #define MSR_IA32_BNDCFGS_RSVD 0x00000ffc
>
> #define MSR_IA32_XSS 0x00000da0
> +#define MSR_IA32_XFD 0x000001c4
> +#define MSR_IA32_XFD_ERR 0x000001c5

At least try to keep those numerically sorted, at least among the
architectural MSR_IA32_ ones. That is, provided those XFD things are
architectural...

> #define MSR_IA32_APICBASE 0x0000001b
> #define MSR_IA32_APICBASE_BSP (1<<8)
> diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
> index defda61f372d..7f891d2eb52e 100644
> --- a/arch/x86/kernel/cpu/cpuid-deps.c
> +++ b/arch/x86/kernel/cpu/cpuid-deps.c
> @@ -75,6 +75,7 @@ static const struct cpuid_dep cpuid_deps[] = {
> { X86_FEATURE_SGX_LC, X86_FEATURE_SGX },
> { X86_FEATURE_SGX1, X86_FEATURE_SGX },
> { X86_FEATURE_SGX2, X86_FEATURE_SGX1 },
> + { X86_FEATURE_XFD, X86_FEATURE_XSAVE },
> {}
> };
>
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 3b56e7612c45..c6ff0575d87d 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -182,6 +182,26 @@ static bool xfeature_is_supervisor(int xfeature_nr)
> return ecx & 1;
> }
>
> +/**
> + * xfd_supported - Check if the feature supports Extended Feature Disable (XFD).
> + * @feature_nr: The feature number.
> + *
> + * Returns: True if supported; otherwise, false.
> + */
> +static bool xfd_supported(int feature_nr)

xfeature_supports_xfd()

> +{
> + u32 eax, ebx, ecx, edx;
> +
> + if (!boot_cpu_has(X86_FEATURE_XFD))
> + return false;
> +
> + /*
> + * If state component 'i' supports it, ECX[2] return 1; otherwise, 0.
> + */
> + cpuid_count(XSTATE_CPUID, feature_nr, &eax, &ebx, &ecx, &edx);
> + return ecx & 4;
> +}
> +
> /**
> * get_xstate_comp_offset - Find the feature's offset in the compacted
> * format.
> @@ -274,6 +294,9 @@ void fpu__init_cpu_xstate(void)
> wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
> xfeatures_mask_independent());
> }
> +
> + if (boot_cpu_has(X86_FEATURE_XFD))
> + xfd_write(xfd_capable());
> }
>
> static bool xfeature_enabled(enum xfeature xfeature)
> @@ -473,8 +496,9 @@ static void __init print_xstate_offset_size(void)
> for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
> if (!xfeature_enabled(i))
> continue;
> - pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d\n",
> - i, xstate_comp_offsets[i], i, xstate_sizes[i]);
> + pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d (%s)\n",
> + i, xstate_comp_offsets[i], i, xstate_sizes[i],
> + (xfeatures_mask_user_dynamic & BIT_ULL(i)) ? "dynamic" : "default");

Make that

(xfeatures_mask_user_dynamic & BIT_ULL(i)) ? "(dynamic)" : "");

> @@ -920,6 +944,16 @@ void __init fpu__init_system_xstate(void)
> /* Do not support the dynamically allocated buffer yet. */
> xfeatures_mask_user_dynamic = 0;
>
> + for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
> + u64 feature_mask = BIT_ULL(i);
> +
> + if (!(xfeatures_mask_uabi() & feature_mask))
> + continue;
> +
> + if (xfd_supported(i))
> + xfeatures_mask_user_dynamic |= feature_mask;
> + }
> +
> /* Enable xstate instructions to be able to continue with initialization: */
> fpu__init_cpu_xstate();
> err = init_xstate_size();
> @@ -981,6 +1015,12 @@ void fpu__resume_cpu(void)
> wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
> xfeatures_mask_independent());
> }
> +
> + if (boot_cpu_has(X86_FEATURE_XFD)) {
> + u64 fpu_xfd_mask = current->thread.fpu.state_mask & xfd_capable();
> +
> + xfd_write(xfd_capable() ^ fpu_xfd_mask);
> + }
> }
>
> /**
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 534b9fb7e7ee..b85fa499f195 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -97,6 +97,12 @@ void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
> *size = get_xstate_config(XSTATE_MIN_SIZE);
> }
>
> +void arch_release_task_struct(struct task_struct *task)
> +{
> + if (cpu_feature_enabled(X86_FEATURE_FPU))
> + free_xstate_buffer(&task->thread.fpu);
> +}
> +
> /*
> * Free thread data structures etc..
> */
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 4f2f54e1281c..7bd5d08eeb41 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -213,7 +213,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>
> this_cpu_write(current_task, next_p);
>
> - switch_fpu_finish(next_fpu);
> + switch_fpu_finish(prev_fpu, next_fpu);
>
> /* Load the Intel cache allocation PQR MSR. */
> resctrl_sched_in();
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index ec0d836a13b1..41c9855158d6 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -620,7 +620,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> this_cpu_write(current_task, next_p);
> this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
>
> - switch_fpu_finish(next_fpu);
> + switch_fpu_finish(prev_fpu, next_fpu);
>
> /* Reload sp0. */
> update_task_stack(next_p);
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index a58800973aed..dd66d528afd8 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -1112,6 +1112,45 @@ DEFINE_IDTENTRY(exc_device_not_available)
> {
> unsigned long cr0 = read_cr0();
>
> + if (boot_cpu_has(X86_FEATURE_XFD)) {

This whole thing wants to be in a separate function. Even the
indentation level is begging for it.

> + u64 xfd_err;
> +
> + rdmsrl_safe(MSR_IA32_XFD_ERR, &xfd_err);
> + wrmsrl_safe(MSR_IA32_XFD_ERR, 0);
> +
> + if (xfd_err) {
> + u64 xfd_event = xfd_err & xfd_capable();
> +
> + if (WARN_ON(!xfd_event)) {
> + /*
> + * Unexpected event is raised. But update XFD state to
> + * unblock the task.
> + */
> + xfd_write(xfd_read() & ~xfd_err);

So AFAIU, xfd_err points to some other feature which caused this
exception.

So if that feature bit is set in XFD, you're clearing it here. Why?

So that it doesn't raise that #NM for it anymore?

This looks weird.

> + } else {
> + struct fpu *fpu = &current->thread.fpu;
> + int err = -1;
> +
> + /*
> + * Make sure not in interrupt context as handling a
> + * trap from userspace.
> + */
> + if (!WARN_ON(in_interrupt())) {

I'm guessing that's supposed to stop people from using AMX and other
dynamic states in the kernel?

> + err = alloc_xstate_buffer(fpu, xfd_event);
> + if (!err)
> + xfd_write((fpu->state_mask & xfd_capable()) ^
> + xfd_capable());
> + }
> +
> + /* Raise a signal when it failed to handle. */
> + if (err)
> + force_sig_fault(SIGILL, ILL_ILLOPC,
> + error_get_trap_addr(regs));

Where is it documented that that configuration of SIG* types means,
failure to allocate the dynamic buffer?

To the general picture: why is this thing even allocating a buffer in #NM?

Why isn't the buffer pre-allocated for the process after latter having
done prctl() so that when an #NM happens, no allocation happens at all?

And with those buffers preallocated, all that XFD muck is not needed
either.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-08-18 17:24:17

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Wednesday, 18 August 2021 09:24:51 PDT Borislav Petkov wrote:
> > +#define X86_FEATURE_XFD (10*32+ 4) /* eXtended
> > Feature Disabling */
>
> Add "" at the marker above - it doesn't look like we wanna show "xfd" in
> /proc/cpuinfo.

Why not?

It could help diagnosing why this code has a failure if XFD is somehow
missing. That can happen with hypervisors or future CPUs.

> > + /* Raise a signal when it failed to
> > handle. */ + if (err)
> > + force_sig_fault(SIGILL,
> > ILL_ILLOPC,
> > +
> > error_get_trap_addr(regs));>
> Where is it documented that that configuration of SIG* types means,
> failure to allocate the dynamic buffer?

This wasn't part of the memory failure, but now that you've pointed out, yes,
we are getting a SIGILL in case the kernel failed to allocate memory too.

This is the same code path we get if the task executes an AMX instruction
without first requesting support for it via the system call. At my request,
Chang changed it from SIGSEGV to SIGILL, because that's the behaviour one
would see if the kernel did not support AMX at all, hadn't enabled it in XCR0
or the CPU didn't support the instructions.

I don't know how to best handle killing the application if the kernel is OOM
(see below, though). Maybe it should SIGKILL instead. The problem with sending
a SIGSEGV is debuggability: if I get a core dump of this crash, which is
likely going to happen in a load instruction, I'll spend a lot time trying to
understand why the pointer in that instruction wasn't correct. Very few people
will ever consider it may have another reason.

> To the general picture: why is this thing even allocating a buffer in #NM?
>
> Why isn't the buffer pre-allocated for the process after latter having
> done prctl() so that when an #NM happens, no allocation happens at all?

That's a good question, but I thought it had been discussed and agreed that we
didn't want to extend the buffers at the moment the application requested the
bits, because it may never need them. This was probably a defence against
applications requesting all bits without knowing whether they'll need them at
all.

The way the API to userspace is implemented, the only way to find out if the
kernel supports a given state is to enable it. It's not currently possible to
ask "do you support AMX tile data?" and then go about the application's merry
way until it determines it really wants to do matrix multiplications. In the
case of applications with plugins, they need to have that answer before the
load the plugin, which usually happens at application start.

I was going to suggest a new API to return the supported bits, but hadn't yet
because it wasn't required for this patchset to work. So long as that API
landed at or before the time a new bit was added, userspace would be able to
cope. But if the kernel is going to allocate the bits at the moment of the
system call *and* we wish for userspace not to request more than it really
needs, then we'll need this extra API right now.

--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel DPG Cloud Engineering

2021-08-18 17:48:11

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Wed, Aug 18, 2021 at 10:20:58AM -0700, Thiago Macieira wrote:
> Why not?
>
> It could help diagnosing why this code has a failure if XFD is somehow
> missing. That can happen with hypervisors or future CPUs.

You're new to this...

tools/arch/x86/kcpuid/kcpuid.c should be used for all CPUID querying
needs.

<snipping the SIGILL question for now because it might become moot>

> That's a good question, but I thought it had been discussed and agreed that we
> didn't want to extend the buffers at the moment the application requested the
> bits, because it may never need them.

Huh? An application doing prctl(GIVE_ME_AMX) and then it might never
need it? That's only that application's fault then.

> This was probably a defence against applications requesting all bits
> without knowing whether they'll need them at all.

That sounds like a badly programmed application.

> The way the API to userspace is implemented, the only way to find
> out if the kernel supports a given state is to enable it. It's not
> currently possible to ask "do you support AMX tile data?"

Then our API needs improving. An app should be able to ask the kernel
"Do you support AMX?" get a proper answer and act accordingly.

> and then go about the application's merry way until it determines it
> really wants to do matrix multiplications. In the case of applications
> with plugins, they need to have that answer before the load the
> plugin, which usually happens at application start.

I don't see a problem with the app doing at load time:

A: Heey, kernel, do you support AMX?
K: Yes
A: Allocate a dynamic FPU buffer for me then pls.

> I was going to suggest a new API to return the supported bits, but
> hadn't yet because it wasn't required for this patchset to work.

I think you should. The important part is having the API good and
complete.

> So long as that API landed at or before the time a new bit was added,
> userspace would be able to cope. But if the kernel is going to
> allocate the bits at the moment of the system call *and* we wish for
> userspace not to request more than it really needs, then we'll need
> this extra API right now.

No no, once the API hits upstream, it is cast in stone. So it better
be done in full with the patchset, in one go. No later significant API
additions or changes, none especially after apps start using it.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-08-18 18:04:20

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Wednesday, 18 August 2021 10:46:09 PDT Borislav Petkov wrote:
> You're new to this...
>
> tools/arch/x86/kcpuid/kcpuid.c should be used for all CPUID querying
> needs.

That tells me what the CPU supports, not what the kernel does. By omitting the
"xfd" entry in /proc/cpuinfo, we are assuming that all kernels with "amxtile"
also implicitly support xfd. That is a valid assumption.

> I don't see a problem with the app doing at load time:
>
> A: Heey, kernel, do you support AMX?
> K: Yes
> A: Allocate a dynamic FPU buffer for me then pls.

Many applications need to determine which plugins and code paths to enable
before getting the data that will tell them what to do. It's entirely possible
for them to never need to run the AMX instructions, so they may wish to defer
the request to allocate the XSAVE state until they have read their input data.

It's indeed possible that the allocation then fails and the application be
unable to continue. But OOM conditions are unlikely, so it may be an
acceptable price to pay. In fact, by *not* allocating the extra state for
every thread in the current process, it may avoid the OOM.

> > I was going to suggest a new API to return the supported bits, but
> > hadn't yet because it wasn't required for this patchset to work.
>
> I think you should. The important part is having the API good and
> complete.
>
> > So long as that API landed at or before the time a new bit was added,
> > userspace would be able to cope. But if the kernel is going to
> > allocate the bits at the moment of the system call *and* we wish for
> > userspace not to request more than it really needs, then we'll need
> > this extra API right now.
>
> No no, once the API hits upstream, it is cast in stone. So it better
> be done in full with the patchset, in one go. No later significant API
> additions or changes, none especially after apps start using it.

Sorry, that's not what I meant. I was going to request an extra API, a third
call. We'd have:
- get current state
- set new state
- get available bits to set

The first two are in Chang's patch set, the third one is not. Right now,
there's a single bit that can be set, so there's no need to have the third
one. Any future software that wants to request a new bit will know if the
kernel supports it by the very presence of the API. That is, if they ask and
the API fails with -EINVAL, then this new bit isn't supported.

I didn't make the request because, as I said, it didn't seem required.
Therefore, I didn't want to add further work before the minimum functionality
got merged.

Now, if we are going to have this API any way, it might be a good idea to
combine the two getters in one by adding a second pointer parameter.

--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel DPG Cloud Engineering

2021-08-18 18:13:50

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Wed, Aug 18, 2021 at 10:58:42AM -0700, Thiago Macieira wrote:
> That tells me what the CPU supports, not what the kernel does. By
> omitting the "xfd" entry in /proc/cpuinfo, we are assuming that all
> kernels with "amxtile" also implicitly support xfd. That is a valid
> assumption.

What relevance does the fact have for userspace whether the kernel
supports XFD or not?

IOW, userspace cares about AMX and the other features which are supposed
to use XFD - not how those features are implemented: whether with
faulting or with pre-allocation or whatever.

> Many applications need to determine which plugins and code paths to
> enable before getting the data that will tell them what to do. It's
> entirely possible for them to never need to run the AMX instructions,
> so they may wish to defer the request to allocate the XSAVE state
> until they have read their input data.
>
> It's indeed possible that the allocation then fails and the
> application be unable to continue. But OOM conditions are unlikely, so
> it may be an acceptable price to pay. In fact, by *not* allocating the
> extra state for every thread in the current process, it may avoid the
> OOM.

And?

That doesn't conflict with my suggestion. It goes and asks the kernel
what it supports and then requests the buffers.

> Sorry, that's not what I meant. I was going to request an extra API, a third
> call. We'd have:
> - get current state
> - set new state
> - get available bits to set

Yes, this should have been the API from the very beginning. Of course
you need to be able to query what bits can be set at all.

> ...
> Now, if we are going to have this API any way, it might be a good
> idea to combine the two getters in one by adding a second pointer
> parameter.

Yeah, I'll get to that patch in the coming days and have a look. So far,
it only makes sense to have a querying API too so that we can provide
support for more "fat" features.

Unless Intel folks decide to stop using XSAVE for that - it was a bad
idea in the first place anyway, TBH - but it's not like hw people listen
to sw folk so...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-08-18 19:49:20

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

[ Cut out the API comments and other obvious ones here. ]

On Aug 18, 2021, at 09:24, Borislav Petkov <[email protected]> wrote:
> On Fri, Jul 30, 2021 at 07:59:43AM -0700, Chang S. Bae wrote:
>
>> +/**
>> + * xfd_switch - Switches the MSR IA32_XFD context if needed.
>> + * @prev: The previous task's struct fpu pointer
>> + * @next: The next task's struct fpu pointer
>> + */
>> +static inline void xfd_switch(struct fpu *prev, struct fpu *next)
>> +{
>> + u64 prev_xfd_mask, next_xfd_mask;
>> +
>> + if (!static_cpu_has(X86_FEATURE_XFD) || !xfd_capable())
>
> cpu_feature_enabled(). Use that everywhere in your patchset. But you
> know already...

Yes, I did on my local.

>> + return;
>> +
>> + prev_xfd_mask = prev->state_mask & xfd_capable();
>> + next_xfd_mask = next->state_mask & xfd_capable();
>
> This is just plain misleading:
>
> You're *AND*ing a mask with xfd_capable?!?
>
> Just use xfeatures_mask_user_dynamic directly instead, as already
> mentioned.

Okay.

>> + if (unlikely(prev_xfd_mask != next_xfd_mask))
>> + xfd_write(xfd_capable() ^ next_xfd_mask);
>> +}
>
> Here too.
>
> Also, I must be missing something. Let's play with some imaginary masks:
>
> prev->state_mask = 110b
> next->state_mask = 111b
> dyn = 101b
>
> ("dyn" is short for xfeatures_mask_user_dynamic)
>
> prev_xfd_mask = 100b
> next_xfd_mask = 101b
>
> if (unlikely(100b != 101b))
> xfd_write(101b ^ 101b) == xfd_write(0)
>
> so next has bits 2 and 0 set but the xfd write zaps them so next won't
> get any more #NMs for those states.
>
> Why?

Because the next has already fully expanded the buffer -- its state_mask
equals to feature_mask_user_dynamic.

No more XFD event is needed for the task.

>>
>> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> index a58800973aed..dd66d528afd8 100644
>> --- a/arch/x86/kernel/traps.c
>> +++ b/arch/x86/kernel/traps.c
>> @@ -1112,6 +1112,45 @@ DEFINE_IDTENTRY(exc_device_not_available)
>> {
>> unsigned long cr0 = read_cr0();
>>
>> + if (boot_cpu_has(X86_FEATURE_XFD)) {
>
> This whole thing wants to be in a separate function. Even the
> indentation level is begging for it.

Ah, it was once in a separate function until V4. Since trimmed down quite a
bit in V5, it has grown from there.

Let me fix this.

>> + u64 xfd_err;
>> +
>> + rdmsrl_safe(MSR_IA32_XFD_ERR, &xfd_err);
>> + wrmsrl_safe(MSR_IA32_XFD_ERR, 0);
>> +
>> + if (xfd_err) {
>> + u64 xfd_event = xfd_err & xfd_capable();
>> +
>> + if (WARN_ON(!xfd_event)) {
>> + /*
>> + * Unexpected event is raised. But update XFD state to
>> + * unblock the task.
>> + */
>> + xfd_write(xfd_read() & ~xfd_err);
>
> So AFAIU, xfd_err points to some other feature which caused this
> exception.
>
> So if that feature bit is set in XFD, you're clearing it here. Why?
>
> So that it doesn't raise that #NM for it anymore?
>
> This looks weird.

If this ever happens, something might be wrong with the hardware.

If the bit is not reset, it will get stuck with repeatedly unhandled #NMs,
which I think is even more terrible.

>> + } else {
>> + struct fpu *fpu = &current->thread.fpu;
>> + int err = -1;
>> +
>> + /*
>> + * Make sure not in interrupt context as handling a
>> + * trap from userspace.
>> + */
>> + if (!WARN_ON(in_interrupt())) {
>
> I'm guessing that's supposed to stop people from using AMX and other
> dynamic states in the kernel?

But the kernel can handle this without XFD?

Thanks,
Chang

2021-08-18 20:45:50

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Aug 18, 2021, at 10:46, Borislav Petkov <[email protected]> wrote:
> On Wed, Aug 18, 2021 at 10:20:58AM -0700, Thiago Macieira wrote
>> The way the API to userspace is implemented, the only way to find
>> out if the kernel supports a given state is to enable it. It's not
>> currently possible to ask "do you support AMX tile data?"
>
> Then our API needs improving. An app should be able to ask the kernel
> "Do you support AMX?" get a proper answer and act accordingly.

Maybe I’m missing something, but I wonder what’s the difference from reading
XCR0.

Thanks,
Chang

2021-08-18 21:06:22

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Wednesday, 18 August 2021 13:43:50 PDT Bae, Chang Seok wrote:
> > Then our API needs improving. An app should be able to ask the kernel
> > "Do you support AMX?" get a proper answer and act accordingly.
>
> Maybe I’m missing something, but I wonder what’s the difference from
> reading XCR0.

That assumes the kernel will always enable the bits in XCR0, like it is doing
today and with your patch, because modifying it is a VM exit.

But it's not the only possible solution. A future kernel could decide to leave
some bits off and only enable upon request. That's how macOS/Darwin does its
AVX512 support.

--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel DPG Cloud Engineering

2021-08-18 21:18:33

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Wed, Aug 18, 2021 at 08:43:50PM +0000, Bae, Chang Seok wrote:
> Maybe I’m missing something, but I wonder what’s the difference
> from reading XCR0.

Wny, because adding another prctl() is too damn hard?

What if this modus operandi of features userspace can use with kernel
assistance but need an explicit request and are off otherwise, gets
extended beyond XSAVE-managed features?

You would wish then that you had defined a

prctl(GET_FEATURES_WITH_KERNEL_ASSISTANCE);

at the time...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-08-18 21:18:33

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Aug 18, 2021, at 14:04, Thiago Macieira <[email protected]> wrote:
> But it's not the only possible solution. A future kernel could decide to leave
> some bits off and only enable upon request. That's how macOS/Darwin does its
> AVX512 support.

Even if XCR0 is ever switched, doesn’t XGETBV(0) return it for the *current*
task?

Thanks,
Chang

2021-08-18 21:39:42

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Aug 18, 2021, at 14:17, Borislav Petkov <[email protected]> wrote:
> On Wed, Aug 18, 2021 at 08:43:50PM +0000, Bae, Chang Seok wrote:
>> Maybe I’m missing something, but I wonder what’s the difference
>> from reading XCR0.
>
> Wny, because adding another prctl() is too damn hard?

Well, IIUC, merely XGETBV(0) in the kernel instead of from userspace.

> What if this modus operandi of features userspace can use with kernel
> assistance but need an explicit request and are off otherwise, gets
> extended beyond XSAVE-managed features?

What if it never happens? It will be just the same as XGETBV(0). I think on
the flip side there is also a benefit of maintaining a simple API as possible.

Thanks,
Chang

2021-08-18 22:29:29

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Wednesday, 18 August 2021 14:12:06 PDT Bae, Chang Seok wrote:
> On Aug 18, 2021, at 14:04, Thiago Macieira <[email protected]>
> wrote:
> > But it's not the only possible solution. A future kernel could decide to
> > leave some bits off and only enable upon request. That's how
> > macOS/Darwin does its AVX512 support.
>
>
> Even if XCR0 is ever switched, doesn’t XGETBV(0) return it for the
> *current* task?

That's the point. If the kernel decides that feature bit 19 will be left off
in XCR0, how shall userspace know the kernel supports the feature through the
arch_prctl syscall you added?

Not that I am advising we adopt this strategy. We don't need more
fragmentation on how we enable the features. But having this syscall gives us
flexibility in case we do need it in the future.

--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel DPG Cloud Engineering

2021-08-19 01:23:36

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Wed, Aug 18, 2021, at 2:04 PM, Thiago Macieira wrote:
> On Wednesday, 18 August 2021 13:43:50 PDT Bae, Chang Seok wrote:
> > > Then our API needs improving. An app should be able to ask the kernel
> > > "Do you support AMX?" get a proper answer and act accordingly.
> >
> > Maybe I’m missing something, but I wonder what’s the difference from
> > reading XCR0.
>
> That assumes the kernel will always enable the bits in XCR0, like it is doing
> today and with your patch, because modifying it is a VM exit.
>
> But it's not the only possible solution. A future kernel could decide to leave
> some bits off and only enable upon request. That's how macOS/Darwin does its
> AVX512 support.

The fact that Darwin does this strongly suggests that real programs can handle it, which increases my inclination for Linux to do the same thing.

>
> --
> Thiago Macieira - thiago.macieira (AT) intel.com
> Software Architect - Intel DPG Cloud Engineering
>
>
>
>

2021-08-19 08:05:51

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Wed, Aug 18, 2021 at 09:37:58PM +0000, Bae, Chang Seok wrote:
> What if it never happens? It will be just the same as XGETBV(0). I
> think on the flip side there is also a benefit of maintaining a simple
> API as possible.

Dude, why are you still pointlessly harping on this?

How is adding adding another trivial prctl which will be simply
forwarding XCR0 for now, making the API more complex?

If you don't wanna do it just say so - someone else will.

Geez.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-08-19 15:29:53

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Aug 19, 2021, at 01:00, Borislav Petkov <[email protected]> wrote:
> If you don't wanna do it just say so - someone else will.

Okay, looks like you’re so sure about it.

Thanks,
Chang

2021-08-19 16:10:11

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Wednesday, 18 August 2021 18:21:11 PDT Andy Lutomirski wrote:
> But it's not the only possible solution. A future kernel could decide to
> leave
> > some bits off and only enable upon request. That's how macOS/Darwin does
> > its AVX512 support.
>
> The fact that Darwin does this strongly suggests that real programs can
> handle it, which increases my inclination for Linux to do the same thing.

Yes and no... yes, programs could be made to handle this. I've reached to the
Intel team responsible for the instructions in the manual on how to detect
AVX512 and AMX, so the content is extended to say there's an OS-specific part
that software developers need to be aware of. But until then, it's not very
discoverable. As a result, there's plenty of software that could enable AVX512
on the Xeon-based Mac Pros but never do because the developers didn't know
that there was more than what the manual said. But the worst case that can
happen here is that the software gracefully falls back to AVX2 or an earlier
instruction set (unlike the Linux solution).

No, because XSETBV causes a VM exit, so we don't want to execute that on a
context switch, for performance reasons. That's probably not been a concern
for Apple developers, but is for Linux.

--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel DPG Cloud Engineering

2021-08-24 22:25:35

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Wed, Aug 18, 2021 at 12:24 PM Borislav Petkov <[email protected]> wrote:

> > diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> > index a7c413432b33..eac0cfd9210b 100644
> > --- a/arch/x86/include/asm/msr-index.h
> > +++ b/arch/x86/include/asm/msr-index.h
> > @@ -626,6 +626,8 @@
> > #define MSR_IA32_BNDCFGS_RSVD 0x00000ffc
> >
> > #define MSR_IA32_XSS 0x00000da0
> > +#define MSR_IA32_XFD 0x000001c4
> > +#define MSR_IA32_XFD_ERR 0x000001c5
>
> At least try to keep those numerically sorted, at least among the
> architectural MSR_IA32_ ones.

agreed

> That is, provided those XFD things are architectural...

Yes.
MSR_IA32_XFD and MSR_IA32_XFD_ERR are architectural.

(which is why they follow the convention of having an "IA32" in their name)

https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

Len Brown, Intel Open Source Technology Center

2021-08-24 22:56:49

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Wed, Aug 18, 2021 at 2:09 PM Borislav Petkov <[email protected]> wrote:

> What relevance does the fact have for userspace whether the kernel
> supports XFD or not?

Right.
If user space needs to know that XFD exists, then we have done
something very wrong.

--
Len Brown, Intel Open Source Technology Center

2021-08-24 23:19:44

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Wed, Aug 18, 2021 at 12:24 PM Borislav Petkov <[email protected]> wrote:

> Why isn't the buffer pre-allocated for the process after latter having
> done prctl() so that when an #NM happens, no allocation happens at all?

The problem with a system call to pre-allocate an AMX context switch buffer
is that it doesn't actually deliver on the goal of guaranteeing no subsequent
run-time failures due to OOM.

Even if your AMX thread pool threads were to invoke this system call
as soon as possible...
What is to say that the thread pool is created only at a time when memory
is available? A thread could be created 24 hours into program execution
under OOM conditions and this system call will return ENOMEM, and your program
will in all likelihood throw up its arms and exit at the exact same place
it would exit for transparently allocated buffers.

What if you don't care about 24-hours in, and you do care about
allocating at program start?
The program can equally cause the kernel to allocate an AMX context switch
buffer by simply touching a TMM register -- and this can be done at exactly the
same place in the program that calling a pre-allocate system call.

The difference in these two methods is that the system call returns a synchronus
ENOMEM, while the touching of a TMM register sends you a signal at
that location.
In theory, a program may have a thoughtfully implemented and thoroughly tested
*else* clause for ENOMEM -- but you and I know that is a fantasy --
they will exit anyway.

The advantage of the #NM over the syscall is that the programmer doesn't
actually have to do anything. Also, transparently allocated buffers offer
a theoretical benefit that a program may have many threads, but only a few
may actually touch AMX, and so there is savings to be had by allocating buffers
only for the threads that actually use the buffers.

Finally, the XFD/NM mechanism opens the door in the future for the
kernel to actually
harvest allocated, but unused buffers -- though we didn't bother
implementing that for AMX.

> And with those buffers preallocated, all that XFD muck is not needed either.

Independent of context switch buffer allocation...

XFD is used to *enforce* that AMX is not used without permission.
Were we to not use the XFD feature, users would be able to stash
data in TMM registers and even use TMUL without the kernel
being able to prevent them from doing so. Then when they
context switch or take a signal, the data in their TMM registers
would mysteriously vanish...

Much better to be able to tell them immediately that they are doing it wrong...

--
Len Brown, Intel Open Source Technology Center

2021-08-24 23:24:09

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Wed, Aug 18, 2021 at 5:16 PM Borislav Petkov <[email protected]> wrote:
>
> On Wed, Aug 18, 2021 at 08:43:50PM +0000, Bae, Chang Seok wrote:
> > Maybe I’m missing something, but I wonder what’s the difference
> > from reading XCR0.
>
> Wny, because adding another prctl() is too damn hard?

Adding complexity is easy. Removing it is the hard part. ;-)

Programmers today know what CPUID and xgetbv(XCR0) mean:
1. feature exists in the HW
2. OS has ability to handle state

This is true for all features.

We are forced to complicate their life for AMX (and subsequent features)
because of the legacy Linux signal ABI.
We require that new apps invoke a system call to tell us that they are
not indeed a legacy
program, but that they are a program that understands if they use an
alt-sig-stack
that it must be big enough to handle whatever current hardware requires.

The secondary motivation for the system call is the desire to give the
kernel a hook
so that it can refuse to give permission for some apps to use AMX,
should the need arise.

Programmers don't like this, but it nobody has figured out a more
programmer-friendly way
to meet these requirements.
And so if they want to use AMX on Linux, they *must* use this new SET syscall.
Since Linux enforces that they use it, they will use it if they want
AMX (or subsequent features).

> prctl(GET_FEATURES_WITH_KERNEL_ASSISTANCE);

The problem is that it adds zero value over the currently used xgetbv(XCR0).
As it adds no value, programmers will not use it.

Sure, if the hardware is re-designed, and Linux is re-designed, and XCR0
can then change at run-time during the lifetime of a program, we have additional
challenges. (such as legacy code that doesn't expect XCR0 to change
at run-time).
I don't think that this additional system call even begins to address
that theoretical
new world.

But this discussion moot. If it has no use, it will not get used.
--
Len Brown, Intel Open Source Technology Center

2021-08-30 17:33:07

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Tue, Aug 24, 2021 at 07:22:18PM -0400, Len Brown wrote:
> We are forced to complicate their life for AMX (and subsequent features)
> because of the legacy Linux signal ABI.

No, we need to design this interface properly because you folks went and
put this AMX thing in xstates. Where it doesn't belong at all.

> We require that new apps invoke a system call to tell us that they
> are not indeed a legacy program, but that they are a program that
> understands if they use an alt-sig-stack that it must be big enough to
> handle whatever current hardware requires.

Yes, because of the reason I gave above. If no additional 8K fat wasn't
an xstate, we wouldn't be having this conversation.

> The secondary motivation for the system call is the desire to give the
> kernel a hook so that it can refuse to give permission for some apps
> to use AMX, should the need arise.

Yes.

> > prctl(GET_FEATURES_WITH_KERNEL_ASSISTANCE);
>
> The problem is that it adds zero value over the currently used xgetbv(XCR0).
> As it adds no value, programmers will not use it.

Bullsh*t.

First of all, it is a new interface we're introducing and if it is
there from the get-go along with examples how to use it and proper
documentation, people will.

Secondly, from a previous email of mine: "What if this modus operandi of
features userspace can use with kernel assistance but need an explicit
request and are off otherwise, gets extended beyond XSAVE-managed
features?"

In that case you can xgetbv() all you want but the new fat feature is
not even in XCR0. So *then* you *have* to introduce a new prctl() to
query supported features. And right then and there you wish you would've
done that from the very beginning!

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-08-30 17:43:46

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Tue, Aug 24, 2021 at 06:21:23PM -0400, Len Brown wrote:
> MSR_IA32_XFD and MSR_IA32_XFD_ERR are architectural.
>
> (which is why they follow the convention of having an "IA32" in their name)

Where is that official statement I can refer to that says that MSRs with
"IA32" in the name are architectural?

Perhaps that section of the SDM:

"2.1 ARCHITECTURAL MSRS"

?

In any case, those MSRs are not there yet, maybe they need to trickle
from the ISA to the SDM docs at some point first.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-08-30 17:53:52

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Tue, Aug 24, 2021 at 07:17:49PM -0400, Len Brown wrote:
> The problem with a system call to pre-allocate an AMX context switch
> buffer is that it doesn't actually deliver on the goal of guaranteeing
> no subsequent run-time failures due to OOM.

You mean you pre-allocate *everything*, i.e., you won't do any more
allocations but then you still might OOM?

Yeah, right.

Maybe from something else but not from your AMX-using process which has
prepared everything already.

> Even if your AMX thread pool threads were to invoke this system call
> as soon as possible... What is to say that the thread pool is created
> only at a time when memory is available? A thread could be created
> 24 hours into program execution under OOM conditions and this system
> call will return ENOMEM, and your program will in all likelihood
> throw up its arms and exit at the exact same place it would exit for
> transparently allocated buffers.

Well, if you preallocate everything you won't have to run for 24 hours,
encounter -ENOMEM and *lose* 24 hours worth of AMX computation. And then
the kernel won't have to do all kinds of crazy dance with dynamically
resizing buffers just because some small percentage of luserspace apps
decided to do AMX stuff.

> The program can equally cause the kernel to allocate an AMX context
> switch buffer by simply touching a TMM register -- and this can
> be done at exactly the same place in the program that calling a
> pre-allocate system call.

If the program touches a TMM register and it hasn't requested AMX
support upfront, it'll get killed.

> The advantage of the #NM over the syscall is that the programmer
> doesn't actually have to do anything. Also, transparently allocated
> buffers offer a theoretical benefit that a program may have many
> threads, but only a few may actually touch AMX, and so there is
> savings to be had by allocating buffers only for the threads that
> actually use the buffers.

The program already asked the kernel whether it can use AMX - it can
allocate the buffers for the threads too.

> XFD is used to *enforce* that AMX is not used without permission.
> Were we to not use the XFD feature, users would be able to stash
> data in TMM registers and even use TMUL without the kernel
> being able to prevent them from doing so. Then when they
> context switch or take a signal, the data in their TMM registers
> would mysteriously vanish...
>
> Much better to be able to tell them immediately that they are doing it
> wrong...

Ok, killing the program in the #NM handler if it hasn't requested AMX
prior makes sense.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-08-30 18:07:06

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On 8/24/21 4:17 PM, Len Brown wrote:
> Even if your AMX thread pool threads were to invoke this system call
> as soon as possible...
> What is to say that the thread pool is created only at a time when memory
> is available? A thread could be created 24 hours into program execution
> under OOM conditions and this system call will return ENOMEM, and your program
> will in all likelihood throw up its arms and exit at the exact same place
> it would exit for transparently allocated buffers.

I tried this exact line of reasoning with Thomas: it doesn't matter
where we run out of memory, we still need the same memory and we're
screwed either way.

However, Thomas expressed a clear preference for ABIs which return
memory failures explicitly at syscalls versus implicit failures which
can happen on random instructions.

One might say that the odds of checking for and handling a NULL value
(or ENOMEM) are the same as installing a signal handler. *But*, it's
infinitely easier to unroll state and recover from a NULL than it is to
handle it from within a signal handler. In other words, the explicit
ones *encourage* better programming.

I'd prefer removing the demand-driven allocation at this point.

2021-08-31 21:51:56

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Mon, Aug 30, 2021 at 1:41 PM Borislav Petkov <[email protected]> wrote:
>
> On Tue, Aug 24, 2021 at 06:21:23PM -0400, Len Brown wrote:
> > MSR_IA32_XFD and MSR_IA32_XFD_ERR are architectural.
> >
> > (which is why they follow the convention of having an "IA32" in their name)
>
> Where is that official statement I can refer to that says that MSRs with
> "IA32" in the name are architectural?
>
> Perhaps that section of the SDM:
>
> "2.1 ARCHITECTURAL MSRS"

Yes.

> In any case, those MSRs are not there yet, maybe they need to trickle
> from the ISA to the SDM docs at some point first.

Right.
These new MSRs are already named IA32... even though the info from
the ISA Extensions Manual hasn't yet tricked into the SDM, because
they are defined to be architectural from the get-go.

Len Brown, Intel Open Source Technology Center

2021-08-31 22:09:47

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Mon, Aug 30, 2021 at 1:52 PM Borislav Petkov <[email protected]> wrote:

> Well, if you preallocate everything...

Nothing prevents, say, a pthread_create() or anything
else where the kernel consumes memory on behalf of a process
from failing at run-time... AMX does not add a unique OOM risk here.

> > The advantage of the #NM over the syscall is that the programmer
> > doesn't actually have to do anything. Also, transparently allocated
> > buffers offer a theoretical benefit that a program may have many
> > threads, but only a few may actually touch AMX, and so there is
> > savings to be had by allocating buffers only for the threads that
> > actually use the buffers.
>
> The program already asked the kernel whether it can use AMX - it can
> allocate the buffers for the threads too.

The result is that if one thread in a 1,000 task process requests
and touches AMX, the kernel would allocate 8MB, instead of 8KB
of context switch buffers for that process, no?

Len Brown, Intel Open Source Technology Center

2021-08-31 22:20:34

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Tue, Aug 31, 2021 at 6:15 PM Len Brown <[email protected]> wrote:
>
> On Mon, Aug 30, 2021 at 2:04 PM Dave Hansen <[email protected]> wrote:
> >
> > On 8/24/21 4:17 PM, Len Brown wrote:
> > > Even if your AMX thread pool threads were to invoke this system call
> > > as soon as possible...
> > > What is to say that the thread pool is created only at a time when memory
> > > is available? A thread could be created 24 hours into program execution
> > > under OOM conditions and this system call will return ENOMEM, and your program
> > > will in all likelihood throw up its arms and exit at the exact same place
> > > it would exit for transparently allocated buffers.
> >
> > I tried this exact line of reasoning with Thomas: it doesn't matter
> > where we run out of memory, we still need the same memory and we're
> > screwed either way.
> >
> > However, Thomas expressed a clear preference for ABIs which return
> > memory failures explicitly at syscalls versus implicit failures which
> > can happen on random instructions.
> >
> > One might say that the odds of checking for and handling a NULL value
> > (or ENOMEM) are the same as installing a signal handler. *But*, it's
> > infinitely easier to unroll state and recover from a NULL than it is to
> > handle it from within a signal handler. In other words, the explicit
> > ones *encourage* better programming.
>
> I agree.
> Indeed, I believe that there is universal agreement that a synchronous
> return code
> from a system call is a far superior programming model than decoding
> the location of a failure in a system call. (no, the IP isn't random -- it is

decoding the location of the failure in a *signal hander*

> always the 1st instruction in that thread to touch a TMM register).
>
> > I'd prefer removing the demand-driven allocation at this point.
>
> Adding a pre-allocate system call that can gracefully fail
> (even though it never will) is independent from removing
> demand-driver allocation. I would leave this to application
> developers. Honestly, the kernel shouldn't care.
>
> --
> Len Brown, Intel Open Source Technology Center

--
Len Brown, Intel Open Source Technology Center

2021-08-31 22:48:39

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On 8/31/21 3:07 PM, Len Brown wrote:
> On Mon, Aug 30, 2021 at 1:52 PM Borislav Petkov <[email protected]> wrote:
>
>> Well, if you preallocate everything...
> Nothing prevents, say, a pthread_create() or anything
> else where the kernel consumes memory on behalf of a process
> from failing at run-time... AMX does not add a unique OOM risk here.
>
>>> The advantage of the #NM over the syscall is that the programmer
>>> doesn't actually have to do anything. Also, transparently allocated
>>> buffers offer a theoretical benefit that a program may have many
>>> threads, but only a few may actually touch AMX, and so there is
>>> savings to be had by allocating buffers only for the threads that
>>> actually use the buffers.
>> The program already asked the kernel whether it can use AMX - it can
>> allocate the buffers for the threads too.
> The result is that if one thread in a 1,000 task process requests
> and touches AMX, the kernel would allocate 8MB, instead of 8KB
> of context switch buffers for that process, no?

Yes, but that's a pretty natural consequence of the process-level ABI
which was chosen. A per-thread permission scheme would not have had
this particular trade-off.

If you have a big process (lots of threads) and you use a process-level
ABI, there are going to big implications. I don't think we can get away
from this.

2021-08-31 22:49:08

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Mon, Aug 30, 2021 at 2:04 PM Dave Hansen <[email protected]> wrote:
>
> On 8/24/21 4:17 PM, Len Brown wrote:
> > Even if your AMX thread pool threads were to invoke this system call
> > as soon as possible...
> > What is to say that the thread pool is created only at a time when memory
> > is available? A thread could be created 24 hours into program execution
> > under OOM conditions and this system call will return ENOMEM, and your program
> > will in all likelihood throw up its arms and exit at the exact same place
> > it would exit for transparently allocated buffers.
>
> I tried this exact line of reasoning with Thomas: it doesn't matter
> where we run out of memory, we still need the same memory and we're
> screwed either way.
>
> However, Thomas expressed a clear preference for ABIs which return
> memory failures explicitly at syscalls versus implicit failures which
> can happen on random instructions.
>
> One might say that the odds of checking for and handling a NULL value
> (or ENOMEM) are the same as installing a signal handler. *But*, it's
> infinitely easier to unroll state and recover from a NULL than it is to
> handle it from within a signal handler. In other words, the explicit
> ones *encourage* better programming.

I agree.
Indeed, I believe that there is universal agreement that a synchronous
return code
from a system call is a far superior programming model than decoding
the location of a failure in a system call. (no, the IP isn't random -- it is
always the 1st instruction in that thread to touch a TMM register).

> I'd prefer removing the demand-driven allocation at this point.

Adding a pre-allocate system call that can gracefully fail
(even though it never will) is independent from removing
demand-driver allocation. I would leave this to application
developers. Honestly, the kernel shouldn't care.

--
Len Brown, Intel Open Source Technology Center

2021-08-31 22:50:49

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Tuesday, 31 August 2021 15:15:55 PDT Len Brown wrote:
> Indeed, I believe that there is universal agreement that a synchronous
> return code
> from a system call is a far superior programming model than decoding
> the location of a failure in a system call. (no, the IP isn't random -- it
> is always the 1st instruction in that thread to touch a TMM register).

That instruction is actually likely going to be a memory load, probably an
LDTILECFG. So the developer will see a crashing instruction with a pointer and
will spend time trying to figure out why that pointer was wrong, when there
was nothing wrong with it.

That's why I suggested (and Chang implemented) a SIGILL for when #NM is
received and the arch_prctl() wasn't previously done. The OOM condition, if
the extra state is dynamically allocated, was meant to stay a SIGSEGV, but
maybe should change to SIGKILL.

On the other hand, if it it's allocated at the syscall, then the kernel can
return -ENOMEM for it (which would allow for graceful degradation) or for a
later clone() syscall starting a new thread (which I don't expect to ever
gracefully degrade).

> decoding the location of the failure in a *signal hander*

That's a separate problem.

We can't be sure that the portion of the userspace doing the alt-stack crash
handler is aware of the portion using AMX. There's no way to enforce this. The
prctl() is a good indication, but I have no clue how high the correlation will
be.

--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel DPG Cloud Engineering

2021-08-31 22:51:07

[permalink] [raw]

Subject: Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Tue, Aug 31, 2021 at 6:39 PM Thiago Macieira
<[email protected]> wrote:
>
> On Tuesday, 31 August 2021 15:15:55 PDT Len Brown wrote:
> > Indeed, I believe that there is universal agreement that a synchronous
> > return code
> > from a system call is a far superior programming model than decoding
> > the location of a failure in a system call. (no, the IP isn't random -- it
> > is always the 1st instruction in that thread to touch a TMM register).
>
> That instruction is actually likely going to be a memory load, probably an
> LDTILECFG.

There is no fault on LDTILECONFIG, it will occur on the load tile data.
But yes, still a memory load (with a TMM destination)

Len Brown, Intel Open Source Technology Center

2021-09-17 16:01:44