2021-08-25 19:06:47

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v10 00/28] x86: Support Intel Advanced Matrix Extensions

Intel Advanced Matrix Extensions (AMX)[1][2] will be shipping on servers
soon (Intel Sapphire Rapids). AMX consists of configurable TMM "TILE"
registers plus new CPU instructions that operate on them. TMUL (Tile
matrix MULtiply) is the first operator to take advantage of the new
registers, and we anticipate additional instructions in the future.

Neither AMX state nor TMUL instructions depend on AVX. However, AMX and
AVX do share common challenges. The TMM registers are 8KB today, and
architecturally as large as 64KB, which merit updates to hardware and
software state management.

Further, both technologies run faster when they are not simultaneously
running on SMT siblings, and both technologies use of power and bandwidth
impact the power and performance available to neighboring cores. (This
impact has measurably improved in recent hardware.)

If the existing kernel approach for managing XSAVE state was employed to
handle AMX, 8KB space would be added to every task, but possibly rarely
used. Thus, Linux implements on-demand expansion of per-task context
switch buffers using an XSAVE feature: eXtended Feature Disabling (XFD).
The kernel arms XFD to provide an #NM exception upon a tasks' first access
to TILE state. The kernel exception handler allocates and installs the
appropriate XSAVE context switch buffer. User space is unaware of the
kernel's contexts switch buffer optimization.

AMX is accessible only to applications that invoke a new system call to
request access. When a user invokes this system call, they agree that if
they use an alternate signal stack, that they are providing an alternative
signal stack of sufficient size. The simplest way to do that is to use the
updated ABI in glibc 2.34 or later [8][9], though they could Also use their
own calculation or ask the kernel directly [3].

The patches are built on top of the recent upstream x86 FPU changes [13].

This series has three parts:
* Patch 01-16: Foundation to support dynamic user states
* Patch 17-22: AMX enablement
* Patch 23-28: Additional supplementary changes for optimization, test,
debug and etc.

Note that the per-process system call in PATCH14 reflects the latest
discussion on LKML, [10][12].

The following points summarize the latest discussion, and this
implementation:

1. Kernel sets XCR0.AMX=1 at boot, and leaves it set, always.

XCR0 is NOT context switched by Linux.
(If it were, every change would provoke VMEXIT if in VM.)

(KVM context switches XCR0. If KVM exports XFD for use by a guest OS,
it must also context switch XFD. KVM can not use XFD for its own
purposes.)

2. Kernel arms XFD for all tasks.

XFD is context switched per Linux task.

3. Apps invoke new system call to request feature access (AMX).

Implemented as a flag to arch_prctl(2), permission granted to any task
will grant that permission to all tasks in the process.

It is sufficient to invoke this syscall at process or library
init-time.

There is no concept of removing or revoking permission, once granted to
a process. (Permission is cleared upon exec of a new process.)

There is a companion system call to return the current permission.

Analogous to AVX-512 and other stateful features, applications probe
for AMX support by checking CPUID for the instructions and checking
XGETBV(XCR0) for the OS support.

However, stateful features from AMX onward also require the system call
above to be invoked before tasks in that process may use the feature.

4. Applications touching AMX without permission results in process exit.

Armed XFD results in #NM, results in SIGILL with si_code ILL_ILLOPC,
typically resulting in process exit.

5. Applications touching AMX with permission allocate context switch buffer
on-demand.

Armed XFD results in #NM.
Kernel allocates large context switch kernel buffer.
Kernel dis-arms XFD for that task.

6. NM handler allocation failure results in process exit.

If the #NM handler can not allocate the 8KB buffer, the task will
receive a SIGILL with si_code ILL_ILLOPC at the instruction that took
the #NM fault, typically resulting in process exit.

7. Legacy app signal stack XSTATE support includes AVX-512, and stops
before AMX.

Legacy apps are those which do not request AMX (or subsequent feature)
access.The signal stack ABI continues to be uncompacted XSTATE for both
legacy and new apps.

Existing code to find offsets in XSTATE still work.
Existing code doing XRSTOR/XSAVE on signal stack buffer will still
work.*

* XSTATE size calculation using CPUID will include
AMX and other supported features, even if the process did not invoke
the new system call. However, the kernel will not XSAVE AMX or later
features onto the signal stack of a legacy process.**

** User-space XSAVE/XRSTOR should size buffers according to CPUID
if they include the bits of xgetbv(XCR0) in RFBM, because XSAVE will
write data (including zeros for INIT state) for all features included in
RFBM.

8. New opt-in apps must agree to provide large enough sigaltstack

1. must invoke permission system call before touching AMX TMM
2. must guarantee if using sigaltstack(2), that they have
allocated signal stack of sufficient size, e.g., by utilizing
glibc signal.h 2.34 or later.

(glibc 2.34 changed MINSIGSTKSZ and SIGSTKSZ from 2KB/8KB constants
into run-time routines. [8])

Linux will continue to XSAVE/XRSTOR directly to/from the signal stack,
and the stack will always include the 8KB *space* for AMX TMM and
subsequent features.

Linux has an optimization in for all XFD-supported features in the INIT
state so that XSAVE will skip writing zeros.

9. intel_idle for SPR will clear AMX TMM state

This guarantees that AMX use will not prevent the CPU from entering the
idle C6 state, which can be beneficial for power savings, and thus
turbo frequency.

Reviewed-by: Len Brown <[email protected]>

Changes from v9 :
* Simplify and rename helpers for managing XSTATE buffer (Patch9,11).
(Borislav Petkov)
* Simplify and use permission check helpers (Patch15,16).
* Remove access helpers (Patch6). (Borislav Petkov)
* Rename XSTATE address finder helper (Patch11). (Borislav Petkov)
* Simplify ptrace path code (Patch14). (Borislav Petkov)
* Use cpu_feature_enabled() whenever possible (Patch9,13,15,23,26,27).
(Borislav Petkov)
* Add comment for tile_release() use (Patch26). (Dave Hansen)
* Update code comment and/or changelog (Patch6,7). (Borislav Petkov)
* Update the cover letter to indicate SPR explicitly (Patch0). (Dave
Hansen)
* Update XFD enabling code (Patch13). (Borislav Petkov)
* Move the state copy function changes. (Patch1,9,12). (Borislav
Petkov)

Changes from v8 [16]:
* Update arch_prctl prototype for consistency with other arch_prctl's. It
now takes an address of return bitmask as a parameter (Patch14). Update
self-tests to reflect this (Patch23).
* bugfix: Fix off-by-one-error in check_xstate_against_struct() feature
number argument (Patch19).

Changes from v7 [15]:
* Update #NM handler to raise SIGILL rather than SIGSEGV (Patch 12).
(Thiago Macieira)
* Rename the syscalls (Patch 14). (Thiago Macieira and Dave Hansen)
* If XSAVE is disabled, assure that syscall correctly indicates legacy
states (Patch14). (Thiago Macieira and Dave Hansen)
* Update existing self-test to expect SIGILL (Patch23).

Changes from v6 [14]:
* Add state bitmap param to proposed syscall. (Thiago Macieira)
* Add companion syscall to return the current permission bitmap.
* Update the ptrace path to return EFAULT when no permission to write
XTILEDATA.
* Simplify xstate size calculation code. (Dave Hansen)
* Update comments for TILERELEASE code. (Rafael J. Wysocki)

Changes from v5 [11]:
* Updated to require per-process permission for dynamic states (v5 was
per-task).
* Support both legacy and expanded sigframe xstate buffer sizes.
* Moved the TILERELEASE code to intel_idle driver. (Peter Zijlstra)
* Fixed to deactivate fpregs with TILERELEASE. (Andy Lutomirski and Dave
Hansen)
* Rebased on Thomas Gleixner's recent x86 FPU code changes.
* Added XFD sanity check. (Dave Hansen)
* Future proofed __raw_xsave_addr().
* Tighten up task size calculation (previously, it could over-calculate).
* Cleaned invocation memset() for init_fpstate (no functional change).
* Updated selftest to handle latest syscall semantics, plus minor updates.
* Dropped the change for XSTATE restore helper.

Changes from v4 [7]:
* Changed the buffer expansion policy to the access-request based approach
from the transparent #NM-based approach. (Andy Lutomirski, Thomas
Gleixner, and et al)
* Removed the boot parameter patch. (Thomas Gleixner)
* Included code to explicitly initialize AMX state during a context switch.
(Thomas Gleixner)
* Added a new arch_prctl to pre-allocate a buffer for dynamic state. (Andy
Lutomirski)
* Updated the fork() path to initialize all the AMX state.
* Improved ptracer's dynamic user state injection path.
* Add optimization to skip tile data in sigframe when an AMX thread
initialized the state.
* Updated to treat the mismatched state size as an error. (Thomas Gleixner)
* Simplified the xstate feature check routine. (Thomas Gleixner)
* Simplified and updated the selftest.
* Updated some changelog. (Thomas Gleixner)
* Updated a function description. (Borislav Petkov)

Changes from v3 [6]:
* Updated some commit messages and code comments. (Borislav Petkov)
* Added and removed some helpers. (Borislav Petkov)
* Revised the buffer allocation function. (Borislav Petkov)
* Simplified in accessing buffers. (Borislav Petkov)
* Re-organized some code change more reviewable. (PATCH9/10)
* Reverted unnecessary changes. (PATCH4)
* Fixed typo in the documentation. (Randy Dunlap)

Changes from v2 [5]:
* Removed the patch for the tile data inheritance. Also, updated the
selftest patch. (Andy Lutomirski)
* Changed the kernel tainted when any unknown state is enabled. (Andy
Lutomirski)
* Changed to use the XFD feature only when the compacted format in use.
* Improved the test code.
* Simplified the cmdline handling.
* Removed 'task->fpu' in changelogs. (Boris Petkov)
* Updated the variable name / comments / changelogs for clarification.

Changes from v1 [4]:
* Added vmalloc() error tracing (Dave Hansen, PeterZ, and Andy Lutomirski)
* Inlined the #NM handling code (Andy Lutomirski)
* Made signal handling optimization revertible
* Revised the new parameter handling code (Andy Lutomirski and Dave Hansen)
* Rebased on the upstream kernel

[1]: Intel Architecture Instruction Set Extension Programming Reference
May 2021, https://software.intel.com/content/dam/develop/external/us/en/documents-tps/architecture-instruction-set-extensions-programming-reference.pdf
[2]: https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions.html
[3]: https://lore.kernel.org/lkml/[email protected]/
[4]: https://lore.kernel.org/lkml/[email protected]/
[5]: https://lore.kernel.org/lkml/[email protected]/
[6]: https://lore.kernel.org/lkml/[email protected]/
[7]: https://lore.kernel.org/lkml/[email protected]/
[8]: https://sourceware.org/git/?p=glibc.git;a=commit;h=6c57d320484988e87e446e2e60ce42816bf51d53
[9]: https://sourceware.org/git/?p=glibc.git;a=blob;f=NEWS;h=aa0f10a891f8f9b4e6f0f6d25b6a307898c07d82;hb=HEAD#l12
[10]: https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ@mail.gmail.com/
[11]: https://lore.kernel.org/lkml/[email protected]/
[12]: https://lore.kernel.org/lkml/CAJvTdKmzN0VMyH8VU_fdzn2UZqmR=_aNrJW01a65BhyLm6YRPg@mail.gmail.com/
[13]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1423e2660cf134a8f21f2451865a04792013e49e
[14]: https://lore.kernel.org/lkml/[email protected]/
[15]: https://lore.kernel.org/lkml/[email protected]/
[16]: https://lore.kernel.org/lkml/[email protected]/
[17]: https://lore.kernel.org/lkml/[email protected]/

Chang S. Bae (28):
x86/fpu/xstate: Fix the state copy function to the XSTATE buffer
x86/fpu/xstate: Modify the initialization helper to handle both static
and dynamic buffers
x86/fpu/xstate: Modify state copy helpers to handle both static and
dynamic buffers
x86/fpu/xstate: Modify address finders to handle both static and
dynamic buffers
x86/fpu/xstate: Add a new variable to indicate dynamic user states
x86/fpu/xstate: Add new variables to indicate dynamic XSTATE buffer
size
x86/fpu/xstate: Calculate and remember dynamic XSTATE buffer sizes
x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer
x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer
dynamically
x86/fpu/xstate: Update the XSTATE save function to support dynamic
states
x86/fpu/xstate: Update the XSTATE buffer address finder to support
dynamic states
x86/fpu/xstate: Update the XSTATE context copy function to support
dynamic states
x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user
state
x86/fpu/xstate: Support ptracer-induced XSTATE buffer expansion
x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE
x86/fpu/xstate: Support both legacy and expanded signal XSTATE size
x86/fpu/xstate: Adjust the XSAVE feature table to address gaps in
state component numbers
x86/fpu/xstate: Disable XSTATE support if an inconsistent state is
detected
x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature
bits
x86/fpu/amx: Define AMX state components and have it used for
boot-time checks
x86/fpu/amx: Initialize child's AMX state
x86/fpu/amx: Enable the AMX feature in 64-bit mode
x86/fpu/xstate: Skip writing zeros to signal frame for dynamic user
states if in INIT-state
selftest/x86/amx: Test cases for the AMX state management
x86/insn/amx: Add TILERELEASE instruction to the opcode map
intel_idle/amx: Add SPR support with XTILEDATA capability
x86/fpu/xstate: Add a sanity check for XFD state when saving XSTATE
x86/arch_prctl: ARCH_GET_FEATURES_WITH_KERNEL_ASSISTANCE

arch/x86/include/asm/cpufeatures.h | 4 +
arch/x86/include/asm/fpu/internal.h | 102 ++-
arch/x86/include/asm/fpu/types.h | 77 +-
arch/x86/include/asm/fpu/xstate.h | 71 +-
arch/x86/include/asm/msr-index.h | 2 +
arch/x86/include/asm/processor.h | 10 +-
arch/x86/include/asm/proto.h | 2 +-
arch/x86/include/asm/special_insns.h | 6 +
arch/x86/include/asm/trace/fpu.h | 4 +-
arch/x86/include/uapi/asm/prctl.h | 22 +-
arch/x86/kernel/cpu/cpuid-deps.c | 4 +
arch/x86/kernel/fpu/core.c | 96 ++-
arch/x86/kernel/fpu/init.c | 37 +-
arch/x86/kernel/fpu/regset.c | 52 +-
arch/x86/kernel/fpu/signal.c | 98 ++-
arch/x86/kernel/fpu/xstate.c | 613 ++++++++++++++--
arch/x86/kernel/process.c | 27 +-
arch/x86/kernel/process_32.c | 2 +-
arch/x86/kernel/process_64.c | 6 +-
arch/x86/kernel/traps.c | 51 ++
arch/x86/kvm/x86.c | 48 +-
arch/x86/lib/x86-opcode-map.txt | 8 +-
arch/x86/math-emu/fpu_aux.c | 2 +-
arch/x86/math-emu/fpu_entry.c | 4 +-
arch/x86/math-emu/fpu_system.h | 2 +-
drivers/idle/intel_idle.c | 82 +++
tools/arch/x86/lib/x86-opcode-map.txt | 8 +-
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/amx.c | 970 ++++++++++++++++++++++++++
29 files changed, 2165 insertions(+), 247 deletions(-)
create mode 100644 tools/testing/selftests/x86/amx.c


base-commit: e22ce8eb631bdc47a4a4ea7ecf4e4ba499db4f93
--
2.17.1


2021-08-25 19:12:08

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v10 02/28] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers

Have the function initializing the XSTATE buffer take a struct fpu *
pointer in preparation for dynamic state buffer support.

init_fpstate is a special case, which is indicated by a null pointer
parameter to fpstate_init().

Also, fpstate_init_xstate() now accepts the state component bitmap to
customize the compacted format.

No functional change.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from v5:
* Moved fpstate_init_xstate() back to the header (again).
* Massaged the changelog.

Changes from v4:
* Added a proper function description. (Borislav Petkov)
* Added the likely() statement as a null pointer is a special case.

Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the function comment to use kernel-doc style. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
---
arch/x86/include/asm/fpu/internal.h | 11 ++++++++++-
arch/x86/kernel/fpu/core.c | 28 +++++++++++++++++-----------
arch/x86/kernel/fpu/init.c | 2 +-
arch/x86/kernel/fpu/xstate.c | 3 +--
arch/x86/kvm/x86.c | 2 +-
5 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 5a18694a89b2..c7a64e2806a9 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -80,7 +80,7 @@ static __always_inline __pure bool use_fxsr(void)

extern union fpregs_state init_fpstate;

-extern void fpstate_init(union fpregs_state *state);
+extern void fpstate_init(struct fpu *fpu);
#ifdef CONFIG_MATH_EMULATION
extern void fpstate_init_soft(struct swregs_state *soft);
#else
@@ -88,6 +88,15 @@ static inline void fpstate_init_soft(struct swregs_state *soft) {}
#endif
extern void save_fpregs_to_fpstate(struct fpu *fpu);

+static inline void fpstate_init_xstate(struct xregs_state *xsave, u64 mask)
+{
+ /*
+ * XRSTORS requires these bits set in xcomp_bv, or it will
+ * trigger #GP:
+ */
+ xsave->header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT | mask;
+}
+
/* Returns 0 or the negated trap number, which results in -EFAULT for #PF */
#define user_insn(insn, output, input...) \
({ \
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 7ada7bd03a32..c0098f8422de 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -203,15 +203,6 @@ void fpu_sync_fpstate(struct fpu *fpu)
fpregs_unlock();
}

-static inline void fpstate_init_xstate(struct xregs_state *xsave)
-{
- /*
- * XRSTORS requires these bits set in xcomp_bv, or it will
- * trigger #GP:
- */
- xsave->header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT | xfeatures_mask_all;
-}
-
static inline void fpstate_init_fxstate(struct fxregs_state *fx)
{
fx->cwd = 0x37f;
@@ -229,8 +220,23 @@ static inline void fpstate_init_fstate(struct fregs_state *fp)
fp->fos = 0xffff0000u;
}

-void fpstate_init(union fpregs_state *state)
+/**
+ *
+ * fpstate_init - initialize the xstate buffer
+ *
+ * If @fpu is NULL, initialize init_fpstate.
+ *
+ * @fpu: A struct fpu * pointer
+ */
+void fpstate_init(struct fpu *fpu)
{
+ union fpregs_state *state;
+
+ if (likely(fpu))
+ state = &fpu->state;
+ else
+ state = &init_fpstate;
+
if (!static_cpu_has(X86_FEATURE_FPU)) {
fpstate_init_soft(&state->soft);
return;
@@ -239,7 +245,7 @@ void fpstate_init(union fpregs_state *state)
memset(state, 0, fpu_kernel_xstate_size);

if (static_cpu_has(X86_FEATURE_XSAVES))
- fpstate_init_xstate(&state->xsave);
+ fpstate_init_xstate(&state->xsave, xfeatures_mask_all);
if (static_cpu_has(X86_FEATURE_FXSR))
fpstate_init_fxstate(&state->fxsave);
else
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 64e29927cc32..e14c72bc8706 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -124,7 +124,7 @@ static void __init fpu__init_system_generic(void)
* Set up the legacy init FPU context. (xstate init might overwrite this
* with a more modern format, if the CPU supports it.)
*/
- fpstate_init(&init_fpstate);
+ fpstate_init(NULL);

fpu__init_system_mxcsr();
}
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index fc1d529547e6..0fed7fbcf2e8 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -395,8 +395,7 @@ static void __init setup_init_fpu_buf(void)
print_xstate_features();

if (boot_cpu_has(X86_FEATURE_XSAVES))
- init_fpstate.xsave.header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT |
- xfeatures_mask_all;
+ fpstate_init_xstate(&init_fpstate.xsave, xfeatures_mask_all);

/*
* Init all the features state with header.xfeatures being 0x0
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e5d5c5ed7dd4..ce529ab233d7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10608,7 +10608,7 @@ static void fx_init(struct kvm_vcpu *vcpu)
if (!vcpu->arch.guest_fpu)
return;

- fpstate_init(&vcpu->arch.guest_fpu->state);
+ fpstate_init(vcpu->arch.guest_fpu);
if (boot_cpu_has(X86_FEATURE_XSAVES))
vcpu->arch.guest_fpu->state.xsave.header.xcomp_bv =
host_xcr0 | XSTATE_COMPACTION_ENABLED;
--
2.17.1

2021-08-25 19:12:20

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v10 03/28] x86/fpu/xstate: Modify state copy helpers to handle both static and dynamic buffers

Have all the functions copying XSTATE take a struct fpu * pointer in
preparation for dynamic state buffer support.

No functional change.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v5:
* Adjusted function prototype changes to the recent renamed on the new
base.

Changes from v3:
* Updated the changelog. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
---
arch/x86/include/asm/fpu/xstate.h | 4 ++--
arch/x86/kernel/fpu/regset.c | 2 +-
arch/x86/kernel/fpu/signal.c | 2 +-
arch/x86/kernel/fpu/xstate.c | 12 ++++++------
4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 109dfcc75299..ede166e9d3f2 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -136,8 +136,8 @@ extern void __init update_regset_xstate_info(unsigned int size,

void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
int xfeature_size(int xfeature_nr);
-int copy_uabi_from_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf);
-int copy_sigframe_from_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf);
+int copy_uabi_from_kernel_to_xstate(struct fpu *fpu, const void *kbuf);
+int copy_sigframe_from_user_to_xstate(struct fpu *fpu, const void __user *ubuf);

void xsaves(struct xregs_state *xsave, u64 mask);
void xrstors(struct xregs_state *xsave, u64 mask);
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 66ed317ebc0d..49dd307003ec 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -164,7 +164,7 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
}

fpu_force_restore(fpu);
- ret = copy_uabi_from_kernel_to_xstate(&fpu->state.xsave, kbuf ?: tmpbuf);
+ ret = copy_uabi_from_kernel_to_xstate(fpu, kbuf ?: tmpbuf);

out:
vfree(tmpbuf);
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 445c57c9c539..bec8c8046888 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -371,7 +371,7 @@ static int __fpu_restore_sig(void __user *buf, void __user *buf_fx,
fpregs_unlock();

if (use_xsave() && !fx_only) {
- ret = copy_sigframe_from_user_to_xstate(&fpu->state.xsave, buf_fx);
+ ret = copy_sigframe_from_user_to_xstate(fpu, buf_fx);
if (ret)
return ret;
} else {
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 0fed7fbcf2e8..df39085e9d05 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1089,10 +1089,10 @@ static int copy_from_buffer(void *dst, unsigned int offset, unsigned int size,
return 0;
}

-
-static int copy_uabi_to_xstate(struct xregs_state *xsave, const void *kbuf,
+static int copy_uabi_to_xstate(struct fpu *fpu, const void *kbuf,
const void __user *ubuf)
{
+ struct xregs_state *xsave = &fpu->state.xsave;
unsigned int offset, size;
struct xstate_header hdr;
u64 mask;
@@ -1161,9 +1161,9 @@ static int copy_uabi_to_xstate(struct xregs_state *xsave, const void *kbuf,
* format and copy to the target thread. This is called from
* xstateregs_set().
*/
-int copy_uabi_from_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
+int copy_uabi_from_kernel_to_xstate(struct fpu *fpu, const void *kbuf)
{
- return copy_uabi_to_xstate(xsave, kbuf, NULL);
+ return copy_uabi_to_xstate(fpu, kbuf, NULL);
}

/*
@@ -1171,10 +1171,10 @@ int copy_uabi_from_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
* XSAVE[S] format and copy to the target thread. This is called from the
* sigreturn() and rt_sigreturn() system calls.
*/
-int copy_sigframe_from_user_to_xstate(struct xregs_state *xsave,
+int copy_sigframe_from_user_to_xstate(struct fpu *fpu,
const void __user *ubuf)
{
- return copy_uabi_to_xstate(xsave, NULL, ubuf);
+ return copy_uabi_to_xstate(fpu, NULL, ubuf);
}

static bool validate_xsaves_xrstors(u64 mask)
--
2.17.1

2021-08-25 19:13:31

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v10 07/28] x86/fpu/xstate: Calculate and remember dynamic XSTATE buffer sizes

The CPUID instruction separately enumerates sizes and alignments of
individual xfeatures. It independently enumerates the required size of an
entire XSAVE buffer to store all enabled features.

calculate_xstate_sizes() currently uses the individual feature
size/alignment enumeration to independently recalculate the required XSAVE
buffer size.

The XSTATE per-task buffer is currently embedded into struct fpu with
static size. To accommodate dynamic user XSTATEs, record the maximum and
minimum buffer sizes.

Extend the function to accept an option to exclude dynamic states. With
that, calculate the maximum size that contains all the enabled states, and
the minimum size that fits in the embedded buffer by excluding them.

Also, move the size comparison with the CPUID value out to the call site.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v9:
* Update the changelog and the code comment. (Borislav Petkov)

Changes from v6:
* Simplify xstate size calculation code. (Dave Hansen)
* Updated the changelog. (Dave Hansen)
* Fixed the v6 changes.

Changes from v5:
* Re-adjusted some local variable names.

Changes from v4:
* Massaged the function description, in preparation for the change
with a return value.

Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the code comment. (Borislav Petkov)
* Adjusted the calculation function naming.
* Moved out the new variable addition into a new patch.

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
* Renamed the in-line size variable.
* Updated some code comments.
---
arch/x86/kernel/fpu/xstate.c | 61 ++++++++++++++++++------------------
1 file changed, 31 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 058dc9df6b86..2e474fbdc241 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -548,24 +548,31 @@ static void check_xstate_against_struct(int nr)
}
}

-/*
- * This essentially double-checks what the cpu told us about
- * how large the XSAVE buffer needs to be. We are recalculating
- * it to be safe.
+/**
+ * calculate_xstate_size - Calculate the xstate per-task buffer size.
+ *
+ * This essentially double-checks what the CPU told us about how large the
+ * XSAVE buffer needs to be. We are recalculating it to be safe.
+ *
+ * Independent XSAVE features allocate their own buffers and are always
+ * excluded. Only the size of the buffer for task->fpu is checked here.
*
- * Independent XSAVE features allocate their own buffers and are not
- * covered by these checks. Only the size of the buffer for task->fpu
- * is checked here.
+ * @include_dynamic_states: A knob to include dynamic states or not.
+ *
+ * Return: The calculated xstate size.
*/
-static void do_extra_xstate_size_checks(void)
+static unsigned int calculate_xstate_size(bool include_dynamic_states)
{
- int paranoid_xstate_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
+ unsigned int xstate_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
int i;

for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
if (!xfeature_enabled(i))
continue;

+ if ((xfeatures_mask_user_dynamic & BIT_ULL(i)) && !include_dynamic_states)
+ continue;
+
check_xstate_against_struct(i);
/*
* Supervisor state components can be managed only by
@@ -576,7 +583,7 @@ static void do_extra_xstate_size_checks(void)

/* Align from the end of the previous feature */
if (xfeature_is_aligned(i))
- paranoid_xstate_size = ALIGN(paranoid_xstate_size, 64);
+ xstate_size = ALIGN(xstate_size, 64);
/*
* The offset of a given state in the non-compacted
* format is given to us in a CPUID leaf. We check
@@ -584,20 +591,16 @@ static void do_extra_xstate_size_checks(void)
* setup_xstate_features(). XSAVES uses compacted format.
*/
if (!cpu_feature_enabled(X86_FEATURE_XSAVES))
- paranoid_xstate_size = xfeature_uncompacted_offset(i);
+ xstate_size = xfeature_uncompacted_offset(i);
/*
* The compacted-format offset always depends on where
* the previous state ended.
*/
- paranoid_xstate_size += xfeature_size(i);
+ xstate_size += xfeature_size(i);
}
- /*
- * The size accounts for all the possible states reserved in the
- * per-task buffer. Check against the maximum size.
- */
- XSTATE_WARN_ON(paranoid_xstate_size != fpu_buf_cfg.max_size);
-}

+ return xstate_size;
+}

/*
* Get total size of enabled xstates in XCR0 | IA32_XSS.
@@ -680,7 +683,7 @@ static bool is_supported_xstate_size(unsigned int test_xstate_size)
static int __init init_xstate_size(void)
{
/* Recompute the context size for enabled features: */
- unsigned int possible_xstate_size;
+ unsigned int possible_xstate_size, xstate_size;
unsigned int xsave_size;

xsave_size = get_xsave_size();
@@ -691,24 +694,22 @@ static int __init init_xstate_size(void)
possible_xstate_size = xsave_size;

/*
- * The size accounts for all the possible states reserved in the
- * per-task buffer. Set the maximum with this value.
+ * Calculate the maximum xstate size, including the dynamic states.
*/
fpu_buf_cfg.max_size = possible_xstate_size;
-
- /* Perform an extra check for the maximum size. */
- do_extra_xstate_size_checks();
+ xstate_size = calculate_xstate_size(true);
+ XSTATE_WARN_ON(possible_xstate_size != xstate_size);

/*
- * Set the minimum to be the same as the maximum. The dynamic
- * user states are not supported yet.
+ * Calculate the minimum xstate size, i.e., excluding the dynamic
+ * xstates.
*/
- fpu_buf_cfg.min_size = possible_xstate_size;
-
- /* Ensure the minimum size fits in the statically-allocated buffer: */
- if (!is_supported_xstate_size(fpu_buf_cfg.min_size))
+ xstate_size = calculate_xstate_size(false);
+ if (!is_supported_xstate_size(xstate_size))
return -EINVAL;

+ fpu_buf_cfg.min_size = xstate_size;
+
/*
* User space is always in standard format.
*/
--
2.17.1

2021-08-25 19:18:10

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v10 04/28] x86/fpu/xstate: Modify address finders to handle both static and dynamic buffers

Have all the functions finding XSTATE address take a struct fpu * pointer
in preparation for dynamic state buffer support.

init_fpstate is a special case, which is indicated by a null pointer
parameter to get_xsave_addr() and __raw_xsave_addr().

No functional change.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from v5:
* Adjusted some call sites for the new base.

Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the function comment to use kernel-doc style. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)

Changes from v1:
* Rebased on the upstream kernel (5.10)
---
arch/x86/include/asm/fpu/xstate.h | 2 +-
arch/x86/kernel/fpu/xstate.c | 42 ++++++++++++++++++++++++-------
arch/x86/kvm/x86.c | 10 +++-----
3 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index ede166e9d3f2..2451bccc6cac 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -134,7 +134,7 @@ extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
extern void __init update_regset_xstate_info(unsigned int size,
u64 xstate_mask);

-void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
+void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
int xfeature_size(int xfeature_nr);
int copy_uabi_from_kernel_to_xstate(struct fpu *fpu, const void *kbuf);
int copy_sigframe_from_user_to_xstate(struct fpu *fpu, const void __user *ubuf);
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index df39085e9d05..0a59df0c48e7 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -841,19 +841,34 @@ void fpu__resume_cpu(void)
}
}

-/*
+/**
+ * __raw_xsave_addr - Find the address where the feature state is saved.
+ *
* Given an xstate feature nr, calculate where in the xsave
* buffer the state is. Callers should ensure that the buffer
* is valid.
+ *
+ * If @fpu is NULL, use init_fpstate.
+ *
+ * @fpu: A struct fpu * pointer
+ *
+ * Return: An address of the feature state in the buffer
*/
-static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
+static void *__raw_xsave_addr(struct fpu *fpu, int xfeature_nr)
{
+ void *xsave;
+
if (!xfeature_enabled(xfeature_nr)) {
WARN_ON_FPU(1);
return NULL;
}

- return (void *)xsave + xstate_comp_offsets[xfeature_nr];
+ if (fpu)
+ xsave = &fpu->state.xsave;
+ else
+ xsave = &init_fpstate.xsave;
+
+ return xsave + xstate_comp_offsets[xfeature_nr];
}
/*
* Given the xsave area and a state inside, this function returns the
@@ -866,15 +881,18 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
* this will return NULL.
*
* Inputs:
- * xstate: the thread's storage area for all FPU data
+ * fpu: the thread's FPU data to reference xstate buffer(s).
+ * (A null pointer parameter indicates init_fpstate.)
* xfeature_nr: state which is defined in xsave.h (e.g. XFEATURE_FP,
* XFEATURE_SSE, etc...)
* Output:
* address of the state in the xsave area, or NULL if the
* field is not present in the xsave buffer.
*/
-void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
+void *get_xsave_addr(struct fpu *fpu, int xfeature_nr)
{
+ struct xregs_state *xsave;
+
/*
* Do we even *have* xsave state?
*/
@@ -887,6 +905,12 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
*/
WARN_ONCE(!(xfeatures_mask_all & BIT_ULL(xfeature_nr)),
"get of unsupported state");
+
+ if (fpu)
+ xsave = &fpu->state.xsave;
+ else
+ xsave = &init_fpstate.xsave;
+
/*
* This assumes the last 'xsave*' instruction to
* have requested that 'xfeature_nr' be saved.
@@ -901,7 +925,7 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
if (!(xsave->header.xfeatures & BIT_ULL(xfeature_nr)))
return NULL;

- return __raw_xsave_addr(xsave, xfeature_nr);
+ return __raw_xsave_addr(fpu, xfeature_nr);
}
EXPORT_SYMBOL_GPL(get_xsave_addr);

@@ -1061,8 +1085,8 @@ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk,
membuf_write(&to, &pkru, sizeof(pkru));
} else {
copy_feature(header.xfeatures & BIT_ULL(i), &to,
- __raw_xsave_addr(xsave, i),
- __raw_xsave_addr(xinit, i),
+ __raw_xsave_addr(&tsk->thread.fpu, i),
+ __raw_xsave_addr(NULL, i),
xstate_sizes[i]);
}
/*
@@ -1129,7 +1153,7 @@ static int copy_uabi_to_xstate(struct fpu *fpu, const void *kbuf,
u64 mask = ((u64)1 << i);

if (hdr.xfeatures & mask) {
- void *dst = __raw_xsave_addr(xsave, i);
+ void *dst = __raw_xsave_addr(fpu, i);

if (!dst)
continue;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ce529ab233d7..9c7c59317e5e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4726,7 +4726,7 @@ static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
memcpy(dest + offset, &vcpu->arch.pkru,
sizeof(vcpu->arch.pkru));
} else {
- src = get_xsave_addr(xsave, xfeature_nr);
+ src = get_xsave_addr(vcpu->arch.guest_fpu, xfeature_nr);
if (src)
memcpy(dest + offset, src, size);
}
@@ -4769,7 +4769,7 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
memcpy(&vcpu->arch.pkru, src + offset,
sizeof(vcpu->arch.pkru));
} else {
- void *dest = get_xsave_addr(xsave, xfeature_nr);
+ void *dest = get_xsave_addr(vcpu->arch.guest_fpu, xfeature_nr);

if (dest)
memcpy(dest, src + offset, size);
@@ -10840,12 +10840,10 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
*/
if (init_event)
kvm_put_guest_fpu(vcpu);
- mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu->state.xsave,
- XFEATURE_BNDREGS);
+ mpx_state_buffer = get_xsave_addr(vcpu->arch.guest_fpu, XFEATURE_BNDREGS);
if (mpx_state_buffer)
memset(mpx_state_buffer, 0, sizeof(struct mpx_bndreg_state));
- mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu->state.xsave,
- XFEATURE_BNDCSR);
+ mpx_state_buffer = get_xsave_addr(vcpu->arch.guest_fpu, XFEATURE_BNDCSR);
if (mpx_state_buffer)
memset(mpx_state_buffer, 0, sizeof(struct mpx_bndcsr));
if (init_event)
--
2.17.1

2021-08-25 19:19:47

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v10 05/28] x86/fpu/xstate: Add a new variable to indicate dynamic user states

The XSTATE per-task buffer is in preparation to be dynamic for user states.
Introduce a new mask variable to indicate the 'dynamic' user states. The
value is determined at boot-time.

No functional change.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v5:
* Made the variable __ro_after_init.
* Dropped the perf's xstate buffer renaming, as renamed already.

Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the code comment. (Borislav Petkov)

Changes from v2:
* Updated the changelog for clarification.
---
arch/x86/include/asm/fpu/xstate.h | 2 ++
arch/x86/kernel/fpu/xstate.c | 9 +++++++++
2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 2451bccc6cac..bc4cba62906b 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -129,6 +129,8 @@ static inline u64 xfeatures_mask_independent(void)
return XFEATURE_MASK_INDEPENDENT;
}

+extern u64 xfeatures_mask_user_dynamic;
+
extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];

extern void __init update_regset_xstate_info(unsigned int size,
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 0a59df0c48e7..6a658ef913bd 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -62,6 +62,12 @@ static short xsave_cpuid_features[] __initdata = {
u64 xfeatures_mask_all __ro_after_init;
EXPORT_SYMBOL_GPL(xfeatures_mask_all);

+/*
+ * This represents user xstates, a subset of xfeatures_mask_all, saved in a
+ * dynamic kernel XSAVE buffer.
+ */
+u64 xfeatures_mask_user_dynamic __ro_after_init;
+
static unsigned int xstate_offsets[XFEATURE_MAX] __ro_after_init =
{ [ 0 ... XFEATURE_MAX - 1] = -1};
static unsigned int xstate_sizes[XFEATURE_MAX] __ro_after_init =
@@ -709,6 +715,7 @@ static int __init init_xstate_size(void)
static void fpu__init_disable_system_xstate(void)
{
xfeatures_mask_all = 0;
+ xfeatures_mask_user_dynamic = 0;
cr4_clear_bits(X86_CR4_OSXSAVE);
setup_clear_cpu_cap(X86_FEATURE_XSAVE);
}
@@ -780,6 +787,8 @@ void __init fpu__init_system_xstate(void)

/* Store it for paranoia check at the end */
xfeatures = xfeatures_mask_all;
+ /* Do not support the dynamically allocated buffer yet. */
+ xfeatures_mask_user_dynamic = 0;

/* Enable xstate instructions to be able to continue with initialization: */
fpu__init_cpu_xstate();
--
2.17.1

2021-08-25 19:20:52

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v10 13/28] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
architecture. XFD allows the kernel to enable a feature state in XCR0 and
to receive a #NM trap when a task uses instructions accessing that state.
In this way, Linux can defer allocating the large XSAVE buffer until tasks
need it.

XFD introduces two MSRs: IA32_XFD to enable/disable the feature and
IA32_XFD_ERR to assist the #NM trap handler. Both use the same
xstate-component bitmap format, used by XCR0.

Use this hardware capability to find the right time to expand the XSTATE
buffer. The #NM handler induces the buffer expansion.

Introduce helper function to switch IA32_XFD MSR.

In the event of vzalloc() failure, send SIGILL with si_code ILL_ILL_OPC.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v9:
* Mask the XFD flag from /proc/cpuinfo. (Borislav Petkov)
* Remove most helpers. (Borislav Petkov)
* Refactor the XFD handling code. (Borislav Petkov)
* Update the feature enumeration ordering. (Borislav Petkov)
* Rename the XFD support helper. (Borislav Petkov)
* Update the print message for dynamic states. (Borislav Petkov)
* Adjust the changelog.
* Use cpu_feature_enabled() wherever possible. (Borislav Petkov)

Changes from v7:
* Update #NM handler to raise SIGILL rather than SIGSEGV. (Thiago
Macieira)

Changes from v6:
* Update the #NM handler a little bit.
* Clean up the code comment.

Changes from v5:
* Excluded the access request check here and included the buffer allocation
again in #NM handler. The access request will be dealt in next patch.
* Updated the title. (Dave Hansen)
* Updated the code comment.

Changes from v4:
* Changed to use XFD to support the access request policy. Updated #NM
handler to raise a signal instead of buffer allocation.
* Decoupled XFD from the use of XSAVE compacted format.
* Updated helper functions.
* Updated function descriptions in a proper format.
* Updated some code comments.

Changes from v3:
* Removed 'no functional change' in the changelog. (Borislav Petkov)

Changes from v2:
* Changed to enable XFD only when the compacted format is used.
* Updated the changelog with task->fpu removed. (Borislav Petkov)

Changes from v1:
* Inlined the XFD-induced #NM handling code (Andy Lutomirski)
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/fpu/internal.h | 25 +++++++++++++--
arch/x86/include/asm/msr-index.h | 2 ++
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
arch/x86/kernel/fpu/xstate.c | 44 +++++++++++++++++++++++--
arch/x86/kernel/process.c | 10 ++++++
arch/x86/kernel/process_32.c | 2 +-
arch/x86/kernel/process_64.c | 2 +-
arch/x86/kernel/traps.c | 50 +++++++++++++++++++++++++++++
9 files changed, 131 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d0ce5cfd3ac1..ab7b3a2de85d 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -277,6 +277,7 @@
#define X86_FEATURE_XSAVEC (10*32+ 1) /* XSAVEC instruction */
#define X86_FEATURE_XGETBV1 (10*32+ 2) /* XGETBV with ECX = 1 instruction */
#define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS instructions */
+#define X86_FEATURE_XFD (10*32+ 4) /* "" eXtended Feature Disabling */

/*
* Extended auxiliary flags: Linux defined - for features scattered in various
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 263e349ff85a..1aa8bc75b24d 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -535,14 +535,35 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
* Misc helper functions:
*/

+/**
+ * xfd_switch - Switches the MSR IA32_XFD context if needed.
+ * @prev: The previous task's struct fpu pointer
+ * @next: The next task's struct fpu pointer
+ */
+static inline void xfd_switch(struct fpu *prev, struct fpu *next)
+{
+ u64 prev_xfd_mask, next_xfd_mask;
+
+ if (!cpu_feature_enabled(X86_FEATURE_XFD) || !xfeatures_mask_user_dynamic)
+ return;
+
+ prev_xfd_mask = prev->state_mask & xfeatures_mask_user_dynamic;
+ next_xfd_mask = next->state_mask & xfeatures_mask_user_dynamic;
+
+ if (unlikely(prev_xfd_mask != next_xfd_mask))
+ wrmsrl_safe(MSR_IA32_XFD, xfeatures_mask_user_dynamic ^ next_xfd_mask);
+}
+
/*
* Delay loading of the complete FPU state until the return to userland.
* PKRU is handled separately.
*/
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
{
- if (cpu_feature_enabled(X86_FEATURE_FPU))
+ if (cpu_feature_enabled(X86_FEATURE_FPU)) {
set_thread_flag(TIF_NEED_FPU_LOAD);
+ xfd_switch(old_fpu, new_fpu);
+ }
}

#endif /* _ASM_X86_FPU_INTERNAL_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a7c413432b33..01e2650b9585 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -625,6 +625,8 @@

#define MSR_IA32_BNDCFGS_RSVD 0x00000ffc

+#define MSR_IA32_XFD 0x000001c4
+#define MSR_IA32_XFD_ERR 0x000001c5
#define MSR_IA32_XSS 0x00000da0

#define MSR_IA32_APICBASE 0x0000001b
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index defda61f372d..7f891d2eb52e 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -75,6 +75,7 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_SGX_LC, X86_FEATURE_SGX },
{ X86_FEATURE_SGX1, X86_FEATURE_SGX },
{ X86_FEATURE_SGX2, X86_FEATURE_SGX1 },
+ { X86_FEATURE_XFD, X86_FEATURE_XSAVE },
{}
};

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index bd6a5c746d3f..1bcab5af0a5a 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -139,6 +139,27 @@ static bool xfeature_is_supervisor(int xfeature_nr)
return ecx & 1;
}

+/**
+ * xfeature_supports_xfd - Check if the feature supports Extended Feature
+ * Disable (XFD).
+ * @feature_nr: The feature number.
+ *
+ * Returns: True if supported; otherwise, false.
+ */
+static bool xfeature_supports_xfd(int feature_nr)
+{
+ u32 eax, ebx, ecx, edx;
+
+ if (!cpu_feature_enabled(X86_FEATURE_XFD))
+ return false;
+
+ /*
+ * If state component 'i' supports it, ECX[2] return 1; otherwise, 0.
+ */
+ cpuid_count(XSTATE_CPUID, feature_nr, &eax, &ebx, &ecx, &edx);
+ return ecx & 4;
+}
+
/**
* get_xstate_comp_offset - Find the feature offset in the compacted format.
* @mask: This bitmap tells which components to be saved in the
@@ -236,6 +257,9 @@ void fpu__init_cpu_xstate(void)
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
xfeatures_mask_independent());
}
+
+ if (boot_cpu_has(X86_FEATURE_XFD))
+ wrmsrl(MSR_IA32_XFD, xfeatures_mask_user_dynamic);
}

static bool xfeature_enabled(enum xfeature xfeature)
@@ -433,8 +457,9 @@ static void __init print_xstate_offset_size(void)
for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
if (!xfeature_enabled(i))
continue;
- pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d\n",
- i, xstate_comp_offsets[i], i, xstate_sizes[i]);
+ pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d %s\n",
+ i, xstate_comp_offsets[i], i, xstate_sizes[i],
+ (xfeatures_mask_user_dynamic & BIT_ULL(i)) ? "(dynamic)" : "");
}
}

@@ -880,6 +905,16 @@ void __init fpu__init_system_xstate(void)
/* Do not support the dynamically allocated buffer yet. */
xfeatures_mask_user_dynamic = 0;

+ for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+ u64 feature_mask = BIT_ULL(i);
+
+ if (!(xfeatures_mask_uabi() & feature_mask))
+ continue;
+
+ if (xfeature_supports_xfd(i))
+ xfeatures_mask_user_dynamic |= feature_mask;
+ }
+
/* Enable xstate instructions to be able to continue with initialization: */
fpu__init_cpu_xstate();
err = init_xstate_size();
@@ -944,6 +979,11 @@ void fpu__resume_cpu(void)
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
xfeatures_mask_independent());
}
+
+ if (cpu_feature_enabled(X86_FEATURE_XFD))
+ wrmsrl_safe(MSR_IA32_XFD, (current->thread.fpu.state_mask &
+ xfeatures_mask_user_dynamic) ^
+ xfeatures_mask_user_dynamic);
}

/**
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 33f5d8d07367..6cd4fb098f8f 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -97,6 +97,16 @@ void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
*size = fpu_buf_cfg.min_size;
}

+void arch_release_task_struct(struct task_struct *task)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_FPU))
+ return;
+
+ /* Free up only the dynamically-allocated memory. */
+ if (task->thread.fpu.state != &task->thread.fpu.__default_state)
+ free_xstate_buffer(&task->thread.fpu);
+}
+
/*
* Free thread data structures etc..
*/
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4f2f54e1281c..7bd5d08eeb41 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -213,7 +213,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)

this_cpu_write(current_task, next_p);

- switch_fpu_finish(next_fpu);
+ switch_fpu_finish(prev_fpu, next_fpu);

/* Load the Intel cache allocation PQR MSR. */
resctrl_sched_in();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ec0d836a13b1..41c9855158d6 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -620,7 +620,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
this_cpu_write(current_task, next_p);
this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));

- switch_fpu_finish(next_fpu);
+ switch_fpu_finish(prev_fpu, next_fpu);

/* Reload sp0. */
update_task_stack(next_p);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..8a608f0e107b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1108,10 +1108,60 @@ DEFINE_IDTENTRY(exc_spurious_interrupt_bug)
*/
}

+static __always_inline bool handle_xfd_event(struct fpu *fpu, struct pt_regs *regs)
+{
+ bool handled = false;
+ u64 xfd_err;
+
+ if (!cpu_feature_enabled(X86_FEATURE_XFD))
+ return handled;
+
+ rdmsrl_safe(MSR_IA32_XFD_ERR, &xfd_err);
+ wrmsrl_safe(MSR_IA32_XFD_ERR, 0);
+
+ if (xfd_err) {
+ u64 xfd_event = xfd_err & xfeatures_mask_user_dynamic;
+ u64 value;
+
+ if (WARN_ON(!xfd_event)) {
+ /*
+ * Unexpected event is raised. But update XFD state to
+ * unblock the task.
+ */
+ rdmsrl_safe(MSR_IA32_XFD, &value);
+ wrmsrl_safe(MSR_IA32_XFD, value & ~xfd_err);
+ } else {
+ struct fpu *fpu = &current->thread.fpu;
+ int err = -1;
+
+ /*
+ * Make sure not in interrupt context as handling a
+ * trap from userspace.
+ */
+ if (!WARN_ON(in_interrupt())) {
+ err = realloc_xstate_buffer(fpu, xfd_event);
+ if (!err)
+ wrmsrl_safe(MSR_IA32_XFD, (fpu->state_mask &
+ xfeatures_mask_user_dynamic) ^
+ xfeatures_mask_user_dynamic);
+ }
+
+ /* Raise a signal when it failed to handle. */
+ if (err)
+ force_sig_fault(SIGILL, ILL_ILLOPC, error_get_trap_addr(regs));
+ }
+ handled = true;
+ }
+ return handled;
+}
+
DEFINE_IDTENTRY(exc_device_not_available)
{
unsigned long cr0 = read_cr0();

+ if (handle_xfd_event(&current->thread.fpu, regs))
+ return;
+
#ifdef CONFIG_MATH_EMULATION
if (!boot_cpu_has(X86_FEATURE_FPU) && (cr0 & X86_CR0_EM)) {
struct math_emu_info info = { };
--
2.17.1

2021-08-25 19:21:39

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v10 10/28] x86/fpu/xstate: Update the XSTATE save function to support dynamic states

Extend os_xsave() to receive a mask argument of which states to save, in
preparation for dynamic user state handling.

Update KVM to set a valid fpu->state_mask, so it can continue to share with
the core code.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from v5:
* Adjusted the changelog and code for the new base code.

Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Made the code change more reviewable.

Changes from v2:
* Updated the changelog to clarify the KVM code changes.
---
arch/x86/include/asm/fpu/internal.h | 3 +--
arch/x86/kernel/fpu/core.c | 2 +-
arch/x86/kernel/fpu/signal.c | 2 +-
arch/x86/kvm/x86.c | 9 +++++++--
4 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index d2fc19c0e457..263e349ff85a 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -298,9 +298,8 @@ static inline void os_xrstor_booting(struct xregs_state *xstate)
* Uses either XSAVE or XSAVEOPT or XSAVES depending on the CPU features
* and command line options. The choice is permanent until the next reboot.
*/
-static inline void os_xsave(struct xregs_state *xstate)
+static inline void os_xsave(struct xregs_state *xstate, u64 mask)
{
- u64 mask = xfeatures_mask_all;
u32 lmask = mask;
u32 hmask = mask >> 32;
int err;
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 2941d03912db..164e75c37dbb 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -99,7 +99,7 @@ EXPORT_SYMBOL(irq_fpu_usable);
void save_fpregs_to_fpstate(struct fpu *fpu)
{
if (likely(use_xsave())) {
- os_xsave(&fpu->state->xsave);
+ os_xsave(&fpu->state->xsave, fpu->state_mask);

/*
* AVX512 state is tracked here because its use is
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 8b333b1a4d07..fe2732db6d6b 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -365,7 +365,7 @@ static int __fpu_restore_sig(void __user *buf, void __user *buf_fx,
* the right place in memory. It's ia32 mode. Shrug.
*/
if (xfeatures_mask_supervisor())
- os_xsave(&fpu->state->xsave);
+ os_xsave(&fpu->state->xsave, fpu->state_mask);
set_thread_flag(TIF_NEED_FPU_LOAD);
}
__fpu_invalidate_fpregs_state(fpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 74dde635df40..7c46747f6865 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9899,11 +9899,16 @@ static void kvm_save_current_fpu(struct fpu *fpu)
* KVM does not support dynamic user states yet. Assume the buffer
* always has the minimum size.
*/
- if (test_thread_flag(TIF_NEED_FPU_LOAD))
+ if (test_thread_flag(TIF_NEED_FPU_LOAD)) {
memcpy(fpu->state, current->thread.fpu.state,
fpu_buf_cfg.min_size);
- else
+ } else {
+ struct fpu *src_fpu = &current->thread.fpu;
+
+ if (fpu->state_mask != src_fpu->state_mask)
+ fpu->state_mask = src_fpu->state_mask;
save_fpregs_to_fpstate(fpu);
+ }
}

/* Swap (qemu) user FPU context for the guest FPU context. */
--
2.17.1

2021-08-25 19:22:15

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v10 08/28] x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer

The XSTATE per-task buffer is embedded into struct fpu. The field 'state'
represents the buffer. When the dynamic user state is in use, the buffer
may be dynamically allocated.

Convert the 'state' field to point either to the embedded buffer or to the
dynamically-allocated buffer. Also, add a new field to represent the
embedded buffer.

The initial task sets it before dealing with soft FPU. Make sure that every
FPU state has a valid pointer value on its creation.

No functional change.

Suggested-by: Borislav Petkov <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from v9:
* Update the code comment. (Borislav Petkov)

Changes from v5:
* Tightened up task size calculation (previously, it could over-calculate)
* Adjusted the changelog.

Changes from v4:
* Fixed KVM's user_fpu and guest_fpu to initialize the 'state' field correctly.
* Massaged the changelog.

Changes from v3:
* Added as a new patch to simplify the buffer access. (Borislav Petkov)
---
arch/x86/include/asm/fpu/internal.h | 2 +-
arch/x86/include/asm/fpu/types.h | 31 +++++++++++++++++++++-------
arch/x86/include/asm/trace/fpu.h | 4 ++--
arch/x86/kernel/fpu/core.c | 32 +++++++++++++++--------------
arch/x86/kernel/fpu/init.c | 8 +++++---
arch/x86/kernel/fpu/regset.c | 24 +++++++++++-----------
arch/x86/kernel/fpu/signal.c | 24 +++++++++++-----------
arch/x86/kernel/fpu/xstate.c | 8 ++++----
arch/x86/kernel/process.c | 2 +-
arch/x86/kvm/x86.c | 22 +++++++++++---------
arch/x86/math-emu/fpu_aux.c | 2 +-
arch/x86/math-emu/fpu_entry.c | 4 ++--
arch/x86/math-emu/fpu_system.h | 2 +-
13 files changed, 94 insertions(+), 71 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index c7a64e2806a9..d2fc19c0e457 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -484,7 +484,7 @@ static inline void fpregs_restore_userregs(void)
*/
mask = xfeatures_mask_restore_user() |
xfeatures_mask_supervisor();
- __restore_fpregs_from_fpstate(&fpu->state, mask);
+ __restore_fpregs_from_fpstate(fpu->state, mask);

fpregs_activate(fpu);
fpu->last_cpu = cpu;
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index f5a38a5f3ae1..ad5cbf922e30 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -339,15 +339,32 @@ struct fpu {
/*
* @state:
*
- * In-memory copy of all FPU registers that we save/restore
- * over context switches. If the task is using the FPU then
- * the registers in the FPU are more recent than this state
- * copy. If the task context-switches away then they get
- * saved here and represent the FPU state.
+ * A pointer to indicate the in-memory copy of all FPU registers
+ * that are saved/restored over context switches.
+ *
+ * Initially @state points to @__default_state. When dynamic states
+ * get used, a memory is allocated for the larger state copy and
+ * @state is updated to point to it. Then, the state in ->state
+ * supersedes and invalidates the state in @__default_state.
+ *
+ * In general, if the task is using the FPU then the registers in
+ * the FPU are more recent than the state copy. If the task
+ * context-switches away then they get saved in ->state and
+ * represent the FPU state.
+ */
+ union fpregs_state *state;
+
+ /*
+ * @__default_state:
+ *
+ * Initial in-memory copy of all FPU registers that saved/restored
+ * over context switches. When the task is switched to dynamic
+ * states, this copy is replaced with the new in-memory copy in
+ * ->state.
*/
- union fpregs_state state;
+ union fpregs_state __default_state;
/*
- * WARNING: 'state' is dynamically-sized. Do not put
+ * WARNING: '__default_state' is dynamically-sized. Do not put
* anything after it here.
*/
};
diff --git a/arch/x86/include/asm/trace/fpu.h b/arch/x86/include/asm/trace/fpu.h
index 879b77792f94..ef82f4824ce7 100644
--- a/arch/x86/include/asm/trace/fpu.h
+++ b/arch/x86/include/asm/trace/fpu.h
@@ -22,8 +22,8 @@ DECLARE_EVENT_CLASS(x86_fpu,
__entry->fpu = fpu;
__entry->load_fpu = test_thread_flag(TIF_NEED_FPU_LOAD);
if (boot_cpu_has(X86_FEATURE_OSXSAVE)) {
- __entry->xfeatures = fpu->state.xsave.header.xfeatures;
- __entry->xcomp_bv = fpu->state.xsave.header.xcomp_bv;
+ __entry->xfeatures = fpu->state->xsave.header.xfeatures;
+ __entry->xcomp_bv = fpu->state->xsave.header.xcomp_bv;
}
),
TP_printk("x86/fpu: %p load: %d xfeatures: %llx xcomp_bv: %llx",
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 62cc993a890a..6b55b8c651f6 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -99,19 +99,19 @@ EXPORT_SYMBOL(irq_fpu_usable);
void save_fpregs_to_fpstate(struct fpu *fpu)
{
if (likely(use_xsave())) {
- os_xsave(&fpu->state.xsave);
+ os_xsave(&fpu->state->xsave);

/*
* AVX512 state is tracked here because its use is
* known to slow the max clock speed of the core.
*/
- if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
+ if (fpu->state->xsave.header.xfeatures & XFEATURE_MASK_AVX512)
fpu->avx512_timestamp = jiffies;
return;
}

if (likely(use_fxsr())) {
- fxsave(&fpu->state.fxsave);
+ fxsave(&fpu->state->fxsave);
return;
}

@@ -119,8 +119,8 @@ void save_fpregs_to_fpstate(struct fpu *fpu)
* Legacy FPU register saving, FNSAVE always clears FPU registers,
* so we have to reload them from the memory state.
*/
- asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (fpu->state.fsave));
- frstor(&fpu->state.fsave);
+ asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (fpu->state->fsave));
+ frstor(&fpu->state->fsave);
}
EXPORT_SYMBOL(save_fpregs_to_fpstate);

@@ -235,7 +235,7 @@ void fpstate_init(struct fpu *fpu)
u64 mask;

if (likely(fpu)) {
- state = &fpu->state;
+ state = fpu->state;
/* The dynamic user states are not prepared yet. */
mask = xfeatures_mask_all & ~xfeatures_mask_user_dynamic;
size = fpu_buf_cfg.min_size;
@@ -274,6 +274,8 @@ int fpu_clone(struct task_struct *dst)
if (!cpu_feature_enabled(X86_FEATURE_FPU))
return 0;

+ dst_fpu->state = &dst_fpu->__default_state;
+
/*
* Don't let 'init optimized' areas of the XSAVE area
* leak into the child task:
@@ -281,7 +283,7 @@ int fpu_clone(struct task_struct *dst)
* The child does not inherit the dynamic states. So,
* the xstate buffer has the minimum size.
*/
- memset(&dst_fpu->state.xsave, 0, fpu_buf_cfg.min_size);
+ memset(&dst_fpu->state->xsave, 0, fpu_buf_cfg.min_size);

/*
* If the FPU registers are not owned by current just memcpy() the
@@ -290,7 +292,7 @@ int fpu_clone(struct task_struct *dst)
*/
fpregs_lock();
if (test_thread_flag(TIF_NEED_FPU_LOAD))
- memcpy(&dst_fpu->state, &src_fpu->state, fpu_buf_cfg.min_size);
+ memcpy(dst_fpu->state, src_fpu->state, fpu_buf_cfg.min_size);

else
save_fpregs_to_fpstate(dst_fpu);
@@ -377,7 +379,7 @@ static void fpu_reset_fpstate(void)
* user space as PKRU is eagerly written in switch_to() and
* flush_thread().
*/
- memcpy(&fpu->state, &init_fpstate, init_fpstate_copy_size());
+ memcpy(fpu->state, &init_fpstate, init_fpstate_copy_size());
set_thread_flag(TIF_NEED_FPU_LOAD);
fpregs_unlock();
}
@@ -404,7 +406,7 @@ void fpu__clear_user_states(struct fpu *fpu)
*/
if (xfeatures_mask_supervisor() &&
!fpregs_state_valid(fpu, smp_processor_id())) {
- os_xrstor(&fpu->state.xsave, xfeatures_mask_supervisor());
+ os_xrstor(&fpu->state->xsave, xfeatures_mask_supervisor());
}

/* Reset user states in registers. */
@@ -486,11 +488,11 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
* fully reproduce the context of the exception.
*/
if (boot_cpu_has(X86_FEATURE_FXSR)) {
- cwd = fpu->state.fxsave.cwd;
- swd = fpu->state.fxsave.swd;
+ cwd = fpu->state->fxsave.cwd;
+ swd = fpu->state->fxsave.swd;
} else {
- cwd = (unsigned short)fpu->state.fsave.cwd;
- swd = (unsigned short)fpu->state.fsave.swd;
+ cwd = (unsigned short)fpu->state->fsave.cwd;
+ swd = (unsigned short)fpu->state->fsave.swd;
}

err = swd & ~cwd;
@@ -504,7 +506,7 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
unsigned short mxcsr = MXCSR_DEFAULT;

if (boot_cpu_has(X86_FEATURE_XMM))
- mxcsr = fpu->state.fxsave.mxcsr;
+ mxcsr = fpu->state->fxsave.mxcsr;

err = ~(mxcsr >> 7) & mxcsr;
}
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index da7341f95008..cd1f3114f3ca 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -31,10 +31,12 @@ static void fpu__init_cpu_generic(void)
cr0 |= X86_CR0_EM;
write_cr0(cr0);

+ current->thread.fpu.state = &current->thread.fpu.__default_state;
+
/* Flush out any pending x87 state: */
#ifdef CONFIG_MATH_EMULATION
if (!boot_cpu_has(X86_FEATURE_FPU))
- fpstate_init_soft(&current->thread.fpu.state.soft);
+ fpstate_init_soft(&current->thread.fpu.state->soft);
else
#endif
asm volatile ("fninit");
@@ -153,7 +155,7 @@ static void __init fpu__init_task_struct_size(void)
* Subtract off the static size of the register state.
* It potentially has a bunch of padding.
*/
- task_size -= sizeof(((struct task_struct *)0)->thread.fpu.state);
+ task_size -= sizeof(((struct task_struct *)0)->thread.fpu.__default_state);

/*
* Add back the dynamically-calculated register state
@@ -170,7 +172,7 @@ static void __init fpu__init_task_struct_size(void)
* you hit a compile error here, check the structure to
* see if something got added to the end.
*/
- CHECK_MEMBER_AT_END_OF(struct fpu, state);
+ CHECK_MEMBER_AT_END_OF(struct fpu, __default_state);
CHECK_MEMBER_AT_END_OF(struct thread_struct, fpu);
CHECK_MEMBER_AT_END_OF(struct task_struct, thread);

diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 80ee64183c7d..7ea10f98c2b0 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -74,8 +74,8 @@ int xfpregs_get(struct task_struct *target, const struct user_regset *regset,
sync_fpstate(fpu);

if (!use_xsave()) {
- return membuf_write(&to, &fpu->state.fxsave,
- sizeof(fpu->state.fxsave));
+ return membuf_write(&to, &fpu->state->fxsave,
+ sizeof(fpu->state->fxsave));
}

copy_xstate_to_uabi_buf(to, target, XSTATE_COPY_FX);
@@ -110,15 +110,15 @@ int xfpregs_set(struct task_struct *target, const struct user_regset *regset,
fpu_force_restore(fpu);

/* Copy the state */
- memcpy(&fpu->state.fxsave, &newstate, sizeof(newstate));
+ memcpy(&fpu->state->fxsave, &newstate, sizeof(newstate));

/* Clear xmm8..15 */
- BUILD_BUG_ON(sizeof(fpu->state.fxsave.xmm_space) != 16 * 16);
- memset(&fpu->state.fxsave.xmm_space[8], 0, 8 * 16);
+ BUILD_BUG_ON(sizeof(fpu->state->fxsave.xmm_space) != 16 * 16);
+ memset(&fpu->state->fxsave.xmm_space[8], 0, 8 * 16);

/* Mark FP and SSE as in use when XSAVE is enabled */
if (use_xsave())
- fpu->state.xsave.header.xfeatures |= XFEATURE_MASK_FPSSE;
+ fpu->state->xsave.header.xfeatures |= XFEATURE_MASK_FPSSE;

return 0;
}
@@ -283,7 +283,7 @@ static void __convert_from_fxsr(struct user_i387_ia32_struct *env,
void
convert_from_fxsr(struct user_i387_ia32_struct *env, struct task_struct *tsk)
{
- __convert_from_fxsr(env, tsk, &tsk->thread.fpu.state.fxsave);
+ __convert_from_fxsr(env, tsk, &tsk->thread.fpu.state->fxsave);
}

void convert_to_fxsr(struct fxregs_state *fxsave,
@@ -326,7 +326,7 @@ int fpregs_get(struct task_struct *target, const struct user_regset *regset,
return fpregs_soft_get(target, regset, to);

if (!cpu_feature_enabled(X86_FEATURE_FXSR)) {
- return membuf_write(&to, &fpu->state.fsave,
+ return membuf_write(&to, &fpu->state->fsave,
sizeof(struct fregs_state));
}

@@ -337,7 +337,7 @@ int fpregs_get(struct task_struct *target, const struct user_regset *regset,
copy_xstate_to_uabi_buf(mb, target, XSTATE_COPY_FP);
fx = &fxsave;
} else {
- fx = &fpu->state.fxsave;
+ fx = &fpu->state->fxsave;
}

__convert_from_fxsr(&env, target, fx);
@@ -366,16 +366,16 @@ int fpregs_set(struct task_struct *target, const struct user_regset *regset,
fpu_force_restore(fpu);

if (cpu_feature_enabled(X86_FEATURE_FXSR))
- convert_to_fxsr(&fpu->state.fxsave, &env);
+ convert_to_fxsr(&fpu->state->fxsave, &env);
else
- memcpy(&fpu->state.fsave, &env, sizeof(env));
+ memcpy(&fpu->state->fsave, &env, sizeof(env));

/*
* Update the header bit in the xsave header, indicating the
* presence of FP.
*/
if (cpu_feature_enabled(X86_FEATURE_XSAVE))
- fpu->state.xsave.header.xfeatures |= XFEATURE_MASK_FP;
+ fpu->state->xsave.header.xfeatures |= XFEATURE_MASK_FP;

return 0;
}
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index f5ec334c5a4e..8b333b1a4d07 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -67,13 +67,13 @@ static inline int check_xstate_in_sigframe(struct fxregs_state __user *fxbuf,
static inline int save_fsave_header(struct task_struct *tsk, void __user *buf)
{
if (use_fxsr()) {
- struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
+ struct xregs_state *xsave = &tsk->thread.fpu.state->xsave;
struct user_i387_ia32_struct env;
struct _fpstate_32 __user *fp = buf;

fpregs_lock();
if (!test_thread_flag(TIF_NEED_FPU_LOAD))
- fxsave(&tsk->thread.fpu.state.fxsave);
+ fxsave(&tsk->thread.fpu.state->fxsave);
fpregs_unlock();

convert_from_fxsr(&env, tsk);
@@ -294,7 +294,7 @@ static int restore_fpregs_from_user(void __user *buf, u64 xrestore,
* been restored from a user buffer directly.
*/
if (test_thread_flag(TIF_NEED_FPU_LOAD) && xfeatures_mask_supervisor())
- os_xrstor(&fpu->state.xsave, xfeatures_mask_supervisor());
+ os_xrstor(&fpu->state->xsave, xfeatures_mask_supervisor());

fpregs_mark_activate();
fpregs_unlock();
@@ -365,7 +365,7 @@ static int __fpu_restore_sig(void __user *buf, void __user *buf_fx,
* the right place in memory. It's ia32 mode. Shrug.
*/
if (xfeatures_mask_supervisor())
- os_xsave(&fpu->state.xsave);
+ os_xsave(&fpu->state->xsave);
set_thread_flag(TIF_NEED_FPU_LOAD);
}
__fpu_invalidate_fpregs_state(fpu);
@@ -377,21 +377,21 @@ static int __fpu_restore_sig(void __user *buf, void __user *buf_fx,
if (ret)
return ret;
} else {
- if (__copy_from_user(&fpu->state.fxsave, buf_fx,
- sizeof(fpu->state.fxsave)))
+ if (__copy_from_user(&fpu->state->fxsave, buf_fx,
+ sizeof(fpu->state->fxsave)))
return -EFAULT;

/* Reject invalid MXCSR values. */
- if (fpu->state.fxsave.mxcsr & ~mxcsr_feature_mask)
+ if (fpu->state->fxsave.mxcsr & ~mxcsr_feature_mask)
return -EINVAL;

/* Enforce XFEATURE_MASK_FPSSE when XSAVE is enabled */
if (use_xsave())
- fpu->state.xsave.header.xfeatures |= XFEATURE_MASK_FPSSE;
+ fpu->state->xsave.header.xfeatures |= XFEATURE_MASK_FPSSE;
}

/* Fold the legacy FP storage */
- convert_to_fxsr(&fpu->state.fxsave, &env);
+ convert_to_fxsr(&fpu->state->fxsave, &env);

fpregs_lock();
if (use_xsave()) {
@@ -406,10 +406,10 @@ static int __fpu_restore_sig(void __user *buf, void __user *buf_fx,
*/
u64 mask = user_xfeatures | xfeatures_mask_supervisor();

- fpu->state.xsave.header.xfeatures &= mask;
- ret = os_xrstor_safe(&fpu->state.xsave, xfeatures_mask_all);
+ fpu->state->xsave.header.xfeatures &= mask;
+ ret = os_xrstor_safe(&fpu->state->xsave, xfeatures_mask_all);
} else {
- ret = fxrstor_safe(&fpu->state.fxsave);
+ ret = fxrstor_safe(&fpu->state->fxsave);
}

if (likely(!ret))
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 2e474fbdc241..4496750208a8 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -882,7 +882,7 @@ static void *__raw_xsave_addr(struct fpu *fpu, int xfeature_nr)
}

if (fpu)
- xsave = &fpu->state.xsave;
+ xsave = &fpu->state->xsave;
else
xsave = &init_fpstate.xsave;

@@ -925,7 +925,7 @@ void *get_xsave_addr(struct fpu *fpu, int xfeature_nr)
"get of unsupported state");

if (fpu)
- xsave = &fpu->state.xsave;
+ xsave = &fpu->state->xsave;
else
xsave = &init_fpstate.xsave;

@@ -1017,7 +1017,7 @@ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk,
enum xstate_copy_mode copy_mode)
{
const unsigned int off_mxcsr = offsetof(struct fxregs_state, mxcsr);
- struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
+ struct xregs_state *xsave = &tsk->thread.fpu.state->xsave;
struct xregs_state *xinit = &init_fpstate.xsave;
struct xstate_header header;
unsigned int zerofrom;
@@ -1134,7 +1134,7 @@ static int copy_from_buffer(void *dst, unsigned int offset, unsigned int size,
static int copy_uabi_to_xstate(struct fpu *fpu, const void *kbuf,
const void __user *ubuf)
{
- struct xregs_state *xsave = &fpu->state.xsave;
+ struct xregs_state *xsave = &fpu->state->xsave;
unsigned int offset, size;
struct xstate_header hdr;
u64 mask;
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index d1ca963cb8f7..33f5d8d07367 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -92,7 +92,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)

void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
{
- *offset = offsetof(struct thread_struct, fpu.state);
+ *offset = offsetof(struct thread_struct, fpu.__default_state);
/* The buffer embedded in thread_struct has the minimum size. */
*size = fpu_buf_cfg.min_size;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 04e38196da79..74dde635df40 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4694,7 +4694,7 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,

static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
{
- struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave;
+ struct xregs_state *xsave = &vcpu->arch.guest_fpu->state->xsave;
u64 xstate_bv = xsave->header.xfeatures;
u64 valid;

@@ -4737,7 +4737,7 @@ static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)

static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
{
- struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave;
+ struct xregs_state *xsave = &vcpu->arch.guest_fpu->state->xsave;
u64 xstate_bv = *(u64 *)(src + XSAVE_HDR_OFFSET);
u64 valid;

@@ -4790,7 +4790,7 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
fill_xsave((u8 *) guest_xsave->region, vcpu);
} else {
memcpy(guest_xsave->region,
- &vcpu->arch.guest_fpu->state.fxsave,
+ &vcpu->arch.guest_fpu->state->fxsave,
sizeof(struct fxregs_state));
*(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)] =
XFEATURE_MASK_FPSSE;
@@ -4824,7 +4824,7 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
if (xstate_bv & ~XFEATURE_MASK_FPSSE ||
mxcsr & ~mxcsr_feature_mask)
return -EINVAL;
- memcpy(&vcpu->arch.guest_fpu->state.fxsave,
+ memcpy(&vcpu->arch.guest_fpu->state->fxsave,
guest_xsave->region, sizeof(struct fxregs_state));
}
return 0;
@@ -9900,7 +9900,7 @@ static void kvm_save_current_fpu(struct fpu *fpu)
* always has the minimum size.
*/
if (test_thread_flag(TIF_NEED_FPU_LOAD))
- memcpy(&fpu->state, &current->thread.fpu.state,
+ memcpy(fpu->state, current->thread.fpu.state,
fpu_buf_cfg.min_size);
else
save_fpregs_to_fpstate(fpu);
@@ -9919,7 +9919,7 @@ static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
*/
if (vcpu->arch.guest_fpu)
/* PKRU is separately restored in kvm_x86_ops.run. */
- __restore_fpregs_from_fpstate(&vcpu->arch.guest_fpu->state,
+ __restore_fpregs_from_fpstate(vcpu->arch.guest_fpu->state,
~XFEATURE_MASK_PKRU);

fpregs_mark_activate();
@@ -9940,7 +9940,7 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
if (vcpu->arch.guest_fpu)
kvm_save_current_fpu(vcpu->arch.guest_fpu);

- restore_fpregs_from_fpstate(&vcpu->arch.user_fpu->state);
+ restore_fpregs_from_fpstate(vcpu->arch.user_fpu->state);

fpregs_mark_activate();
fpregs_unlock();
@@ -10529,7 +10529,7 @@ int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)

vcpu_load(vcpu);

- fxsave = &vcpu->arch.guest_fpu->state.fxsave;
+ fxsave = &vcpu->arch.guest_fpu->state->fxsave;
memcpy(fpu->fpr, fxsave->st_space, 128);
fpu->fcw = fxsave->cwd;
fpu->fsw = fxsave->swd;
@@ -10552,7 +10552,7 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)

vcpu_load(vcpu);

- fxsave = &vcpu->arch.guest_fpu->state.fxsave;
+ fxsave = &vcpu->arch.guest_fpu->state->fxsave;

memcpy(fxsave->st_space, fpu->fpr, 128);
fxsave->cwd = fpu->fcw;
@@ -10613,7 +10613,7 @@ static void fx_init(struct kvm_vcpu *vcpu)

fpstate_init(vcpu->arch.guest_fpu);
if (boot_cpu_has(X86_FEATURE_XSAVES))
- vcpu->arch.guest_fpu->state.xsave.header.xcomp_bv =
+ vcpu->arch.guest_fpu->state->xsave.header.xcomp_bv =
host_xcr0 | XSTATE_COMPACTION_ENABLED;

/*
@@ -10693,6 +10693,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
pr_err("kvm: failed to allocate userspace's fpu\n");
goto free_emulate_ctxt;
}
+ vcpu->arch.user_fpu->state = &vcpu->arch.user_fpu->__default_state;

vcpu->arch.guest_fpu = kmem_cache_zalloc(x86_fpu_cache,
GFP_KERNEL_ACCOUNT);
@@ -10700,6 +10701,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
pr_err("kvm: failed to allocate vcpu's fpu\n");
goto free_user_fpu;
}
+ vcpu->arch.guest_fpu->state = &vcpu->arch.guest_fpu->__default_state;
fx_init(vcpu);

vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
diff --git a/arch/x86/math-emu/fpu_aux.c b/arch/x86/math-emu/fpu_aux.c
index 034748459482..51432a73024c 100644
--- a/arch/x86/math-emu/fpu_aux.c
+++ b/arch/x86/math-emu/fpu_aux.c
@@ -53,7 +53,7 @@ void fpstate_init_soft(struct swregs_state *soft)

void finit(void)
{
- fpstate_init_soft(&current->thread.fpu.state.soft);
+ fpstate_init_soft(&current->thread.fpu.state->soft);
}

/*
diff --git a/arch/x86/math-emu/fpu_entry.c b/arch/x86/math-emu/fpu_entry.c
index 8679a9d6c47f..6ba56632170e 100644
--- a/arch/x86/math-emu/fpu_entry.c
+++ b/arch/x86/math-emu/fpu_entry.c
@@ -640,7 +640,7 @@ int fpregs_soft_set(struct task_struct *target,
unsigned int pos, unsigned int count,
const void *kbuf, const void __user *ubuf)
{
- struct swregs_state *s387 = &target->thread.fpu.state.soft;
+ struct swregs_state *s387 = &target->thread.fpu.state->soft;
void *space = s387->st_space;
int ret;
int offset, other, i, tags, regnr, tag, newtop;
@@ -691,7 +691,7 @@ int fpregs_soft_get(struct task_struct *target,
const struct user_regset *regset,
struct membuf to)
{
- struct swregs_state *s387 = &target->thread.fpu.state.soft;
+ struct swregs_state *s387 = &target->thread.fpu.state->soft;
const void *space = s387->st_space;
int offset = (S387->ftop & 7) * 10, other = 80 - offset;

diff --git a/arch/x86/math-emu/fpu_system.h b/arch/x86/math-emu/fpu_system.h
index 9b41391867dc..a6291ddfdda6 100644
--- a/arch/x86/math-emu/fpu_system.h
+++ b/arch/x86/math-emu/fpu_system.h
@@ -73,7 +73,7 @@ static inline bool seg_writable(struct desc_struct *d)
return (d->type & SEG_TYPE_EXECUTE_MASK) == SEG_TYPE_WRITABLE;
}

-#define I387 (&current->thread.fpu.state)
+#define I387 (current->thread.fpu.state)
#define FPU_info (I387->soft.info)

#define FPU_CS (*(unsigned short *) &(FPU_info->regs->cs))
--
2.17.1

2021-08-25 19:23:19

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v10 19/28] x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature bits

Intel's Advanced Matrix Extension (AMX) is a new 64-bit extended feature
consisting of two-dimensional registers and an accelerator unit. The first
implementation of the latter is the tile matrix multiply unit (TMUL). TMUL
performs SIMD dot-products on four bytes (INT8) or two bfloat16
floating-point (BF16) elements.

Here enumerate this hardware capability to be shown as 'amx_tile',
'amx_bf16', and 'amx_int8' in /proc/cpuinfo.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Massaged the changelog a bit.
---
arch/x86/include/asm/cpufeatures.h | 3 +++
arch/x86/kernel/cpu/cpuid-deps.c | 3 +++
2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index ab7b3a2de85d..dc0fb04cce69 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -386,7 +386,10 @@
#define X86_FEATURE_TSXLDTRK (18*32+16) /* TSX Suspend Load Address Tracking */
#define X86_FEATURE_PCONFIG (18*32+18) /* Intel PCONFIG */
#define X86_FEATURE_ARCH_LBR (18*32+19) /* Intel ARCH LBR */
+#define X86_FEATURE_AMX_BF16 (18*32+22) /* AMX BF16 Support */
#define X86_FEATURE_AVX512_FP16 (18*32+23) /* AVX512 FP16 */
+#define X86_FEATURE_AMX_TILE (18*32+24) /* AMX tile Support */
+#define X86_FEATURE_AMX_INT8 (18*32+25) /* AMX INT8 Support */
#define X86_FEATURE_SPEC_CTRL (18*32+26) /* "" Speculation Control (IBRS + IBPB) */
#define X86_FEATURE_INTEL_STIBP (18*32+27) /* "" Single Thread Indirect Branch Predictors */
#define X86_FEATURE_FLUSH_L1D (18*32+28) /* Flush L1D cache */
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index 7f891d2eb52e..9a520ab259ac 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -76,6 +76,9 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_SGX1, X86_FEATURE_SGX },
{ X86_FEATURE_SGX2, X86_FEATURE_SGX1 },
{ X86_FEATURE_XFD, X86_FEATURE_XSAVE },
+ { X86_FEATURE_AMX_TILE, X86_FEATURE_XSAVE },
+ { X86_FEATURE_AMX_INT8, X86_FEATURE_AMX_TILE },
+ { X86_FEATURE_AMX_BF16, X86_FEATURE_AMX_TILE },
{}
};

--
2.17.1

2021-08-25 19:23:24

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v10 25/28] x86/insn/amx: Add TILERELEASE instruction to the opcode map

Include the opcode of TILERELEASE that returns all the AMX state to
INIT-state.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v4:
* Added as a new patch as preparatory to use the instruction in the kernel.
---
arch/x86/lib/x86-opcode-map.txt | 8 +++++++-
tools/arch/x86/lib/x86-opcode-map.txt | 8 +++++++-
2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index ec31f5b60323..dbc5078ccafe 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -690,7 +690,9 @@ AVXcode: 2
45: vpsrlvd/q Vx,Hx,Wx (66),(v)
46: vpsravd Vx,Hx,Wx (66),(v) | vpsravd/q Vx,Hx,Wx (66),(evo)
47: vpsllvd/q Vx,Hx,Wx (66),(v)
-# Skip 0x48-0x4b
+# Skip 0x48
+49: Grp22 (1A)
+# Skip 0x4a-0x4b
4c: vrcp14ps/d Vpd,Wpd (66),(ev)
4d: vrcp14ss/d Vsd,Hpd,Wsd (66),(ev)
4e: vrsqrt14ps/d Vpd,Wpd (66),(ev)
@@ -1082,6 +1084,10 @@ GrpTable: Grp21
7: ENDBR64 (F3),(010),(11B) | ENDBR32 (F3),(011),(11B)
EndTable

+GrpTable: Grp22
+0: TILERELEASE (!F3),(v1),(11B)
+EndTable
+
# AMD's Prefetch Group
GrpTable: GrpP
0: PREFETCH
diff --git a/tools/arch/x86/lib/x86-opcode-map.txt b/tools/arch/x86/lib/x86-opcode-map.txt
index ec31f5b60323..dbc5078ccafe 100644
--- a/tools/arch/x86/lib/x86-opcode-map.txt
+++ b/tools/arch/x86/lib/x86-opcode-map.txt
@@ -690,7 +690,9 @@ AVXcode: 2
45: vpsrlvd/q Vx,Hx,Wx (66),(v)
46: vpsravd Vx,Hx,Wx (66),(v) | vpsravd/q Vx,Hx,Wx (66),(evo)
47: vpsllvd/q Vx,Hx,Wx (66),(v)
-# Skip 0x48-0x4b
+# Skip 0x48
+49: Grp22 (1A)
+# Skip 0x4a-0x4b
4c: vrcp14ps/d Vpd,Wpd (66),(ev)
4d: vrcp14ss/d Vsd,Hpd,Wsd (66),(ev)
4e: vrsqrt14ps/d Vpd,Wpd (66),(ev)
@@ -1082,6 +1084,10 @@ GrpTable: Grp21
7: ENDBR64 (F3),(010),(11B) | ENDBR32 (F3),(011),(11B)
EndTable

+GrpTable: Grp22
+0: TILERELEASE (!F3),(v1),(11B)
+EndTable
+
# AMD's Prefetch Group
GrpTable: GrpP
0: PREFETCH
--
2.17.1

2021-08-25 19:24:59

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v10 23/28] x86/fpu/xstate: Skip writing zeros to signal frame for dynamic user states if in INIT-state

By default, for XSTATE features in the INIT-state, XSAVE writes zeros to
the uncompressed destination buffer.

E.g., if you are not using AVX-512, you will still get a bunch of zeros on
the signal stack where live AVX-512 data would go.

For 'dynamic user state' (currently only XTILEDATA), explicitly skip this
data transfer. The result is that the user buffer for the AMX region will
not be touched by XSAVE.

[ Reading XINUSE takes about 20-30 cycles, but writing zeros consumes about
5-times or more, e.g., for XTILEDATA. ]

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v9:
* Use cpu_feature_enabled() instead of boot_cpu_has(). (Borislav Petkov)

Changes from v5:
* Mentioned the optimization trade-offs in the changelog. (Dave Hansen)
* Added code comment.

Changes from v4:
* Added as a new patch.
---
arch/x86/include/asm/fpu/internal.h | 38 +++++++++++++++++++++--------
1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index a7e39862df30..0225ab63e5d1 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -337,8 +337,9 @@ static inline void os_xrstor(struct xregs_state *xstate, u64 mask)
*/
static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
{
+ struct fpu *fpu = &current->thread.fpu;
u32 lmask, hmask;
- u64 mask;
+ u64 state_mask;
int err;

/*
@@ -346,21 +347,38 @@ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
* internally, e.g. PKRU. That's user space ABI and also required
* to allow the signal handler to modify PKRU.
*/
- mask = xfeatures_mask_uabi();
+ state_mask = xfeatures_mask_uabi();
+
+ if (!xfeatures_mask_user_dynamic)
+ goto mask_ready;

/*
* Exclude dynamic user states for non-opt-in threads.
*/
- if (xfeatures_mask_user_dynamic) {
- struct fpu *fpu = &current->thread.fpu;
-
- mask &= fpu->dynamic_state_perm ?
- fpu->state_mask :
- ~xfeatures_mask_user_dynamic;
+ if (!fpu->dynamic_state_perm) {
+ state_mask &= ~xfeatures_mask_user_dynamic;
+ } else {
+ u64 dynamic_state_mask;
+
+ state_mask &= fpu->state_mask;
+
+ dynamic_state_mask = state_mask & xfeatures_mask_user_dynamic;
+ if (dynamic_state_mask && cpu_feature_enabled(X86_FEATURE_XGETBV1)) {
+ u64 dynamic_xinuse, dynamic_init;
+ u64 xinuse = xgetbv(1);
+
+ dynamic_xinuse = xinuse & dynamic_state_mask;
+ dynamic_init = ~xinuse & dynamic_state_mask;
+ if (dynamic_init) {
+ state_mask &= ~xfeatures_mask_user_dynamic;
+ state_mask |= dynamic_xinuse;
+ }
+ }
}

- lmask = mask;
- hmask = mask >> 32;
+mask_ready:
+ lmask = state_mask;
+ hmask = state_mask >> 32;

/*
* Clear the xsave header first, so that reserved fields are
--
2.17.1

2021-08-25 19:59:34

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v10 17/28] x86/fpu/xstate: Adjust the XSAVE feature table to address gaps in state component numbers

At compile-time xfeatures_mask_all includes all possible XCR0 features. At
run-time fpu__init_system_xstate() clears features in xfeatures_mask_all
that are not enabled in CPUID. It does this by looping through all possible
XCR0 features.

Update the code to handle the possibility that there will be gaps in the
XCR0 feature bit numbers.

No functional change.

Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v5:
* Folded a few lines.

Changes from v4:
* Simplified the implementation. (Thomas Gleixner)
* Updated the patch title accordingly.

Changes from v1:
* Rebased on the upstream kernel (5.10)
---
arch/x86/kernel/fpu/xstate.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index bb31ef8a45b5..28e4f3254487 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -43,18 +43,17 @@ static const char *xfeature_names[] =
"unknown xstate feature" ,
};

-static short xsave_cpuid_features[] __initdata = {
- X86_FEATURE_FPU,
- X86_FEATURE_XMM,
- X86_FEATURE_AVX,
- X86_FEATURE_MPX,
- X86_FEATURE_MPX,
- X86_FEATURE_AVX512F,
- X86_FEATURE_AVX512F,
- X86_FEATURE_AVX512F,
- X86_FEATURE_INTEL_PT,
- X86_FEATURE_PKU,
- X86_FEATURE_ENQCMD,
+static unsigned short xsave_cpuid_features[] __initdata = {
+ [XFEATURE_SSE] = X86_FEATURE_XMM,
+ [XFEATURE_YMM] = X86_FEATURE_AVX,
+ [XFEATURE_BNDREGS] = X86_FEATURE_MPX,
+ [XFEATURE_BNDCSR] = X86_FEATURE_MPX,
+ [XFEATURE_OPMASK] = X86_FEATURE_AVX512F,
+ [XFEATURE_ZMM_Hi256] = X86_FEATURE_AVX512F,
+ [XFEATURE_Hi16_ZMM] = X86_FEATURE_AVX512F,
+ [XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
+ [XFEATURE_PKRU] = X86_FEATURE_PKU,
+ [XFEATURE_PASID] = X86_FEATURE_ENQCMD,
};

/*
@@ -908,7 +907,8 @@ void __init fpu__init_system_xstate(void)
* Clear XSAVE features that are disabled in the normal CPUID.
*/
for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
- if (!boot_cpu_has(xsave_cpuid_features[i]))
+ if (((i == 0) || xsave_cpuid_features[i]) &&
+ !boot_cpu_has(xsave_cpuid_features[i]))
xfeatures_mask_all &= ~BIT_ULL(i);
}

--
2.17.1

2021-08-25 19:59:34

by Chang S. Bae

[permalink] [raw]
Subject: [PATCH v10 26/28] intel_idle/amx: Add SPR support with XTILEDATA capability

Add a custom Sapphire Rapids (SPR) C-state table to intel_idle driver. The
parameters in this table are preferred over those supplied by ACPI.

SPR supports AMX, and so this custom table uses idle entry points that know
how to initialize AMX TMM state, if necessary.

This guarantees that AMX TMM state will never be the cause of hardware
C-state demotion from C6 to C1E. Under some conditions this may result in
improved power savings, and thus higher available turbo frequency budget.

[ Based on patch by Artem Bityutskiy <[email protected]>. ]

Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Reviewed-by: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes from v9:
* Add a comment to use tile_release() after preempt_disable(). (Dave
Hansen)
* Use cpu_feature_enabled() instead of boot_cpu_has(). (Borislav Petkov)
* Add a Suggested-by tag.

Changes from v6:
* Update the changelog and function description. (Rafael J. Wysocki)

Changes from v5:
* Moved the code to intel_idle. (Peter Zijlstra)
* Fixed to deactivate fpregs. (Andy Lutomirski and Dave Hansen)
* Updated the code comment. (Dave Hansen)

Changes from v4:
* Added as a new patch. (Thomas Gleixner)
---
arch/x86/include/asm/special_insns.h | 6 ++
drivers/idle/intel_idle.c | 82 ++++++++++++++++++++++++++++
2 files changed, 88 insertions(+)

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index f3fbb84ff8a7..fada1bb82c7b 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -294,6 +294,12 @@ static inline int enqcmds(void __iomem *dst, const void *src)
return 0;
}

+static inline void tile_release(void)
+{
+ /* Instruction opcode for TILERELEASE; supported in binutils >= 2.36. */
+ asm volatile(".byte 0xc4, 0xe2, 0x78, 0x49, 0xc0");
+}
+
#endif /* __KERNEL__ */

#endif /* _ASM_X86_SPECIAL_INSNS_H */
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index e6c543b5ee1d..72b72fa0e072 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -54,6 +54,8 @@
#include <asm/intel-family.h>
#include <asm/mwait.h>
#include <asm/msr.h>
+#include <asm/fpu/internal.h>
+#include <asm/special_insns.h>

#define INTEL_IDLE_VERSION "0.5.1"

@@ -155,6 +157,58 @@ static __cpuidle int intel_idle_s2idle(struct cpuidle_device *dev,
return 0;
}

+/**
+ * idle_tile - Initialize TILE registers in INIT-state
+ *
+ * Leaving state in the dirty TILE registers may prevent the processor from
+ * entering lower-power idle states. Use TILERELEASE to initialize the
+ * state. Destroying fpregs state is safe after the fpstate update.
+ *
+ * WARNING: It should be called after preemption is disabled; otherwise,
+ * reschedule is possible with the destroyed state.
+ */
+static inline void idle_tile(void)
+{
+ if (cpu_feature_enabled(X86_FEATURE_XGETBV1) && (xgetbv(1) & XFEATURE_MASK_XTILE)) {
+ tile_release();
+ fpregs_deactivate(&current->thread.fpu);
+ }
+}
+
+/**
+ * intel_idle_tile - Ask the processor to enter the given idle state.
+ * @dev: cpuidle device of the target CPU.
+ * @drv: cpuidle driver (assumed to point to intel_idle_driver).
+ * @index: Target idle state index.
+ *
+ * Ensure TILE registers in INIT-state before using intel_idle() to
+ * enter the idle state.
+ */
+static __cpuidle int intel_idle_tile(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv, int index)
+{
+ idle_tile();
+
+ return intel_idle(dev, drv, index);
+}
+
+/**
+ * intel_idle_s2idle_tile - Ask the processor to enter the given idle state.
+ * @dev: cpuidle device of the target CPU.
+ * @drv: cpuidle driver (assumed to point to intel_idle_driver).
+ * @index: Target idle state index.
+ *
+ * Ensure TILE registers in INIT-state before using intel_idle_s2idle() to
+ * enter the idle state.
+ */
+static __cpuidle int intel_idle_s2idle_tile(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv, int index)
+{
+ idle_tile();
+
+ return intel_idle_s2idle(dev, drv, index);
+}
+
/*
* States are indexed by the cstate number,
* which is also the index into the MWAIT hint array.
@@ -752,6 +806,27 @@ static struct cpuidle_state icx_cstates[] __initdata = {
.enter = NULL }
};

+static struct cpuidle_state spr_cstates[] __initdata = {
+ {
+ .name = "C1",
+ .desc = "MWAIT 0x00",
+ .flags = MWAIT2flg(0x00),
+ .exit_latency = 1,
+ .target_residency = 1,
+ .enter = &intel_idle,
+ .enter_s2idle = intel_idle_s2idle, },
+ {
+ .name = "C6",
+ .desc = "MWAIT 0x20",
+ .flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED,
+ .exit_latency = 128,
+ .target_residency = 384,
+ .enter = &intel_idle_tile,
+ .enter_s2idle = intel_idle_s2idle_tile, },
+ {
+ .enter = NULL }
+};
+
static struct cpuidle_state atom_cstates[] __initdata = {
{
.name = "C1E",
@@ -1095,6 +1170,12 @@ static const struct idle_cpu idle_cpu_icx __initconst = {
.use_acpi = true,
};

+static const struct idle_cpu idle_cpu_spr __initconst = {
+ .state_table = spr_cstates,
+ .disable_promotion_to_c1e = true,
+ .use_acpi = true,
+};
+
static const struct idle_cpu idle_cpu_avn __initconst = {
.state_table = avn_cstates,
.disable_promotion_to_c1e = true,
@@ -1157,6 +1238,7 @@ static const struct x86_cpu_id intel_idle_ids[] __initconst = {
X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_X, &idle_cpu_skx),
X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, &idle_cpu_icx),
X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_D, &idle_cpu_icx),
+ X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, &idle_cpu_spr),
X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNL, &idle_cpu_knl),
X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNM, &idle_cpu_knl),
X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT, &idle_cpu_bxt),
--
2.17.1

2021-09-30 21:40:13

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH v10 00/28] x86: Support Intel Advanced Matrix Extensions

Sep 17 Minutes (tglx, daveh, chang, rafael, lenb)

As a result of a verbal discussion with Thomas, we propose the
following five updates to the v10 AMX patch series. Below this list
of updates are notes from additional discussion that did not result in
changes.

Update #1 Expand XSAVE no-write optimization to cover XSTATE feature17 and later
------------------------------------------------------------------------------------------------------------------

Linux/AMX patchset v10 checks for XINUSE.AMX=0 just before XSAVE to
the signal stack. If that INIT condition is detected, Linux sets
RFBM.AMX=0 before XSAVE. The result is that instead of writing 8K of
zeros for AMX, XSAVE writes nothing. Originally, this was intended as
a performance optimization, However, if this optimization applies to
the last items on the stack, then it also reduces how much stack is
required.

Update this optimization to apply to feature17 and later, rather than
just XFD-enabled features (like feature 18, AMX.data). We do this
starting with feature17 (AMX.TILECONFIG) not because it is large, but
because it is new. Also it sets the precedent for an inevitable
"feature19" that comes after AMX, but may not necessarily be protected
by XFD.

[ note that a future performance enhancement might be to apply this
check to AVX-512. Doing so would save data transfer (of zeros) when
AVX512 is INIT, but due to features after AVX512 and before feature17
it would not necessarily shrink the signal stack use, which must be
assumed to include up through feature16 ]

Update #2: Enhance run-time stack check upon signal delivery.
------------------------------------------------------------------------------------

Linux 5.14 added a new check just before writing to the signal stack.
If sigaltstack is being used, Linux checks if the XSTATE will fit. If
not, SIGSEGV rather than corrupting the user stack. [ It was an
oversight that this check was not present before AVX-512 was deployed,
and the result was that there were situations where too-small
sigaltstack were corrupted. ]

This new check currently uses the total XSTATE for its comparison.
However, it doesn’t account for the possibility that feature17 (AMX
TILECONFIG) and later may be in INIT state and never actually written.
And so the check is overly pessimistic, and could kill some legacy
apps that do not need to be killed.

Enhance this check to allow programs that have not requested AMX (and
later) to continue running when their actual stack use fits within
their allocated sigaltstack, even the addition of features17 and later
would not fit.

Update #3: New check upon request to use AMX:
-----------------------------------------------------------------

For an application that registers a too-small sigaltstack, and later
requests access to AMX, the request to access AMX should be denied.

This sequence could occur if a legacy application contains a signal
handler and a hard-coded sigaltstack size, but it links with a new
library that requests use of AMX. The result should be the same as if
the application and library were to run on a system that does not
support AMX.

[And since this application is prohibited from touching AMX, that
guarantees its signal stack will not grow]

Update #4: New check upon sigaltstack(2)
--------------------------------------------------------

For an application that is granted access to AMX, and then later
registers a too-small sigaltstack, sigaltstack(2) should return an
error and have no effect.

This sequence is the reverse of the above, and is more problematic.
If the app looks at the return code of sigaltstack, it will likely
exit. If the app ignores the return code from sigaltstack, then it
will proceed to invoke sigaction(SA_ONSTACK), which will return
failure, because there was no preceding successful sigaltstack(2).

Update #5: Enhance prtcl to support synchronous ctx switch buffer allocation.
------------------------------------------------------------------------------------------------------

Linux/AMX v10 supports per-task transparent allocation for 8KB AMX
context switch buffers using XFD hardware support. The benefit of
this approach is that buffers are allocated only for threads that
actually need them. However, in OOM conditions, an allocation failure
results in a signal, which is an unfriendly way to fail.

Enhance the existing prctl so that a user can not only ask for
permission for all tasks in the process to access AMX, but for the
kernel to immediately allocate context switch buffers for all current
and future tasks in that process.

If the kernel has an allocation failure in this call, no buffers are
allocated, no permission is granted, and the system call returns an
error.

Note that use of this optional flag may provoke unnecessary allocation
of buffers, perhaps exacerbating the risk of OOM in the first place.

Note that use of this flag does not add any protection from OOM issues
on subsequent thread creation. With this flag, the actual thread
clone call would fail. Without this flag, if the clone worked, a
signal would be delivered if and when the clone actually touched AMX
and the kernel was unable to allocate a context switch buffer.

ie. This flag will not necessarily enhance survivability in the face
of run-time OOM issues. All it does is deliver an initial OOM failure
synchronously. Also, use of this flag could provoke future OOM errors
due to unnecessary hogging of per task kernel memory.

Currently the system call will allocate buffers for all current and
future threads in a process. This notion is cleared on EXEC (just as
AMX access permission is cleared on EXEC)

There was some discussion about not allocating for threads that exit
before the call, or only those cloned from a task with a buffer, but
we decided brute force allocating for all tasks was simplest. We
fully expect most apps to use dynamic-allocation on 1st use, rather
than this flag.

Requirement: AMX-induced XSTATE growth must not cause run-time app failures
-------------------------------------------------------------------------------------------------------------

The Linux signal ABI uses uncompacted XSTATE format. The same ABI is
in effect, whether the application is using the (large) default stack,
or if it specifies a (user-allocated) sigaltstack(2). AMX.TMM is
defined as 8KB of data in the XSTATE format, and thus there is a real
risk that users who allocate a sigaltstack() will not make it large
enough to hold XSTATE. Note that since uncompacted XSTATE includes
all previous features, this is an issue even on hardware that doesn't
support AMX, but supports a feature after AMX "feature19". Further,
since uncompacted XSTATE format is dedicated by CPUID, its unprotected
use is a risk to programs even though they may not use AMX or
subsequent features. Eg. XSTATE feature19 will appear 8KB after
feature17, even on a CPU that doesn’t support feature18 (AMX.TMM).

Legacy glibc signal.h hard-coded 2K and 8K stack sizes for
applications to use for sigaltstack(). AVX-512 exceeded the 2K limit,
and AMX shall exceed the 8K limit.

Glibc 2.34 shipped on 8/1/2021. It includes a change to the
hard-coded stack limit to one calculated at run time to reflect the
current hardware. Apps that use the ABI and re-compile with the
updated glibc should have no problems, whether they use AMX or not.

However this does not benefit legacy binaries that were compiled with
an old glibc, or those which ignore signal.h and allocate a signal
stack using a non-ABI size. Those apps run on past hardware with a
small sigaltstack(), and they must be able to run on AMX-capable
hardware.

This patch series includes a new prctl(), which must be invoked for a
task in a process to access AMX and later features.

Legacy apps with legacy libraries never call the prctl, Linux used XFD
to enforce that AMX.data stays in INIT, Linux doesn't write zeros for
AMX to the signal stack, and so those apps will not experience any
signal stack growth.

Legacy apps linked with a new library may indirectly invoke the
prctl(AMX). If the binary had registered a too small sigaltstack, the
prctl() will deny access to AMX. If the binary registers a too small
sigalstack after success prctl(), the sigalstack() (or subsequent
sigaction()) will return failure.

New sigaltstack apps built with the new glibc properly using the ABI
will allocate a stack large enough for using AMX and later features.

Requirement: kernel ability to control which apps access AMX
----------------------------------------------------------------------------------

As servers get larger, customers are consolidating more functions onto
fewer systems. The risk of “noisy neighbor” applications interfering
with each other is growing. This has been seen to be an acute problem
where AVX-512 applications have a greater frequency impact than non
AVX-512 applications. This impact has been optimized in SPR and newer
hardware, as compared to previous generations, so it should be a lot
less noticeable. However, like AVX-512 instructions, AMX instructions
can also switch many transistors, and will thus impact maximum
frequency more than simpler instructions. This impact will be
core-wide, and thus will be seen by threads running on an HT sibling.

In response, Linux requires the ability to exercise “fine grained
control” over which apps can use AMX. The v10 AMX patch set includes
a “permission” system call for this purpose. It would be mandated to
be called by all apps that want to access AMX.

Per-Task XCR0 Proposal
----------------------------------
The benefit of a per-task XCR0 would be that application probing for
AMX could do the traditional CPUID+XGETBV(XCR0) for AMX, just like
they did for previous features.

At the same time, Linux could use a per-task XCR0 to implement its
desired “fine grained policy”. The task asks if XCR0 has AMX, and
Linux can say YES or NO based on policy.

But context switching XCR0 per-task is not viable in a virtualized
environment, as each write would provoke a VMEXIT. As Linux is the
same on bare metal and on top of a VMM, and a huge portion of our
customers now have some sort of VMM, this VMEXIT penalty prohibits
per-task XCR0 via XCR0 context switches.

However, user-space accesses XCR0 via the XGETBV() instruction. If
the HW could be made to trap on that instruction such that the kernel
could spoof XCR0, then the kernel would have a mechanism to tell an
app that it can’t access AMX without mandating a new system call. The
”real” HW XCR0 would still be in effect. (Since XCR0.AMX=1, Linux
would not have a #UD to aid it in blocking access to AMX, but it can
enforce access via XFD/#NM.)

If trapping XGETBV() is not viable, another approach would be
replacing its use with a system call, say, sys_xgetbv(), to return the
spoofed version instead of the hardware version. Apps would have to
cut over to this new system call from the native XGETBV(). A simple
system call returning the SW XCR0 is sufficient to let the app know if
it is approved to use AMX or not. However, if this call is to replace
the v10 prtcl(GET/SET) system call, the kernel has to know that the
app is actually requesting to use AMX, rather than just seeing what is
available. Only when it knows this is a “request” can it immediately
sanity check if a too-small sigaltstack() was registered. And so the
proposal would be: allowed-bits = sys_xgetbv(requested-bits). This is
similar to the v10 prctl(GET), prtl(SET). However, it has the
advantage of combining the request and the response into a single
call, and it could also replace use of the native XGETBV() in the
current application probe for AMX and other stateful features.

At the end of the day, it seems that we will not be able to trap on
user xgetbv(), and we don't see a significant benefit of combining
GET/SET, so we are sticking with the current prctl().

What we did not have time to talk about:
------------------------------------------------------
There was a request from Boris and Thiago on the list to add an
additional system call, to allow an app to distinguish between what
the HW supports, and what the app has permission to access. Unclear
if this is part of a larger "Linux needs a better way to probe for
features" ABI, or if this is necessary for initial AMX support. We'll
see where the discussion on the list leads...

Brainstorm for future: Future-proof lightweight signal ABI
---------------------------------------------------------------------------
The current signal ABI with its uncompacted XSTATE format did not
anticipate large XSTATE features.

After AMX, we should investigate a new signal ABI which would be
immune to problems due to new large state features.

At the same time, the new signal ABI could also be designed to be
efficient and fast.

The legacy ABI supports the signal handler running with a “clean
slate” of register state, while being able to access all signaled
thread state in the XSTATE on the stack.

A lightweight signal handler might start execution with the minimum of
state cleared and saved for its benefit, and all else the signal
handler would have to save itself.

Such a handler would need to be built specially (ala. User-space
interrupt handlers) to not expand, say memcpy() into AVX512, for
example. The handler is also prohibited from calling routines that
break such rules.

Large state would sit in registers throughout the lifetime of the
signal handler, making it immune to performance scaling issues due to
future large state. Indeed, this is the fastest possible solution for
a minimal signal handler.

Another proposal is to simply opt-into using XSAVEC format on the
signal stack. This is the compacted format that the kernel enjoys
during context switch, and has the advantage that it will never grow
due to unused (or unsupported) features. However, for applications
that actually want to examine or modify the state, accessor routines
to find state based on the XSAVEC would need to be used, rather than
today's fixed offsets.

Finally a hybrid is possible, where legacy XSAVE is used up through
feature16, and a second invocation using XSAVEC is used for feature17
and later. This would have the advantage of keeping legacy
compatibility for existing features, and adding no growth for unused
future features. However, decoders to understand the stack format
would need to be updated.

This efforts will take a while to develop, and even longer to deploy,
and so it is necessarily a post-AMX effort.

2021-10-01 13:10:07

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 02/28] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers

Chang,

On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
> Have the function initializing the XSTATE buffer take a struct fpu *
> pointer in preparation for dynamic state buffer support.
>
> init_fpstate is a special case, which is indicated by a null pointer
> parameter to fpstate_init().
>
> Also, fpstate_init_xstate() now accepts the state component bitmap to
> customize the compacted format.

That's not a changelog. Changelogs have to explain the WHY not the WHAT.

I can see the WHY when I look at the later changes, but that's not how
it works.

Also the subject of this patch is just wrong. It does not make the
functions handle dynamic buffers, it prepares them to add support for
that later.

> +static inline void fpstate_init_xstate(struct xregs_state *xsave, u64 mask)
> +{
> + /*
> + * XRSTORS requires these bits set in xcomp_bv, or it will
> + * trigger #GP:
> + */
> + xsave->header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT | mask;
> +}

This wants to be a separate cleanup patch which replaces the open coded
variant here:

> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index fc1d529547e6..0fed7fbcf2e8 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -395,8 +395,7 @@ static void __init setup_init_fpu_buf(void)
> print_xstate_features();
>
> if (boot_cpu_has(X86_FEATURE_XSAVES))
> - init_fpstate.xsave.header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT |
> - xfeatures_mask_all;
> + fpstate_init_xstate(&init_fpstate.xsave, xfeatures_mask_all);

Thanks,

tglx

2021-10-01 13:10:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 03/28] x86/fpu/xstate: Modify state copy helpers to handle both static and dynamic buffers

Chang,

On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
> Have all the functions copying XSTATE take a struct fpu * pointer in
> preparation for dynamic state buffer support.

Which is something different than the subject line claims to do and
still does not explain why this code needs access to struct fpu later
on.

Thanks

tglx

2021-10-01 13:24:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 05/28] x86/fpu/xstate: Add a new variable to indicate dynamic user states

On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
> +/*
> + * This represents user xstates, a subset of xfeatures_mask_all, saved in a
> + * dynamic kernel XSAVE buffer.
> + */
> +u64 xfeatures_mask_user_dynamic __ro_after_init;
> +
> static unsigned int xstate_offsets[XFEATURE_MAX] __ro_after_init =
> { [ 0 ... XFEATURE_MAX - 1] = -1};
> static unsigned int xstate_sizes[XFEATURE_MAX] __ro_after_init =
> @@ -709,6 +715,7 @@ static int __init init_xstate_size(void)
> static void fpu__init_disable_system_xstate(void)
> {
> xfeatures_mask_all = 0;
> + xfeatures_mask_user_dynamic = 0;

It's zero at boot already, isn't it?

> cr4_clear_bits(X86_CR4_OSXSAVE);
> setup_clear_cpu_cap(X86_FEATURE_XSAVE);
> }
> @@ -780,6 +787,8 @@ void __init fpu__init_system_xstate(void)
>
> /* Store it for paranoia check at the end */
> xfeatures = xfeatures_mask_all;
> + /* Do not support the dynamically allocated buffer yet. */
> + xfeatures_mask_user_dynamic = 0;
>
> /* Enable xstate instructions to be able to continue with initialization: */
> fpu__init_cpu_xstate();

2021-10-01 13:49:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 04/28] x86/fpu/xstate: Modify address finders to handle both static and dynamic buffers

Chang,

On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:

> Have all the functions finding XSTATE address take a struct fpu * pointer
> in preparation for dynamic state buffer support.
>
> init_fpstate is a special case, which is indicated by a null pointer
> parameter to get_xsave_addr() and __raw_xsave_addr().

Same comment vs. subject. Prepare ...

> + if (fpu)
> + xsave = &fpu->state.xsave;
> + else
> + xsave = &init_fpstate.xsave;
> +
> + return xsave + xstate_comp_offsets[xfeature_nr];

So you have the same conditionals and the same comments vs. that NULL
pointer oddity how many times now all over the place?

That can be completely avoided:

Patch 1:

-union fpregs_state init_fpstate __ro_after_init;
+static union fpregs_state init_fpstate __ro_after_init;
+struct fpu init_fpu = { .state = &init_fpstate } __ro_after_init;

and make all users of init_fpstate access it through init_fpu.

Patches 2..N which change arguments from fpregs_state to fpu:

- fun(init_fpu->state);
+ fun(&init_fpu);

Patch M which adds state_mask:

@fpu__init_system_xstate()
+ init_fpu.state_mask = xfeatures_mask_all;

Hmm?

Thanks,

tglx

2021-10-01 16:05:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 13/28] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Fri, Oct 01 2021 at 17:02, Thomas Gleixner wrote:
> On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
>> DEFINE_IDTENTRY(exc_device_not_available)
>> {
>> unsigned long cr0 = read_cr0();
>
>> + if (handle_xfd_event(&current->thread.fpu, regs))
>> + return;
>
> As I said before, this is wrong because at that point interrupts are disabled.

So you want something like this:

static bool handle_xfd_event(struct pt_regs *regs)
{
u64 xfd_err, xfd_event, xfd, mask;
struct fpu *fpu;

if (!cpu_feature_enabled(X86_FEATURE_XFD))
return false;

rdmsrl_safe(MSR_IA32_XFD_ERR, &xfd_err);
if (!xfd_err)
return false;

wrmsrl_safe(MSR_IA32_XFD_ERR, 0);

xfd_event = xfd_err & xfeatures_mask_user_dynamic;

/* Die if a non-handled feature raised the exception */
if (WARN_ON(!xfd_event))
return true;

/* Die if that happens in kernel space */
if (WARN_ON(!user_mode(regs)))
return false;

local_irq_enable();

/* Make sure that dynamic buffer expansion is permitted. */
if (dynamic_state_permitted(current, xfd_event) &&
!realloc_xstate_buffer(current, xfd_event)) {
mask = xfeatures_mask_user_dynamic;
fpu = &current->thread.fpu;
xfd_write((fpu->state_mask & mask) ^ mask);
} else {
force_sig_fault(SIGILL, ILL_ILLOPC, error_get_trap_addr(regs));
}

local_irq_disable();
return true;
}

Along with a correct implementation of realloc_xstate_buffer().

Thanks,

tglx

2021-10-01 17:17:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 13/28] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
> +/**
> + * xfd_switch - Switches the MSR IA32_XFD context if needed.
> + * @prev: The previous task's struct fpu pointer
> + * @next: The next task's struct fpu pointer
> + */
> +static inline void xfd_switch(struct fpu *prev, struct fpu *next)
> +{
> + u64 prev_xfd_mask, next_xfd_mask;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_XFD) || !xfeatures_mask_user_dynamic)
> + return;

This is context switch, so this wants to be a static key which is turned
on during init when the CPU supports XFD and user dynamic features are
available.

> +
> + prev_xfd_mask = prev->state_mask & xfeatures_mask_user_dynamic;
> + next_xfd_mask = next->state_mask & xfeatures_mask_user_dynamic;
> +
> + if (unlikely(prev_xfd_mask != next_xfd_mask))
> + wrmsrl_safe(MSR_IA32_XFD, xfeatures_mask_user_dynamic ^ next_xfd_mask);
> +}
> +
> /*
> * Delay loading of the complete FPU state until the return to userland.
> * PKRU is handled separately.
> */
> -static inline void switch_fpu_finish(struct fpu *new_fpu)
> +static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
> {
> - if (cpu_feature_enabled(X86_FEATURE_FPU))
> + if (cpu_feature_enabled(X86_FEATURE_FPU)) {
> set_thread_flag(TIF_NEED_FPU_LOAD);
> + xfd_switch(old_fpu, new_fpu);

Why has this to be done on context switch? Zero explanation provided.

Why can't this be done in exit_to_user() where the FPU state restore is
handled?

> }
> +
> + if (boot_cpu_has(X86_FEATURE_XFD))

s/boot_cpu_has/cpu_feature_enabled/g

> + wrmsrl(MSR_IA32_XFD, xfeatures_mask_user_dynamic);
> }
> +
> + if (cpu_feature_enabled(X86_FEATURE_XFD))
> + wrmsrl_safe(MSR_IA32_XFD, (current->thread.fpu.state_mask &
> + xfeatures_mask_user_dynamic) ^
> + xfeatures_mask_user_dynamic);

Lacks curly braces as it's not a single line of code.

> }
>
> /**
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 33f5d8d07367..6cd4fb098f8f 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -97,6 +97,16 @@ void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
> *size = fpu_buf_cfg.min_size;
> }
>
> +void arch_release_task_struct(struct task_struct *task)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_FPU))
> + return;
> +
> + /* Free up only the dynamically-allocated memory. */
> + if (task->thread.fpu.state != &task->thread.fpu.__default_state)

Sigh.

> + free_xstate_buffer(&task->thread.fpu);
>
> +static __always_inline bool handle_xfd_event(struct fpu *fpu, struct pt_regs *regs)
> +{
> + bool handled = false;
> + u64 xfd_err;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_XFD))
> + return handled;
> +
> + rdmsrl_safe(MSR_IA32_XFD_ERR, &xfd_err);
> + wrmsrl_safe(MSR_IA32_XFD_ERR, 0);
> +
> + if (xfd_err) {

What's wrong with

if (!xfd_err)
return false;

an spare the full indentation levels below

> + u64 xfd_event = xfd_err & xfeatures_mask_user_dynamic;
> + u64 value;
> +
> + if (WARN_ON(!xfd_event)) {
> + /*
> + * Unexpected event is raised. But update XFD state to
> + * unblock the task.
> + */
> + rdmsrl_safe(MSR_IA32_XFD, &value);
> + wrmsrl_safe(MSR_IA32_XFD, value & ~xfd_err);

Ditto. But returning false here will not unblock the task as
exc_device_not_available() will simply reach "die()".

> + } else {
> + struct fpu *fpu = &current->thread.fpu;

You need this because the fpu argument above is invalid?

> + int err = -1;
> +
> + /*
> + * Make sure not in interrupt context as handling a
> + * trap from userspace.
> + */
> + if (!WARN_ON(in_interrupt())) {

Why would in_interrupt() be necessarily true when the trap comes from
kernel space? The proper check is user_mode(regs) as done anywhere else.

> + err = realloc_xstate_buffer(fpu, xfd_event);
> + if (!err)
> + wrmsrl_safe(MSR_IA32_XFD, (fpu->state_mask &
> + xfeatures_mask_user_dynamic) ^
> + xfeatures_mask_user_dynamic);
> + }
> +
> + /* Raise a signal when it failed to handle. */
> + if (err)
> + force_sig_fault(SIGILL, ILL_ILLOPC, error_get_trap_addr(regs));
> + }
> + handled = true;
> + }
> + return handled;
> +}
> +
> DEFINE_IDTENTRY(exc_device_not_available)
> {
> unsigned long cr0 = read_cr0();

> + if (handle_xfd_event(&current->thread.fpu, regs))
> + return;

As I said before, this is wrong because at that point interrupts are disabled.

Thanks,

tglx

2021-10-01 19:48:33

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 10/28] x86/fpu/xstate: Update the XSTATE save function to support dynamic states

On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 74dde635df40..7c46747f6865 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9899,11 +9899,16 @@ static void kvm_save_current_fpu(struct fpu *fpu)
> * KVM does not support dynamic user states yet. Assume the buffer
> * always has the minimum size.
> */
> - if (test_thread_flag(TIF_NEED_FPU_LOAD))
> + if (test_thread_flag(TIF_NEED_FPU_LOAD)) {
> memcpy(fpu->state, current->thread.fpu.state,
> fpu_buf_cfg.min_size);

What happens with the rest of the state?

> - else
> + } else {
> + struct fpu *src_fpu = &current->thread.fpu;
> +
> + if (fpu->state_mask != src_fpu->state_mask)
> + fpu->state_mask = src_fpu->state_mask;

What guarantees that the state size of @fpu is big enough when src_fpu
has dynamic features included?

> save_fpregs_to_fpstate(fpu);

Thanks,

tglx

2021-10-01 21:04:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 13/28] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

Chang,

On Fri, Oct 01 2021 at 17:02, Thomas Gleixner wrote:
> On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
>> +/**
>> + * xfd_switch - Switches the MSR IA32_XFD context if needed.
>> + * @prev: The previous task's struct fpu pointer
>> + * @next: The next task's struct fpu pointer
>> + */
>> +static inline void xfd_switch(struct fpu *prev, struct fpu *next)
>> +{
>> + u64 prev_xfd_mask, next_xfd_mask;
>> +
>> + if (!cpu_feature_enabled(X86_FEATURE_XFD) || !xfeatures_mask_user_dynamic)
>> + return;
>
> This is context switch, so this wants to be a static key which is turned
> on during init when the CPU supports XFD and user dynamic features are
> available.
>
>> +
>> + prev_xfd_mask = prev->state_mask & xfeatures_mask_user_dynamic;
>> + next_xfd_mask = next->state_mask & xfeatures_mask_user_dynamic;
>> +
>> + if (unlikely(prev_xfd_mask != next_xfd_mask))
>> + wrmsrl_safe(MSR_IA32_XFD, xfeatures_mask_user_dynamic ^ next_xfd_mask);
>> +}
>> +
>> /*
>> * Delay loading of the complete FPU state until the return to userland.
>> * PKRU is handled separately.
>> */
>> -static inline void switch_fpu_finish(struct fpu *new_fpu)
>> +static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
>> {
>> - if (cpu_feature_enabled(X86_FEATURE_FPU))
>> + if (cpu_feature_enabled(X86_FEATURE_FPU)) {
>> set_thread_flag(TIF_NEED_FPU_LOAD);
>> + xfd_switch(old_fpu, new_fpu);
>
> Why has this to be done on context switch? Zero explanation provided.
>
> Why can't this be done in exit_to_user() where the FPU state restore is
> handled?

DEFINE_PER_CPU(xfd_state);

update_xfd(fpu)
{
if (__this_cpu_read(xfd_state) != fpu->xfd_state) {
wrmsrl(XFD, fpu->xfd_state);
__this_cpu_write(xfd_state, fpu->xfd_state);
}
}

fpregs_restore_userregs()
{
if (!fpregs_state_valid(fpu, cpu)) {
if (static_branch_unlikely(xfd_switching_enabled))
update_xfd(fpu);
...
}
}

Hmm?

Thanks,

tglx

2021-10-02 21:38:54

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 10/28] x86/fpu/xstate: Update the XSTATE save function to support dynamic states

On Fri, Oct 01 2021 at 17:41, Thomas Gleixner wrote:
> On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 74dde635df40..7c46747f6865 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -9899,11 +9899,16 @@ static void kvm_save_current_fpu(struct fpu *fpu)
>> * KVM does not support dynamic user states yet. Assume the buffer
>> * always has the minimum size.

I have to come back to this because that assumption is just broken.

create_vcpu()
vcpu->user_fpu = alloc_default_fpu_size();
vcpu->guest_fpu = alloc_default_fpu_size();

vcpu_task()
get_amx_permission()
use_amx()
#NM
alloc_larger_state()
...
kvm_arch_vcpu_ioctl_run()
kvm_arch_vcpu_ioctl_run()
kvm_load_guest_fpu()
kvm_save_current_fpu(vcpu->arch.user_fpu);
save_fpregs_to_fpstate(fpu); <- Out of bounds write

Adding a comment that KVM does not yet support dynamic user states does
not cut it, really.

Even if the above is unlikely, it is possible and has to be handled
correctly at the point where AMX support is enabled in the kernel
independent of guest support.

You have two options:

1) Always allocate the large buffer size which is required to
accomodate all possible features.

Trivial, but waste of memory.

2) Make the allocation dynamic which seems to be trivial to do in
kvm_load_guest_fpu() at least for vcpu->user_fpu.

The vcpu->guest_fpu handling can probably be postponed to the
point where AMX is actually exposed to guests, but it's probably
not the worst idea to think about the implications now.

Paolo, any opinions?

Thanks,

tglx

2021-10-02 22:58:42

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH v10 10/28] x86/fpu/xstate: Update the XSTATE save function to support dynamic states

On Oct 2, 2021, at 14:31, Thomas Gleixner <[email protected]> wrote:
> On Fri, Oct 01 2021 at 17:41, Thomas Gleixner wrote:
>> On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 74dde635df40..7c46747f6865 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -9899,11 +9899,16 @@ static void kvm_save_current_fpu(struct fpu *fpu)
>>> * KVM does not support dynamic user states yet. Assume the buffer
>>> * always has the minimum size.
>
> I have to come back to this because that assumption is just broken.
>
> create_vcpu()
> vcpu->user_fpu = alloc_default_fpu_size();
> vcpu->guest_fpu = alloc_default_fpu_size();
>
> vcpu_task()
> get_amx_permission()
> use_amx()
> #NM
> alloc_larger_state()
> ...
> kvm_arch_vcpu_ioctl_run()
> kvm_arch_vcpu_ioctl_run()
> kvm_load_guest_fpu()
> kvm_save_current_fpu(vcpu->arch.user_fpu);
> save_fpregs_to_fpstate(fpu); <- Out of bounds write
>
> Adding a comment that KVM does not yet support dynamic user states does
> not cut it, really.
>
> Even if the above is unlikely, it is possible and has to be handled
> correctly at the point where AMX support is enabled in the kernel
> independent of guest support.

I see.

> You have two options:
>
> 1) Always allocate the large buffer size which is required to
> accomodate all possible features.
>
> Trivial, but waste of memory.
>
> 2) Make the allocation dynamic which seems to be trivial to do in
> kvm_load_guest_fpu() at least for vcpu->user_fpu.
>
> The vcpu->guest_fpu handling can probably be postponed to the
> point where AMX is actually exposed to guests, but it's probably
> not the worst idea to think about the implications now.

FWIW, the proposed KVM patch for AMX looks to take (1) here [1] and
Paolo said [2]:

Most guests will not need the whole xstate feature set. So perhaps you
could set XFD to the host value | the guest value, trap #NM if the
host XFD is zero, and possibly reflect the exception to the guest's XFD
and XFD_ERR.

In addition, loading the guest XFD MSRs should use the MSR autoload
feature (add_atomic_switch_msr).

And then I guess discussion goes on..

Thanks,
Chang

[1] https://lore.kernel.org/kvm/[email protected]/
[2] https://lore.kernel.org/kvm/[email protected]/

2021-10-03 22:38:06

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH v10 02/28] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers

On Oct 1, 2021, at 05:45, Thomas Gleixner <[email protected]> wrote:
> On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
>> Have the function initializing the XSTATE buffer take a struct fpu *
>> pointer in preparation for dynamic state buffer support.
>>
>> init_fpstate is a special case, which is indicated by a null pointer
>> parameter to fpstate_init().
>>
>> Also, fpstate_init_xstate() now accepts the state component bitmap to
>> customize the compacted format.
>
> That's not a changelog. Changelogs have to explain the WHY not the WHAT.
>
> I can see the WHY when I look at the later changes, but that's not how
> it works.

The same feedback was raised before [1]. I thought this changelog has been
settled down with Boris [2].

How about:

“To prepare dynamic features, change fpstate_init()’s argument to a struct
fpu * pointer instead of a struct fpregs_state * pointer. A struct fpu
will have new fields to handle dynamic features."

With fpstate_init_xstate() changes in a separate patch and defining init_fpu,
the last two sentences shall be removed.

> Also the subject of this patch is just wrong. It does not make the
> functions handle dynamic buffers, it prepares them to add support for
> that later.

How about “Prepare fpstate_init() to handle dynamic features"

>> +static inline void fpstate_init_xstate(struct xregs_state *xsave, u64 mask)
>> +{
>> + /*
>> + * XRSTORS requires these bits set in xcomp_bv, or it will
>> + * trigger #GP:
>> + */
>> + xsave->header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT | mask;
>> +}
>
> This wants to be a separate cleanup patch which replaces the open coded
> variant here:

Okay, maybe the change becomes to be the new patch1.

>> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>> index fc1d529547e6..0fed7fbcf2e8 100644
>> --- a/arch/x86/kernel/fpu/xstate.c
>> +++ b/arch/x86/kernel/fpu/xstate.c
>> @@ -395,8 +395,7 @@ static void __init setup_init_fpu_buf(void)
>> print_xstate_features();
>>
>> if (boot_cpu_has(X86_FEATURE_XSAVES))
>> - init_fpstate.xsave.header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT |
>> - xfeatures_mask_all;
>> + fpstate_init_xstate(&init_fpstate.xsave, xfeatures_mask_all);

Thanks,
Chang

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/


2021-10-03 22:39:02

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH v10 05/28] x86/fpu/xstate: Add a new variable to indicate dynamic user states

On Oct 1, 2021, at 06:16, Thomas Gleixner <[email protected]> wrote:
> On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
>> +/*
>> + * This represents user xstates, a subset of xfeatures_mask_all, saved in a
>> + * dynamic kernel XSAVE buffer.
>> + */
>> +u64 xfeatures_mask_user_dynamic __ro_after_init;
>> +
>> static unsigned int xstate_offsets[XFEATURE_MAX] __ro_after_init =
>> { [ 0 ... XFEATURE_MAX - 1] = -1};
>> static unsigned int xstate_sizes[XFEATURE_MAX] __ro_after_init =
>> @@ -709,6 +715,7 @@ static int __init init_xstate_size(void)
>> static void fpu__init_disable_system_xstate(void)
>> {
>> xfeatures_mask_all = 0;
>> + xfeatures_mask_user_dynamic = 0;
>
> It's zero at boot already, isn't it?

I thought of this as a safety net here when it was introduced.

But let me move it to patch13 [1] where the variable is updated.

Thanks,
Chang

[1] https://lore.kernel.org/lkml/[email protected]/

2021-10-03 22:40:22

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH v10 13/28] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Oct 1, 2021, at 08:10, Thomas Gleixner <[email protected]> wrote:
> On Fri, Oct 01 2021 at 17:02, Thomas Gleixner wrote:
>> On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
>>> DEFINE_IDTENTRY(exc_device_not_available)
>>> {
>>> unsigned long cr0 = read_cr0();
>>
>>> + if (handle_xfd_event(&current->thread.fpu, regs))
>>> + return;
>>
>> As I said before, this is wrong because at that point interrupts are disabled.
>
> So you want something like this:
>
> static bool handle_xfd_event(struct pt_regs *regs)
> {
> u64 xfd_err, xfd_event, xfd, mask;
> struct fpu *fpu;
>
> if (!cpu_feature_enabled(X86_FEATURE_XFD))
> return false;
>
> rdmsrl_safe(MSR_IA32_XFD_ERR, &xfd_err);
> if (!xfd_err)
> return false;
>
> wrmsrl_safe(MSR_IA32_XFD_ERR, 0);
>
> xfd_event = xfd_err & xfeatures_mask_user_dynamic;
>
> /* Die if a non-handled feature raised the exception */
> if (WARN_ON(!xfd_event))
> return true;
>
> /* Die if that happens in kernel space */
> if (WARN_ON(!user_mode(regs)))
> return false;
>
> local_irq_enable();

v1 had some similar ones (not the same though) [1]. FWIW, I think Andy’s point
is worth to be noted here:

First, you can't just enable IRQs here. If IRQs are off, they're off for a
reason. Secondly, if they're *on*, you just forgot that fact.

> /* Make sure that dynamic buffer expansion is permitted. */
> if (dynamic_state_permitted(current, xfd_event) &&
> !realloc_xstate_buffer(current, xfd_event)) {
> mask = xfeatures_mask_user_dynamic;
> fpu = &current->thread.fpu;
> xfd_write((fpu->state_mask & mask) ^ mask);
> } else {
> force_sig_fault(SIGILL, ILL_ILLOPC, error_get_trap_addr(regs));
> }
>
> local_irq_disable();
> return true;
> }
>
> Along with a correct implementation of realloc_xstate_buffer().

Thanks,
Chang

[1] https://lore.kernel.org/lkml/CALCETrWjOYd4wM0Mn7fY+t4ztU99GNP77A6skNwjTViJYUEZYQ@mail.gmail.com/

2021-10-03 22:41:09

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH v10 04/28] x86/fpu/xstate: Modify address finders to handle both static and dynamic buffers

On Oct 1, 2021, at 06:15, Thomas Gleixner <[email protected]> wrote:
> On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
>
>> Have all the functions finding XSTATE address take a struct fpu * pointer
>> in preparation for dynamic state buffer support.
>>
>> init_fpstate is a special case, which is indicated by a null pointer
>> parameter to get_xsave_addr() and __raw_xsave_addr().
>
> Same comment vs. subject. Prepare ...

How about:
"Prepare address finders to handle dynamic features"

>> + if (fpu)
>> + xsave = &fpu->state.xsave;
>> + else
>> + xsave = &init_fpstate.xsave;
>> +
>> + return xsave + xstate_comp_offsets[xfeature_nr];
>
> So you have the same conditionals and the same comments vs. that NULL
> pointer oddity how many times now all over the place?
>
> That can be completely avoided:
>
> Patch 1:
>
> -union fpregs_state init_fpstate __ro_after_init;
> +static union fpregs_state init_fpstate __ro_after_init;
> +struct fpu init_fpu = { .state = &init_fpstate } __ro_after_init;
>
> and make all users of init_fpstate access it through init_fpu.
>
> Patches 2..N which change arguments from fpregs_state to fpu:
>
> - fun(init_fpu->state);
> + fun(&init_fpu);
>
> Patch M which adds state_mask:
>
> @fpu__init_system_xstate()
> + init_fpu.state_mask = xfeatures_mask_all;
>
> Hmm?

Okay, a NULL pointer is odd and it as an argument should be avoided. Defining
a separate struct fpu for the initial state can make every function expect a
valid struct fpu pointer.

I think that the patch set will have such order (once [1] is dropped out) of,
- patch1 (new): a cleanup patch for fpstate_init_xstate() in patch1
- patch2 (new): the above init_fpu goes into this, and
- patch3-5: changes arguments to fpu,
- patch6 (new): finally patch 6 to add ->state_mask and ->state_size.


Thanks,
Chang

[1] https://lore.kernel.org/lkml/[email protected]/

2021-10-03 22:41:45

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH v10 13/28] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Oct 1, 2021, at 13:20, Thomas Gleixner <[email protected]> wrote:
> On Fri, Oct 01 2021 at 17:02, Thomas Gleixner wrote:
>> On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
>>> +/**
>>> + * xfd_switch - Switches the MSR IA32_XFD context if needed.
>>> + * @prev: The previous task's struct fpu pointer
>>> + * @next: The next task's struct fpu pointer
>>> + */
>>> +static inline void xfd_switch(struct fpu *prev, struct fpu *next)
>>> +{
>>> + u64 prev_xfd_mask, next_xfd_mask;
>>> +
>>> + if (!cpu_feature_enabled(X86_FEATURE_XFD) || !xfeatures_mask_user_dynamic)
>>> + return;
>>
>> This is context switch, so this wants to be a static key which is turned
>> on during init when the CPU supports XFD and user dynamic features are
>> available.

I see. When there is a static key, use it only since this is context switch.

>>> +
>>> + prev_xfd_mask = prev->state_mask & xfeatures_mask_user_dynamic;
>>> + next_xfd_mask = next->state_mask & xfeatures_mask_user_dynamic;
>>> +
>>> + if (unlikely(prev_xfd_mask != next_xfd_mask))
>>> + wrmsrl_safe(MSR_IA32_XFD, xfeatures_mask_user_dynamic ^ next_xfd_mask);
>>> +}
>>> +
>>> /*
>>> * Delay loading of the complete FPU state until the return to userland.
>>> * PKRU is handled separately.
>>> */
>>> -static inline void switch_fpu_finish(struct fpu *new_fpu)
>>> +static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
>>> {
>>> - if (cpu_feature_enabled(X86_FEATURE_FPU))
>>> + if (cpu_feature_enabled(X86_FEATURE_FPU)) {
>>> set_thread_flag(TIF_NEED_FPU_LOAD);
>>> + xfd_switch(old_fpu, new_fpu);
>>
>> Why has this to be done on context switch? Zero explanation provided.
>>
>> Why can't this be done in exit_to_user() where the FPU state restore is
>> handled?

Looking at the changelog of the patch to delay XSTATE [1] load:

This gives the kernel the potential to skip loading FPU state for tasks
that stay in kernel mode, or for tasks that end up with repeated
invocations of kernel_fpu_begin() & kernel_fpu_end().

But I think XFD state is different from XSTATE. There is no use case for
XFD-enabled features in kernel mode. So, XFD state was considered to be
switched under switch_to() just like other user states. E.g. user FSBASE is
switched here as kernel does not use it. But user GSBASE is loaded at
returning to userspace. Potentially, it is also beneficial as XFD-armed states
will hold INIT-state [3]:

If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i, the
instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1; instead,
it saves bit i of XSTATE_BV field of the XSAVE header as 0 (indicating
that the state component is in its initialized state).

I wish I could make sure the above point before following this:

> DEFINE_PER_CPU(xfd_state);
>
> update_xfd(fpu)
> {
> if (__this_cpu_read(xfd_state) != fpu->xfd_state) {
> wrmsrl(XFD, fpu->xfd_state);
> __this_cpu_write(xfd_state, fpu->xfd_state);
> }
> }
>
> fpregs_restore_userregs()
> {
> if (!fpregs_state_valid(fpu, cpu)) {
> if (static_branch_unlikely(xfd_switching_enabled))
> update_xfd(fpu);
> ...
> }
> }
>
> Hmm?

Yes, I thought some similar things in this [2].

Thanks,
Chang

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] 3.2.6 Extended Feature Disable (XFD), Intel Architecture Instruction Set Extension Programming Reference May 2021, https://software.intel.com/content/dam/develop/external/us/en/documents-tps/architecture-instruction-set-extensions-programming-reference.pdf

2021-10-03 22:45:26

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH v10 03/28] x86/fpu/xstate: Modify state copy helpers to handle both static and dynamic buffers

On Oct 1, 2021, at 05:47, Thomas Gleixner <[email protected]> wrote:
> On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
>> Have all the functions copying XSTATE take a struct fpu * pointer in
>> preparation for dynamic state buffer support.
>
> Which is something different than the subject line claims to do and
> still does not explain why this code needs access to struct fpu later
> on.

This was liked by Boris [1]:

What I would've written is:

"Have all functions handling FPU state take a struct fpu * pointer in
preparation for dynamic state buffer support."

Plain and simple.

How about this (as similar to [2]):

“To prepare dynamic features, change XSTATE copy functions' arguments to a
struct fpu * pointer instead of a struct fpregs_state * pointer. It will
have new fields for dynamic features.”

Along with this title:

“Prepare XSTATE copy functions to handle dynamic features”

Thanks,
Chang

[1] https://lore.kernel.org/lkml/[email protected]/
[2]https://lore.kernel.org/lkml/[email protected]/

2021-10-03 22:49:18

by Chang S. Bae

[permalink] [raw]
Subject: Re: [PATCH v10 13/28] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Oct 1, 2021, at 08:02, Thomas Gleixner <[email protected]> wrote:
> On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
>> +/**
>> + * xfd_switch - Switches the MSR IA32_XFD context if needed.
>> + * @prev: The previous task's struct fpu pointer
>> + * @next: The next task's struct fpu pointer
>> + */
>> +static inline void xfd_switch(struct fpu *prev, struct fpu *next)
>> +{
>> + u64 prev_xfd_mask, next_xfd_mask;
>> +
>> + if (!cpu_feature_enabled(X86_FEATURE_XFD) || !xfeatures_mask_user_dynamic)
>> + return;
>
> This is context switch, so this wants to be a static key which is turned
> on during init when the CPU supports XFD and user dynamic features are
> available.

Replied in the later email [1].

>> +
>> + prev_xfd_mask = prev->state_mask & xfeatures_mask_user_dynamic;
>> + next_xfd_mask = next->state_mask & xfeatures_mask_user_dynamic;
>> +
>> + if (unlikely(prev_xfd_mask != next_xfd_mask))
>> + wrmsrl_safe(MSR_IA32_XFD, xfeatures_mask_user_dynamic ^ next_xfd_mask);
>> +}
>> +
>> /*
>> * Delay loading of the complete FPU state until the return to userland.
>> * PKRU is handled separately.
>> */
>> -static inline void switch_fpu_finish(struct fpu *new_fpu)
>> +static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
>> {
>> - if (cpu_feature_enabled(X86_FEATURE_FPU))
>> + if (cpu_feature_enabled(X86_FEATURE_FPU)) {
>> set_thread_flag(TIF_NEED_FPU_LOAD);
>> + xfd_switch(old_fpu, new_fpu);
>
> Why has this to be done on context switch? Zero explanation provided.
>
> Why can't this be done in exit_to_user() where the FPU state restore is
> handled?

Replied in the later email [1].

>> }
>> +
>> + if (boot_cpu_has(X86_FEATURE_XFD))
>
> s/boot_cpu_has/cpu_feature_enabled/g

I think this is under fpu__init_cpu_xstate(). IIRC, here cpu_feature_enabled()
had caused a build error before. Now it looks okay. Will update.

>> + wrmsrl(MSR_IA32_XFD, xfeatures_mask_user_dynamic);
>> }
>> +
>> + if (cpu_feature_enabled(X86_FEATURE_XFD))
>> + wrmsrl_safe(MSR_IA32_XFD, (current->thread.fpu.state_mask &
>> + xfeatures_mask_user_dynamic) ^
>> + xfeatures_mask_user_dynamic);
>
> Lacks curly braces as it's not a single line of code.

Sorry, I was confused with other examples like this in the mainline. Will fix.

>> }
>>
>> /**
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index 33f5d8d07367..6cd4fb098f8f 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -97,6 +97,16 @@ void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
>> *size = fpu_buf_cfg.min_size;
>> }
>>
>> +void arch_release_task_struct(struct task_struct *task)
>> +{
>> + if (!cpu_feature_enabled(X86_FEATURE_FPU))
>> + return;
>> +
>> + /* Free up only the dynamically-allocated memory. */
>> + if (task->thread.fpu.state != &task->thread.fpu.__default_state)
>
> Sigh.

Yeah, I will fix it this time. I also responded about the reason for doing
this in the other mail [2].

>> + free_xstate_buffer(&task->thread.fpu);
>>
>> +static __always_inline bool handle_xfd_event(struct fpu *fpu, struct pt_regs *regs)
>> +{
>> + bool handled = false;
>> + u64 xfd_err;
>> +
>> + if (!cpu_feature_enabled(X86_FEATURE_XFD))
>> + return handled;
>> +
>> + rdmsrl_safe(MSR_IA32_XFD_ERR, &xfd_err);
>> + wrmsrl_safe(MSR_IA32_XFD_ERR, 0);
>> +
>> + if (xfd_err) {
>
> What's wrong with
>
> if (!xfd_err)
> return false;
>
> an spare the full indentation levels below

I thought local variables under this. But yes, this can save an indentation
level here.

>> + u64 xfd_event = xfd_err & xfeatures_mask_user_dynamic;
>> + u64 value;
>> +
>> + if (WARN_ON(!xfd_event)) {
>> + /*
>> + * Unexpected event is raised. But update XFD state to
>> + * unblock the task.
>> + */
>> + rdmsrl_safe(MSR_IA32_XFD, &value);
>> + wrmsrl_safe(MSR_IA32_XFD, value & ~xfd_err);
>
> Ditto. But returning false here will not unblock the task as
> exc_device_not_available() will simply reach "die()".

Yes, it is. But this "unexpected #NM exception” could make confusion as an #NM
is XFD-induced and that needs to be differentiated for users. (Len made this
point to me.)

>> + } else {
>> + struct fpu *fpu = &current->thread.fpu;
>
> You need this because the fpu argument above is invalid?

Ah, so sorry, I should have removed this line when I refactor this function..

>> + int err = -1;
>> +
>> + /*
>> + * Make sure not in interrupt context as handling a
>> + * trap from userspace.
>> + */
>> + if (!WARN_ON(in_interrupt())) {
>
> Why would in_interrupt() be necessarily true when the trap comes from
> kernel space? The proper check is user_mode(regs) as done anywhere else.

I see.

>> + err = realloc_xstate_buffer(fpu, xfd_event);
>> + if (!err)
>> + wrmsrl_safe(MSR_IA32_XFD, (fpu->state_mask &
>> + xfeatures_mask_user_dynamic) ^
>> + xfeatures_mask_user_dynamic);
>> + }
>> +
>> + /* Raise a signal when it failed to handle. */
>> + if (err)
>> + force_sig_fault(SIGILL, ILL_ILLOPC, error_get_trap_addr(regs));
>> + }
>> + handled = true;
>> + }
>> + return handled;
>> +}
>> +
>> DEFINE_IDTENTRY(exc_device_not_available)
>> {
>> unsigned long cr0 = read_cr0();
>
>> + if (handle_xfd_event(&current->thread.fpu, regs))
>> + return;
>
> As I said before, this is wrong because at that point interrupts are disabled.

I saw you suggested the code. Will take that, thanks.

Thanks,
Chang

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/



2021-10-04 12:40:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 13/28] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Sun, Oct 03 2021 at 22:38, Chang Seok Bae wrote:
> On Oct 1, 2021, at 08:10, Thomas Gleixner <[email protected]> wrote:
>> local_irq_enable();
>
> v1 had some similar ones (not the same though) [1]. FWIW, I think Andy’s point
> is worth to be noted here:
>
> First, you can't just enable IRQs here. If IRQs are off, they're off for a
> reason. Secondly, if they're *on*, you just forgot that fact.

The #NM comes from user space where interrupts are always enabled. So we
can enable interrupts _after_ doing the sanity checks.

Also we reenable interrupts in various other trap handlers when the trap
comes from user space as well. That's perfectly fine and required. How
would e.g. fault handling or single stepping ever work otherwise?

I have no idea where you had places the local_irq_enable(), but the code
I outlined is correct.

Thanks,

tglx

2021-10-04 12:56:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 04/28] x86/fpu/xstate: Modify address finders to handle both static and dynamic buffers

On Sun, Oct 03 2021 at 22:35, Chang Seok Bae wrote:
> On Oct 1, 2021, at 06:15, Thomas Gleixner <[email protected]> wrote:
>
> Okay, a NULL pointer is odd and it as an argument should be avoided. Defining
> a separate struct fpu for the initial state can make every function expect a
> valid struct fpu pointer.
>
> I think that the patch set will have such order (once [1] is dropped out) of,
> - patch1 (new): a cleanup patch for fpstate_init_xstate() in patch1
> - patch2 (new): the above init_fpu goes into this, and
> - patch3-5: changes arguments to fpu,

So actually I sat down over the weekend and looked at that again. Adding this
to struct fpu is wrong. The size and features information belongs into
something like this:

struct fpstate {
unsigned int size;
u64 xfeatures;

union fpregs_state regs;
};

Why? Simply because fpstate is the container for the dynamically sized
regs and that's where it semantically belongs.

While staring at that I just started to cleanup stuff all over the place
to make the integration of this simpler.

The patches are completely untested and have no changelogs yet, but if
you want a preview, I've uploaded a patch series to:

https://tglx.de/~tglx/patches.tar

I'm still staring at some of the dynamic feature integrations, but this
is roughly where this should be heading.

Thanks,

tglx

2021-10-04 19:16:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 13/28] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state

On Sun, Oct 03 2021 at 22:39, Chang Seok Bae wrote:
> On Oct 1, 2021, at 13:20, Thomas Gleixner <[email protected]> wrote:
>
> Looking at the changelog of the patch to delay XSTATE [1] load:
>
> This gives the kernel the potential to skip loading FPU state for tasks
> that stay in kernel mode, or for tasks that end up with repeated
> invocations of kernel_fpu_begin() & kernel_fpu_end().

Correct.

> But I think XFD state is different from XSTATE. There is no use case for
> XFD-enabled features in kernel mode.

Correct, but your patch does not ensure that XFD features are disabled
on context switch. You write the XFD mask of the next task when it
differs frome the XFD mask of the previous task. So we have the following:

prev XFD next XFD
DISABLED DISABLED XFD features stay disabled
ENABLED DISABLED XFD features are disabled
DISABLED ENABLED XFD features are enabled
ENABLED ENABLED XFD features stay enabled

So it still runs in the kernel with XFD features enabled including
interrupts, soft interupts, exceptions and NMIs. So what's the problem
when it does a user -> kernel -> user transition with the user XFD on?

> So, XFD state was considered to be switched under switch_to() just
> like other user states. E.g. user FSBASE is switched here as kernel
> does not use it.

That's not really a justification.

> But user GSBASE is loaded at returning to userspace.

And so is XSTATE

> Potentially, it is also beneficial as XFD-armed states will hold
> INIT-state [3]:
>
> If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i, the
> instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1; instead,
> it saves bit i of XSTATE_BV field of the XSAVE header as 0 (indicating
> that the state component is in its initialized state).

How does that matter? The point is that if the FPU registers are
unmodified then a task can return to user space without doing anything
even if it went through five context switches. So how is XFD any
different?

Where is the kernel doing XSAVE / XSAVES:

1) On context switch which sets TIF_NEED_FPU_LOAD

Once TIF_NEED_FPU_LOAD is set the kernel does not do XSAVES in
the context of the task simply because it knows that the content
is in the memory buffer.

2) In signal handling

Only happens when TIF_NEED_FPU_LOAD == 0

Where is the kernel doing XRSTOR / XRSTORS:

1) On return to user space if the FPU registers are not up to date

So this can restore XFD as well

2) In signal handling and related functions

Only happens when TIF_NEED_FPU_LOAD == 0

So what's the win?

No wrmsrl() on context switch, which means for the user -> kthread ->
user context switch scenario for which the register preserving is
optimized you spare two wrmsrl() invocations, run less code with less
conditionals.

What's the price?

A few trivial XFD sanity checks for debug enabled kernels to ensure that
XFD is correct on XSAVE[S] and XRSTOR[S], which have no runtime overhead
on production systems.

Even if we decide that these checks should be permanent then they happen
in code pathes which are doing a slow X* operation anyway.

Thanks,

tglx



2021-10-04 19:45:44

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 05/28] x86/fpu/xstate: Add a new variable to indicate dynamic user states

On Sun, Oct 03 2021 at 22:35, Chang Seok Bae wrote:
> On Oct 1, 2021, at 06:16, Thomas Gleixner <[email protected]> wrote:
>> On Wed, Aug 25 2021 at 08:53, Chang S. Bae wrote:
>>> +/*
>>> + * This represents user xstates, a subset of xfeatures_mask_all, saved in a
>>> + * dynamic kernel XSAVE buffer.
>>> + */
>>> +u64 xfeatures_mask_user_dynamic __ro_after_init;
>>> +
>>> static unsigned int xstate_offsets[XFEATURE_MAX] __ro_after_init =
>>> { [ 0 ... XFEATURE_MAX - 1] = -1};
>>> static unsigned int xstate_sizes[XFEATURE_MAX] __ro_after_init =
>>> @@ -709,6 +715,7 @@ static int __init init_xstate_size(void)
>>> static void fpu__init_disable_system_xstate(void)
>>> {
>>> xfeatures_mask_all = 0;
>>> + xfeatures_mask_user_dynamic = 0;
>>
>> It's zero at boot already, isn't it?
>
> I thought of this as a safety net here when it was introduced.

A safety net for what? The compiler initializing the variable to 4711?

2021-10-05 07:52:43

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v10 10/28] x86/fpu/xstate: Update the XSTATE save function to support dynamic states

On 02/10/21 23:31, Thomas Gleixner wrote:
> You have two options:
>
> 1) Always allocate the large buffer size which is required to
> accomodate all possible features.
>
> Trivial, but waste of memory.
>
> 2) Make the allocation dynamic which seems to be trivial to do in
> kvm_load_guest_fpu() at least for vcpu->user_fpu.
>
> The vcpu->guest_fpu handling can probably be postponed to the
> point where AMX is actually exposed to guests, but it's probably
> not the worst idea to think about the implications now.
>
> Paolo, any opinions?

Unless we're missing something, dynamic allocation should not be hard to
do for both guest_fpu and user_fpu; either near the call sites of
kvm_save_current_fpu, or in the function itself. Basically adding
something like

struct kvm_fpu {
struct fpu *state;
unsigned size;
} user_fpu, guest_fpu;

to struct kvm_vcpu. Since the size can vary, it can be done simply with
kzalloc instead of the x86_fpu_cache that KVM has now.

The only small complication is that kvm_save_current_fpu is called
within fpregs_lock; the allocation has to be outside so that you can use
GFP_KERNEL even on RT kernels. If the code looks better with
fpregs_lock moved within kvm_save_current_fpu, go ahead and do it like that.

Paolo

2021-10-05 08:20:02

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v10 10/28] x86/fpu/xstate: Update the XSTATE save function to support dynamic states

On 03/10/21 00:54, Bae, Chang Seok wrote:
> FWIW, the proposed KVM patch for AMX looks to take (1) here [1] and
> Paolo said [2]:
>
> Most guests will not need the whole xstate feature set. So perhaps you
> could set XFD to the host value | the guest value, trap #NM if the
> host XFD is zero, and possibly reflect the exception to the guest's XFD
^^^^

(better: if the host XFD is nonzero, and the guest XCR0 includes any bit
whose state is optional).

> and XFD_ERR.

This comment is about letting arch/x86/kernel resize current->thread.fpu
while the guest runs. It's not necessary before KVM supports AMX,
because KVM will never let a guest set XCR0[18] (__kvm_set_xcr).

Thomas instead was talking about allocation of vcpu->arch.guest_fpu and
vcpu->arch.user_fpu.

For dynamic allocation in kvm_save_current_fpu, you can retrieve the
XINUSE bitmask (XGETBV with RCX=1). If it contains any bits that have
optional state, you check if KVM's vcpu->arch.guest_fpu or
vcpu->arch.user_fpu are already big enough, and if not do the reallocation.

If X86_FEATURE_XGETBV1 is not available, you will not need to resize.
If XFD is supported but X86_FEATURE_XGETBV1 is not, you can just make
kvm_arch_init fail with -ENODEV. It's a nonsensical combination.

Thanks,

Paolo

> In addition, loading the guest XFD MSRs should use the MSR autoload
> feature (add_atomic_switch_msr).
>
> And then I guess discussion goes on..

2021-10-05 09:58:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 10/28] x86/fpu/xstate: Update the XSTATE save function to support dynamic states

On Tue, Oct 05 2021 at 09:50, Paolo Bonzini wrote:
> On 02/10/21 23:31, Thomas Gleixner wrote:
>> You have two options:
>>
>> 1) Always allocate the large buffer size which is required to
>> accomodate all possible features.
>>
>> Trivial, but waste of memory.
>>
>> 2) Make the allocation dynamic which seems to be trivial to do in
>> kvm_load_guest_fpu() at least for vcpu->user_fpu.
>>
>> The vcpu->guest_fpu handling can probably be postponed to the
>> point where AMX is actually exposed to guests, but it's probably
>> not the worst idea to think about the implications now.
>>
>> Paolo, any opinions?
>
> Unless we're missing something, dynamic allocation should not be hard to
> do for both guest_fpu and user_fpu; either near the call sites of
> kvm_save_current_fpu, or in the function itself. Basically adding
> something like
>
> struct kvm_fpu {
> struct fpu *state;
> unsigned size;
> } user_fpu, guest_fpu;
>
> to struct kvm_vcpu. Since the size can vary, it can be done simply with
> kzalloc instead of the x86_fpu_cache that KVM has now.
>
> The only small complication is that kvm_save_current_fpu is called
> within fpregs_lock; the allocation has to be outside so that you can use
> GFP_KERNEL even on RT kernels. If the code looks better with
> fpregs_lock moved within kvm_save_current_fpu, go ahead and do it like that.

I'm reworking quite some of this already and with the new bits you don't
have to do anything in kvm_fpu because the size and allowed feature bits
are going to be part of fpu->fpstate.

Stay tuned.

Thanks,

tglx