2021-06-23 12:24:56

by Thomas Gleixner

[permalink] [raw]
Subject: [patch V4 00/65] x86/fpu: Spring cleaning and PKRU sanitizing

The main parts of this series are:

- Simplification and removal/replacement of redundant and/or
overengineered code.

- Name space cleanup as the existing names were just a permanent source
of confusion.

- Clear seperation of user ABI and kernel internal state handling.

- Removal of PKRU from being XSTATE managed in the kernel because PKRU
has to be eagerly restored on context switch and keeping it in sync
in the xstate buffer is just pointless overhead and fragile.

The kernel still XSAVEs PKRU on context switch but the value in the
buffer is not longer used and never restored from the buffer.

This still needs to be cleaned up, but the series is already 40+
patches large and the cleanup of this is not a functional problem.

The functional issues of PKRU management are fully addressed with the
series as is.

- Cleanup of fpu signal restore

- Make the fast path self contained. Handle #PF directly and skip
the slow path on any other exception as that will just end up
with the same result that the frame is invalid. This allows
the compiler to optimize the slow path out for 64bit kernels
w/o ia32 emulation.

- Reduce code duplication and unnecessary operations

It applies on top of

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

and is also available via git:

git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/fpu

This is a follow up to V3 which can be found here:

https://lore.kernel.org/r/[email protected]

Changes vs. V3:

- Dropped the two bugfixes which are applied already and rebased on top

- Addressed review comments (Andy, Boris)

Patches: 13, 35, 36, 37, 46, 58, 62, 63

- Fixed the math-emu fallout which I had stashed safely on the 32bit
testbox (Boris)

Patch: 28

- Picked up tags

Thanks to everyone for review, feedback and testing (various teams
@Intel).

Note: I've not picked up any tested-by tags. It would be nice to have
them on this hopefully final version.

Thanks,

tglx
---
arch/x86/events/intel/lbr.c | 6
arch/x86/include/asm/fpu/internal.h | 202 ++++------
arch/x86/include/asm/fpu/xstate.h | 78 +++-
arch/x86/include/asm/pgtable.h | 57 ---
arch/x86/include/asm/pkeys.h | 9
arch/x86/include/asm/pkru.h | 62 +++
arch/x86/include/asm/processor.h | 9
arch/x86/include/asm/special_insns.h | 14
arch/x86/kernel/cpu/common.c | 34 -
arch/x86/kernel/fpu/core.c | 282 +++++++--------
arch/x86/kernel/fpu/init.c | 15
arch/x86/kernel/fpu/regset.c | 223 ++++++------
arch/x86/kernel/fpu/signal.c | 419 ++++++++++------------
arch/x86/kernel/fpu/xstate.c | 645 +++++++++++++----------------------
arch/x86/kernel/process.c | 22 +
arch/x86/kernel/process_64.c | 28 +
arch/x86/kernel/traps.c | 5
arch/x86/kvm/svm/sev.c | 1
arch/x86/kvm/x86.c | 56 +--
arch/x86/math-emu/fpu_proto.h | 2
arch/x86/math-emu/load_store.c | 2
arch/x86/math-emu/reg_ld_str.c | 2
arch/x86/mm/extable.c | 2
arch/x86/mm/fault.c | 2
arch/x86/mm/pkeys.c | 22 -
include/linux/pkeys.h | 4
26 files changed, 1037 insertions(+), 1166 deletions(-)



2021-06-25 14:35:32

by kernel test robot

[permalink] [raw]
Subject: Re: [patch V4 00/65] x86/fpu: Spring cleaning and PKRU sanitizing

Hi Thomas,

On Wed, Jun 23, 2021 at 02:01:27PM +0200, Thomas Gleixner wrote:
> The main parts of this series are:
>
> - Simplification and removal/replacement of redundant and/or
> overengineered code.
>
> - Name space cleanup as the existing names were just a permanent source
> of confusion.
>
> - Clear seperation of user ABI and kernel internal state handling.
>
> - Removal of PKRU from being XSTATE managed in the kernel because PKRU
> has to be eagerly restored on context switch and keeping it in sync
> in the xstate buffer is just pointless overhead and fragile.
>
> The kernel still XSAVEs PKRU on context switch but the value in the
> buffer is not longer used and never restored from the buffer.
>
> This still needs to be cleaned up, but the series is already 40+
> patches large and the cleanup of this is not a functional problem.
>
> The functional issues of PKRU management are fully addressed with the
> series as is.
>
> - Cleanup of fpu signal restore
>
> - Make the fast path self contained. Handle #PF directly and skip
> the slow path on any other exception as that will just end up
> with the same result that the frame is invalid. This allows
> the compiler to optimize the slow path out for 64bit kernels
> w/o ia32 emulation.
>
> - Reduce code duplication and unnecessary operations
>
> It applies on top of
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master
>
> and is also available via git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/fpu

0-Day kernel CI tested this branch from performance view, choosing some sub-tests
from will-it-scale and stress-ng (detail as below)

Test Summary
============
no obvious performance changes found vs v5.13-rc7 so far

Test Environment
================
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/log/?h=x86/fpu
* a263e30a4f111 (tglx-devel/x86/fpu) x86/fpu/signal: Let xrstor handle the features to init
* cf917e53a97e9 x86/fpu/signal: Handle #PF in the direct restore path
* 7b269dff0be66 x86/fpu: Return proper error codes from user access functions
* d2d6be9e16386 x86/fpu/signal: Split out the direct restore code
...
* d3ca29fb9911f x86/fpu: Fix copy_xstate_to_kernel() gap handling
* 58a664af246e6 Merge branch 'x86/fpu' of ../tip into x86/fpu
|\
| * b7c11876d24bd (tip/x86/fpu, peterz-queue/x86/fpu) selftests/x86: Test signal frame XSTATE header corruption handling
...
* | 13311e74253fe (tag: v5.13-rc7


64bit kernel testing, upon below platforms:
(1)
model: Cascade Lake
Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz
nr_node: 2
nr_cpu: 88
memory: 128G
(2)
model: Ice Lake
nr_node: 2
nr_cpu: 96
memory: 256G


32bit kernel testing, upon below platform:
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
model: Ivy Bridge
nr_node: 1
nr_cpu: 8
memory: 16G


tested below test suites:
will-it-scale-performance-context_switch1
will-it-scale-performance-page_fault1
will-it-scale-performance-poll1
will-it-scale-performance-pthread_mutex1
will-it-scale-performance-writeseek1

regarding stress-ng, either no gap or data is unstable on
64bit (1) platform

commit:
v5.13-rc7
a263e30a4f111e13a6c859102d1b319a79b072c0

v5.13-rc7 a263e30a4f111e13a6c859102d1
---------------- ---------------------------
%stddev %change %stddev
\ | \
2.786e+08 ? 2% -1.3% 2.751e+08 stress-ng.switch.ops
4643776 ? 2% -1.3% 4585061 stress-ng.switch.ops_per_sec
--
25884344 ? 15% +5.3% 27252860 ? 12% stress-ng.nanosleep.ops
431397 ? 15% +5.3% 454204 ? 12% stress-ng.nanosleep.ops_per_sec
--
5.48e+09 ? 2% +4.3% 5.713e+09 ? 3% stress-ng.yield.ops
91326562 ? 2% +4.3% 95215138 ? 3% stress-ng.yield.ops_per_sec
--
6723205 ? 3% -2.0% 6585690 ? 9% stress-ng.tee.ops
112049 ? 3% -2.0% 109755 ? 9% stress-ng.tee.ops_per_sec
--
3078530 -0.2% 3072610 stress-ng.zombie.ops
49390 -0.2% 49295 stress-ng.zombie.ops_per_sec
--
2261601 -0.1% 2259339 stress-ng.cyclic.ops
37693 -0.1% 37655 stress-ng.cyclic.ops_per_sec
--
1.543e+09 ? 16% +0.4% 1.55e+09 ? 19% stress-ng.fifo.ops
25721633 ? 16% +0.4% 25831339 ? 19% stress-ng.fifo.ops_per_sec
--
192171 +0.6% 193383 stress-ng.clone.ops
2998 +0.6% 3016 stress-ng.clone.ops_per_sec
--
88703410 -0.5% 88257268 stress-ng.sleep.ops
1470841 -0.5% 1463509 stress-ng.sleep.ops_per_sec
--
401598 +0.1% 401990 stress-ng.netlink-proc.ops
6692 +0.1% 6699 stress-ng.netlink-proc.ops_per_sec
--
1.182e+08 +2.5% 1.212e+08 stress-ng.softlockup.ops
1969636 +2.5% 2019730 stress-ng.softlockup.ops_per_sec
--
1.866e+08 -0.2% 1.863e+08 stress-ng.fanotify.ops
4754116 ? 12% -2.8% 4620280 ? 5% stress-ng.fanotify.ops_per_sec
--
17926280 -0.3% 17866738 stress-ng.vforkmany.ops
286256 +0.4% 287262 stress-ng.vforkmany.ops_per_sec
--
1.381e+09 ? 13% +8.7% 1.501e+09 ? 19% stress-ng.fifo.ops
23013090 ? 13% +8.7% 25020209 ? 19% stress-ng.fifo.ops_per_sec
--
1.165e+08 ? 2% -2.0% 1.141e+08 ? 3% stress-ng.splice.ops
1941398 ? 2% -2.0% 1901755 ? 3% stress-ng.splice.ops_per_sec
--
5476091 -0.2% 5466962 stress-ng.vfork.ops
91262 -0.2% 91116 stress-ng.vfork.ops_per_sec
--
21878786 -1.4% 21574984 stress-ng.netlink-task.ops
364646 -1.4% 359582 stress-ng.netlink-task.ops_per_sec
--
5023286 +1.6% 5101458 stress-ng.pthread.ops
83554 +1.6% 84869 stress-ng.pthread.ops_per_sec
--
92688395 ? 2% -8.3% 84967931 ? 4% stress-ng.futex.ops
1544792 ? 2% -8.3% 1416117 ? 4% stress-ng.futex.ops_per_sec
--
5416582 ? 2% +1.6% 5504734 ? 5% stress-ng.wait.ops
90275 ? 2% +1.6% 91744 ? 5% stress-ng.wait.ops_per_sec
--
7.939e+08 ? 4% +0.7% 7.998e+08 stress-ng.mq.ops
13231931 ? 4% +0.7% 13329018 stress-ng.mq.ops_per_sec
--
3.936e+08 ? 53% +51.5% 5.965e+08 ? 37% stress-ng.hrtimers.ops
6416615 ? 52% +49.8% 9609730 ? 38% stress-ng.hrtimers.ops_per_sec
--
35268682 -1.1% 34872339 stress-ng.sendfile.ops
587805 -1.1% 581200 stress-ng.sendfile.ops_per_sec
--
18126587 -0.7% 18000304 ? 3% stress-ng.sem-sysv.ops
302108 -0.7% 300004 ? 3% stress-ng.sem-sysv.ops_per_sec
--
9.953e+08 ? 2% -0.4% 9.912e+08 stress-ng.pipe.ops
16589030 ? 2% -0.4% 16519202 stress-ng.pipe.ops_per_sec
--
1056000 +0.0% 1056000 stress-ng.schedpolicy.ops
17599 +0.0% 17599 stress-ng.schedpolicy.ops_per_sec
--
2.481e+09 +1.2% 2.51e+09 stress-ng.vm-splice.ops
41349053 +1.2% 41828864 stress-ng.vm-splice.ops_per_sec
--
56805932 ? 2% +1.1% 57451245 ? 2% stress-ng.eventfd.ops
946751 ? 2% +1.1% 957505 ? 2% stress-ng.eventfd.ops_per_sec
--
1605 -1.2% 1585 ? 2% stress-ng.mmapfork.ops
25.18 ? 4% -0.8% 24.98 ? 4% stress-ng.mmapfork.ops_per_sec
--
2549139 +1.2% 2578537 stress-ng.fork.ops
42485 +1.2% 42975 stress-ng.fork.ops_per_sec
--
16582541 ? 3% -10.8% 14792067 ? 21% stress-ng.close.ops
276357 ? 3% -10.8% 246515 ? 21% stress-ng.close.ops_per_sec
--
1.847e+09 -1.3% 1.822e+09 stress-ng.pipeherd.ops
30743202 -1.3% 30330538 stress-ng.pipeherd.ops_per_sec
--
6780757 ? 2% +4.3% 7070424 stress-ng.tee.ops
113008 ? 2% +4.3% 117835 stress-ng.tee.ops_per_sec
--
3421466 +0.0% 3422853 stress-ng.dnotify.ops
57024 +0.0% 57047 stress-ng.dnotify.ops_per_sec
--
749895 -0.9% 742801 stress-ng.kill.ops
12498 -0.9% 12379 stress-ng.kill.ops_per_sec
--
3.761e+08 -2.5% 3.668e+08 stress-ng.msg.ops
6266833 -2.5% 6112949 stress-ng.msg.ops_per_sec
--
2258479 +0.1% 2260255 stress-ng.daemon.ops
37641 +0.1% 37670 stress-ng.daemon.ops_per_sec
--
40560445 -0.8% 40239283 stress-ng.affinity.ops
668138 -1.6% 657142 stress-ng.affinity.ops_per_sec
--
21230 ? 4% -0.9% 21047 ? 2% stress-ng.inotify.ops
353.14 ? 4% -0.9% 349.88 ? 2% stress-ng.inotify.ops_per_sec
--
23250444 ? 2% +1.6% 23627954 stress-ng.fault.ops
387495 ? 2% +1.6% 393795 stress-ng.fault.ops_per_sec
--
3.94e+08 -0.2% 3.933e+08 stress-ng.sem.ops
6567438 -0.2% 6555371 stress-ng.sem.ops_per_sec
--
698544 -0.0% 698480 stress-ng.session.ops
11494 +0.0% 11495 stress-ng.session.ops_per_sec

>
> This is a follow up to V3 which can be found here:
>
> https://lore.kernel.org/r/[email protected]
>
> Changes vs. V3:
>
> - Dropped the two bugfixes which are applied already and rebased on top
>
> - Addressed review comments (Andy, Boris)
>
> Patches: 13, 35, 36, 37, 46, 58, 62, 63
>
> - Fixed the math-emu fallout which I had stashed safely on the 32bit
> testbox (Boris)
>
> Patch: 28
>
> - Picked up tags
>
> Thanks to everyone for review, feedback and testing (various teams
> @Intel).
>
> Note: I've not picked up any tested-by tags. It would be nice to have
> them on this hopefully final version.
>
> Thanks,
>
> tglx
> ---
> arch/x86/events/intel/lbr.c | 6
> arch/x86/include/asm/fpu/internal.h | 202 ++++------
> arch/x86/include/asm/fpu/xstate.h | 78 +++-
> arch/x86/include/asm/pgtable.h | 57 ---
> arch/x86/include/asm/pkeys.h | 9
> arch/x86/include/asm/pkru.h | 62 +++
> arch/x86/include/asm/processor.h | 9
> arch/x86/include/asm/special_insns.h | 14
> arch/x86/kernel/cpu/common.c | 34 -
> arch/x86/kernel/fpu/core.c | 282 +++++++--------
> arch/x86/kernel/fpu/init.c | 15
> arch/x86/kernel/fpu/regset.c | 223 ++++++------
> arch/x86/kernel/fpu/signal.c | 419 ++++++++++------------
> arch/x86/kernel/fpu/xstate.c | 645 +++++++++++++----------------------
> arch/x86/kernel/process.c | 22 +
> arch/x86/kernel/process_64.c | 28 +
> arch/x86/kernel/traps.c | 5
> arch/x86/kvm/svm/sev.c | 1
> arch/x86/kvm/x86.c | 56 +--
> arch/x86/math-emu/fpu_proto.h | 2
> arch/x86/math-emu/load_store.c | 2
> arch/x86/math-emu/reg_ld_str.c | 2
> arch/x86/mm/extable.c | 2
> arch/x86/mm/fault.c | 2
> arch/x86/mm/pkeys.c | 22 -
> include/linux/pkeys.h | 4
> 26 files changed, 1037 insertions(+), 1166 deletions(-)
>
>