by Eric Biggers

[permalink] [raw]

Subject: Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs

On Mon, Apr 08, 2024 at 07:41:44AM +0000, David Laight wrote:
> From: Eric Biggers
> > Sent: 05 April 2024 20:19
> ...
> > I did some tests on Sapphire Rapids using a system call that I customized to do
> > nothing except possibly a kernel_fpu_begin / kernel_fpu_end pair.
> >
> > On average the bare syscall took 70 ns. The syscall with the kernel_fpu_begin /
> > kernel_fpu_end pair took 160 ns if the userspace program used xmm only, 340 ns
> > if it used ymm, or 360 ns if it used zmm...
> >
> > Note that without the kernel_fpu_begin / kernel_fpu_end pair, AES-NI
> > instructions cannot be used and the alternative would be xts(ecb(aes-generic)).
> > On the same CPU, encrypting a single 512-byte sector with xts(ecb(aes-generic))
> > takes about 2235ns. With xts-aes-vaes-avx10_512 it takes 75 ns...
>
> So most of the cost of a single 512-byte sector is the kernel_fpu_begin().
> But it is so much slower any other way it is still faster.
>

Yes. To clarify, the 75 ns time I mentioned for a 512-byte sector is the
average for repeated calls, amortizing the XSAVE and XRSTOR. For a real single
512-byte sector that eats the entire cost of the XSAVE and XRSTOR by itself, if
all state is in-use it should be about 75 + (360 - 70) = 365 ns (based on the
syscall benchmarks I did), with the XSAVE and XRSTOR accounting for 80% of that
time. But yes, that's still over 6 times faster than the scalar alternative.

- Eric

2024-04-05 07:58:35

by Herbert Xu

[permalink] [raw]

Subject: Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs

Eric Biggers <[email protected]> wrote:
> This patchset adds new AES-XTS implementations that accelerate disk and
> file encryption on modern x86_64 CPUs.
>
> The largest improvements are seen on CPUs that support the VAES
> extension: Intel Ice Lake (2019) and later, and AMD Zen 3 (2020) and
> later. However, an implementation using plain AESNI + AVX is also added
> and provides a small boost on older CPUs too.
>
> To try to handle the mess that is x86 SIMD, the code for all the new
> AES-XTS implementations is generated from an assembly macro. This makes
> it so that we e.g. don't have to have entirely different source code
> just for different vector lengths (xmm, ymm, zmm).
>
> To avoid downclocking effects, zmm registers aren't used on certain
> Intel CPU models such as Ice Lake. These CPU models default to an
> implementation using ymm registers instead.
>
> This patchset increases the throughput of AES-256-XTS decryption by the
> following amounts on the following CPUs:
>
> | 4096-byte messages | 512-byte messages |
> ----------------------+--------------------+-------------------+
> Intel Skylake | 1% | 11% |
> Intel Ice Lake | 92% | 59% |
> Intel Sapphire Rapids | 115% | 78% |
> AMD Zen 1 | 25% | 20% |
> AMD Zen 2 | 26% | 20% |
> AMD Zen 3 | 82% | 40% |
> AMD Zen 4 | 118% | 48% |
>
> (The results for encryption are very similar to decryption. I just tend
> to measure decryption because decryption performance is more important.)
>
> There's no separate kconfig option for the new AES-XTS implementations,
> as they are included in the existing option CONFIG_CRYPTO_AES_NI_INTEL.
>
> To make testing easier, all four new AES-XTS implementations are
> registered separately with the crypto API. They are prioritized
> appropriately so that the best one for the CPU is used by default.
>
> Open questions:
>
> - Is the policy that I implemented for preferring ymm registers to zmm
> registers the right one? arch/x86/crypto/poly1305_glue.c thinks that
> only Skylake has the bad downclocking. My current proposal is a bit
> more conservative; it also excludes Ice Lake and Tiger Lake. Those
> CPUs supposedly still have some downclocking, though not as much.
>
> - Should the policy on the use of zmm registers be in a centralized
> place? It probably doesn't make sense to have random different
> policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).
>
> - Are there any other known issues with using AVX512 in kernel mode? It
> seems to work, and technically it's not new because Poly1305 and ARIA
> already use AVX512, including the mask registers and zmm registers up
> to 31. So if there was a major issue, like the new registers not
> being properly saved and restored, it probably would have already been
> found. But AES-XTS support would introduce a wider use of it.
>
> Eric Biggers (6):
> x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support
> crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs
> crypto: x86/aes-xts - wire up AESNI + AVX implementation
> crypto: x86/aes-xts - wire up VAES + AVX2 implementation
> crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation
> crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation
>
> arch/x86/Kconfig.assembler | 10 +
> arch/x86/crypto/Makefile | 3 +-
> arch/x86/crypto/aes-xts-avx-x86_64.S | 796 +++++++++++++++++++++++++++
> arch/x86/crypto/aesni-intel_glue.c | 263 ++++++++-
> 4 files changed, 1070 insertions(+), 2 deletions(-)
> create mode 100644 arch/x86/crypto/aes-xts-avx-x86_64.S
>
>
> base-commit: 4cece764965020c22cff7665b18a012006359095

All applied. Thanks.
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt