Received-SPF: pass (google.com: domain of linux-crypto+bounces-3313-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1;
From: David Laight <David.Laight@ACULAB.COM>
To: 'Eric Biggers' <ebiggers@kernel.org>
CC: Ard Biesheuvel <ardb@kernel.org>, "linux-crypto@vger.kernel.org"
	<linux-crypto@vger.kernel.org>, "x86@kernel.org" <x86@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "Andy
 Lutomirski" <luto@kernel.org>, "Chang S . Bae" <chang.seok.bae@intel.com>
Subject: RE: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs
Thread-Topic: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs
Thread-Index: AQHaf518oZDDPKfpuUWrj9ZpRMjHLrFWPMWwgAET/ICAAHT4IA==
Date: Thu, 4 Apr 2024 07:53:48 +0000
Message-ID: <142077804bee45daac3b0fad8bc4c2fe@AcuMS.aculab.com>
References: <20240326080305.402382-1-ebiggers@kernel.org>
 <CAMj1kXH4fNevFzrbazJptadxh_spEY3W91FZni5eMqD+UKrSUQ@mail.gmail.com>
 <20240326164755.GB1524@sol.localdomain>
 <6629b8120807458ab76e1968056f5e10@AcuMS.aculab.com>
 <20240404013529.GB24248@quark.localdomain>
In-Reply-To: <20240404013529.GB24248@quark.localdomain>
Accept-Language: en-GB, en-US
Precedence: bulk
MIME-Version: 1.0
Content-Language: en-US
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

From: Eric Biggers
> Sent: 04 April 2024 02:35
>=20
> Hi David,
>=20
> On Wed, Apr 03, 2024 at 08:12:09AM +0000, David Laight wrote:
> > From: Eric Biggers
> > > Sent: 26 March 2024 16:48
> > ....
> > > Consider Intel Ice Lake for example, these are the AES-256-XTS encryp=
tion speeds
> > > on 4096-byte messages in MB/s I'm seeing:
> > >
> > >     xts-aes-aesni                  5136
> > >     xts-aes-aesni-avx              5366
> > >     xts-aes-vaes-avx2              9337
> > >     xts-aes-vaes-avx10_256         9876
> > >     xts-aes-vaes-avx10_512         10215
> > >
> > > So yes, on that CPU the biggest boost comes just from VAES, staying o=
n AVX2.
> > > But taking advantage of AVX512 does help a bit more, first from the p=
arts other
> > > than 512-bit registers, then a bit more from 512-bit registers.
> >
> > How much does the kernel_fpu_begin() cost on real workloads?
> > (ie when the registers are live and it forces an extra save/restore)
>=20
> x86 Linux does lazy restore of the FPU state.  The first kernel_fpu_begin=
() can
> have a significant cost, as it issues an XSAVE (or equivalent) instructio=
n and
> causes an XRSTOR (or equivalent) instruction to be issued when returning =
to
> userspace when it otherwise might not be needed.  Additional kernel_fpu_b=
egin()
> / kernel_fpu_end() pairs without returning to userspace have only a small=
 cost,
> as they don't cause any more saves or restores of the FPU state to be don=
e.
>=20
> My new xts(aes) implementations have one kernel_fpu_begin() / kernel_fpu_=
end()
> pair per message (if the message doesn't span any page boundaries, which =
is
> almost always the case).  That's exactly the same as the current xts-aes-=
aesni.

I realised after sending it that the code almost certainly already did
kernel_fpu_begin() - so there probably isn't a difference because all the
fpu state is always saved.
(I'm sure there should be a way of getting access to (say) 2 ymm registers
by providing an on-stack save area to allow wide data copies or special
instructions - but that is a different issue.)

> I think what you may really be asking is how much the overhead of the XSA=
VE /
> XRSTOR pair associated with kernel-mode use of the FPU *increases* if the=
 kernel
> clobbers AVX or AVX512 state, instead of just SSE state as xts-aes-aesni =
does.
> That's much more relevant to this patchset.

It depends on what has to be saved, not on what is used.
Although, since all the x/y/zmm registers are caller-saved I think they cou=
ld
be 'zapped' on syscall entry (and restored as zero later).
Trouble is I suspect there is a single piece of code somewhere that relies
on them being preserved across an inlined system call.

> I think the answer is that there is no additional overhead.  This is beca=
use the
> XSAVE / XRSTOR pair happens regardless of the type of state the kernel cl=
obbers,
> and it operates on the userspace state, not the kernel's.  Some of the ne=
wer
> variants of XSAVE (XSAVEOPT and XSAVES) do have a "modified" optimization=
 where
> they don't save parts of the state that are unmodified since the last XRS=
TOR;
> however, that is unimportant here because the kernel's FPU state is never=
 saved.
>=20
> (This would change if x86 Linux were to support preemption of kernel-mode=
 FPU
> code.  In that case, we may need to take more care to minimize use of AVX=
 and
> AVX512 state.  That being said, AES-XTS tends to be used for bulk data an=
yway.)
>=20
> This is based on theory, though.  I'll do a test to confirm that there's =
indeed
> no additional overhead.  And also, even if there's no additional overhead=
, what
> the existing overhead actually is.

Yes, I was wondering how it is used for 'real applications'.
If a system call that would normally return immediately (or at least withou=
t
a full process switch) hits the aes code it gets the cost of the XSAVE adde=
d.
Whereas the benchmark probably doesn't do anywhere near as many.

OTOH this is probably no different.

>=20
> > I've not looked at the code but I often see what looks like
> > excessive inlining in crypto code.
> > This will speed up benchmarks but can have a negative effect
> > on real code both because of the time taken to load the
> > code and the effect of displacing other code.
> >
> > It might be that this code is a simple loop....
>=20
> This is a different topic.  By "inlining" I assume that you also mean thi=
ngs
> like loop unrolling.  I totally agree that some of the crypto assembly co=
de goes
> way overboard on this, resulting in an unreasonably large machine code si=
ze.
> The AVX implementation of AES-GCM (aesni-intel_avx-x86_64.S), which was w=
ritten
> by Intel, is the worst offender by far, generating 256011 bytes of machin=
e code.
> In OpenSSL, Intel has even taken that to the next level with their VAES
> optimized implementation of AES-GCM generating 696040 bytes of machine co=
de.

That is truly stunning!
I can't believe anything that big is actually 'optimised'.
Just think of all the TLB misses :-)
Unless it is slightly faster if you are encrypting several TB of data.

...
> So, I think my current proposal is at a reasonable place regarding compil=
ed code
> size, especially when it's compared to the monstrosity that is some of th=
e
> existing crypto assembly code.  But let me know if there are any specific
> choices I've made that you may have a different opinion on.

At least you've thought about code size.

=09David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1=
PT, UK
Registration No: 1397386 (Wales)