2021-03-26 04:40:17

by Andy Lutomirski

[permalink] [raw]
Subject: Why does glibc use AVX-512?

Hi all-

glibc appears to use AVX512F for memcpy by default. (Unless
Prefer_ERMS is default-on, but I genuinely can't tell if this is the
case. I did some searching.) The commit adding it refers to a 2016
email saying that it's 30% on KNL. Unfortunately, AVX-512 is now
available in normal hardware, and the overhead from switching between
normal and AVX-512 code appears to vary from bad to genuinely
horrible. And, once anything has used the high parts of YMM and/or
ZMM, those states tend to get stuck with XINUSE=1.

I'm wondering whether glibc should stop using AVX-512 by default.

Meanwhile, some of you may have noticed a little ABI break we have.
On AVX-512 hardware, the size of a signal frame is unreasonably large,
and this is causing problems even for existing software that doesn't
use AVX-512. Do any of you have any clever ideas for how to fix it?
We have some kernel patches around to try to fail more cleanly, but we
still fail.

I think we should seriously consider solutions in which, for new
tasks, XCR0 has new giant features (e.g. AMX) and possibly even
AVX-512 cleared, and programs need to explicitly request enablement.
This would allow programs to opt into not saving/restoring across
signals or to save/restore in buffers supplied when the feature is
enabled. This has all kinds of pros and cons, and I'm not sure it's a
great idea. But, in the absence of some change to the ABI, the
default outcome is that, on AMX-enabled kernels on AMX-enabled
hardware, the signal frame will be more than 8kB, and this will affect
*every* signal regardless of whether AMX is in use.

--Andy


2021-03-26 10:11:17

by Borislav Petkov

[permalink] [raw]
Subject: Re: Why does glibc use AVX-512?

On Thu, Mar 25, 2021 at 09:38:24PM -0700, Andy Lutomirski wrote:
> I think we should seriously consider solutions in which, for new
> tasks, XCR0 has new giant features (e.g. AMX) and possibly even
> AVX-512 cleared, and programs need to explicitly request enablement.

I totally agree with making this depend on an explicit user request,
but...

> This would allow programs to opt into not saving/restoring across
> signals or to save/restore in buffers supplied when the feature is
> enabled. This has all kinds of pros and cons, and I'm not sure it's a
> great idea. But, in the absence of some change to the ABI, the
> default outcome is that, on AMX-enabled kernels on AMX-enabled
> hardware, the signal frame will be more than 8kB, and this will affect
> *every* signal regardless of whether AMX is in use.

... what's stopping the library from issuing that new ABI call before it
starts the app and get <insert fat feature here> automatically enabled
for everything by default?

And then we'll get the lazy FPU thing all over again.

So the ABI should be explicit user interaction or a kernel cmdline param
or so.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-03-26 12:14:04

by Florian Weimer

[permalink] [raw]
Subject: Re: Why does glibc use AVX-512?

* Andy Lutomirski-alpha:

> glibc appears to use AVX512F for memcpy by default. (Unless
> Prefer_ERMS is default-on, but I genuinely can't tell if this is the
> case. I did some searching.) The commit adding it refers to a 2016
> email saying that it's 30% on KNL.

As far as I know, glibc only does that on KNL, and there it is
actually beneficial. The relevant code is:

/* Since AVX512ER is unique to Xeon Phi, set Prefer_No_VZEROUPPER
if AVX512ER is available. Don't use AVX512 to avoid lower CPU
frequency if AVX512ER isn't available. */
if (CPU_FEATURES_CPU_P (cpu_features, AVX512ER))
cpu_features->preferred[index_arch_Prefer_No_VZEROUPPER]
|= bit_arch_Prefer_No_VZEROUPPER;
else
cpu_features->preferred[index_arch_Prefer_No_AVX512]
|= bit_arch_Prefer_No_AVX512;

So it's not just about Prefer_ERMS.

> I think we should seriously consider solutions in which, for new
> tasks, XCR0 has new giant features (e.g. AMX) and possibly even

I think the AMX programming model will be different, yes.

> AVX-512 cleared, and programs need to explicitly request enablement.
> This would allow programs to opt into not saving/restoring across
> signals or to save/restore in buffers supplied when the feature is
> enabled.

Isn't XSAVEOPT already able to handle that?

In glibc, we use XSAVE/XSAVEC for the dynamic loader trampoline, so it
should not needlessly enable AVX-512 state today, while still enabling
AVX-512 calling conventions transparently.

There is a discussion about using the higher (AVX-512-only) %ymm
registers, to avoid the %xmm transition penalty without the need for
VZEROUPPER. (VZEROUPPER is incompatible with RTM from a performance
point of view.) That would perhaps negatively impact XSAVEOPT.

Assuming you can make XSAVEOPT work for you on the kernel side, my
instincts tell me that we should have markup for RTM, not for AVX-512.
This way, we could avoid use of the AVX-512 registers and keep using
VZEROUPPER, without run-time transaction checks, and deal with other
idiosyncrasies needed for transaction support that users might
encounter once this feature sees more use. But the VZEROUPPER vs RTM
issues is currently stuck in some internal process issue on my end (or
two, come to think of it), which I hope to untangle next month.

2021-03-26 13:34:29

by David Laight

[permalink] [raw]
Subject: RE: Why does glibc use AVX-512?

From: Andy Lutomirski
> Sent: 26 March 2021 04:38
>
> Hi all-
>
> glibc appears to use AVX512F for memcpy by default. (Unless
> Prefer_ERMS is default-on, but I genuinely can't tell if this is the
> case. I did some searching.) The commit adding it refers to a 2016
> email saying that it's 30% on KNL. Unfortunately, AVX-512 is now
> available in normal hardware, and the overhead from switching between
> normal and AVX-512 code appears to vary from bad to genuinely
> horrible. And, once anything has used the high parts of YMM and/or
> ZMM, those states tend to get stuck with XINUSE=1.

Yes I wonder how much faster 'normal' copies ever get because
of these optimisations.
Not many programs sit in a loop repeatedly copying the same 8k buffer.

Not to mention the cpu where the 'wide' instructions either
use the 'narrow' execution unit twice or at half frequency.
So while supported, using them isn't really useful.

IIRC the [XYZ]MM registers are all caller saved?
So system calls (or rather the C wrapper) is allowed to
trash them all.
So the system call entry could zero all the [XYZ]MM registers.
I think they XSAVExxx and later XRESTORExxx are then quick.
In particular they don't need saving on a context switch from
a system call.
This might get them marked 'not in use' more often.
But probably not if memcpy() starts using them.
(This doesn't help signal handlers.)

ISTR one cpu family where ZVEROUPPER goes from 'cheap' to
'expensive'.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-03-26 18:16:28

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Why does glibc use AVX-512?

On Fri, Mar 26, 2021 at 5:12 AM Florian Weimer <[email protected]> wrote:
>
> * Andy Lutomirski-alpha:
>
> > glibc appears to use AVX512F for memcpy by default. (Unless
> > Prefer_ERMS is default-on, but I genuinely can't tell if this is the
> > case. I did some searching.) The commit adding it refers to a 2016
> > email saying that it's 30% on KNL.
>
> As far as I know, glibc only does that on KNL, and there it is
> actually beneficial. The relevant code is:
>
> /* Since AVX512ER is unique to Xeon Phi, set Prefer_No_VZEROUPPER
> if AVX512ER is available. Don't use AVX512 to avoid lower CPU
> frequency if AVX512ER isn't available. */
> if (CPU_FEATURES_CPU_P (cpu_features, AVX512ER))
> cpu_features->preferred[index_arch_Prefer_No_VZEROUPPER]
> |= bit_arch_Prefer_No_VZEROUPPER;
> else
> cpu_features->preferred[index_arch_Prefer_No_AVX512]
> |= bit_arch_Prefer_No_AVX512;
>
> So it's not just about Prefer_ERMS.

Phew.

>
> > AVX-512 cleared, and programs need to explicitly request enablement.
> > This would allow programs to opt into not saving/restoring across
> > signals or to save/restore in buffers supplied when the feature is
> > enabled.
>
> Isn't XSAVEOPT already able to handle that?
>

Yes, but we need a place to put the data, and we need to acknowledge
that, with the current save-everything-on-signal model, the amount of
time and memory used is essentially unbounded. This isn't great.

>
> There is a discussion about using the higher (AVX-512-only) %ymm
> registers, to avoid the %xmm transition penalty without the need for
> VZEROUPPER. (VZEROUPPER is incompatible with RTM from a performance
> point of view.) That would perhaps negatively impact XSAVEOPT.
>
> Assuming you can make XSAVEOPT work for you on the kernel side, my
> instincts tell me that we should have markup for RTM, not for AVX-512.
> This way, we could avoid use of the AVX-512 registers and keep using
> VZEROUPPER, without run-time transaction checks, and deal with other
> idiosyncrasies needed for transaction support that users might
> encounter once this feature sees more use. But the VZEROUPPER vs RTM
> issues is currently stuck in some internal process issue on my end (or
> two, come to think of it), which I hope to untangle next month.

Can you elaborate on the issue?

2021-03-26 18:21:21

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Why does glibc use AVX-512?

On Fri, Mar 26, 2021 at 3:08 AM Borislav Petkov <[email protected]> wrote:
>
> On Thu, Mar 25, 2021 at 09:38:24PM -0700, Andy Lutomirski wrote:
> > I think we should seriously consider solutions in which, for new
> > tasks, XCR0 has new giant features (e.g. AMX) and possibly even
> > AVX-512 cleared, and programs need to explicitly request enablement.
>
> I totally agree with making this depend on an explicit user request,
> but...
>
> > This would allow programs to opt into not saving/restoring across
> > signals or to save/restore in buffers supplied when the feature is
> > enabled. This has all kinds of pros and cons, and I'm not sure it's a
> > great idea. But, in the absence of some change to the ABI, the
> > default outcome is that, on AMX-enabled kernels on AMX-enabled
> > hardware, the signal frame will be more than 8kB, and this will affect
> > *every* signal regardless of whether AMX is in use.
>
> ... what's stopping the library from issuing that new ABI call before it
> starts the app and get <insert fat feature here> automatically enabled
> for everything by default?
>
> And then we'll get the lazy FPU thing all over again.

At the end of the day, it's not the kernel's job to make userspace be
sane or to make users or programmers make the right decisions. But it
is our job to make sure that it's even possible to make the system
work well, and we are responsible for making sure that old binaries
continue to work, preferably well, on new kernels and new hardware.

2021-03-26 19:35:56

by Florian Weimer

[permalink] [raw]
Subject: Re: Why does glibc use AVX-512?

* Andy Lutomirski:

>> > AVX-512 cleared, and programs need to explicitly request enablement.
>> > This would allow programs to opt into not saving/restoring across
>> > signals or to save/restore in buffers supplied when the feature is
>> > enabled.
>>
>> Isn't XSAVEOPT already able to handle that?
>>
>
> Yes, but we need a place to put the data, and we need to acknowledge
> that, with the current save-everything-on-signal model, the amount of
> time and memory used is essentially unbounded. This isn't great.

The size has to have a known upper bound, but the save amount can be
dynamic, right?

How was the old lazy FPU initialization support for i386 implemented?

>> Assuming you can make XSAVEOPT work for you on the kernel side, my
>> instincts tell me that we should have markup for RTM, not for AVX-512.
>> This way, we could avoid use of the AVX-512 registers and keep using
>> VZEROUPPER, without run-time transaction checks, and deal with other
>> idiosyncrasies needed for transaction support that users might
>> encounter once this feature sees more use. But the VZEROUPPER vs RTM
>> issues is currently stuck in some internal process issue on my end (or
>> two, come to think of it), which I hope to untangle next month.
>
> Can you elaborate on the issue?

This is the bug:

vzeroupper use in AVX2 multiarch string functions cause HTM aborts
<https://sourceware.org/bugzilla/show_bug.cgi?id=27457>

Unfortunately we have a bug (outside of glibc) that makes me wonder if
we can actually roll out RTM transaction checks (or any RTM
instruction) on a large scale:

x86: Sporadic failures in tst-cpu-features-cpuinfo
<https://sourceware.org/bugzilla/show_bug.cgi?id=27398#c3>

The dynamic RTM check might trap due to this bug. (We have a bit more
information about the nature of the bug, currently missing from
Bugzilla.)

I'm also worried that the new dynamic RTM check in the string
functions has a performance impact. Due to its nature, it will be
enabled for every program once running on RTM-capable hardware, not
just those that actually use RTM.

2021-03-26 19:49:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Why does glibc use AVX-512?

On Fri, Mar 26, 2021 at 12:34 PM Florian Weimer <[email protected]> wrote:
>
> * Andy Lutomirski:
>
> >> > AVX-512 cleared, and programs need to explicitly request enablement.
> >> > This would allow programs to opt into not saving/restoring across
> >> > signals or to save/restore in buffers supplied when the feature is
> >> > enabled.
> >>
> >> Isn't XSAVEOPT already able to handle that?
> >>
> >
> > Yes, but we need a place to put the data, and we need to acknowledge
> > that, with the current save-everything-on-signal model, the amount of
> > time and memory used is essentially unbounded. This isn't great.
>
> The size has to have a known upper bound, but the save amount can be
> dynamic, right?
>
> How was the old lazy FPU initialization support for i386 implemented?
>
> >> Assuming you can make XSAVEOPT work for you on the kernel side, my
> >> instincts tell me that we should have markup for RTM, not for AVX-512.
> >> This way, we could avoid use of the AVX-512 registers and keep using
> >> VZEROUPPER, without run-time transaction checks, and deal with other
> >> idiosyncrasies needed for transaction support that users might
> >> encounter once this feature sees more use. But the VZEROUPPER vs RTM
> >> issues is currently stuck in some internal process issue on my end (or
> >> two, come to think of it), which I hope to untangle next month.
> >
> > Can you elaborate on the issue?
>
> This is the bug:
>
> vzeroupper use in AVX2 multiarch string functions cause HTM aborts
> <https://sourceware.org/bugzilla/show_bug.cgi?id=27457>
>
> Unfortunately we have a bug (outside of glibc) that makes me wonder if
> we can actually roll out RTM transaction checks (or any RTM
> instruction) on a large scale:
>
> x86: Sporadic failures in tst-cpu-features-cpuinfo
> <https://sourceware.org/bugzilla/show_bug.cgi?id=27398#c3>

It's worth noting that recent microcode updates have make RTM
considerably less likely to actually work on many parts. It's
possible you should just disable it. :(

2021-03-26 20:37:00

by Florian Weimer

[permalink] [raw]
Subject: Re: Why does glibc use AVX-512?

* Andy Lutomirski:

> On Fri, Mar 26, 2021 at 12:34 PM Florian Weimer <[email protected]> wrote:
>> x86: Sporadic failures in tst-cpu-features-cpuinfo
>> <https://sourceware.org/bugzilla/show_bug.cgi?id=27398#c3>
>
> It's worth noting that recent microcode updates have make RTM
> considerably less likely to actually work on many parts. It's
> possible you should just disable it. :(

Sorry, I'm not sure who should disable it.

Let me sum up the situation:

We have a request for a performance enhancement in glibc, so that
applications can use it on server parts where RTM actually works.

For CPUs that support AVX-512, we may be able to meet that with a
change that uses the new 256-bit registers, t avoid the %xmm
transition penalty. (This is the easy case, hopefully—there shouldn't
be any frequency issues associated with that, and if the kernel
doesn't optimize the context switch today, that's a nonissue as well.)

For CPUs that do not support AVX-512 but support RTM (and AVX2), we
need a dynamic run-time check whether the string function is invoked
in a transaction. In that case, we need to use VZEROALL instead of
VZEROUPPER. (It's apparently too costly to issue VZEROALL
unconditionally.)

All this needs to work transparently without user intervention. We
cannot require firmware upgrades to fix the incorrect RTM reporting
issue (the bug I referenced). I think we can require software updates
which tell glibc when to use RTM-enabled string functions if the
dynamic selection does not work (either for performance reasons, or
because of the RTM reporting bug).

I want to avoid a situation where one in eight processes fail to work
correctly because the CPUID checks ran on CPU 0, where RTM is reported
as available, and then we trap when executing XTEST on other CPUs.

2021-03-26 20:50:32

by H.J. Lu

[permalink] [raw]
Subject: Re: Why does glibc use AVX-512?

On Fri, Mar 26, 2021 at 1:35 PM Florian Weimer <[email protected]> wrote:

>
> All this needs to work transparently without user intervention. We
> cannot require firmware upgrades to fix the incorrect RTM reporting
> issue (the bug I referenced). I think we can require software updates
> which tell glibc when to use RTM-enabled string functions if the
> dynamic selection does not work (either for performance reasons, or
> because of the RTM reporting bug).
>
> I want to avoid a situation where one in eight processes fail to work
> correctly because the CPUID checks ran on CPU 0, where RTM is reported
> as available, and then we trap when executing XTEST on other CPUs.

glibc can disable RTM based on CPU model and stepping.

--
H.J.

2021-03-26 20:52:02

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Why does glibc use AVX-512?

On Fri, Mar 26, 2021 at 1:35 PM Florian Weimer <[email protected]> wrote:
>
> * Andy Lutomirski:
>
> > On Fri, Mar 26, 2021 at 12:34 PM Florian Weimer <[email protected]> wrote:
> >> x86: Sporadic failures in tst-cpu-features-cpuinfo
> >> <https://sourceware.org/bugzilla/show_bug.cgi?id=27398#c3>
> >
> > It's worth noting that recent microcode updates have make RTM
> > considerably less likely to actually work on many parts. It's
> > possible you should just disable it. :(
>
> Sorry, I'm not sure who should disable it.
>
> Let me sum up the situation:
>
> We have a request for a performance enhancement in glibc, so that
> applications can use it on server parts where RTM actually works.
>
> For CPUs that support AVX-512, we may be able to meet that with a
> change that uses the new 256-bit registers, t avoid the %xmm
> transition penalty. (This is the easy case, hopefully—there shouldn't
> be any frequency issues associated with that, and if the kernel
> doesn't optimize the context switch today, that's a nonissue as well.)

I would make sure that the transition penalty actually works the way
you think it does. My general experience with the transition
penalties is that the CPU is rather more aggressive about penalizing
you than makes sense.

>
> For CPUs that do not support AVX-512 but support RTM (and AVX2), we
> need a dynamic run-time check whether the string function is invoked
> in a transaction. In that case, we need to use VZEROALL instead of
> VZEROUPPER. (It's apparently too costly to issue VZEROALL
> unconditionally.)

So VZEROALL works in a transaction and VZEROUPPER doesn't? That's bizarre.


> All this needs to work transparently without user intervention. We
> cannot require firmware upgrades to fix the incorrect RTM reporting
> issue (the bug I referenced). I think we can require software updates
> which tell glibc when to use RTM-enabled string functions if the
> dynamic selection does not work (either for performance reasons, or
> because of the RTM reporting bug).
>
> I want to avoid a situation where one in eight processes fail to work
> correctly because the CPUID checks ran on CPU 0, where RTM is reported
> as available, and then we trap when executing XTEST on other CPUs.

What kind of system has that problem? If RTM reports as available,
then it should work in the sense of not trapping. (There is no
guarantee that transactions will *ever* complete, and that part is no
joke.)

2021-03-26 21:16:25

by Florian Weimer

[permalink] [raw]
Subject: Re: Why does glibc use AVX-512?

* Andy Lutomirski:

> On Fri, Mar 26, 2021 at 1:35 PM Florian Weimer <[email protected]> wrote:
>>
>> * Andy Lutomirski:
>>
>> > On Fri, Mar 26, 2021 at 12:34 PM Florian Weimer <[email protected]> wrote:
>> >> x86: Sporadic failures in tst-cpu-features-cpuinfo
>> >> <https://sourceware.org/bugzilla/show_bug.cgi?id=27398#c3>
>> >
>> > It's worth noting that recent microcode updates have make RTM
>> > considerably less likely to actually work on many parts. It's
>> > possible you should just disable it. :(
>>
>> Sorry, I'm not sure who should disable it.
>>
>> Let me sum up the situation:
>>
>> We have a request for a performance enhancement in glibc, so that
>> applications can use it on server parts where RTM actually works.
>>
>> For CPUs that support AVX-512, we may be able to meet that with a
>> change that uses the new 256-bit registers, t avoid the %xmm
>> transition penalty. (This is the easy case, hopefully—there shouldn't
>> be any frequency issues associated with that, and if the kernel
>> doesn't optimize the context switch today, that's a nonissue as well.)
>
> I would make sure that the transition penalty actually works the way
> you think it does. My general experience with the transition
> penalties is that the CPU is rather more aggressive about penalizing
> you than makes sense.

Do you mean the frequency/thermal budget?

I mean the immense slowdown you get if you use %xmm registers after
their %ymm counterparts (doesn't have to be %zmm, that issue is
present starting with AVX) and you have not issued VZEROALL or
VZEROUPPER between the two uses.

It's a bit like EMMS, I gues, only that you don't get corruption, just
really poor performance.

>> For CPUs that do not support AVX-512 but support RTM (and AVX2), we
>> need a dynamic run-time check whether the string function is invoked
>> in a transaction. In that case, we need to use VZEROALL instead of
>> VZEROUPPER. (It's apparently too costly to issue VZEROALL
>> unconditionally.)
>
> So VZEROALL works in a transaction and VZEROUPPER doesn't? That's bizarre.

Apparently yes.

>> All this needs to work transparently without user intervention. We
>> cannot require firmware upgrades to fix the incorrect RTM reporting
>> issue (the bug I referenced). I think we can require software updates
>> which tell glibc when to use RTM-enabled string functions if the
>> dynamic selection does not work (either for performance reasons, or
>> because of the RTM reporting bug).
>>
>> I want to avoid a situation where one in eight processes fail to work
>> correctly because the CPUID checks ran on CPU 0, where RTM is reported
>> as available, and then we trap when executing XTEST on other CPUs.
>
> What kind of system has that problem?

It's a standard laptop after a suspend/resume cycle. It's either a
kernel or firmware bug.

> If RTM reports as available, then it should work in the sense of not
> trapping. (There is no guarantee that transactions will *ever*
> complete, and that part is no joke.)

XTEST doesn't abort transactions, but it traps without RTM support.
If CPU0 has RTM support and we enable XTEST use in glibc based on that
(because the startup code runs on CPU0), then the XTEST instruction
must not trap when running on other CPUs.

Currently, we do not use RTM for anything in glibc by default, even if
it is available according to CPUID. (There are ways to opt in, unless
the CPU is on the disallow list due to the early Haswell bug.) I'm
worried that if we start executing XTEST on all CPUs that indicate RTM
support, we will see lots of weird issues, along the lines of bug 27398.

2021-03-26 21:23:15

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Why does glibc use AVX-512?



> On Mar 26, 2021, at 2:11 PM, Florian Weimer <[email protected]> wrote:
>
> * Andy Lutomirski:
>
>>> On Fri, Mar 26, 2021 at 1:35 PM Florian Weimer <[email protected]> wrote:
>>>
>>> I mean the immense slowdown you get if you use %xmm registers after
> their %ymm counterparts (doesn't have to be %zmm, that issue is
> present starting with AVX) and you have not issued VZEROALL or
> VZEROUPPER between the two uses.

It turns out that it’s not necessary to access the registers in question to trigger this behavior. You just need to make the CPU think it should penalize you. For example, LDMXCSR appears to be a legacy SSE insn for this purpose, and VLDMXCSR is an AVX insn for this purpose. I wouldn’t trust that using ymm9 would avoid the penalty just because common sense says it should.

>> What kind of system has that problem?
>
> It's a standard laptop after a suspend/resume cycle. It's either a
> kernel or firmware bug.

What kernel version? I think fixing the kernel makes more sense than fixing glibc.