2024-04-11 00:56:40

by Hector Martin

[permalink] [raw]
Subject: [PATCH 0/4] arm64: Support the TSO memory model

x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
reason, x86 emulation on baseline ARM64 systems requires very expensive
memory model emulation. Having hardware that supports this natively is
therefore very attractive. Such hardware, in fact, exists. This series
adds support for userspace to identify when TSO is available and
toggle it on, if supported.

Some ARM64 CPUs intrinsically implement the TSO memory model, while
others expose is as an IMPDEF control. Apple Silicon SoCs are in the
latter category. Using TSO for x86 emulation on chips that support it
has been shown to provide a massive performance boost [1].

Patch 1 introduces the PR_{SET,GET}_MEM_MODEL userspace control, which
is initially not implemented for any architectures.

Patch 2 implements it for CPUs which are known, to the best of my
knowledge, to always implement the TSO memory model unconditionally.
This uses the cpufeature mechanism to only enable this if *all* cores in
the system meet the requirements.

Patch 3 adds the scaffolding necesasry to save/restore the ACTLR_EL1
register across context switches. This register contains IMPDEF flags
related to CPU execution, and on Apple CPUs this is where the runtime
TSO toggle bit is implemented. Other CPUs could conceivably benefit from
this scaffolding if they also use ACTLR_EL1 for things that could
ostensibly be runtime controlled and context-switched. For this to work,
ACTLR_EL1 must have a uniform layout across all cores in the system.

Finally, patch 4 implements PR_{SET,GET}_MEM_MODEL for Apple CPUs by
hooking it up to flip the appropriate ACTLR_EL1 bit when the Apple TSO
feature is detected (on all CPUs, which also implies the uniform
ACTLR_EL1 layout).

This series has been brewing in the downstream Asahi Linux tree for a
while now, and ships to thousands of users. A subset have been using it
with FEX-Emu, which already supports this feature. This rebase on
v6.9-rc1 is only build-tested (all intermediate commits with and without
the config enabled, on ARM64) but I'll update the downstream branch soon
with this version and get it pushed out to users/testers.

The Apple support works on bare metal and *should* work exactly the same
way on macOS VMs (as alluded to by Zayd in his independent submission [3]),
though I haven't personally verified this. KVM support for this is left
for a future patchset.

(Apologies for the large Cc: list; I want to make sure nobody who got
Cced on Zayd's alternate take is left out of this one.)

[1] https://fex-emu.com/FEX-2306/
[2] https://github.com/AsahiLinux/linux/tree/bits/220-tso
[3] https://lore.kernel.org/lkml/[email protected]/

To: Catalin Marinas <[email protected]>
To: Will Deacon <[email protected]>
To: Marc Zyngier <[email protected]>
To: Mark Rutland <[email protected]>
Cc: Zayd Qumsieh <[email protected]>
Cc: Justin Lu <[email protected]>
Cc: Ryan Houdek <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: Miguel Luis <[email protected]>
Cc: Joey Gouly <[email protected]>
Cc: Christoph Paasch <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Joel Granados <[email protected]>
Cc: Dawei Li <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Florent Revest <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Stefan Roesch <[email protected]>
Cc: Andy Chiu <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Zev Weiss <[email protected]>
Cc: Ondrej Mosnacek <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Asahi Linux <[email protected]>

Signed-off-by: Hector Martin <[email protected]>
---
Hector Martin (4):
prctl: Introduce PR_{SET,GET}_MEM_MODEL
arm64: Implement PR_{GET,SET}_MEM_MODEL for always-TSO CPUs
arm64: Introduce scaffolding to add ACTLR_EL1 to thread state
arm64: Implement Apple IMPDEF TSO memory model control

arch/arm64/Kconfig | 14 ++++++
arch/arm64/include/asm/apple_cpufeature.h | 15 +++++++
arch/arm64/include/asm/cpufeature.h | 10 +++++
arch/arm64/include/asm/processor.h | 3 ++
arch/arm64/kernel/Makefile | 3 +-
arch/arm64/kernel/cpufeature.c | 11 ++---
arch/arm64/kernel/cpufeature_impdef.c | 61 ++++++++++++++++++++++++++
arch/arm64/kernel/process.c | 71 +++++++++++++++++++++++++++++++
arch/arm64/kernel/setup.c | 8 ++++
arch/arm64/tools/cpucaps | 2 +
include/linux/memory_ordering_model.h | 11 +++++
include/uapi/linux/prctl.h | 5 +++
kernel/sys.c | 21 +++++++++
13 files changed, 229 insertions(+), 6 deletions(-)
---
base-commit: 4cece764965020c22cff7665b18a012006359095
change-id: 20240411-tso-e86fdceb94b8

Best regards,
--
Hector Martin <[email protected]>



2024-04-11 01:39:11

by Neal Gompa

[permalink] [raw]
Subject: Re: [PATCH 0/4] arm64: Support the TSO memory model

On Wed, Apr 10, 2024 at 8:51 PM Hector Martin <[email protected]> wrote:
>
> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
> reason, x86 emulation on baseline ARM64 systems requires very expensive
> memory model emulation. Having hardware that supports this natively is
> therefore very attractive. Such hardware, in fact, exists. This series
> adds support for userspace to identify when TSO is available and
> toggle it on, if supported.
>
> Some ARM64 CPUs intrinsically implement the TSO memory model, while
> others expose is as an IMPDEF control. Apple Silicon SoCs are in the
> latter category. Using TSO for x86 emulation on chips that support it
> has been shown to provide a massive performance boost [1].
>
> Patch 1 introduces the PR_{SET,GET}_MEM_MODEL userspace control, which
> is initially not implemented for any architectures.
>
> Patch 2 implements it for CPUs which are known, to the best of my
> knowledge, to always implement the TSO memory model unconditionally.
> This uses the cpufeature mechanism to only enable this if *all* cores in
> the system meet the requirements.
>
> Patch 3 adds the scaffolding necesasry to save/restore the ACTLR_EL1
> register across context switches. This register contains IMPDEF flags
> related to CPU execution, and on Apple CPUs this is where the runtime
> TSO toggle bit is implemented. Other CPUs could conceivably benefit from
> this scaffolding if they also use ACTLR_EL1 for things that could
> ostensibly be runtime controlled and context-switched. For this to work,
> ACTLR_EL1 must have a uniform layout across all cores in the system.
>
> Finally, patch 4 implements PR_{SET,GET}_MEM_MODEL for Apple CPUs by
> hooking it up to flip the appropriate ACTLR_EL1 bit when the Apple TSO
> feature is detected (on all CPUs, which also implies the uniform
> ACTLR_EL1 layout).
>
> This series has been brewing in the downstream Asahi Linux tree for a
> while now, and ships to thousands of users. A subset have been using it
> with FEX-Emu, which already supports this feature. This rebase on
> v6.9-rc1 is only build-tested (all intermediate commits with and without
> the config enabled, on ARM64) but I'll update the downstream branch soon
> with this version and get it pushed out to users/testers.
>
> The Apple support works on bare metal and *should* work exactly the same
> way on macOS VMs (as alluded to by Zayd in his independent submission [3]),
> though I haven't personally verified this. KVM support for this is left
> for a future patchset.
>
> (Apologies for the large Cc: list; I want to make sure nobody who got
> Cced on Zayd's alternate take is left out of this one.)
>
> [1] https://fex-emu.com/FEX-2306/
> [2] https://github.com/AsahiLinux/linux/tree/bits/220-tso
> [3] https://lore.kernel.org/lkml/[email protected]/
>
> To: Catalin Marinas <[email protected]>
> To: Will Deacon <[email protected]>
> To: Marc Zyngier <[email protected]>
> To: Mark Rutland <[email protected]>
> Cc: Zayd Qumsieh <[email protected]>
> Cc: Justin Lu <[email protected]>
> Cc: Ryan Houdek <[email protected]>
> Cc: Mark Brown <[email protected]>
> Cc: Ard Biesheuvel <[email protected]>
> Cc: Mateusz Guzik <[email protected]>
> Cc: Anshuman Khandual <[email protected]>
> Cc: Oliver Upton <[email protected]>
> Cc: Miguel Luis <[email protected]>
> Cc: Joey Gouly <[email protected]>
> Cc: Christoph Paasch <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Sami Tolvanen <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Joel Granados <[email protected]>
> Cc: Dawei Li <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Florent Revest <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Stefan Roesch <[email protected]>
> Cc: Andy Chiu <[email protected]>
> Cc: Josh Triplett <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Helge Deller <[email protected]>
> Cc: Zev Weiss <[email protected]>
> Cc: Ondrej Mosnacek <[email protected]>
> Cc: Miguel Ojeda <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Asahi Linux <[email protected]>
>
> Signed-off-by: Hector Martin <[email protected]>
> ---
> Hector Martin (4):
> prctl: Introduce PR_{SET,GET}_MEM_MODEL
> arm64: Implement PR_{GET,SET}_MEM_MODEL for always-TSO CPUs
> arm64: Introduce scaffolding to add ACTLR_EL1 to thread state
> arm64: Implement Apple IMPDEF TSO memory model control
>
> arch/arm64/Kconfig | 14 ++++++
> arch/arm64/include/asm/apple_cpufeature.h | 15 +++++++
> arch/arm64/include/asm/cpufeature.h | 10 +++++
> arch/arm64/include/asm/processor.h | 3 ++
> arch/arm64/kernel/Makefile | 3 +-
> arch/arm64/kernel/cpufeature.c | 11 ++---
> arch/arm64/kernel/cpufeature_impdef.c | 61 ++++++++++++++++++++++++++
> arch/arm64/kernel/process.c | 71 +++++++++++++++++++++++++++++++
> arch/arm64/kernel/setup.c | 8 ++++
> arch/arm64/tools/cpucaps | 2 +
> include/linux/memory_ordering_model.h | 11 +++++
> include/uapi/linux/prctl.h | 5 +++
> kernel/sys.c | 21 +++++++++
> 13 files changed, 229 insertions(+), 6 deletions(-)
> ---
> base-commit: 4cece764965020c22cff7665b18a012006359095
> change-id: 20240411-tso-e86fdceb94b8
>

The series looks good to me.

Reviewed-by: Neal Gompa <[email protected]>



--
真実はいつも一つ!/ Always, there's only one truth!

2024-04-11 13:40:44

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 0/4] arm64: Support the TSO memory model

Hi Hector,

On Thu, Apr 11, 2024 at 09:51:19AM +0900, Hector Martin wrote:
> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
> reason, x86 emulation on baseline ARM64 systems requires very expensive
> memory model emulation. Having hardware that supports this natively is
> therefore very attractive. Such hardware, in fact, exists. This series
> adds support for userspace to identify when TSO is available and
> toggle it on, if supported.

I'm probably going to make myself hugely unpopular here, but I have a
strong objection to this patch series as it stands. I firmly believe
that providing a prctl() to query and toggle the memory model to/from
TSO is going to lead to subtle fragmentation of arm64 Linux userspace.

It's not difficult to envisage this TSO switch being abused for native
arm64 applications:

* A program no longer crashes when TSO is enabled, so the developer
just toggles TSO to meet a deadline.

* Some legacy x86 sources are being ported to arm64 but concurrency
is hard so the developer just enables TSO to (mostly) avoid thinking
about it.

* Some binaries in a distribution exhibit instability which goes away
in TSO mode, so a taskset-like program is used to run them with TSO
enabled.

In all these cases, we end up with native arm64 applications that will
either fail to load or will crash in subtle ways on CPUs without the TSO
feature. Assuming that the application cannot be fixed, a better
approach would be to recompile using stronger instructions (e.g.
LDAR/STLR) so that at least the resulting binary is portable. Now, it's
true that some existing CPUs are TSO by design (this is a perfectly
valid implementation of the arm64 memory model), but I think there's a
big difference between quietly providing more ordering guarantees than
software may be relying on and providing a mechanism to discover,
request and ultimately rely upon the stronger behaviour.

An alternative option is to go down the SPARC RMO route and just enable
TSO statically (although presumably in the firmware) for Apple silicon.
I'm assuming that has a performance impact for native code?

Will

P.S. I briefly pondered the idea of the kernel toggling the bit in the
ELF loader when e.g. it sees an x86 machine type but I suspect that
doesn't really help with existing emulators and you'd still need a way
to tell the emulator whether or not it was enabled.

2024-04-11 14:19:35

by Hector Martin

[permalink] [raw]
Subject: Re: [PATCH 0/4] arm64: Support the TSO memory model

On 2024/04/11 22:28, Will Deacon wrote:
> Hi Hector,
>
> On Thu, Apr 11, 2024 at 09:51:19AM +0900, Hector Martin wrote:
>> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
>> reason, x86 emulation on baseline ARM64 systems requires very expensive
>> memory model emulation. Having hardware that supports this natively is
>> therefore very attractive. Such hardware, in fact, exists. This series
>> adds support for userspace to identify when TSO is available and
>> toggle it on, if supported.
>
> I'm probably going to make myself hugely unpopular here, but I have a
> strong objection to this patch series as it stands. I firmly believe
> that providing a prctl() to query and toggle the memory model to/from
> TSO is going to lead to subtle fragmentation of arm64 Linux userspace.

I honestly doubt this should be a significant concern right now, given
that only a subset of implementations actually support this. Yes,
developers can do stupid stuff, but we already have gone through this
kind of story with other situations (e.g. 16K and 64K page support on
ARM64 breaking 4K assumptions) and things have been fixed over time.

In particular, I highly suspect Asahi Linux and Apple Silicon have done
a lot more good for the ARM64 ecosystem by getting developers to fix
their page size mess than they will do bad by somehow encouraging TSO
abuse. We've even found new memory model issues thanks to the
architecture's deep out-of-order character (remember that mess with
Linux atomics? :-)). So far, in the year+ we've had this patchset
downstream, not a single developer has proposed abusing it for something
that isn't an x86 emulator.

There's a pragmatic argument here: since we need this, and it absolutely
will continue to ship downstream if rejected, it doesn't make much
difference for fragmentation risk does it? The vast majority of
Linux-on-Mac users are likely to continue running downstream kernels for
the foreseeable future anyway to get newer features and hardware support
faster than they can be upstreamed. So not allowing this upstream
doesn't really change the landscape vis-a-vis being able to abuse this
or not, it just makes our life harder by forcing us to carry more
patches forever.

> It's not difficult to envisage this TSO switch being abused for native
> arm64 applications:
>
> * A program no longer crashes when TSO is enabled, so the developer
> just toggles TSO to meet a deadline.
>
> * Some legacy x86 sources are being ported to arm64 but concurrency
> is hard so the developer just enables TSO to (mostly) avoid thinking
> about it.

Both of these rely on the developer *knowing* what TSO is and why it
fixes this. I posit that a developer who knows what that is also likely
to know why this is a stupid hack and they shouldn't be doing this and
that it won't work on all machines.

>
> * Some binaries in a distribution exhibit instability which goes away
> in TSO mode, so a taskset-like program is used to run them with TSO
> enabled.

Since the flag is cleared on execve, this third one isn't generally
possible as far as I know.

> In all these cases, we end up with native arm64 applications that will
> either fail to load or will crash in subtle ways on CPUs without the TSO
> feature. Assuming that the application cannot be fixed, a better
> approach would be to recompile using stronger instructions (e.g.
> LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> true that some existing CPUs are TSO by design (this is a perfectly
> valid implementation of the arm64 memory model), but I think there's a
> big difference between quietly providing more ordering guarantees than
> software may be relying on and providing a mechanism to discover,
> request and ultimately rely upon the stronger behaviour.

The problem is "just" using stronger instructions is much more
expensive, as emulators have demonstrated. If TSO didn't serve a
practical purpose I wouldn't be submitting this, but it does. This is
basically non-negotiable for x86 emulation; if this is rejected
upstream, it will forever live as a downstream patch used by the entire
gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
explicitly targeting, given our efforts with microVMs for 4K page size
support and the upcoming Vulkan drivers).

That said, I have a pragmatic proposal here. The "fixed TSO" part of the
implementation should be harmless, since those CPUs would correctly run
poorly-written applications anyway so the API is moot. That leaves Apple
Silicon. Our native kernels are and likely always will be 16K page size,
due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot
natively but with very broken functionality including no GPU
acceleration) plus performance differences that favor 16K. How about we
gate the TSO functionality to only be supported on 4K kernel builds?
This would make them only work in 4K VMs on Asahi Linux. We are very
explicitly discouraging people from trying to use the microVMs to work
around page size problems (which they can already do, another
fragmentation problem, anyway); any application which requires the 4K VM
to run that isn't an emulator is already clearly broken and advertising
that fact openly. So, adding TSO to this should be only a marginal risk
of further fragmentation, and it wouldn't allow apps to "sneakily" "just
work" on Apple machines by abusing TSO.

>
> An alternative option is to go down the SPARC RMO route and just enable
> TSO statically (although presumably in the firmware) for Apple silicon.
> I'm assuming that has a performance impact for native code?

Correct. We already have this as a bootloader option, but it is not
desirable. Plus, userspace code still needs a way to *discover* that TSO
is enabled for correctness, so it can automatically decide whether to
use stronger or weaker instructions.

>
> Will
>
> P.S. I briefly pondered the idea of the kernel toggling the bit in the
> ELF loader when e.g. it sees an x86 machine type but I suspect that
> doesn't really help with existing emulators and you'd still need a way
> to tell the emulator whether or not it was enabled.
>

- Hector

2024-04-11 19:02:06

by Hector Martin

[permalink] [raw]
Subject: Re: [PATCH 0/4] arm64: Support the TSO memory model



On 2024/04/11 23:19, Hector Martin wrote:
>>
>> An alternative option is to go down the SPARC RMO route and just enable
>> TSO statically (although presumably in the firmware) for Apple silicon.
>> I'm assuming that has a performance impact for native code?
>
> Correct. We already have this as a bootloader option, but it is not
> desirable. Plus, userspace code still needs a way to *discover* that TSO
> is enabled for correctness, so it can automatically decide whether to
> use stronger or weaker instructions.

To add some numbers to this (I was just made aware of this paper):

https://www.sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs.pdf

Using TSO globally has, on average, a 9% performance hit, so that is
clearly off the table as a general solution.

Meanwhile, more detailed microbenchmarks often show TSO as having better
performance than outright using acquire/release instructions without
TSO. Therefore, just giving up on TSO and using acq/rel semantics for
emulators is also not an acceptable solution.

Additionally, the general load/store instructions on ARM have more
flexible addressing modes than the synchronizing ones, and since general
x86 emulation requires *all* loads and stores to be like this in a
non-TSO model (without much more complex/expensive program analysis to
determine where this can be elided), the perf impact is definitely worse
for emulation (e.g. stack accesses are affected) than for a
microbenchmark where only the "target" test instructions are being modified.

- Hector

2024-04-16 02:12:05

by Zayd Qumsieh

[permalink] [raw]
Subject: Re: [PATCH 0/4] arm64: Support the TSO memory model

The patch looks great! :) I have one minor suggestion, though:

>static __always_inline bool system_has_actlr_state(void)
>{
> return IS_ENABLED(CONFIG_ARM64_ACTLR_STATE) &&
> alternative_has_cap_unlikely(ARM64_HAS_TSO_APPLE);
>}

ACTLR_EL1.TSO is not exposed for writing on Virtual Machines on all
versions of MacOS. However, AIDR_EL1 may still advertise TSO, whether
or not ACTLR_EL1.TSO is writable. Could you modify the patch such that
we check the writability of ACTLR_EL1.TSO in system_has_actlr_state
(or once on startup, and cache it, since reading from AIDR_EL1 causes
a trap to Hypervisor.fwk)?

Thanks,
Zayd

2024-04-16 03:24:54

by Zayd Qumsieh

[permalink] [raw]
Subject: Re: [PATCH 0/4] arm64: Support the TSO memory model

>I'm probably going to make myself hugely unpopular here, but I have a
>strong objection to this patch series as it stands. I firmly believe
>that providing a prctl() to query and toggle the memory model to/from
>TSO is going to lead to subtle fragmentation of arm64 Linux userspace.

It's definitely not our intent to fragment the ecosystem.
The goal of this memory ordering is to simplify emulation layers that benefit from this.
If you have suggestions to reduce the risk of it being misused outside of emulators, we'd be happy to look into it.

Thanks,
Zayd

2024-04-19 16:58:31

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 0/4] arm64: Support the TSO memory model

On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> On 2024/04/11 22:28, Will Deacon wrote:
> > * Some binaries in a distribution exhibit instability which goes away
> > in TSO mode, so a taskset-like program is used to run them with TSO
> > enabled.
>
> Since the flag is cleared on execve, this third one isn't generally
> possible as far as I know.

Ah ok, I'd missed that. Thanks.

> > In all these cases, we end up with native arm64 applications that will
> > either fail to load or will crash in subtle ways on CPUs without the TSO
> > feature. Assuming that the application cannot be fixed, a better
> > approach would be to recompile using stronger instructions (e.g.
> > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > true that some existing CPUs are TSO by design (this is a perfectly
> > valid implementation of the arm64 memory model), but I think there's a
> > big difference between quietly providing more ordering guarantees than
> > software may be relying on and providing a mechanism to discover,
> > request and ultimately rely upon the stronger behaviour.
>
> The problem is "just" using stronger instructions is much more
> expensive, as emulators have demonstrated. If TSO didn't serve a
> practical purpose I wouldn't be submitting this, but it does. This is
> basically non-negotiable for x86 emulation; if this is rejected
> upstream, it will forever live as a downstream patch used by the entire
> gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> explicitly targeting, given our efforts with microVMs for 4K page size
> support and the upcoming Vulkan drivers).

These microVMs sound quite interesting. What exactly are they? Are you
running them under KVM?

Ignoring the mechanism for the time being, would it solve your problem
if you were able to run specific microVMs in TSO mode, or do you *really*
need the VM to have finer-grained control than that? If the whole VM is
running in TSO mode, then my concerns largely disappear, as that's
indistinguishable from running on a hardware implementation that happens
to be TSO.

> That said, I have a pragmatic proposal here. The "fixed TSO" part of the
> implementation should be harmless, since those CPUs would correctly run
> poorly-written applications anyway so the API is moot. That leaves Apple
> Silicon. Our native kernels are and likely always will be 16K page size,
> due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot
> natively but with very broken functionality including no GPU
> acceleration) plus performance differences that favor 16K. How about we
> gate the TSO functionality to only be supported on 4K kernel builds?
> This would make them only work in 4K VMs on Asahi Linux. We are very
> explicitly discouraging people from trying to use the microVMs to work
> around page size problems (which they can already do, another
> fragmentation problem, anyway); any application which requires the 4K VM
> to run that isn't an emulator is already clearly broken and advertising
> that fact openly. So, adding TSO to this should be only a marginal risk
> of further fragmentation, and it wouldn't allow apps to "sneakily" "just
> work" on Apple machines by abusing TSO.

I appreciate that you're trying to be constructive here, but I don't think
we should tie this to the page size. It's an artifical limitation and I
don't think it really addresses the underlying concerns that I have.

Will

2024-04-19 16:59:44

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 0/4] arm64: Support the TSO memory model

On Mon, Apr 15, 2024 at 07:22:41PM -0700, Zayd Qumsieh wrote:
> >I'm probably going to make myself hugely unpopular here, but I have a
> >strong objection to this patch series as it stands. I firmly believe
> >that providing a prctl() to query and toggle the memory model to/from
> >TSO is going to lead to subtle fragmentation of arm64 Linux userspace.
>
> It's definitely not our intent to fragment the ecosystem.
> The goal of this memory ordering is to simplify emulation layers that benefit from this.
> If you have suggestions to reduce the risk of it being misused outside of emulators, we'd be happy to look into it.

Once you have exposed this toggle via prctl(), it doesn't really matter
what your intentions where. It will get used outside of emulation laters
and we'll be stuck supporting it.

Will

2024-04-19 18:07:56

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH 0/4] arm64: Support the TSO memory model

On Fri, Apr 19, 2024 at 05:58:26PM +0100, Will Deacon wrote:
> On Mon, Apr 15, 2024 at 07:22:41PM -0700, Zayd Qumsieh wrote:
> > >I'm probably going to make myself hugely unpopular here, but I have a
> > >strong objection to this patch series as it stands. I firmly believe
> > >that providing a prctl() to query and toggle the memory model to/from
> > >TSO is going to lead to subtle fragmentation of arm64 Linux userspace.
> >
> > It's definitely not our intent to fragment the ecosystem. The goal
> > of this memory ordering is to simplify emulation layers that benefit
> > from this. If you have suggestions to reduce the risk of it being
> > misused outside of emulators, we'd be happy to look into it.
>
> Once you have exposed this toggle via prctl(), it doesn't really matter
> what your intentions where. It will get used outside of emulation laters
> and we'll be stuck supporting it.

Just FTR, I fully agree with Will. I'm strongly against this kind of ABI
for a non-architected, implementation defined feature. I can't even tell
exactly what TSO means on the Apple hardware. Is it close to the x86
TSO? Is there a formal memory model for it? Are future Apple (or other
Arm vendor) implementations going to follow exactly the same model to be
able to call it some form of "Apple standard" that deserves an ABI?

So, sorry, I'm going to NAK these approaches proposing imp def features
as generic opt-in mechanisms (the microVMs thing sounds doable though,
to my limited understanding; I guess that would mean running the
emulator in a VM).

--
Catalin

2024-04-20 11:37:39

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 0/4] arm64: Support the TSO memory model

On Fri, 19 Apr 2024 17:58:09 +0100,
Will Deacon <[email protected]> wrote:
>
> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > On 2024/04/11 22:28, Will Deacon wrote:
> > > * Some binaries in a distribution exhibit instability which goes away
> > > in TSO mode, so a taskset-like program is used to run them with TSO
> > > enabled.
> >
> > Since the flag is cleared on execve, this third one isn't generally
> > possible as far as I know.
>
> Ah ok, I'd missed that. Thanks.
>
> > > In all these cases, we end up with native arm64 applications that will
> > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > > feature. Assuming that the application cannot be fixed, a better
> > > approach would be to recompile using stronger instructions (e.g.
> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > > true that some existing CPUs are TSO by design (this is a perfectly
> > > valid implementation of the arm64 memory model), but I think there's a
> > > big difference between quietly providing more ordering guarantees than
> > > software may be relying on and providing a mechanism to discover,
> > > request and ultimately rely upon the stronger behaviour.
> >
> > The problem is "just" using stronger instructions is much more
> > expensive, as emulators have demonstrated. If TSO didn't serve a
> > practical purpose I wouldn't be submitting this, but it does. This is
> > basically non-negotiable for x86 emulation; if this is rejected
> > upstream, it will forever live as a downstream patch used by the entire
> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > explicitly targeting, given our efforts with microVMs for 4K page size
> > support and the upcoming Vulkan drivers).
>
> These microVMs sound quite interesting. What exactly are they? Are you
> running them under KVM?
>
> Ignoring the mechanism for the time being, would it solve your problem
> if you were able to run specific microVMs in TSO mode, or do you *really*
> need the VM to have finer-grained control than that? If the whole VM is
> running in TSO mode, then my concerns largely disappear, as that's
> indistinguishable from running on a hardware implementation that happens
> to be TSO.

Since KVM has been mentioned a few times, I'll give my take on this.

Since day 1, it was a conscious decision for KVM/arm64 to emulate the
architecture, and only that -- this is complicated enough. Meaning
that no implementation-defined features should be explicitly exposed
to the guest. So I have no plan to expose any such feature for
userspace to configure TSO or anything else of the sort.

However, that doesn't preclude VMs from running in TSO mode if the HW
is configured as such at boot time. From what I have understood, this
is a per translation regime setting (EL1 and EL2 have separate knobs).

So it should be possible to set ACTLR_EL1.TSO=1 from firmware (using
the non-architected ACTLR_EL12 accessor), and let things work without
touching anything else (KVM doesn't context switch this register and
traps accesses to it). This would keep KVM out of the loop, the host
side would be unaffected, and only VMs would pay the overhead of TSO.

I appreciate that this is not the ideal situation, and very much an
all-or-nothing approach. But that's what we can reasonably manage from
an upstream perspective given the variability of the arm64 ecosystem.

M.

--
Without deviation from the norm, progress is not possible.

2024-04-20 12:14:28

by Eric Curtin

[permalink] [raw]
Subject: Re: [PATCH 0/4] arm64: Support the TSO memory model

On Fri, 19 Apr 2024 at 18:08, Will Deacon <[email protected]> wrote:
>
> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > On 2024/04/11 22:28, Will Deacon wrote:
> > > * Some binaries in a distribution exhibit instability which goes away
> > > in TSO mode, so a taskset-like program is used to run them with TSO
> > > enabled.
> >
> > Since the flag is cleared on execve, this third one isn't generally
> > possible as far as I know.
>
> Ah ok, I'd missed that. Thanks.
>
> > > In all these cases, we end up with native arm64 applications that will
> > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > > feature. Assuming that the application cannot be fixed, a better
> > > approach would be to recompile using stronger instructions (e.g.
> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > > true that some existing CPUs are TSO by design (this is a perfectly
> > > valid implementation of the arm64 memory model), but I think there's a
> > > big difference between quietly providing more ordering guarantees than
> > > software may be relying on and providing a mechanism to discover,
> > > request and ultimately rely upon the stronger behaviour.
> >
> > The problem is "just" using stronger instructions is much more
> > expensive, as emulators have demonstrated. If TSO didn't serve a
> > practical purpose I wouldn't be submitting this, but it does. This is
> > basically non-negotiable for x86 emulation; if this is rejected
> > upstream, it will forever live as a downstream patch used by the entire
> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > explicitly targeting, given our efforts with microVMs for 4K page size
> > support and the upcoming Vulkan drivers).
>
> These microVMs sound quite interesting. What exactly are they? Are you
> running them under KVM?

It's the magic of libkrun. This is one of the git repos in the family
of libkrun, it has a wide array of use cases, which I personally won't
do much justice explaining all then, this is just one
repo/tool/usecases:

https://github.com/containers/krunvm

https://sinrega.org/running-microvms-on-m1/

CC'ing @Sergio Lopez Pascual the lead of krun in general.

Is mise le meas/Regards,

Eric Curtin

>
> Ignoring the mechanism for the time being, would it solve your problem
> if you were able to run specific microVMs in TSO mode, or do you *really*
> need the VM to have finer-grained control than that? If the whole VM is
> running in TSO mode, then my concerns largely disappear, as that's
> indistinguishable from running on a hardware implementation that happens
> to be TSO.
>
> > That said, I have a pragmatic proposal here. The "fixed TSO" part of the
> > implementation should be harmless, since those CPUs would correctly run
> > poorly-written applications anyway so the API is moot. That leaves Apple
> > Silicon. Our native kernels are and likely always will be 16K page size,
> > due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot
> > natively but with very broken functionality including no GPU
> > acceleration) plus performance differences that favor 16K. How about we
> > gate the TSO functionality to only be supported on 4K kernel builds?
> > This would make them only work in 4K VMs on Asahi Linux. We are very
> > explicitly discouraging people from trying to use the microVMs to work
> > around page size problems (which they can already do, another
> > fragmentation problem, anyway); any application which requires the 4K VM
> > to run that isn't an emulator is already clearly broken and advertising
> > that fact openly. So, adding TSO to this should be only a marginal risk
> > of further fragmentation, and it wouldn't allow apps to "sneakily" "just
> > work" on Apple machines by abusing TSO.
>
> I appreciate that you're trying to be constructive here, but I don't think
> we should tie this to the page size. It's an artifical limitation and I
> don't think it really addresses the underlying concerns that I have.
>
> Will
>


2024-04-20 12:16:35

by Eric Curtin

[permalink] [raw]
Subject: Re: [PATCH 0/4] arm64: Support the TSO memory model

On Sat, 20 Apr 2024 at 13:13, Eric Curtin <[email protected]> wrote:
>
> On Fri, 19 Apr 2024 at 18:08, Will Deacon <[email protected]> wrote:
> >
> > On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > > On 2024/04/11 22:28, Will Deacon wrote:
> > > > * Some binaries in a distribution exhibit instability which goes away
> > > > in TSO mode, so a taskset-like program is used to run them with TSO
> > > > enabled.
> > >
> > > Since the flag is cleared on execve, this third one isn't generally
> > > possible as far as I know.
> >
> > Ah ok, I'd missed that. Thanks.
> >
> > > > In all these cases, we end up with native arm64 applications that will
> > > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > > > feature. Assuming that the application cannot be fixed, a better
> > > > approach would be to recompile using stronger instructions (e.g.
> > > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > > > true that some existing CPUs are TSO by design (this is a perfectly
> > > > valid implementation of the arm64 memory model), but I think there's a
> > > > big difference between quietly providing more ordering guarantees than
> > > > software may be relying on and providing a mechanism to discover,
> > > > request and ultimately rely upon the stronger behaviour.
> > >
> > > The problem is "just" using stronger instructions is much more
> > > expensive, as emulators have demonstrated. If TSO didn't serve a
> > > practical purpose I wouldn't be submitting this, but it does. This is
> > > basically non-negotiable for x86 emulation; if this is rejected
> > > upstream, it will forever live as a downstream patch used by the entire
> > > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > > explicitly targeting, given our efforts with microVMs for 4K page size
> > > support and the upcoming Vulkan drivers).
> >
> > These microVMs sound quite interesting. What exactly are they? Are you
> > running them under KVM?
>
> It's the magic of libkrun. This is one of the git repos in the family
> of libkrun, it has a wide array of use cases, which I personally won't
> do much justice explaining all then, this is just one
> repo/tool/usecases:
>
> https://github.com/containers/krunvm
>
> https://sinrega.org/running-microvms-on-m1/

Sorry for the double post, meant to share this one for the Asahi
emulator usecase. Sergio's blogs are great in general:

https://sinrega.org/2023-10-06-using-microvms-for-gaming-on-fedora-asahi/

Is mise le meas/Regards,

Eric Curtin

>
> CC'ing @Sergio Lopez Pascual the lead of krun in general.
>
> Is mise le meas/Regards,
>
> Eric Curtin
>
> >
> > Ignoring the mechanism for the time being, would it solve your problem
> > if you were able to run specific microVMs in TSO mode, or do you *really*
> > need the VM to have finer-grained control than that? If the whole VM is
> > running in TSO mode, then my concerns largely disappear, as that's
> > indistinguishable from running on a hardware implementation that happens
> > to be TSO.
> >
> > > That said, I have a pragmatic proposal here. The "fixed TSO" part of the
> > > implementation should be harmless, since those CPUs would correctly run
> > > poorly-written applications anyway so the API is moot. That leaves Apple
> > > Silicon. Our native kernels are and likely always will be 16K page size,
> > > due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot
> > > natively but with very broken functionality including no GPU
> > > acceleration) plus performance differences that favor 16K. How about we
> > > gate the TSO functionality to only be supported on 4K kernel builds?
> > > This would make them only work in 4K VMs on Asahi Linux. We are very
> > > explicitly discouraging people from trying to use the microVMs to work
> > > around page size problems (which they can already do, another
> > > fragmentation problem, anyway); any application which requires the 4K VM
> > > to run that isn't an emulator is already clearly broken and advertising
> > > that fact openly. So, adding TSO to this should be only a marginal risk
> > > of further fragmentation, and it wouldn't allow apps to "sneakily" "just
> > > work" on Apple machines by abusing TSO.
> >
> > I appreciate that you're trying to be constructive here, but I don't think
> > we should tie this to the page size. It's an artifical limitation and I
> > don't think it really addresses the underlying concerns that I have.
> >
> > Will
> >


2024-05-02 00:16:57

by Zayd Qumsieh

[permalink] [raw]
Subject: Re: [PATCH 0/4] arm64: Support the TSO memory model

On Thu, 11 Apr 2024 14:28:54 +0100,
Will Deacon <[email protected]> wrote:
> P.S. I briefly pondered the idea of the kernel toggling the bit in the
> ELF loader when e.g. it sees an x86 machine type but I suspect that
> doesn't really help with existing emulators and you'd still need a way
> to tell the emulator whether or not it was enabled.

This seems promising to me. What do people think of adding an opt-in argument,
option, or similar to binfmt that allows users to mark certain file formats as
"must run under TSO"? And then, the kernel would set the TSO bit when invoking
the interpreter for those file formats. If an emulator decides to create a
non-CPU-emulation thread, then it can use a prctl to disable TSO and switch to
the default ARM memory model. Note that this prctl wouldn't be allowed to
enable TSO - it would only disable it. This way, it is much harder for a
faulty application to be made that relies on TSO, since enabling of TSO is
only done via a binfmt handler that the user must explicitly opt into.

It is true that existing emulators wouldn't be able to benefit from this, but
that's the case no matter the activation mechanism. We can, however, expose a
prctl to get the memory model, so emulators can detect if TSO was enabled for
their threads.

To summarize, I propose two prctls (similar to the ones in the current revision
of the patch series). One to switch from the TSO memory model to the default
ARM one (this is a one-way street). And another to query the current memory
model.

Thanks,
Zayd

P.S. I forgot to CC you in my most recent email to Marc Zyngier just now.
Sorry, I'm quite new to using mailing lists.

2024-05-02 01:11:38

by Zayd Qumsieh

[permalink] [raw]
Subject: Re: [PATCH 0/4] arm64: Support the TSO memory model

> On Fri, 19 Apr 2024 17:58:09 +0100,
> Will Deacon <[email protected]> wrote:
> >
> > On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > > On 2024/04/11 22:28, Will Deacon wrote:
> > > > * Some binaries in a distribution exhibit instability which goes away
> > > > in TSO mode, so a taskset-like program is used to run them with TSO
> > > > enabled.
> > >
> > > Since the flag is cleared on execve, this third one isn't generally
> > > possible as far as I know.
> >
> > Ah ok, I'd missed that. Thanks.
> >
> > > > In all these cases, we end up with native arm64 applications that will
> > > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > > > feature. Assuming that the application cannot be fixed, a better
> > > > approach would be to recompile using stronger instructions (e.g.
> > > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > > > true that some existing CPUs are TSO by design (this is a perfectly
> > > > valid implementation of the arm64 memory model), but I think there's a
> > > > big difference between quietly providing more ordering guarantees than
> > > > software may be relying on and providing a mechanism to discover,
> > > > request and ultimately rely upon the stronger behaviour.
> > >
> > > The problem is "just" using stronger instructions is much more
> > > expensive, as emulators have demonstrated. If TSO didn't serve a
> > > practical purpose I wouldn't be submitting this, but it does. This is
> > > basically non-negotiable for x86 emulation; if this is rejected
> > > upstream, it will forever live as a downstream patch used by the entire
> > > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > > explicitly targeting, given our efforts with microVMs for 4K page size
> > > support and the upcoming Vulkan drivers).
> >
> > These microVMs sound quite interesting. What exactly are they? Are you
> > running them under KVM?
> >
> > Ignoring the mechanism for the time being, would it solve your problem
> > if you were able to run specific microVMs in TSO mode, or do you *really*
> > need the VM to have finer-grained control than that? If the whole VM is
> > running in TSO mode, then my concerns largely disappear, as that's
> > indistinguishable from running on a hardware implementation that happens
> > to be TSO.
>
> Since KVM has been mentioned a few times, I'll give my take on this.
>
> Since day 1, it was a conscious decision for KVM/arm64 to emulate the
> architecture, and only that -- this is complicated enough. Meaning
> that no implementation-defined features should be explicitly exposed
> to the guest. So I have no plan to expose any such feature for
> userspace to configure TSO or anything else of the sort.

Agreed. We do not intend for TSO mode to be used extensively for EL1, the
intention is for TSO mode to be reserved for userspace applications that
request it.

2024-05-02 13:25:26

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 0/4] arm64: Support the TSO memory model

[adding Will back to the thread]

On Thu, 02 May 2024 01:10:35 +0100,
Zayd Qumsieh <[email protected]> wrote:
>
> > On Fri, 19 Apr 2024 17:58:09 +0100,
> > Will Deacon <[email protected]> wrote:
> > >
> > > On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > > > On 2024/04/11 22:28, Will Deacon wrote:
> > > > > * Some binaries in a distribution exhibit instability which goes away
> > > > > in TSO mode, so a taskset-like program is used to run them with TSO
> > > > > enabled.
> > > >
> > > > Since the flag is cleared on execve, this third one isn't generally
> > > > possible as far as I know.
> > >
> > > Ah ok, I'd missed that. Thanks.
> > >
> > > > > In all these cases, we end up with native arm64 applications that will
> > > > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > > > > feature. Assuming that the application cannot be fixed, a better
> > > > > approach would be to recompile using stronger instructions (e.g.
> > > > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > > > > true that some existing CPUs are TSO by design (this is a perfectly
> > > > > valid implementation of the arm64 memory model), but I think there's a
> > > > > big difference between quietly providing more ordering guarantees than
> > > > > software may be relying on and providing a mechanism to discover,
> > > > > request and ultimately rely upon the stronger behaviour.
> > > >
> > > > The problem is "just" using stronger instructions is much more
> > > > expensive, as emulators have demonstrated. If TSO didn't serve a
> > > > practical purpose I wouldn't be submitting this, but it does. This is
> > > > basically non-negotiable for x86 emulation; if this is rejected
> > > > upstream, it will forever live as a downstream patch used by the entire
> > > > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > > > explicitly targeting, given our efforts with microVMs for 4K page size
> > > > support and the upcoming Vulkan drivers).
> > >
> > > These microVMs sound quite interesting. What exactly are they? Are you
> > > running them under KVM?
> > >
> > > Ignoring the mechanism for the time being, would it solve your problem
> > > if you were able to run specific microVMs in TSO mode, or do you *really*
> > > need the VM to have finer-grained control than that? If the whole VM is
> > > running in TSO mode, then my concerns largely disappear, as that's
> > > indistinguishable from running on a hardware implementation that happens
> > > to be TSO.
> >
> > Since KVM has been mentioned a few times, I'll give my take on this.
> >
> > Since day 1, it was a conscious decision for KVM/arm64 to emulate the
> > architecture, and only that -- this is complicated enough. Meaning
> > that no implementation-defined features should be explicitly exposed
> > to the guest. So I have no plan to expose any such feature for
> > userspace to configure TSO or anything else of the sort.
>
> Agreed. We do not intend for TSO mode to be used extensively for EL1, the
> intention is for TSO mode to be reserved for userspace applications that
> request it.

But that's the same thing for a hypervisor.

For usersoace in a VM to make use of any feature, it must be exposed
to the VM as a whole by the host VMM (QEMU, kvmtool, whatever). Which
means having a new userspace ABI, specific to KVM, exposing a feature
for which there is no spec whatsoever. Even worse, you cannot discover
whether the instruction you must use to context switch the ACTLR_EL1
register is implemented. Isn't that great?

And I'm not even talking about the joys of migrating such a VM,
because we have no clue what this bit means on other implementations.
For all we know it causes another CPU to catch fire (or go PDP-endian,
which is basically the same).

Which is why my proposal is for this bit to be set statically for
*all* VMs, and leave the kernel (and KVM) out of the picture
altogether. At least that is something we can reason about (although
someone would need to start thinking of how this particular TSO
implementation composes with the relaxed memory ordering used outside
of the VM and show that they actually lead to correct results for
something such as virtio, for example).

Thanks,

M.

--
Without deviation from the norm, progress is not possible.