2021-11-16 00:24:10

by Andy Lutomirski

[permalink] [raw]
Subject: Revisiting XFD-based AMX and heterogenous systems

[resend -- first try was HTML. oops.]

Hi all-

I just learned that current Alder Lake review samples are actually heterogenous, at least physically. The performance cores have AVX-512 and the efficiency cores don't have AVX-512. Since no OS supports actual runtime ISA heterogeneity, this feature seems to be hidden in that one must choose, per boot, whether one wants AVX-512 or efficiency cores, but the CPU is physically heterogenous.

All the earlier discussions about Linux AMX architecture happened under the assumption that xfeature-heterogenous systems would never happen, and my grudging acceptance of the XFD model was predicated on that. But now we have obviously heterogenous hardware that is apparently actually shipping at least to reviewers, and I think we should revisit this before we merge AMX support.

The current proposed AMX ABI on Linux has all tasks starting out with the AMX bits set in XCR0. If the tasks want to actually use AMX, they need to issue a prctl asking for permission and, if they don't issue that prctl, they will take XFD faults and get signalled if they try to use the TILE regs. If Intel thinks we might ever have software-visible heterogenous system, then this will be very awkward -- all tasks will start with AMX set in XCR0 even if they're on efficiency cores and even they have no intention to ever use AMX. Once a process asks for AMX permission or perhaps after the first XFD fault, the process will be pinned to performance cores and can use AMX. Aside from awkwardness, this requires that efficienty cores actually implement enough of AMX to do this. And keep in mind that Alder Lake is actually heterogenous for AVX-512, and there is no such thing as AVX-512 XFD.

So I suggest that we go back and switch to the XCR0 model. Tasks will start out with AMX clear in XCR0. If they want AMX, they issue a prctl asking for AMX, AMX gets set in XCR0, and the tasks need to be able to tolerate the XCR0 change. Then, if Intel ever wants to expose the full Alder Lake physical capabilities and support efficiency cores and AVX-512 on the same boot, we can have a mode in which tasks start with AVX-512 clear in XCR0 and can opt in with prctl. This will require HPC-like apps to be recompiled or run with a special wrapper bit will otherwise expose the full HW capabilities. (Of course this assumes that Intel sets up MSRs or ucode or whatever to support this.)

What do you all think?