2024-05-29 19:47:30

by Mark Brown

[permalink] [raw]
Subject: [PATCH v6] arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state

When we are in a syscall we take the opportunity to discard the SVE state,
saving only the FPSIMD subset of the register state. When we reload the
state from memory we reenable SVE access traps, stopping tracking SVE until
the task takes another SVE access trap. This means that for a task which is
actively using SVE many blocking system calls will have the additional
overhead of a SVE access trap.

As SVE deployment is progressing we are seeing much wider use of the SVE
instruction set, including performance optimised implementations of
operations like memset() and memcpy(), which mean that even tasks which are
not obviously floating point based can end up with substantial SVE usage.

It does not, however, make sense to just unconditionally use the full SVE
register state all the time since it is larger than the FPSIMD register
state so there is overhead saving and restoring it on context switch and
our requirement to flush the register state not shared with FPSIMD on
syscall also creates a noticeable overhead on system call.

I did some instrumentation which counted the number of SVE access traps
and the number of times we loaded FPSIMD only register state for each task.
Testing with Debian Bookworm this showed that during boot the overwhelming
majority of tasks triggered another SVE access trap more than 50% of the
time after loading FPSIMD only state with a substantial number near 100%,
though some programs had a very small number of SVE accesses most likely
from startup. There were few tasks in the range 5-45%, most tasks either
used SVE frequently or used it only a tiny proportion of times. As expected
older distributions which do not have the SVE performance work available
showed no SVE usage in general applications.

This indicates that there should be some useful benefit from reducing the
number of SVE access traps for blocking system calls like we did for non
blocking system calls in commit 8c845e273104 ("arm64/sve: Leave SVE enabled
on syscall if we don't context switch"). Let's do this with a timeout, when
we take a SVE access trap record a jiffies after which we'll reeanble SVE
traps then check this whenver we load a FPSIMD only floating point state
from memory. If the time has passed then we reenable traps, otherwise we
leave traps disabled and flush the non-shared register state like we would
on trap.

The timeout is currently set to a second, I pulled this number out of thin
air so there is doubtless some room for tuning. This means that for a
task which is actively using SVE the number of SVE access traps will be
substantially reduced but applications which use SVE only very
infrequently will avoid the overheads associated with tracking SVE state
after a second. The extra cost from additional tracking of SVE state
only occurs when a task is preempted so short running tasks should be
minimally affected.

There should be no functional change resulting from this, it is purely a
performance optimisation.

Signed-off-by: Mark Brown <[email protected]>
---
Changes in v6:
- Rebase onto v6.10-rc1.
- Link to v5: https://lore.kernel.org/r/[email protected]

Changes in v5:
- Rebase onto v6.9-rc1.
- Use a timeout rather than number of state loads to decide when to
reenable traps.
- Link to v4: https://lore.kernel.org/r/[email protected]

Changes in v4:
- Rebase onto v6.8-rc1.
- Link to v3: https://lore.kernel.org/r/[email protected]

Changes in v3:
- Rebase onto v6.7-rc1.
- Link to v2: https://lore.kernel.org/r/[email protected]

Changes in v2:
- Rebase onto v6.6-rc1.
- Link to v1: https://lore.kernel.org/r/[email protected]
---
arch/arm64/include/asm/processor.h | 1 +
arch/arm64/kernel/fpsimd.c | 42 ++++++++++++++++++++++++++++++++------
2 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index f77371232d8c..7a6ed0551291 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -164,6 +164,7 @@ struct thread_struct {
unsigned int fpsimd_cpu;
void *sve_state; /* SVE registers, if any */
void *sme_state; /* ZA and ZT state, if any */
+ unsigned long sve_timeout; /* jiffies to drop TIF_SVE */
unsigned int vl[ARM64_VEC_MAX]; /* vector length */
unsigned int vl_onexec[ARM64_VEC_MAX]; /* vl after next exec */
unsigned long fault_address; /* fault info */
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 82e8a6017382..4741e4fb612a 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -354,6 +354,7 @@ static void task_fpsimd_load(void)
{
bool restore_sve_regs = false;
bool restore_ffr;
+ unsigned long sve_vq_minus_one;

WARN_ON(!system_supports_fpsimd());
WARN_ON(preemptible());
@@ -365,18 +366,12 @@ static void task_fpsimd_load(void)
if (system_supports_sve() || system_supports_sme()) {
switch (current->thread.fp_type) {
case FP_STATE_FPSIMD:
- /* Stop tracking SVE for this task until next use. */
- if (test_and_clear_thread_flag(TIF_SVE))
- sve_user_disable();
break;
case FP_STATE_SVE:
if (!thread_sm_enabled(&current->thread) &&
!WARN_ON_ONCE(!test_and_set_thread_flag(TIF_SVE)))
sve_user_enable();

- if (test_thread_flag(TIF_SVE))
- sve_set_vq(sve_vq_from_vl(task_get_sve_vl(current)) - 1);
-
restore_sve_regs = true;
restore_ffr = true;
break;
@@ -395,6 +390,15 @@ static void task_fpsimd_load(void)
}
}

+ /*
+ * If SVE has been enabled we may keep it enabled even if
+ * loading only FPSIMD state, so always set the VL.
+ */
+ if (system_supports_sve() && test_thread_flag(TIF_SVE)) {
+ sve_vq_minus_one = sve_vq_from_vl(task_get_sve_vl(current)) - 1;
+ sve_set_vq(sve_vq_minus_one);
+ }
+
/* Restore SME, override SVE register configuration if needed */
if (system_supports_sme()) {
unsigned long sme_vl = task_get_sme_vl(current);
@@ -421,6 +425,25 @@ static void task_fpsimd_load(void)
} else {
WARN_ON_ONCE(current->thread.fp_type != FP_STATE_FPSIMD);
fpsimd_load_state(&current->thread.uw.fpsimd_state);
+
+ /*
+ * If the task had been using SVE we keep it enabled
+ * when loading FPSIMD only state for a period to
+ * minimise overhead for tasks actively using SVE,
+ * disabling it periodicaly to ensure that tasks that
+ * use SVE intermittently do eventually avoid the
+ * overhead of carrying SVE state. The timeout is
+ * initialised when we take a SVE trap in in
+ * do_sve_acc().
+ */
+ if (system_supports_sve() && test_thread_flag(TIF_SVE)) {
+ if (time_after(jiffies, current->thread.sve_timeout)) {
+ clear_thread_flag(TIF_SVE);
+ sve_user_disable();
+ } else {
+ sve_flush_live(true, sve_vq_minus_one);
+ }
+ }
}
}

@@ -1397,6 +1420,13 @@ void do_sve_acc(unsigned long esr, struct pt_regs *regs)

get_cpu_fpsimd_context();

+ /*
+ * We will keep SVE enabled when loading FPSIMD only state for
+ * the next second to minimise traps when userspace is
+ * actively using SVE.
+ */
+ current->thread.sve_timeout = jiffies + HZ;
+
if (test_and_set_thread_flag(TIF_SVE))
WARN_ON(1); /* SVE access shouldn't have trapped */


---
base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
change-id: 20230807-arm64-sve-trap-mitigation-2e7e2663c849

Best regards,
--
Mark Brown <[email protected]>



2024-05-30 14:04:17

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH v6] arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state

Hi Mark,

On Wed, May 29, 2024 at 08:46:23PM +0100, Mark Brown wrote:
> When we are in a syscall we take the opportunity to discard the SVE state,
> saving only the FPSIMD subset of the register state. When we reload the
> state from memory we reenable SVE access traps, stopping tracking SVE until
> the task takes another SVE access trap. This means that for a task which is
> actively using SVE many blocking system calls will have the additional
> overhead of a SVE access trap.

Playing devil's advocate here: doesn't a blocking syscall already imply
a high overhead (at least in terms of latency for the thread concerned)?

i.e., does letting TIF_SVE linger across some blocking syscalls make a
meaningful difference in some known use case?


(For non-blocking syscalls the argument for allowing TIF_SVE to linger
seems a lot stronger.)

> As SVE deployment is progressing we are seeing much wider use of the SVE
> instruction set, including performance optimised implementations of
> operations like memset() and memcpy(), which mean that even tasks which are
> not obviously floating point based can end up with substantial SVE usage.
>
> It does not, however, make sense to just unconditionally use the full SVE
> register state all the time since it is larger than the FPSIMD register
> state so there is overhead saving and restoring it on context switch and
> our requirement to flush the register state not shared with FPSIMD on
> syscall also creates a noticeable overhead on system call.
>
> I did some instrumentation which counted the number of SVE access traps
> and the number of times we loaded FPSIMD only register state for each task.
> Testing with Debian Bookworm this showed that during boot the overwhelming
> majority of tasks triggered another SVE access trap more than 50% of the
> time after loading FPSIMD only state with a substantial number near 100%,
> though some programs had a very small number of SVE accesses most likely
> from startup. There were few tasks in the range 5-45%, most tasks either
> used SVE frequently or used it only a tiny proportion of times. As expected
> older distributions which do not have the SVE performance work available
> showed no SVE usage in general applications.
>
> This indicates that there should be some useful benefit from reducing the
> number of SVE access traps for blocking system calls like we did for non
> blocking system calls in commit 8c845e273104 ("arm64/sve: Leave SVE enabled
> on syscall if we don't context switch"). Let's do this with a timeout, when
> we take a SVE access trap record a jiffies after which we'll reeanble SVE
> traps then check this whenver we load a FPSIMD only floating point state
> from memory. If the time has passed then we reenable traps, otherwise we
> leave traps disabled and flush the non-shared register state like we would
> on trap.
>
> The timeout is currently set to a second, I pulled this number out of thin
> air so there is doubtless some room for tuning. This means that for a
> task which is actively using SVE the number of SVE access traps will be
> substantially reduced but applications which use SVE only very
> infrequently will avoid the overheads associated with tracking SVE state
> after a second. The extra cost from additional tracking of SVE state
> only occurs when a task is preempted so short running tasks should be
> minimally affected.

Could your instrumentation be extended to build a histogram of the delay
between successive SVE traps per task?

There's a difference here between a task that takes a lot of traps in a
burst (perhaps due to startup or a specific library call), versus a task
that uses SVE sporadically for all time.

I wonder whether the sweet spot for the timeout may be quite a lot
shorter than a second. Still, once we have something we can tune, we
can always tune it later as you suggest.

>
> There should be no functional change resulting from this, it is purely a
> performance optimisation.
>
> Signed-off-by: Mark Brown <[email protected]>
> ---
> Changes in v6:
> - Rebase onto v6.10-rc1.
> - Link to v5: https://lore.kernel.org/r/[email protected]
>
> Changes in v5:
> - Rebase onto v6.9-rc1.
> - Use a timeout rather than number of state loads to decide when to
> reenable traps.
> - Link to v4: https://lore.kernel.org/r/[email protected]
>
> Changes in v4:
> - Rebase onto v6.8-rc1.
> - Link to v3: https://lore.kernel.org/r/[email protected]
>
> Changes in v3:
> - Rebase onto v6.7-rc1.
> - Link to v2: https://lore.kernel.org/r/[email protected]
>
> Changes in v2:
> - Rebase onto v6.6-rc1.
> - Link to v1: https://lore.kernel.org/r/[email protected]
> ---
> arch/arm64/include/asm/processor.h | 1 +
> arch/arm64/kernel/fpsimd.c | 42 ++++++++++++++++++++++++++++++++------
> 2 files changed, 37 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
> index f77371232d8c..7a6ed0551291 100644
> --- a/arch/arm64/include/asm/processor.h
> +++ b/arch/arm64/include/asm/processor.h
> @@ -164,6 +164,7 @@ struct thread_struct {
> unsigned int fpsimd_cpu;
> void *sve_state; /* SVE registers, if any */
> void *sme_state; /* ZA and ZT state, if any */
> + unsigned long sve_timeout; /* jiffies to drop TIF_SVE */
> unsigned int vl[ARM64_VEC_MAX]; /* vector length */
> unsigned int vl_onexec[ARM64_VEC_MAX]; /* vl after next exec */
> unsigned long fault_address; /* fault info */
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index 82e8a6017382..4741e4fb612a 100644
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -354,6 +354,7 @@ static void task_fpsimd_load(void)
> {
> bool restore_sve_regs = false;
> bool restore_ffr;
> + unsigned long sve_vq_minus_one;
>
> WARN_ON(!system_supports_fpsimd());
> WARN_ON(preemptible());
> @@ -365,18 +366,12 @@ static void task_fpsimd_load(void)
> if (system_supports_sve() || system_supports_sme()) {
> switch (current->thread.fp_type) {
> case FP_STATE_FPSIMD:
> - /* Stop tracking SVE for this task until next use. */
> - if (test_and_clear_thread_flag(TIF_SVE))
> - sve_user_disable();
> break;
> case FP_STATE_SVE:
> if (!thread_sm_enabled(&current->thread) &&
> !WARN_ON_ONCE(!test_and_set_thread_flag(TIF_SVE)))
> sve_user_enable();
>
> - if (test_thread_flag(TIF_SVE))
> - sve_set_vq(sve_vq_from_vl(task_get_sve_vl(current)) - 1);
> -
> restore_sve_regs = true;
> restore_ffr = true;
> break;
> @@ -395,6 +390,15 @@ static void task_fpsimd_load(void)
> }
> }
>
> + /*
> + * If SVE has been enabled we may keep it enabled even if
> + * loading only FPSIMD state, so always set the VL.
> + */
> + if (system_supports_sve() && test_thread_flag(TIF_SVE)) {
> + sve_vq_minus_one = sve_vq_from_vl(task_get_sve_vl(current)) - 1;
> + sve_set_vq(sve_vq_minus_one);
> + }
> +
> /* Restore SME, override SVE register configuration if needed */
> if (system_supports_sme()) {
> unsigned long sme_vl = task_get_sme_vl(current);
> @@ -421,6 +425,25 @@ static void task_fpsimd_load(void)
> } else {
> WARN_ON_ONCE(current->thread.fp_type != FP_STATE_FPSIMD);
> fpsimd_load_state(&current->thread.uw.fpsimd_state);
> +
> + /*
> + * If the task had been using SVE we keep it enabled
> + * when loading FPSIMD only state for a period to
> + * minimise overhead for tasks actively using SVE,
> + * disabling it periodicaly to ensure that tasks that
> + * use SVE intermittently do eventually avoid the
> + * overhead of carrying SVE state. The timeout is
> + * initialised when we take a SVE trap in in
> + * do_sve_acc().
> + */
> + if (system_supports_sve() && test_thread_flag(TIF_SVE)) {
> + if (time_after(jiffies, current->thread.sve_timeout)) {
> + clear_thread_flag(TIF_SVE);
> + sve_user_disable();
> + } else {
> + sve_flush_live(true, sve_vq_minus_one);

Didn't we already flush Zn[max:128] as a side-effect of loading the
V-regs in fpsimd_load_state() above?

Also, unless I'm missing something, prior to this patch we could just
fall through this code with TIF_SVE still set, suggesting that either
this flush is not needed for some reason, or it is shadowed by another
flush done somewhere else, or a flush is currenly needed but missing.
Am I just getting myself confused here?

(Or, do the deletions from the switch in the earlier hunk cancel this
out?)

> + }
> + }
> }
> }
>
> @@ -1397,6 +1420,13 @@ void do_sve_acc(unsigned long esr, struct pt_regs *regs)
>
> get_cpu_fpsimd_context();
>
> + /*
> + * We will keep SVE enabled when loading FPSIMD only state for
> + * the next second to minimise traps when userspace is
> + * actively using SVE.
> + */
> + current->thread.sve_timeout = jiffies + HZ;
> +
> if (test_and_set_thread_flag(TIF_SVE))
> WARN_ON(1); /* SVE access shouldn't have trapped */
>
>
> ---
> base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
> change-id: 20230807-arm64-sve-trap-mitigation-2e7e2663c849
>
> Best regards,
> --
> Mark Brown <[email protected]>

[...]

Cheers
---Dave

2024-05-30 15:41:02

by Mark Brown

[permalink] [raw]
Subject: Re: [PATCH v6] arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state

On Thu, May 30, 2024 at 03:03:42PM +0100, Dave Martin wrote:
> On Wed, May 29, 2024 at 08:46:23PM +0100, Mark Brown wrote:

> > When we are in a syscall we take the opportunity to discard the SVE state,
> > saving only the FPSIMD subset of the register state. When we reload the
> > state from memory we reenable SVE access traps, stopping tracking SVE until
> > the task takes another SVE access trap. This means that for a task which is
> > actively using SVE many blocking system calls will have the additional
> > overhead of a SVE access trap.

> Playing devil's advocate here: doesn't a blocking syscall already imply
> a high overhead (at least in terms of latency for the thread concerned)?

> i.e., does letting TIF_SVE linger across some blocking syscalls make a
> meaningful difference in some known use case?

I don't have any solid examples since I didn't do physical benchmarking
yet (the only thing I have access to is Graviton, I need to figure out
deploying kernels there for test). I am always slightly nervous about
only benchmarking on big cores, though for something like this they
might well get more benefit anyway.

> Could your instrumentation be extended to build a histogram of the delay
> between successive SVE traps per task?

You could do that instrumentation, what I had was extremely crude
though.

> There's a difference here between a task that takes a lot of traps in a
> burst (perhaps due to startup or a specific library call), versus a task
> that uses SVE sporadically for all time.

The pattern I was seeing was mostly either a very small number of SVE
uses during startup and then none for the rest of the runtime or
otherwise SVE used on comfortably over 50% of the restores so I didn't
look super hard. I'd expect you will see some things more in the middle
where a non-SVE application uses a SVE enabled library (eg, a JIT that
doesn't generate SVE code doing calls out occasionally to a C library
built with SVE) though.

> I wonder whether the sweet spot for the timeout may be quite a lot
> shorter than a second. Still, once we have something we can tune, we
> can always tune it later as you suggest.

My instinct is that a second is probably on the high end, yeah. Given
that it's just pulling numbers out of thin air I went high to minimise
the impact on tasks that are actively using SVE for a long time,
guessing that short running tasks are likely to not get scheduled so
often anyway and they only pay the cost when preempted.

> > + if (time_after(jiffies, current->thread.sve_timeout)) {
> > + clear_thread_flag(TIF_SVE);
> > + sve_user_disable();
> > + } else {
> > + sve_flush_live(true, sve_vq_minus_one);

> Didn't we already flush Zn[max:128] as a side-effect of loading the
> V-regs in fpsimd_load_state() above?

Looking at the pseudoode for V[] that does appear to be true for the Z
registers since we are now setting the vector length, but we do still
need to worry about the P and FFR registers so there still needs to be
some flush. Unless we're both missing something there there's room for
an optimisation here, it won't make a difference for 128 bit VLs since
we skip the Z register flush there anyway but it would help with larger
VLs.

> Also, unless I'm missing something, prior to this patch we could just
> fall through this code with TIF_SVE still set, suggesting that either
> this flush is not needed for some reason, or it is shadowed by another
> flush done somewhere else, or a flush is currenly needed but missing.
> Am I just getting myself confused here?

With the current code we will disable SVE so the contents of the state
only accessible with SVE become immaterial, we deal with it when we take
an access trap.

With this patch we will leave SVE enabled so we do need to ensure that
the visible SVE state is flushed. As you point out we no longer need to
flush the Z registers even for larger VLs since SVE is always enabled
for EL1 and this patch means that we always set the vector length if the
task has TIF_SVE. Setting the VL means that the constrained
unpredictable choice between zeroing the maximum or currently visible
upper bits in the Z registers will always zero all the bits that EL0
could access. We do however continue to need to flush the predicate
registers.


Attachments:
(No filename) (4.27 kB)
signature.asc (499.00 B)
Download all attachments