LinuxLists.cc - Re: [PATCH v4 1/2] RISC-V: Probe for unaligned access speed

2023-09-13 12:38:16

Subject: Re: [PATCH v4 1/2] RISC-V: Probe for unaligned access speed

Hi Evan,

On Fri, Aug 18, 2023 at 9:44 PM Evan Green <[email protected]> wrote:
> Rather than deferring unaligned access speed determinations to a vendor
> function, let's probe them and find out how fast they are. If we
> determine that an unaligned word access is faster than N byte accesses,
> mark the hardware's unaligned access as "fast". Otherwise, we mark
> accesses as slow.
>
> The algorithm itself runs for a fixed amount of jiffies. Within each
> iteration it attempts to time a single loop, and then keeps only the best
> (fastest) loop it saw. This algorithm was found to have lower variance from
> run to run than my first attempt, which counted the total number of
> iterations that could be done in that fixed amount of jiffies. By taking
> only the best iteration in the loop, assuming at least one loop wasn't
> perturbed by an interrupt, we eliminate the effects of interrupts and
> other "warm up" factors like branch prediction. The only downside is it
> depends on having an rdtime granular and accurate enough to measure a
> single copy. If we ever manage to complete a loop in 0 rdtime ticks, we
> leave the unaligned setting at UNKNOWN.
>
> There is a slight change in user-visible behavior here. Previously, all
> boards except the THead C906 reported misaligned access speed of
> UNKNOWN. C906 reported FAST. With this change, since we're now measuring
> misaligned access speed on each hart, all RISC-V systems will have this
> key set as either FAST or SLOW.
>
> Currently, we don't have a way to confidently measure the difference between
> SLOW and EMULATED, so we label anything not fast as SLOW. This will
> mislabel some systems that are actually EMULATED as SLOW. When we get
> support for delegating misaligned access traps to the kernel (as opposed
> to the firmware quietly handling it), we can explicitly test in Linux to
> see if unaligned accesses trap. Those systems will start to report
> EMULATED, though older (today's) systems without that new SBI mechanism
> will continue to report SLOW.
>
> I've updated the documentation for those hwprobe values to reflect
> this, specifically: SLOW may or may not be emulated by software, and FAST
> represents means being faster than equivalent byte accesses. The change
> in documentation is accurate with respect to both the former and current
> behavior.
>
> Signed-off-by: Evan Green <[email protected]>
> Acked-by: Conor Dooley <[email protected]>

Thanks for your patch, which is now commit 584ea6564bcaead2 ("RISC-V:
Probe for unaligned access speed") in v6.6-rc1.

On the boards I have, I get:

rzfive:
cpu0: Ratio of byte access time to unaligned word access is
1.05, unaligned accesses are fast

icicle:

cpu1: Ratio of byte access time to unaligned word access is
0.00, unaligned accesses are slow
cpu2: Ratio of byte access time to unaligned word access is
0.00, unaligned accesses are slow
cpu3: Ratio of byte access time to unaligned word access is
0.00, unaligned accesses are slow

cpu0: Ratio of byte access time to unaligned word access is
0.00, unaligned accesses are slow

k210:

cpu1: Ratio of byte access time to unaligned word access is
0.02, unaligned accesses are slow
cpu0: Ratio of byte access time to unaligned word access is
0.02, unaligned accesses are slow

starlight:

cpu1: Ratio of byte access time to unaligned word access is
0.01, unaligned accesses are slow
cpu0: Ratio of byte access time to unaligned word access is
0.02, unaligned accesses are slow

vexriscv/orangecrab:

cpu0: Ratio of byte access time to unaligned word access is
0.00, unaligned accesses are slow

I am a bit surprised by the near-zero values. Are these expected?
Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2023-09-14 00:30:50

by Evan Green

[permalink] [raw]

Subject: Re: [PATCH v4 1/2] RISC-V: Probe for unaligned access speed

On Wed, Sep 13, 2023 at 5:36 AM Geert Uytterhoeven <[email protected]> wrote:
>
> Hi Evan,
>
> On Fri, Aug 18, 2023 at 9:44 PM Evan Green <[email protected]> wrote:
> > Rather than deferring unaligned access speed determinations to a vendor
> > function, let's probe them and find out how fast they are. If we
> > determine that an unaligned word access is faster than N byte accesses,
> > mark the hardware's unaligned access as "fast". Otherwise, we mark
> > accesses as slow.
> >
> > The algorithm itself runs for a fixed amount of jiffies. Within each
> > iteration it attempts to time a single loop, and then keeps only the best
> > (fastest) loop it saw. This algorithm was found to have lower variance from
> > run to run than my first attempt, which counted the total number of
> > iterations that could be done in that fixed amount of jiffies. By taking
> > only the best iteration in the loop, assuming at least one loop wasn't
> > perturbed by an interrupt, we eliminate the effects of interrupts and
> > other "warm up" factors like branch prediction. The only downside is it
> > depends on having an rdtime granular and accurate enough to measure a
> > single copy. If we ever manage to complete a loop in 0 rdtime ticks, we
> > leave the unaligned setting at UNKNOWN.
> >
> > There is a slight change in user-visible behavior here. Previously, all
> > boards except the THead C906 reported misaligned access speed of
> > UNKNOWN. C906 reported FAST. With this change, since we're now measuring
> > misaligned access speed on each hart, all RISC-V systems will have this
> > key set as either FAST or SLOW.
> >
> > Currently, we don't have a way to confidently measure the difference between
> > SLOW and EMULATED, so we label anything not fast as SLOW. This will
> > mislabel some systems that are actually EMULATED as SLOW. When we get
> > support for delegating misaligned access traps to the kernel (as opposed
> > to the firmware quietly handling it), we can explicitly test in Linux to
> > see if unaligned accesses trap. Those systems will start to report
> > EMULATED, though older (today's) systems without that new SBI mechanism
> > will continue to report SLOW.
> >
> > I've updated the documentation for those hwprobe values to reflect
> > this, specifically: SLOW may or may not be emulated by software, and FAST
> > represents means being faster than equivalent byte accesses. The change
> > in documentation is accurate with respect to both the former and current
> > behavior.
> >
> > Signed-off-by: Evan Green <[email protected]>
> > Acked-by: Conor Dooley <[email protected]>
>
> Thanks for your patch, which is now commit 584ea6564bcaead2 ("RISC-V:
> Probe for unaligned access speed") in v6.6-rc1.
>
> On the boards I have, I get:
>
> rzfive:
> cpu0: Ratio of byte access time to unaligned word access is
> 1.05, unaligned accesses are fast

Hrm, I'm a little surprised to be seeing this number come out so close
to 1. If you reboot a few times, what kind of variance do you get on
this?

>
> icicle:
>
> cpu1: Ratio of byte access time to unaligned word access is
> 0.00, unaligned accesses are slow
> cpu2: Ratio of byte access time to unaligned word access is
> 0.00, unaligned accesses are slow
> cpu3: Ratio of byte access time to unaligned word access is
> 0.00, unaligned accesses are slow
>
> cpu0: Ratio of byte access time to unaligned word access is
> 0.00, unaligned accesses are slow
>
> k210:
>
> cpu1: Ratio of byte access time to unaligned word access is
> 0.02, unaligned accesses are slow
> cpu0: Ratio of byte access time to unaligned word access is
> 0.02, unaligned accesses are slow
>
> starlight:
>
> cpu1: Ratio of byte access time to unaligned word access is
> 0.01, unaligned accesses are slow
> cpu0: Ratio of byte access time to unaligned word access is
> 0.02, unaligned accesses are slow
>
> vexriscv/orangecrab:
>
> cpu0: Ratio of byte access time to unaligned word access is
> 0.00, unaligned accesses are slow
>
> I am a bit surprised by the near-zero values. Are these expected?
> Thanks!

This could be expected, if firmware is trapping the unaligned accesses
and coming out >100x slower than a native access. If you're interested
in getting a little more resolution, you could try to print a few more
decimal places with something like (sorry gmail mangles the whitespace
on this):

diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index 1cfbba65d11a..2c094037658a 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -632,11 +632,11 @@ void check_unaligned_access(int cpu)
if (word_cycles < byte_cycles)
speed = RISCV_HWPROBE_MISALIGNED_FAST;

- ratio = div_u64((byte_cycles * 100), word_cycles);
- pr_info("cpu%d: Ratio of byte access time to unaligned word
access is %d.%02d, unaligned accesses are %s\n",
+ ratio = div_u64((byte_cycles * 100000), word_cycles);
+ pr_info("cpu%d: Ratio of byte access time to unaligned word
access is %d.%05d, unaligned accesses are %s\n",
cpu,
- ratio / 100,
- ratio % 100,
+ ratio / 100000,
+ ratio % 100000,
(speed == RISCV_HWPROBE_MISALIGNED_FAST) ? "fast" : "slow");

per_cpu(misaligned_access_speed, cpu) = speed;

If you did, I'd be interested to see the results.
-Evan

2023-09-14 07:33:32

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: [PATCH v4 1/2] RISC-V: Probe for unaligned access speed

Hi Evan,

On Wed, Sep 13, 2023 at 7:46 PM Evan Green <[email protected]> wrote:
> On Wed, Sep 13, 2023 at 5:36 AM Geert Uytterhoeven <[email protected]> wrote:
> > On Fri, Aug 18, 2023 at 9:44 PM Evan Green <[email protected]> wrote:
> > > Rather than deferring unaligned access speed determinations to a vendor
> > > function, let's probe them and find out how fast they are. If we
> > > determine that an unaligned word access is faster than N byte accesses,
> > > mark the hardware's unaligned access as "fast". Otherwise, we mark
> > > accesses as slow.
> > >
> > > The algorithm itself runs for a fixed amount of jiffies. Within each
> > > iteration it attempts to time a single loop, and then keeps only the best
> > > (fastest) loop it saw. This algorithm was found to have lower variance from
> > > run to run than my first attempt, which counted the total number of
> > > iterations that could be done in that fixed amount of jiffies. By taking
> > > only the best iteration in the loop, assuming at least one loop wasn't
> > > perturbed by an interrupt, we eliminate the effects of interrupts and
> > > other "warm up" factors like branch prediction. The only downside is it
> > > depends on having an rdtime granular and accurate enough to measure a
> > > single copy. If we ever manage to complete a loop in 0 rdtime ticks, we
> > > leave the unaligned setting at UNKNOWN.
> > >
> > > There is a slight change in user-visible behavior here. Previously, all
> > > boards except the THead C906 reported misaligned access speed of
> > > UNKNOWN. C906 reported FAST. With this change, since we're now measuring
> > > misaligned access speed on each hart, all RISC-V systems will have this
> > > key set as either FAST or SLOW.
> > >
> > > Currently, we don't have a way to confidently measure the difference between
> > > SLOW and EMULATED, so we label anything not fast as SLOW. This will
> > > mislabel some systems that are actually EMULATED as SLOW. When we get
> > > support for delegating misaligned access traps to the kernel (as opposed
> > > to the firmware quietly handling it), we can explicitly test in Linux to
> > > see if unaligned accesses trap. Those systems will start to report
> > > EMULATED, though older (today's) systems without that new SBI mechanism
> > > will continue to report SLOW.
> > >
> > > I've updated the documentation for those hwprobe values to reflect
> > > this, specifically: SLOW may or may not be emulated by software, and FAST
> > > represents means being faster than equivalent byte accesses. The change
> > > in documentation is accurate with respect to both the former and current
> > > behavior.
> > >
> > > Signed-off-by: Evan Green <[email protected]>
> > > Acked-by: Conor Dooley <[email protected]>
> >
> > Thanks for your patch, which is now commit 584ea6564bcaead2 ("RISC-V:
> > Probe for unaligned access speed") in v6.6-rc1.
> >
> > On the boards I have, I get:
> >
> > rzfive:
> > cpu0: Ratio of byte access time to unaligned word access is
> > 1.05, unaligned accesses are fast
>
> Hrm, I'm a little surprised to be seeing this number come out so close
> to 1. If you reboot a few times, what kind of variance do you get on
> this?

Rock-solid at 1.05 (even with increased resolution: 1.05853 on 3 tries)

> > icicle:
> >
> > cpu1: Ratio of byte access time to unaligned word access is
> > 0.00, unaligned accesses are slow
> > cpu2: Ratio of byte access time to unaligned word access is
> > 0.00, unaligned accesses are slow
> > cpu3: Ratio of byte access time to unaligned word access is
> > 0.00, unaligned accesses are slow
> >
> > cpu0: Ratio of byte access time to unaligned word access is
> > 0.00, unaligned accesses are slow

cpu1: Ratio of byte access time to unaligned word access is 0.00344,
unaligned accesses are slow
cpu2: Ratio of byte access time to unaligned word access is 0.00343,
unaligned accesses are slow
cpu3: Ratio of byte access time to unaligned word access is 0.00343,
unaligned accesses are slow
cpu0: Ratio of byte access time to unaligned word access is 0.00340,
unaligned accesses are slow

> > k210:
> >
> > cpu1: Ratio of byte access time to unaligned word access is
> > 0.02, unaligned accesses are slow
> > cpu0: Ratio of byte access time to unaligned word access is
> > 0.02, unaligned accesses are slow

cpu1: Ratio of byte access time to unaligned word access is 0.02392,
unaligned accesses are slow
cpu0: Ratio of byte access time to unaligned word access is 0.02084,
unaligned accesses are slow

> > starlight:
> >
> > cpu1: Ratio of byte access time to unaligned word access is
> > 0.01, unaligned accesses are slow
> > cpu0: Ratio of byte access time to unaligned word access is
> > 0.02, unaligned accesses are slow

cpu1: Ratio of byte access time to unaligned word access is 0.01872,
unaligned accesses are slow
cpu0: Ratio of byte access time to unaligned word access is 0.01930,
unaligned accesses are slow

> > vexriscv/orangecrab:
> >
> > cpu0: Ratio of byte access time to unaligned word access is
> > 0.00, unaligned accesses are slow

cpu0: Ratio of byte access time to unaligned word access is 0.00417,
unaligned accesses are slow

> > I am a bit surprised by the near-zero values. Are these expected?
>
> This could be expected, if firmware is trapping the unaligned accesses
> and coming out >100x slower than a native access. If you're interested
> in getting a little more resolution, you could try to print a few more
> decimal places with something like (sorry gmail mangles the whitespace
> on this):

Looks like you need to add one digit to get anything useful on half of the
systems.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2023-09-14 17:26:17

by David Laight

[permalink] [raw]

Subject: RE: [PATCH v4 1/2] RISC-V: Probe for unaligned access speed

From: Geert Uytterhoeven
> Sent: 14 September 2023 08:33
...
> > > rzfive:
> > > cpu0: Ratio of byte access time to unaligned word access is
> > > 1.05, unaligned accesses are fast
> >
> > Hrm, I'm a little surprised to be seeing this number come out so close
> > to 1. If you reboot a few times, what kind of variance do you get on
> > this?
>
> Rock-solid at 1.05 (even with increased resolution: 1.05853 on 3 tries)

Would that match zero overhead unless the access crosses a
cache line boundary?
(I can't remember whether the test is using increasing addresses.)

...
> > > vexriscv/orangecrab:
> > >
> > > cpu0: Ratio of byte access time to unaligned word access is
> > > 0.00, unaligned accesses are slow
>
> cpu0: Ratio of byte access time to unaligned word access is 0.00417,
> unaligned accesses are slow
>
> > > I am a bit surprised by the near-zero values. Are these expected?
> >
> > This could be expected, if firmware is trapping the unaligned accesses
> > and coming out >100x slower than a native access. If you're interested
> > in getting a little more resolution, you could try to print a few more
> > decimal places with something like (sorry gmail mangles the whitespace
> > on this):

I'd expect one of three possible values:
- 1.0x: Basically zero cost except for cache line/page boundaries.
- ~2: Hardware does two reads and merges the values.
- >100: Trap fixed up in software.

I'd think the '2' case could be considered fast.
You only need to time one access to see if it was a fault.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-09-14 18:27:09

by Evan Green

[permalink] [raw]

Subject: Re: [PATCH v4 1/2] RISC-V: Probe for unaligned access speed

On Thu, Sep 14, 2023 at 1:47 AM David Laight <[email protected]> wrote:
>
> From: Geert Uytterhoeven
> > Sent: 14 September 2023 08:33
> ...
> > > > rzfive:
> > > > cpu0: Ratio of byte access time to unaligned word access is
> > > > 1.05, unaligned accesses are fast
> > >
> > > Hrm, I'm a little surprised to be seeing this number come out so close
> > > to 1. If you reboot a few times, what kind of variance do you get on
> > > this?
> >
> > Rock-solid at 1.05 (even with increased resolution: 1.05853 on 3 tries)
>
> Would that match zero overhead unless the access crosses a
> cache line boundary?
> (I can't remember whether the test is using increasing addresses.)

Yes, the test does use increasing addresses, it copies across 4 pages.
We start with a warmup, so caching effects beyond L1 are largely not
taken into account.

>
> ...
> > > > vexriscv/orangecrab:
> > > >
> > > > cpu0: Ratio of byte access time to unaligned word access is
> > > > 0.00, unaligned accesses are slow
> >
> > cpu0: Ratio of byte access time to unaligned word access is 0.00417,
> > unaligned accesses are slow
> >
> > > > I am a bit surprised by the near-zero values. Are these expected?
> > >
> > > This could be expected, if firmware is trapping the unaligned accesses
> > > and coming out >100x slower than a native access. If you're interested
> > > in getting a little more resolution, you could try to print a few more
> > > decimal places with something like (sorry gmail mangles the whitespace
> > > on this):
>
> I'd expect one of three possible values:
> - 1.0x: Basically zero cost except for cache line/page boundaries.
> - ~2: Hardware does two reads and merges the values.
> - >100: Trap fixed up in software.
>
> I'd think the '2' case could be considered fast.
> You only need to time one access to see if it was a fault.

We're comparing misaligned word accesses with byte accesses of the
same total size. So 1.0 means a misaligned load is basically no
different from 8 byte loads. The goal was to help people that are
forced to do odd loads and stores decide whether they are better off
moving by bytes or by misaligned words. (In contrast, the answer to
"should I do a misaligned word load or an aligned word load" is
generally always "do the aligned one if you can", so comparing those
two things didn't seem as useful).

We opted for 1.0 as a cutoff, since even at 1.05, you get a boost from
doing misaligned word loads over byte copies. I asked about the
variance because I don't want to see machines that change their mind
from boot to boot. I originally considered trying to create a "gray
zone" where the answer goes back to UNKNOWN, but in the end that just
moves the fiddly point rather than really eliminating it.

You're right that in theory we just need one perfect access to test,
but testing only once makes it susceptible to hiccups. We went with
doing it many times in a fixed period and taking the minimum to
hopefully remove noise like NMI-like things, branch prediction misses,
or cache eviction.

Geert,
Thanks for providing the numbers. Yes, we could add another digit to
the print. Though if you already know you're at least 100x slower,
maybe knowing exactly how much slower isn't super meaningful, just
very much avoid unaligned accesses on these systems :). Hopefully over
time the number of systems like this will dwindle.

-Evan

2023-10-19 06:37:56

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: [PATCH v4 1/2] RISC-V: Probe for unaligned access speed

Hi Prabahkar,

On Thu, Sep 14, 2023 at 9:32 AM Geert Uytterhoeven <[email protected]> wrote:
> On Wed, Sep 13, 2023 at 7:46 PM Evan Green <[email protected]> wrote:
> > On Wed, Sep 13, 2023 at 5:36 AM Geert Uytterhoeven <[email protected]> wrote:
> > > On Fri, Aug 18, 2023 at 9:44 PM Evan Green <[email protected]> wrote:
> > > > Rather than deferring unaligned access speed determinations to a vendor
> > > > function, let's probe them and find out how fast they are. If we
> > > > determine that an unaligned word access is faster than N byte accesses,
> > > > mark the hardware's unaligned access as "fast". Otherwise, we mark
> > > > accesses as slow.
> > > >
> > > > The algorithm itself runs for a fixed amount of jiffies. Within each
> > > > iteration it attempts to time a single loop, and then keeps only the best
> > > > (fastest) loop it saw. This algorithm was found to have lower variance from
> > > > run to run than my first attempt, which counted the total number of
> > > > iterations that could be done in that fixed amount of jiffies. By taking
> > > > only the best iteration in the loop, assuming at least one loop wasn't
> > > > perturbed by an interrupt, we eliminate the effects of interrupts and
> > > > other "warm up" factors like branch prediction. The only downside is it
> > > > depends on having an rdtime granular and accurate enough to measure a
> > > > single copy. If we ever manage to complete a loop in 0 rdtime ticks, we
> > > > leave the unaligned setting at UNKNOWN.
> > > >
> > > > There is a slight change in user-visible behavior here. Previously, all
> > > > boards except the THead C906 reported misaligned access speed of
> > > > UNKNOWN. C906 reported FAST. With this change, since we're now measuring
> > > > misaligned access speed on each hart, all RISC-V systems will have this
> > > > key set as either FAST or SLOW.
> > > >
> > > > Currently, we don't have a way to confidently measure the difference between
> > > > SLOW and EMULATED, so we label anything not fast as SLOW. This will
> > > > mislabel some systems that are actually EMULATED as SLOW. When we get
> > > > support for delegating misaligned access traps to the kernel (as opposed
> > > > to the firmware quietly handling it), we can explicitly test in Linux to
> > > > see if unaligned accesses trap. Those systems will start to report
> > > > EMULATED, though older (today's) systems without that new SBI mechanism
> > > > will continue to report SLOW.
> > > >
> > > > I've updated the documentation for those hwprobe values to reflect
> > > > this, specifically: SLOW may or may not be emulated by software, and FAST
> > > > represents means being faster than equivalent byte accesses. The change
> > > > in documentation is accurate with respect to both the former and current
> > > > behavior.
> > > >
> > > > Signed-off-by: Evan Green <[email protected]>
> > > > Acked-by: Conor Dooley <[email protected]>
> > >
> > > Thanks for your patch, which is now commit 584ea6564bcaead2 ("RISC-V:
> > > Probe for unaligned access speed") in v6.6-rc1.
> > >
> > > On the boards I have, I get:
> > >
> > > rzfive:
> > > cpu0: Ratio of byte access time to unaligned word access is
> > > 1.05, unaligned accesses are fast
> >
> > Hrm, I'm a little surprised to be seeing this number come out so close
> > to 1. If you reboot a few times, what kind of variance do you get on
> > this?
>
> Rock-solid at 1.05 (even with increased resolution: 1.05853 on 3 tries)

After upgrading the firmware from [1] to [2], this changed to
"0.00, unaligned accesses are slow".

[1] RZ-Five-ETH
U-Boot 2020.10-g611c657e43 (Aug 26 2022 - 11:29:06 +0100)

[2] OpenSBI v1.3-75-g3cf0ea4
U-Boot 2023.01-00209-g1804c8ab17 (Oct 04 2023 - 13:18:01 +0100)

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2023-10-19 07:51:38

by Lad, Prabhakar

[permalink] [raw]

Subject: Re: [PATCH v4 1/2] RISC-V: Probe for unaligned access speed

Hi Geert,

On Thu, Oct 19, 2023 at 7:40 AM Geert Uytterhoeven <[email protected]> wrote:
>
> Hi Prabahkar,
>
> On Thu, Sep 14, 2023 at 9:32 AM Geert Uytterhoeven <[email protected]> wrote:
> > On Wed, Sep 13, 2023 at 7:46 PM Evan Green <[email protected]> wrote:
> > > On Wed, Sep 13, 2023 at 5:36 AM Geert Uytterhoeven <[email protected]> wrote:
> > > > On Fri, Aug 18, 2023 at 9:44 PM Evan Green <[email protected]> wrote:
> > > > > Rather than deferring unaligned access speed determinations to a vendor
> > > > > function, let's probe them and find out how fast they are. If we
> > > > > determine that an unaligned word access is faster than N byte accesses,
> > > > > mark the hardware's unaligned access as "fast". Otherwise, we mark
> > > > > accesses as slow.
> > > > >
> > > > > The algorithm itself runs for a fixed amount of jiffies. Within each
> > > > > iteration it attempts to time a single loop, and then keeps only the best
> > > > > (fastest) loop it saw. This algorithm was found to have lower variance from
> > > > > run to run than my first attempt, which counted the total number of
> > > > > iterations that could be done in that fixed amount of jiffies. By taking
> > > > > only the best iteration in the loop, assuming at least one loop wasn't
> > > > > perturbed by an interrupt, we eliminate the effects of interrupts and
> > > > > other "warm up" factors like branch prediction. The only downside is it
> > > > > depends on having an rdtime granular and accurate enough to measure a
> > > > > single copy. If we ever manage to complete a loop in 0 rdtime ticks, we
> > > > > leave the unaligned setting at UNKNOWN.
> > > > >
> > > > > There is a slight change in user-visible behavior here. Previously, all
> > > > > boards except the THead C906 reported misaligned access speed of
> > > > > UNKNOWN. C906 reported FAST. With this change, since we're now measuring
> > > > > misaligned access speed on each hart, all RISC-V systems will have this
> > > > > key set as either FAST or SLOW.
> > > > >
> > > > > Currently, we don't have a way to confidently measure the difference between
> > > > > SLOW and EMULATED, so we label anything not fast as SLOW. This will
> > > > > mislabel some systems that are actually EMULATED as SLOW. When we get
> > > > > support for delegating misaligned access traps to the kernel (as opposed
> > > > > to the firmware quietly handling it), we can explicitly test in Linux to
> > > > > see if unaligned accesses trap. Those systems will start to report
> > > > > EMULATED, though older (today's) systems without that new SBI mechanism
> > > > > will continue to report SLOW.
> > > > >
> > > > > I've updated the documentation for those hwprobe values to reflect
> > > > > this, specifically: SLOW may or may not be emulated by software, and FAST
> > > > > represents means being faster than equivalent byte accesses. The change
> > > > > in documentation is accurate with respect to both the former and current
> > > > > behavior.
> > > > >
> > > > > Signed-off-by: Evan Green <[email protected]>
> > > > > Acked-by: Conor Dooley <[email protected]>
> > > >
> > > > Thanks for your patch, which is now commit 584ea6564bcaead2 ("RISC-V:
> > > > Probe for unaligned access speed") in v6.6-rc1.
> > > >
> > > > On the boards I have, I get:
> > > >
> > > > rzfive:
> > > > cpu0: Ratio of byte access time to unaligned word access is
> > > > 1.05, unaligned accesses are fast
> > >
> > > Hrm, I'm a little surprised to be seeing this number come out so close
> > > to 1. If you reboot a few times, what kind of variance do you get on
> > > this?
> >
> > Rock-solid at 1.05 (even with increased resolution: 1.05853 on 3 tries)
>
> After upgrading the firmware from [1] to [2], this changed to
> "0.00, unaligned accesses are slow".
>
> [1] RZ-Five-ETH
> U-Boot 2020.10-g611c657e43 (Aug 26 2022 - 11:29:06 +0100)
>
> [2] OpenSBI v1.3-75-g3cf0ea4
> U-Boot 2023.01-00209-g1804c8ab17 (Oct 04 2023 - 13:18:01 +0100)
>
Thanks, let me go through the changes.

Cheers,
Prabhakar