From: Evan Green
> Sent: 14 September 2023 16:01
>
> On Thu, Sep 14, 2023 at 1:47 AM David Laight <[email protected]> wrote:
> >
> > From: Geert Uytterhoeven
> > > Sent: 14 September 2023 08:33
> > ...
> > > > > rzfive:
> > > > > cpu0: Ratio of byte access time to unaligned word access is
> > > > > 1.05, unaligned accesses are fast
> > > >
> > > > Hrm, I'm a little surprised to be seeing this number come out so close
> > > > to 1. If you reboot a few times, what kind of variance do you get on
> > > > this?
> > >
> > > Rock-solid at 1.05 (even with increased resolution: 1.05853 on 3 tries)
> >
> > Would that match zero overhead unless the access crosses a
> > cache line boundary?
> > (I can't remember whether the test is using increasing addresses.)
>
> Yes, the test does use increasing addresses, it copies across 4 pages.
> We start with a warmup, so caching effects beyond L1 are largely not
> taken into account.
That seems entirely excessive.
If you want to avoid data cache issues (which probably do)
then just repeating a single access would almost certainly
suffice.
Repeatedly using a short buffer (say 256 bytes) won't add
much loop overhead.
Although you may want to do a test that avoids transfers
that cross cache line and especially page boundaries.
Either of those could easily be much slower than a read
that is entirely within a cache line.
...
> > > > > vexriscv/orangecrab:
> > > > >
> > > > > cpu0: Ratio of byte access time to unaligned word access is
> > > > > 0.00, unaligned accesses are slow
> > >
> > > cpu0: Ratio of byte access time to unaligned word access is 0.00417,
> > > unaligned accesses are slow
> > >
> > > > > I am a bit surprised by the near-zero values. Are these expected?
> > > >
> > > > This could be expected, if firmware is trapping the unaligned accesses
> > > > and coming out >100x slower than a native access. If you're interested
> > > > in getting a little more resolution, you could try to print a few more
> > > > decimal places with something like (sorry gmail mangles the whitespace
> > > > on this):
> >
> > I'd expect one of three possible values:
> > - 1.0x: Basically zero cost except for cache line/page boundaries.
> > - ~2: Hardware does two reads and merges the values.
> > - >100: Trap fixed up in software.
> >
> > I'd think the '2' case could be considered fast.
> > You only need to time one access to see if it was a fault.
>
> We're comparing misaligned word accesses with byte accesses of the
> same total size. So 1.0 means a misaligned load is basically no
> different from 8 byte loads. The goal was to help people that are
> forced to do odd loads and stores decide whether they are better off
> moving by bytes or by misaligned words. (In contrast, the answer to
> "should I do a misaligned word load or an aligned word load" is
> generally always "do the aligned one if you can", so comparing those
> two things didn't seem as useful).
Ah, I'd have compared the cost of aligned accesses with misaligned ones.
That would tell you whether you really need to avoid them.
The cost of byte and aligned word accesses should be much the same
(for each access that is) - if not you've got a real bottleneck.
If a misaligned access is 8 times slower than an aligned one
it is still 'quite slow'.
I'd definitely call that 8 not 1 - even if you treat it as 'fast'.
For comparison you (well I) can write x64-64 asm for the ip-checksum
loop that will execute 1 memory read every clock (8 bytes/clock).
It is very slightly slower for misaligned buffers, but by less
than 1 clock per cache line.
That's what I'd call 1.0 :-)
I'd expect even simple hardware to do misaligned reads as two
reads and then merge the data - so should really be no slower
than two separate aligned reads.
Since you'd expect a cpu to do an L1 data cache read every clock
(probably pipelined) the misaligned read should just add 1 clock.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Thu, Sep 14, 2023 at 8:55 AM David Laight <[email protected]> wrote:
>
> From: Evan Green
> > Sent: 14 September 2023 16:01
> >
> > On Thu, Sep 14, 2023 at 1:47 AM David Laight <[email protected]> wrote:
> > >
> > > From: Geert Uytterhoeven
> > > > Sent: 14 September 2023 08:33
> > > ...
> > > > > > rzfive:
> > > > > > cpu0: Ratio of byte access time to unaligned word access is
> > > > > > 1.05, unaligned accesses are fast
> > > > >
> > > > > Hrm, I'm a little surprised to be seeing this number come out so close
> > > > > to 1. If you reboot a few times, what kind of variance do you get on
> > > > > this?
> > > >
> > > > Rock-solid at 1.05 (even with increased resolution: 1.05853 on 3 tries)
> > >
> > > Would that match zero overhead unless the access crosses a
> > > cache line boundary?
> > > (I can't remember whether the test is using increasing addresses.)
> >
> > Yes, the test does use increasing addresses, it copies across 4 pages.
> > We start with a warmup, so caching effects beyond L1 are largely not
> > taken into account.
>
> That seems entirely excessive.
> If you want to avoid data cache issues (which probably do)
> then just repeating a single access would almost certainly
> suffice.
> Repeatedly using a short buffer (say 256 bytes) won't add
> much loop overhead.
> Although you may want to do a test that avoids transfers
> that cross cache line and especially page boundaries.
> Either of those could easily be much slower than a read
> that is entirely within a cache line.
We won't be faulting on any of these pages, and they should remain in
the TLB, so I don't expect many page boundary specific effects. If
there is a steep penalty for misaligned loads across a cache line,
such that it's worse than doing byte accesses, I want the test results
to be dinged for that.
>
> ...
> > > > > > vexriscv/orangecrab:
> > > > > >
> > > > > > cpu0: Ratio of byte access time to unaligned word access is
> > > > > > 0.00, unaligned accesses are slow
> > > >
> > > > cpu0: Ratio of byte access time to unaligned word access is 0.00417,
> > > > unaligned accesses are slow
> > > >
> > > > > > I am a bit surprised by the near-zero values. Are these expected?
> > > > >
> > > > > This could be expected, if firmware is trapping the unaligned accesses
> > > > > and coming out >100x slower than a native access. If you're interested
> > > > > in getting a little more resolution, you could try to print a few more
> > > > > decimal places with something like (sorry gmail mangles the whitespace
> > > > > on this):
> > >
> > > I'd expect one of three possible values:
> > > - 1.0x: Basically zero cost except for cache line/page boundaries.
> > > - ~2: Hardware does two reads and merges the values.
> > > - >100: Trap fixed up in software.
> > >
> > > I'd think the '2' case could be considered fast.
> > > You only need to time one access to see if it was a fault.
> >
> > We're comparing misaligned word accesses with byte accesses of the
> > same total size. So 1.0 means a misaligned load is basically no
> > different from 8 byte loads. The goal was to help people that are
> > forced to do odd loads and stores decide whether they are better off
> > moving by bytes or by misaligned words. (In contrast, the answer to
> > "should I do a misaligned word load or an aligned word load" is
> > generally always "do the aligned one if you can", so comparing those
> > two things didn't seem as useful).
>
> Ah, I'd have compared the cost of aligned accesses with misaligned ones.
> That would tell you whether you really need to avoid them.
> The cost of byte and aligned word accesses should be much the same
> (for each access that is) - if not you've got a real bottleneck.
>
> If a misaligned access is 8 times slower than an aligned one
> it is still 'quite slow'.
> I'd definitely call that 8 not 1 - even if you treat it as 'fast'.
The number itself isn't exported or saved anywhere, it's just printed
as diagnostic explanation into the final fast/slow designation.
Misaligned word loads are never going to be faster than aligned ones,
and aren't really going to be equal either. It's also generally not
something that causes software a lot of angst: we align most of our
buffers and structures with help from the compiler, and generally do
an aligned access whenever possible. It's the times when we're forced
to do odd sizes or accesses we know are already misaligned that this
hwprobe bit was designed to help. In those cases, users are forced to
decide if they should do a misaligned word access or byte accesses, so
we aim to provide that result.
If there's a use case for knowing "misaligned accesses are exactly as
fast as aligned ones", we could detect this threshold in the same
test, and add another hwprobe bit for it.
-Evan
>
> For comparison you (well I) can write x64-64 asm for the ip-checksum
> loop that will execute 1 memory read every clock (8 bytes/clock).
> It is very slightly slower for misaligned buffers, but by less
> than 1 clock per cache line.
> That's what I'd call 1.0 :-)
>
> I'd expect even simple hardware to do misaligned reads as two
> reads and then merge the data - so should really be no slower
> than two separate aligned reads.
> Since you'd expect a cpu to do an L1 data cache read every clock
> (probably pipelined) the misaligned read should just add 1 clock.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
From: Evan Green
> Sent: 14 September 2023 17:37
>
> On Thu, Sep 14, 2023 at 8:55 AM David Laight <[email protected]> wrote:
> >
> > From: Evan Green
> > > Sent: 14 September 2023 16:01
> > >
> > > On Thu, Sep 14, 2023 at 1:47 AM David Laight <[email protected]> wrote:
> > > >
> > > > From: Geert Uytterhoeven
> > > > > Sent: 14 September 2023 08:33
> > > > ...
> > > > > > > rzfive:
> > > > > > > cpu0: Ratio of byte access time to unaligned word access is
> > > > > > > 1.05, unaligned accesses are fast
> > > > > >
> > > > > > Hrm, I'm a little surprised to be seeing this number come out so close
> > > > > > to 1. If you reboot a few times, what kind of variance do you get on
> > > > > > this?
> > > > >
> > > > > Rock-solid at 1.05 (even with increased resolution: 1.05853 on 3 tries)
> > > >
> > > > Would that match zero overhead unless the access crosses a
> > > > cache line boundary?
> > > > (I can't remember whether the test is using increasing addresses.)
> > >
> > > Yes, the test does use increasing addresses, it copies across 4 pages.
> > > We start with a warmup, so caching effects beyond L1 are largely not
> > > taken into account.
> >
> > That seems entirely excessive.
> > If you want to avoid data cache issues (which probably do)
> > then just repeating a single access would almost certainly
> > suffice.
> > Repeatedly using a short buffer (say 256 bytes) won't add
> > much loop overhead.
> > Although you may want to do a test that avoids transfers
> > that cross cache line and especially page boundaries.
> > Either of those could easily be much slower than a read
> > that is entirely within a cache line.
>
> We won't be faulting on any of these pages, and they should remain in
> the TLB, so I don't expect many page boundary specific effects. If
> there is a steep penalty for misaligned loads across a cache line,
> such that it's worse than doing byte accesses, I want the test results
> to be dinged for that.
That is an entirely different issue.
Are you absolutely certain that the reason 8 byte loads take
as long as a 64-bit mis-aligned load isn't because the entire
test is limited by L1 cache fills?
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Fri, Sep 15, 2023 at 12:57 AM David Laight <[email protected]> wrote:
>
> From: Evan Green
> > Sent: 14 September 2023 17:37
> >
> > On Thu, Sep 14, 2023 at 8:55 AM David Laight <[email protected]> wrote:
> > >
> > > From: Evan Green
> > > > Sent: 14 September 2023 16:01
> > > >
> > > > On Thu, Sep 14, 2023 at 1:47 AM David Laight <[email protected]> wrote:
> > > > >
> > > > > From: Geert Uytterhoeven
> > > > > > Sent: 14 September 2023 08:33
> > > > > ...
> > > > > > > > rzfive:
> > > > > > > > cpu0: Ratio of byte access time to unaligned word access is
> > > > > > > > 1.05, unaligned accesses are fast
> > > > > > >
> > > > > > > Hrm, I'm a little surprised to be seeing this number come out so close
> > > > > > > to 1. If you reboot a few times, what kind of variance do you get on
> > > > > > > this?
> > > > > >
> > > > > > Rock-solid at 1.05 (even with increased resolution: 1.05853 on 3 tries)
> > > > >
> > > > > Would that match zero overhead unless the access crosses a
> > > > > cache line boundary?
> > > > > (I can't remember whether the test is using increasing addresses.)
> > > >
> > > > Yes, the test does use increasing addresses, it copies across 4 pages.
> > > > We start with a warmup, so caching effects beyond L1 are largely not
> > > > taken into account.
> > >
> > > That seems entirely excessive.
> > > If you want to avoid data cache issues (which probably do)
> > > then just repeating a single access would almost certainly
> > > suffice.
> > > Repeatedly using a short buffer (say 256 bytes) won't add
> > > much loop overhead.
> > > Although you may want to do a test that avoids transfers
> > > that cross cache line and especially page boundaries.
> > > Either of those could easily be much slower than a read
> > > that is entirely within a cache line.
> >
> > We won't be faulting on any of these pages, and they should remain in
> > the TLB, so I don't expect many page boundary specific effects. If
> > there is a steep penalty for misaligned loads across a cache line,
> > such that it's worse than doing byte accesses, I want the test results
> > to be dinged for that.
>
> That is an entirely different issue.
>
> Are you absolutely certain that the reason 8 byte loads take
> as long as a 64-bit mis-aligned load isn't because the entire
> test is limited by L1 cache fills?
Fair question. I hacked up a little code [1] to retry the test at
several different sizes, as well as printing out the best and worst
times. I only have one piece of real hardware, the THead C906, which
has a 32KB L1 D-cache.
Here are the results at various sizes, starting with the original:
[ 0.047556] cpu0: Ratio of byte access time to unaligned word
access is 4.35, unaligned accesses are fast
[ 0.047578] EVAN size 0x1f80 word cycles best 69 worst 29e, byte
cycles best 1c9 worst 3b7
[ 0.071549] cpu0: Ratio of byte access time to unaligned word
access is 4.29, unaligned accesses are fast
[ 0.071566] EVAN size 0x1000 word cycles best 36 worst 210, byte
cycles best e8 worst 2b2
[ 0.095540] cpu0: Ratio of byte access time to unaligned word
access is 4.14, unaligned accesses are fast
[ 0.095556] EVAN size 0x200 word cycles best 7 worst 1d9, byte
cycles best 1d worst 1d5
[ 0.119539] cpu0: Ratio of byte access time to unaligned word
access is 5.00, unaligned accesses are fast
[ 0.119555] EVAN size 0x100 word cycles best 3 worst 1a8, byte
cycles best f worst 1b5
[ 0.143538] cpu0: Ratio of byte access time to unaligned word
access is 3.50, unaligned accesses are fast
[ 0.143556] EVAN size 0x80 word cycles best 2 worst 1a5, byte
cycles best 7 worst 1aa
[1] https://pastebin.com/uwwU2CVn
I don't see any cliffs as the numbers get smaller, so it seems to me
there are no working set issues. Geert, it might be interesting to see
these same results on the rzfive. The thing that made me uncomfortable
with the smaller buffer sizes is it starts to bump up against the
resolution of the timer. Another option would have been to time
several iterations, but I went with the larger buffer instead as I'd
hoped it would minimize other overhead like the function calls, branch
prediction, C loop management, etc.
-Evan