LinuxLists.cc - [PATCH] ARM64: Improve copy_page for 128 cache line sizes.

2015-12-20 00:11:25

Subject: [PATCH] ARM64: Improve copy_page for 128 cache line sizes.

Adding a check for the cache line size is not much overhead.
Special case 128 byte cache line size.
This improves copy_page by 85% on ThunderX compared to the
original implementation.

For LMBench, it improves between 4-10%.

Signed-off-by: Andrew Pinski <[email protected]>
---
arch/arm64/lib/copy_page.S | 39 +++++++++++++++++++++++++++++++++++++++
1 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 512b9a7..4c28789 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -18,6 +18,7 @@
#include <linux/const.h>
#include <asm/assembler.h>
#include <asm/page.h>
+#include <asm/cachetype.h>

/*
* Copy a page from src to dest (both are page aligned)
@@ -27,8 +28,17 @@
* x1 - src
*/
ENTRY(copy_page)
+ /* Special case 128 byte or more cache lines */
+ mrs x2, ctr_el0
+ lsr x2, x2, CTR_CWG_SHIFT
+ and w2, w2, CTR_CWG_MASK
+ cmp w2, 5
+ b.ge 2f
+
/* Assume cache line size is 64 bytes. */
prfm pldl1strm, [x1, #64]
+ /* Align the loop is it fits in one cache line. */
+ .balign 64
1: ldp x2, x3, [x1]
ldp x4, x5, [x1, #16]
ldp x6, x7, [x1, #32]
@@ -43,4 +53,33 @@ ENTRY(copy_page)
tst x1, #(PAGE_SIZE - 1)
b.ne 1b
ret
+
+2:
+ /* The cache line size is at least 128 bytes. */
+ prfm pldl1strm, [x1, #128]
+ /* Align the loop so it fits in one cache line */
+ .balign 128
+1: prfm pldl1strm, [x1, #256]
+ ldp x2, x3, [x1]
+ ldp x4, x5, [x1, #16]
+ ldp x6, x7, [x1, #32]
+ ldp x8, x9, [x1, #48]
+ stnp x2, x3, [x0]
+ stnp x4, x5, [x0, #16]
+ stnp x6, x7, [x0, #32]
+ stnp x8, x9, [x0, #48]
+
+ ldp x2, x3, [x1, #64]
+ ldp x4, x5, [x1, #80]
+ ldp x6, x7, [x1, #96]
+ ldp x8, x9, [x1, #112]
+ add x1, x1, #128
+ stnp x2, x3, [x0, #64]
+ stnp x4, x5, [x0, #80]
+ stnp x6, x7, [x0, #96]
+ stnp x8, x9, [x0, #112]
+ add x0, x0, #128
+ tst x1, #(PAGE_SIZE - 1)
+ b.ne 1b
+ ret
ENDPROC(copy_page)
--
1.7.2.5

2015-12-21 12:46:40

by Will Deacon

[permalink] [raw]

Subject: Re: [PATCH] ARM64: Improve copy_page for 128 cache line sizes.

On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
> Adding a check for the cache line size is not much overhead.
> Special case 128 byte cache line size.
> This improves copy_page by 85% on ThunderX compared to the
> original implementation.

So this patch seems to:

- Align the loop
- Increase the prefetch size
- Unroll the loop once

Do you know where your 85% boost comes from between these? I'd really
like to avoid having multiple versions of copy_page, if possible, but
maybe we could end up with something that works well enough regardless
of cacheline size. Understanding what your bottleneck is would help to
lead us in the right direction.

Also, how are you measuring the improvement? If you can share your
test somewhere, I can see how it affects the other systems I have access
to.

Will

2015-12-21 13:43:49

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [PATCH] ARM64: Improve copy_page for 128 cache line sizes.

On Monday 21 December 2015, Will Deacon wrote:
> On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
> > Adding a check for the cache line size is not much overhead.
> > Special case 128 byte cache line size.
> > This improves copy_page by 85% on ThunderX compared to the
> > original implementation.
>
> So this patch seems to:
>
> - Align the loop
> - Increase the prefetch size
> - Unroll the loop once
>
> Do you know where your 85% boost comes from between these? I'd really
> like to avoid having multiple versions of copy_page, if possible, but
> maybe we could end up with something that works well enough regardless
> of cacheline size. Understanding what your bottleneck is would help to
> lead us in the right direction.
>
> Also, how are you measuring the improvement? If you can share your
> test somewhere, I can see how it affects the other systems I have access
> to.

A related question would be how other CPU cores are affected by the change.
The test for the cache line size is going to take a few cycles, possibly
a lot on certain implementations, e.g. if we ever get one where 'mrs' is
microcoded or trapped by a hypervisor.

Are there any possible downsides to using the ThunderX version on other
microarchitectures too and skip the check?

Arnd

2015-12-22 23:32:21

by Andrew Pinski

[permalink] [raw]

Subject: Re: [PATCH] ARM64: Improve copy_page for 128 cache line sizes.

On Tue, Dec 21, 2015 at 5:43 AM, Arnd Bergmann <[email protected]> wrote:
>
> On Monday 21 December 2015, Will Deacon wrote:
>> On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
>> > Adding a check for the cache line size is not much overhead.
>> > Special case 128 byte cache line size.
>> > This improves copy_page by 85% on ThunderX compared to the original
>> > implementation.
>>
>> So this patch seems to:
>>
>> - Align the loop
>> - Increase the prefetch size
>> - Unroll the loop once
>>
>> Do you know where your 85% boost comes from between these? I'd really
>> like to avoid having multiple versions of copy_page, if possible, but
>> maybe we could end up with something that works well enough regardless
>> of cacheline size. Understanding what your bottleneck is would help to
>> lead us in the right direction.

I think it is the prefetching. ThunderX T88 pass 1 and pass 2 does
not have a hardware prefetcher so prefetching a half of a cacheline
ahead does not help at all.

>>
>> Also, how are you measuring the improvement? If you can share your
>> test somewhere, I can see how it affects the other systems I have
>> access to.

You can find my benchmark at
https://github.com/apinski-cavium/copy_page_benchmark .
copy_page is my previous patch.
copy_page128 is just the unrolled and only 128 byte prefetching
copy_page64 is the original code
copy_page64unroll is the new patch which I will be sending out soon.

>
> A related question would be how other CPU cores are affected by the change.
> The test for the cache line size is going to take a few cycles, possibly a lot on certain implementations, e.g. if we ever get one where 'mrs' is microcoded or trapped by a hypervisor.
>
> Are there any possible downsides to using the ThunderX version on other microarchitectures too and skip the check?

Yes that is a good idea. I will send out a new patch in a little bit
which just unrolls the loop with keeping of the two prefetch
instructions in there.

Thanks,
Andrew Pinski

>
> Arnd