2021-11-16 15:10:57

by Guanghui Feng

[permalink] [raw]
Subject: [PATCH] arm64: clear_page: use stnp non-temporal instruction for performance optimizing

When clear page mem, there is no need to alloc cache for storing these
mem value. And the copy_page.S have used stnp instruction for optimizing.
So I rewrite the clear_page.S with stnp. At the same time, I have tested it
with stnp instruction which will get about twice the performance improvement.

Signed-off-by: Guanghui Feng <[email protected]>
---
arch/arm64/lib/clear_page.S | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/lib/clear_page.S b/arch/arm64/lib/clear_page.S
index b84b179..e9dc2d6 100644
--- a/arch/arm64/lib/clear_page.S
+++ b/arch/arm64/lib/clear_page.S
@@ -15,13 +15,18 @@
* x0 - dest
*/
SYM_FUNC_START_PI(clear_page)
- mrs x1, dczid_el0
- and w1, w1, #0xf
- mov x2, #4
- lsl x1, x2, x1
-
-1: dc zva, x0
- add x0, x0, x1
+ mov x1, #0
+ mov x2, #0
+1:
+ stnp x1, x2, [x0]
+ stnp x1, x2, [x0, #16]
+ stnp x1, x2, [x0, #32]
+ stnp x1, x2, [x0, #48]
+ stnp x1, x2, [x0, #64]
+ stnp x1, x2, [x0, #80]
+ stnp x1, x2, [x0, #96]
+ stnp x1, x2, [x0, #112]
+ add x0, x0, #128
tst x0, #(PAGE_SIZE - 1)
b.ne 1b
ret
--
1.8.3.1



2021-11-16 18:17:19

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH] arm64: clear_page: use stnp non-temporal instruction for performance optimizing

On Tue, Nov 16, 2021 at 11:08:14PM +0800, Guanghui Feng wrote:
> When clear page mem, there is no need to alloc cache for storing these
> mem value.

I theory, DC ZVA is supposed to trigger write streaming mode and all
writes go directly to memory avoiding cache allocation.

> And the copy_page.S have used stnp instruction for optimizing.
> So I rewrite the clear_page.S with stnp. At the same time, I have tested it
> with stnp instruction which will get about twice the performance improvement.

On which CPU implementation? Is the same improvement seen on a wider
range of CPUs?

--
Catalin

2021-11-16 23:12:28

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH] arm64: clear_page: use stnp non-temporal instruction for performance optimizing

On 2021-11-16 15:08, Guanghui Feng wrote:
> When clear page mem, there is no need to alloc cache for storing these
> mem value. And the copy_page.S have used stnp instruction for optimizing.
> So I rewrite the clear_page.S with stnp. At the same time, I have tested it
> with stnp instruction which will get about twice the performance improvement.
>
> Signed-off-by: Guanghui Feng <[email protected]>
> ---
> arch/arm64/lib/clear_page.S | 19 ++++++++++++-------
> 1 file changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/lib/clear_page.S b/arch/arm64/lib/clear_page.S
> index b84b179..e9dc2d6 100644
> --- a/arch/arm64/lib/clear_page.S
> +++ b/arch/arm64/lib/clear_page.S
> @@ -15,13 +15,18 @@
> * x0 - dest
> */
> SYM_FUNC_START_PI(clear_page)
> - mrs x1, dczid_el0
> - and w1, w1, #0xf
> - mov x2, #4
> - lsl x1, x2, x1
> -
> -1: dc zva, x0
> - add x0, x0, x1
> + mov x1, #0
> + mov x2, #0

Regardless of the bigger question around the architectural intent that
DC ZVA is supposed to be the best way to clear memory (sanity check:
this wasn't under virtualisation with HCR_EL2.TDZ set, was it?) - out of
curiosity, why do this and not just "stnp xzr, xzr, ..."?

Note also that this is liable to conflict with the patch for respecting
DCZID_EL0.DZP. On which note, is DC {GVA,GZVA} performance also a
concern, or does your platform not have MTE? If the performance anomaly
does turn out to be platform-specific, maybe it might be better to quirk
those platforms to set DZP, rather than changing the code for everyone?

Robin.

> +1:
> + stnp x1, x2, [x0]
> + stnp x1, x2, [x0, #16]
> + stnp x1, x2, [x0, #32]
> + stnp x1, x2, [x0, #48]
> + stnp x1, x2, [x0, #64]
> + stnp x1, x2, [x0, #80]
> + stnp x1, x2, [x0, #96]
> + stnp x1, x2, [x0, #112]
> + add x0, x0, #128
> tst x0, #(PAGE_SIZE - 1)
> b.ne 1b
> ret
>