Currently mov_q is used to move a constant into a 64-bit register,
when the lower 16 or 32bits of the constant are all zero, the mov_q
emits one or two useless movk instructions. If the mov_q macro is used
in hot code path, we want to save the movk instructions as much as
possible. For example, when CONFIG_ARM64_MTE is 'Y' and
CONFIG_KASAN_HW_TAGS is 'N', the following code in __cpu_setup()
routine is the pontential optimization target:
/* set the TCR_EL1 bits */
mov_q x10, TCR_MTE_FLAGS
Before the patch:
mov x10, #0x10000000000000
movk x10, #0x40, lsl #32
movk x10, #0x0, lsl #16
movk x10, #0x0
After the patch:
mov x10, #0x10000000000000
movk x10, #0x40, lsl #32
Signed-off-by: Jisheng Zhang <[email protected]>
---
arch/arm64/include/asm/assembler.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 8c5a61aeaf8e..09f408424cae 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -568,9 +568,13 @@ alternative_endif
movz \reg, :abs_g3:\val
movk \reg, :abs_g2_nc:\val
.endif
+ .if ((((\val) >> 16) & 0xffff) != 0)
movk \reg, :abs_g1_nc:\val
.endif
+ .endif
+ .if (((\val) & 0xffff) != 0)
movk \reg, :abs_g0_nc:\val
+ .endif
.endm
/*
--
2.34.1
On Sat, Jul 09, 2022 at 04:48:30PM +0800, Jisheng Zhang wrote:
> Currently mov_q is used to move a constant into a 64-bit register,
> when the lower 16 or 32bits of the constant are all zero, the mov_q
> emits one or two useless movk instructions. If the mov_q macro is used
> in hot code path, we want to save the movk instructions as much as
> possible. For example, when CONFIG_ARM64_MTE is 'Y' and
> CONFIG_KASAN_HW_TAGS is 'N', the following code in __cpu_setup()
> routine is the pontential optimization target:
>
> /* set the TCR_EL1 bits */
> mov_q x10, TCR_MTE_FLAGS
>
> Before the patch:
> mov x10, #0x10000000000000
> movk x10, #0x40, lsl #32
> movk x10, #0x0, lsl #16
> movk x10, #0x0
>
> After the patch:
> mov x10, #0x10000000000000
> movk x10, #0x40, lsl #32
>
> Signed-off-by: Jisheng Zhang <[email protected]>
> ---
> arch/arm64/include/asm/assembler.h | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index 8c5a61aeaf8e..09f408424cae 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -568,9 +568,13 @@ alternative_endif
> movz \reg, :abs_g3:\val
> movk \reg, :abs_g2_nc:\val
> .endif
> + .if ((((\val) >> 16) & 0xffff) != 0)
> movk \reg, :abs_g1_nc:\val
> .endif
> + .endif
> + .if (((\val) & 0xffff) != 0)
> movk \reg, :abs_g0_nc:\val
> + .endif
Please provide some numbers showing that this is worthwhile.
Will
On Tue, Jul 19, 2022 at 07:13:41PM +0100, Will Deacon wrote:
> On Sat, Jul 09, 2022 at 04:48:30PM +0800, Jisheng Zhang wrote:
> > Currently mov_q is used to move a constant into a 64-bit register,
> > when the lower 16 or 32bits of the constant are all zero, the mov_q
> > emits one or two useless movk instructions. If the mov_q macro is used
> > in hot code path, we want to save the movk instructions as much as
> > possible. For example, when CONFIG_ARM64_MTE is 'Y' and
> > CONFIG_KASAN_HW_TAGS is 'N', the following code in __cpu_setup()
> > routine is the pontential optimization target:
> >
> > /* set the TCR_EL1 bits */
> > mov_q x10, TCR_MTE_FLAGS
> >
> > Before the patch:
> > mov x10, #0x10000000000000
> > movk x10, #0x40, lsl #32
> > movk x10, #0x0, lsl #16
> > movk x10, #0x0
> >
> > After the patch:
> > mov x10, #0x10000000000000
> > movk x10, #0x40, lsl #32
> >
> > Signed-off-by: Jisheng Zhang <[email protected]>
> > ---
> > arch/arm64/include/asm/assembler.h | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> > index 8c5a61aeaf8e..09f408424cae 100644
> > --- a/arch/arm64/include/asm/assembler.h
> > +++ b/arch/arm64/include/asm/assembler.h
> > @@ -568,9 +568,13 @@ alternative_endif
> > movz \reg, :abs_g3:\val
> > movk \reg, :abs_g2_nc:\val
> > .endif
> > + .if ((((\val) >> 16) & 0xffff) != 0)
> > movk \reg, :abs_g1_nc:\val
> > .endif
> > + .endif
> > + .if (((\val) & 0xffff) != 0)
> > movk \reg, :abs_g0_nc:\val
> > + .endif
>
> Please provide some numbers showing that this is worthwhile.
>
No, I have no performance numbers, but here are my opnion
about this patch: the two checks doesn't add maintaince effort, its
readability is good, if the two checks can save two movk instructions,
it's worthwhile to add the checks.
Thanks
On Tue, Jul 26, 2022 at 09:44:40PM +0800, Jisheng Zhang wrote:
> On Tue, Jul 19, 2022 at 07:13:41PM +0100, Will Deacon wrote:
> > On Sat, Jul 09, 2022 at 04:48:30PM +0800, Jisheng Zhang wrote:
> > > Currently mov_q is used to move a constant into a 64-bit register,
> > > when the lower 16 or 32bits of the constant are all zero, the mov_q
> > > emits one or two useless movk instructions. If the mov_q macro is used
> > > in hot code path, we want to save the movk instructions as much as
> > > possible. For example, when CONFIG_ARM64_MTE is 'Y' and
> > > CONFIG_KASAN_HW_TAGS is 'N', the following code in __cpu_setup()
> > > routine is the pontential optimization target:
> > >
> > > /* set the TCR_EL1 bits */
> > > mov_q x10, TCR_MTE_FLAGS
> > >
> > > Before the patch:
> > > mov x10, #0x10000000000000
> > > movk x10, #0x40, lsl #32
> > > movk x10, #0x0, lsl #16
> > > movk x10, #0x0
> > >
> > > After the patch:
> > > mov x10, #0x10000000000000
> > > movk x10, #0x40, lsl #32
> > >
> > > Signed-off-by: Jisheng Zhang <[email protected]>
> > > ---
> > > arch/arm64/include/asm/assembler.h | 4 ++++
> > > 1 file changed, 4 insertions(+)
> > >
> > > diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> > > index 8c5a61aeaf8e..09f408424cae 100644
> > > --- a/arch/arm64/include/asm/assembler.h
> > > +++ b/arch/arm64/include/asm/assembler.h
> > > @@ -568,9 +568,13 @@ alternative_endif
> > > movz \reg, :abs_g3:\val
> > > movk \reg, :abs_g2_nc:\val
> > > .endif
> > > + .if ((((\val) >> 16) & 0xffff) != 0)
> > > movk \reg, :abs_g1_nc:\val
> > > .endif
> > > + .endif
> > > + .if (((\val) & 0xffff) != 0)
> > > movk \reg, :abs_g0_nc:\val
> > > + .endif
> >
> > Please provide some numbers showing that this is worthwhile.
> >
>
> No, I have no performance numbers, but here are my opnion
> about this patch: the two checks doesn't add maintaince effort, its
> readability is good, if the two checks can save two movk instructions,
> it's worthwhile to add the checks.
Not unless you can measure a performance increase, no. The code is always
going to be more readable without this stuff added so we shouldn't clutter
our low-level assembly macros with nested conditionals just for fun.
Will
On Sat, 9 Jul 2022 at 01:58, Jisheng Zhang <[email protected]> wrote:
>
> Currently mov_q is used to move a constant into a 64-bit register,
> when the lower 16 or 32bits of the constant are all zero, the mov_q
> emits one or two useless movk instructions. If the mov_q macro is used
> in hot code path, we want to save the movk instructions as much as
> possible. For example, when CONFIG_ARM64_MTE is 'Y' and
> CONFIG_KASAN_HW_TAGS is 'N', the following code in __cpu_setup()
> routine is the pontential optimization target:
>
> /* set the TCR_EL1 bits */
> mov_q x10, TCR_MTE_FLAGS
>
> Before the patch:
> mov x10, #0x10000000000000
> movk x10, #0x40, lsl #32
> movk x10, #0x0, lsl #16
> movk x10, #0x0
>
> After the patch:
> mov x10, #0x10000000000000
> movk x10, #0x40, lsl #32
>
> Signed-off-by: Jisheng Zhang <[email protected]>
This is broken for constants that have 0xffff in the top 16 bits, as
in that case, we will emit a MOVN/MOVK/MOVK sequence, and omitting the
MOVKs will set the corresponding field to 0xffff not 0x0.
> ---
> arch/arm64/include/asm/assembler.h | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index 8c5a61aeaf8e..09f408424cae 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -568,9 +568,13 @@ alternative_endif
> movz \reg, :abs_g3:\val
> movk \reg, :abs_g2_nc:\val
> .endif
> + .if ((((\val) >> 16) & 0xffff) != 0)
> movk \reg, :abs_g1_nc:\val
> .endif
> + .endif
> + .if (((\val) & 0xffff) != 0)
> movk \reg, :abs_g0_nc:\val
> + .endif
> .endm
>
> /*
> --
> 2.34.1
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Wed, Jul 27, 2022 at 08:15:11AM -0700, Ard Biesheuvel wrote:
> On Sat, 9 Jul 2022 at 01:58, Jisheng Zhang <[email protected]> wrote:
> >
> > Currently mov_q is used to move a constant into a 64-bit register,
> > when the lower 16 or 32bits of the constant are all zero, the mov_q
> > emits one or two useless movk instructions. If the mov_q macro is used
> > in hot code path, we want to save the movk instructions as much as
> > possible. For example, when CONFIG_ARM64_MTE is 'Y' and
> > CONFIG_KASAN_HW_TAGS is 'N', the following code in __cpu_setup()
> > routine is the pontential optimization target:
> >
> > /* set the TCR_EL1 bits */
> > mov_q x10, TCR_MTE_FLAGS
> >
> > Before the patch:
> > mov x10, #0x10000000000000
> > movk x10, #0x40, lsl #32
> > movk x10, #0x0, lsl #16
> > movk x10, #0x0
> >
> > After the patch:
> > mov x10, #0x10000000000000
> > movk x10, #0x40, lsl #32
> >
> > Signed-off-by: Jisheng Zhang <[email protected]>
>
> This is broken for constants that have 0xffff in the top 16 bits, as
> in that case, we will emit a MOVN/MOVK/MOVK sequence, and omitting the
> MOVKs will set the corresponding field to 0xffff not 0x0.
Thanks so much for this hint. I think you are right about the 0xffff in
top 16bits case.
On Thu, 28 Jul 2022 at 08:26, Jisheng Zhang <[email protected]> wrote:
>
> On Thu, Jul 28, 2022 at 10:49:02PM +0800, Jisheng Zhang wrote:
> > On Wed, Jul 27, 2022 at 08:15:11AM -0700, Ard Biesheuvel wrote:
> > > On Sat, 9 Jul 2022 at 01:58, Jisheng Zhang <[email protected]> wrote:
> > > >
> > > > Currently mov_q is used to move a constant into a 64-bit register,
> > > > when the lower 16 or 32bits of the constant are all zero, the mov_q
> > > > emits one or two useless movk instructions. If the mov_q macro is used
> > > > in hot code path, we want to save the movk instructions as much as
> > > > possible. For example, when CONFIG_ARM64_MTE is 'Y' and
> > > > CONFIG_KASAN_HW_TAGS is 'N', the following code in __cpu_setup()
> > > > routine is the pontential optimization target:
> > > >
> > > > /* set the TCR_EL1 bits */
> > > > mov_q x10, TCR_MTE_FLAGS
> > > >
> > > > Before the patch:
> > > > mov x10, #0x10000000000000
> > > > movk x10, #0x40, lsl #32
> > > > movk x10, #0x0, lsl #16
> > > > movk x10, #0x0
> > > >
> > > > After the patch:
> > > > mov x10, #0x10000000000000
> > > > movk x10, #0x40, lsl #32
> > > >
> > > > Signed-off-by: Jisheng Zhang <[email protected]>
> > >
> > > This is broken for constants that have 0xffff in the top 16 bits, as
> > > in that case, we will emit a MOVN/MOVK/MOVK sequence, and omitting the
> > > MOVKs will set the corresponding field to 0xffff not 0x0.
> >
> > Thanks so much for this hint. I think you are right about the 0xffff in
> > top 16bits case.
> >
>
> the patch breaks below usage case:
> mov_q x0, 0xffffffff00000000
>
> I think the reason is mov_q starts from high bits, if we change the
> macro to start from LSB, then that could solve the breakage. But this
> needs a rewrite of mov_q
No it has nothing to do with that.
The problem is that the use of MOVN changes the implicit value of the
16-bit fields that are left unspecified, and assigning them in a
different order is not going to make any difference.
I don't think we should further complicate mov_q, and I would argue
that the existing optimization (which I added myself) is premature
already: in the grand scheme of things, one or two instructions more
or less are not going to make a difference anyway, given how rarely
this macro is used. And even if any of these occurrences are on a hot
path, it is not a given that shorter sequences of MOVZ/MOVN/MOVK are
going to execute any faster, as the canonical MOVZ/MOVK/MOVK/MOVK
might well decode to fewer uops.
So in summary, let's leave this code be.
On Thu, Jul 28, 2022 at 10:49:02PM +0800, Jisheng Zhang wrote:
> On Wed, Jul 27, 2022 at 08:15:11AM -0700, Ard Biesheuvel wrote:
> > On Sat, 9 Jul 2022 at 01:58, Jisheng Zhang <[email protected]> wrote:
> > >
> > > Currently mov_q is used to move a constant into a 64-bit register,
> > > when the lower 16 or 32bits of the constant are all zero, the mov_q
> > > emits one or two useless movk instructions. If the mov_q macro is used
> > > in hot code path, we want to save the movk instructions as much as
> > > possible. For example, when CONFIG_ARM64_MTE is 'Y' and
> > > CONFIG_KASAN_HW_TAGS is 'N', the following code in __cpu_setup()
> > > routine is the pontential optimization target:
> > >
> > > /* set the TCR_EL1 bits */
> > > mov_q x10, TCR_MTE_FLAGS
> > >
> > > Before the patch:
> > > mov x10, #0x10000000000000
> > > movk x10, #0x40, lsl #32
> > > movk x10, #0x0, lsl #16
> > > movk x10, #0x0
> > >
> > > After the patch:
> > > mov x10, #0x10000000000000
> > > movk x10, #0x40, lsl #32
> > >
> > > Signed-off-by: Jisheng Zhang <[email protected]>
> >
> > This is broken for constants that have 0xffff in the top 16 bits, as
> > in that case, we will emit a MOVN/MOVK/MOVK sequence, and omitting the
> > MOVKs will set the corresponding field to 0xffff not 0x0.
>
> Thanks so much for this hint. I think you are right about the 0xffff in
> top 16bits case.
>
the patch breaks below usage case:
mov_q x0, 0xffffffff00000000
I think the reason is mov_q starts from high bits, if we change the
macro to start from LSB, then that could solve the breakage. But this
needs a rewrite of mov_q