*** BLURB HERE ***
Andrew Pinski (2):
ARM64:VDSO: Improve gettimeofday, don't use udiv
ARM64:VDSO: Improve __do_get_tspec, don't use udiv
arch/arm64/kernel/vdso/gettimeofday.S | 47 ++++++++++++++++++++++++--------
1 files changed, 35 insertions(+), 12 deletions(-)
--
1.7.2.5
On many cores, udiv with a large value is slow, expand instead
the division out to be what GCC would have generated for the
divide by 1000.
On ThunderX, the speeds up gettimeofday by 5%.
Signed-off-by: Andrew Pinski <[email protected]>
---
arch/arm64/kernel/vdso/gettimeofday.S | 20 ++++++++++++++++----
1 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/kernel/vdso/gettimeofday.S b/arch/arm64/kernel/vdso/gettimeofday.S
index efa79e8..e5caef9 100644
--- a/arch/arm64/kernel/vdso/gettimeofday.S
+++ b/arch/arm64/kernel/vdso/gettimeofday.S
@@ -64,10 +64,22 @@ ENTRY(__kernel_gettimeofday)
bl __do_get_tspec
seqcnt_check w9, 1b
- /* Convert ns to us. */
- mov x13, #1000
- lsl x13, x13, x12
- udiv x11, x11, x13
+ /* Undo the shift. */
+ lsr x11, x11, x12
+
+ /* Convert ns to us (division by 1000 by using multiply high).
+ * This is how GCC converts the division by 1000 into.
+ * This is faster than divide on most cores.
+ */
+ mov x13, 63439
+ movk x13, 0xe353, lsl 16
+ lsr x11, x11, 3
+ movk x13, 0x9ba5, lsl 32
+ movk x13, 0x20c4, lsl 48
+ /* x13 = 0x20c49ba5e353f7cf */
+ umulh x11, x11, x13
+ lsr x11, x11, 4
+
stp x10, x11, [x0, #TVAL_TV_SEC]
2:
/* If tz is NULL, return 0. */
--
1.7.2.5
In most other targets (x86/tile for an example),
the division in __do_get_tspec is converted into
a simple loop. The main reason for this is
because the result of this division is going
to be either 0 or 1.
This changes the division to the simple loop
and thus speeding up gettimeofday.
On ThunderX, this speeds up gettimeofday by 16.6%.
Signed-off-by: Andrew Pinski <[email protected]>
---
arch/arm64/kernel/vdso/gettimeofday.S | 27 +++++++++++++++++++--------
1 files changed, 19 insertions(+), 8 deletions(-)
diff --git a/arch/arm64/kernel/vdso/gettimeofday.S b/arch/arm64/kernel/vdso/gettimeofday.S
index e5caef9..28f4da7 100644
--- a/arch/arm64/kernel/vdso/gettimeofday.S
+++ b/arch/arm64/kernel/vdso/gettimeofday.S
@@ -246,14 +246,25 @@ ENTRY(__do_get_tspec)
mul x10, x10, x11
/* Use the kernel time to calculate the new timespec. */
- mov x11, #NSEC_PER_SEC_LO16
- movk x11, #NSEC_PER_SEC_HI16, lsl #16
- lsl x11, x11, x12
- add x15, x10, x14
- udiv x14, x15, x11
- add x10, x13, x14
- mul x13, x14, x11
- sub x11, x15, x13
+ mov x15, #NSEC_PER_SEC_LO16
+ movk x15, #NSEC_PER_SEC_HI16, lsl #16
+ lsl x15, x15, x12
+ add x11, x10, x14
+ mov x10, x13
+
+ /*
+ * Use a loop instead of a division as this is most
+ * likely going to be only giving a 1 or 0 and that is faster
+ * than a division.
+ */
+ cmp x11, x15
+ b.lt 1f
+2:
+ sub x11, x11, x15
+ add x10, x10, 1
+ cmp x11, x15
+ b.ge 2b
+1:
ret
.cfi_endproc
--
1.7.2.5
On 13 March 2016 at 20:50, Andrew Pinski <[email protected]> wrote:
> On many cores, udiv with a large value is slow, expand instead
> the division out to be what GCC would have generated for the
> divide by 1000.
>
> On ThunderX, the speeds up gettimeofday by 5%.
>
> Signed-off-by: Andrew Pinski <[email protected]>
> ---
> arch/arm64/kernel/vdso/gettimeofday.S | 20 ++++++++++++++++----
> 1 files changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/kernel/vdso/gettimeofday.S b/arch/arm64/kernel/vdso/gettimeofday.S
> index efa79e8..e5caef9 100644
> --- a/arch/arm64/kernel/vdso/gettimeofday.S
> +++ b/arch/arm64/kernel/vdso/gettimeofday.S
> @@ -64,10 +64,22 @@ ENTRY(__kernel_gettimeofday)
> bl __do_get_tspec
> seqcnt_check w9, 1b
>
> - /* Convert ns to us. */
> - mov x13, #1000
> - lsl x13, x13, x12
> - udiv x11, x11, x13
> + /* Undo the shift. */
> + lsr x11, x11, x12
> +
> + /* Convert ns to us (division by 1000 by using multiply high).
> + * This is how GCC converts the division by 1000 into.
> + * This is faster than divide on most cores.
> + */
> + mov x13, 63439
Please don't mix hex and decimal constants
> + movk x13, 0xe353, lsl 16
> + lsr x11, x11, 3
> + movk x13, 0x9ba5, lsl 32
> + movk x13, 0x20c4, lsl 48
> + /* x13 = 0x20c49ba5e353f7cf */
Could we clean this up a bit? Something along the lines of
.set m, 0x20c49ba5e353f7cf
movz x13,#:abs_g3:m
movk x13, #:abs:g2_nc:m
movk x13, #:abs_g1_nc:m
movk x13, #:abs_g0_nc:m
Actually, the movz/movk sequence should probably be implemented as a
macro in asm/assembler.h, with parameters for the register and the
symbol name. I think Mark proposed such a patch at some point
> + umulh x11, x11, x13
> + lsr x11, x11, 4
> +
> stp x10, x11, [x0, #TVAL_TV_SEC]
> 2:
> /* If tz is NULL, return 0. */
> --
> 1.7.2.5
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Mon, Mar 14, 2016 at 07:55:38AM +0100, Ard Biesheuvel wrote:
> On 13 March 2016 at 20:50, Andrew Pinski <[email protected]> wrote:
> > + movk x13, 0xe353, lsl 16
> > + lsr x11, x11, 3
> > + movk x13, 0x9ba5, lsl 32
> > + movk x13, 0x20c4, lsl 48
> > + /* x13 = 0x20c49ba5e353f7cf */
>
> Could we clean this up a bit? Something along the lines of
>
> .set m, 0x20c49ba5e353f7cf
> movz x13,#:abs_g3:m
> movk x13, #:abs:g2_nc:m
> movk x13, #:abs_g1_nc:m
> movk x13, #:abs_g0_nc:m
>
> Actually, the movz/movk sequence should probably be implemented as a
> macro in asm/assembler.h, with parameters for the register and the
> symbol name.
Agreed.
> I think Mark proposed such a patch at some point
That would be [1], which needs the relocations fixed up [2,3] to match the
above.
I didn't respin that as it turned out to be unnecessary at the time, but I'm
more than happy for someone to pick it up.
Mark.
[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/397563.html
[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/397572.html
[3] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/397573.html
On 03/13/2016 03:50 PM, Andrew Pinski wrote:
> On many cores, udiv with a large value is slow, expand instead
> the division out to be what GCC would have generated for the
> divide by 1000.
This like checking object code into version control. Why not write in C
and let GCC perform the generation?
Thanks,
Cov
--
Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project