This is a fourth tentative to switch powerpc VDSO to generic C implementation.
This version should work on PPC64 (untested). VDSO32 for PPC64 is
impossible to build and has been de-activated, because the powerpc
ASM header files for C are not prepared to build 32 bits code with CONFIG_PPC64.
powerpc is a bit special for VDSO as well as system calls in the
way that it requires setting CR SO bit which cannot be done in C.
Therefore, entry/exit and fallback need to be performed in ASM.
Note that on previous patches, a buggy version of vdsotest was used which was
underestimating the time in gettimeofday compared to clock-get... functions.
This series applies on a merge of powerpc/merge and tip/timers/core branches,
series "lib/vdso: Bugfix and consolidation"
(https://lore.kernel.org/patchwork/project/lkml/list/?series=425784)
applied after the above merge.
On a powerpc8xx, with current powerpc/32 ASM VDSO:
gettimeofday: vdso: 907 nsec/call
clock-getres-realtime: vdso: 484 nsec/call
clock-gettime-realtime: vdso: 899 nsec/call
The first patch adds VDSO generic C support without any changes to common code.
Performance is as follows:
gettimeofday: vdso: 1211 nsec/call
clock-getres-realtime: vdso: 722 nsec/call
clock-gettime-realtime: vdso: 1216 nsec/call
Then a few changes in the common code have allowed performance improvement. At
the end of the series we have:
gettimeofday: vdso: 974 nsec/call
clock-getres-realtime: vdso: 545 nsec/call
clock-gettime-realtime: vdso: 941 nsec/call
The final result is rather close to pure ASM VDSO:
* 7% more on gettimeofday (9 cycles)
* 5% more on clock-gettime-realtime (6 cycles)
* 12% more on clock-getres-realtime (8 cycles)
Due to the unavoidable ASM trampoline, we won't get much closer but that should be
acceptable for a port from ASM to a generic C code (here, 1 cycle is about 7,5 ns)
Christophe Leroy (11):
powerpc/64: Don't provide time functions in compat VDSO32
powerpc/vdso: Switch VDSO to generic C implementation.
lib: vdso: only read hrtimer_res when needed in __cvdso_clock_getres()
powerpc/vdso: simplify __get_datapage()
lib: vdso: allow arches to provide vdso data pointer
powerpc/vdso: provide inline alternative to __get_datapage()
powerpc/vdso: provide vdso data pointer from the ASM caller.
lib: vdso: allow fixed clock mode
powerpc/vdso: override __arch_vdso_capable()
lib: vdso: Allow arches to override the ns shift operation
powerpc/32: provide vdso_shift_ns()
arch/powerpc/Kconfig | 2 +
arch/powerpc/include/asm/clocksource.h | 6 +
arch/powerpc/include/asm/vdso/gettimeofday.h | 117 ++++++++++++
arch/powerpc/include/asm/vdso/vsyscall.h | 25 +++
arch/powerpc/include/asm/vdso_datapage.h | 52 +++---
arch/powerpc/kernel/asm-offsets.c | 46 +----
arch/powerpc/kernel/time.c | 91 +---------
arch/powerpc/kernel/vdso.c | 58 ++----
arch/powerpc/kernel/vdso32/Makefile | 27 ++-
arch/powerpc/kernel/vdso32/datapage.S | 10 +-
arch/powerpc/kernel/vdso32/gettimeofday.S | 258 ++++-----------------------
arch/powerpc/kernel/vdso32/vdso32.lds.S | 9 +-
arch/powerpc/kernel/vdso32/vgettimeofday.c | 29 +++
arch/powerpc/kernel/vdso64/Makefile | 23 ++-
arch/powerpc/kernel/vdso64/datapage.S | 13 +-
arch/powerpc/kernel/vdso64/gettimeofday.S | 257 ++++----------------------
arch/powerpc/kernel/vdso64/vdso64.lds.S | 7 +-
arch/powerpc/kernel/vdso64/vgettimeofday.c | 29 +++
lib/vdso/gettimeofday.c | 107 ++++++++---
19 files changed, 457 insertions(+), 709 deletions(-)
create mode 100644 arch/powerpc/include/asm/clocksource.h
create mode 100644 arch/powerpc/include/asm/vdso/gettimeofday.h
create mode 100644 arch/powerpc/include/asm/vdso/vsyscall.h
create mode 100644 arch/powerpc/kernel/vdso32/vgettimeofday.c
create mode 100644 arch/powerpc/kernel/vdso64/vgettimeofday.c
--
2.13.3
The generic x >> s gives the following result:
18: 35 25 ff e0 addic. r9,r5,-32
1c: 41 80 00 10 blt 2c <shift+0x14>
20: 7c 64 4c 30 srw r4,r3,r9
24: 38 60 00 00 li r3,0
...
2c: 54 69 08 3c rlwinm r9,r3,1,0,30
30: 21 45 00 1f subfic r10,r5,31
34: 7c 84 2c 30 srw r4,r4,r5
38: 7d 29 50 30 slw r9,r9,r10
3c: 7c 63 2c 30 srw r3,r3,r5
40: 7d 24 23 78 or r4,r9,r4
In our case the shift is always < 32. In addition, the upper 32 bits
of the result are likely nul. Lets GCC know it, it also optimises the
following calculations.
With the patch, we get:
0: 21 25 00 20 subfic r9,r5,32
4: 7c 69 48 30 slw r9,r3,r9
8: 7c 84 2c 30 srw r4,r4,r5
c: 7d 24 23 78 or r4,r9,r4
10: 7c 63 2c 30 srw r3,r3,r5
Performance before the patch:
clock-gettime-realtime: vdso: 1033 nsec/call
After the patch:
clock-gettime-realtime: vdso: 941 nsec/call
Signed-off-by: Christophe Leroy <[email protected]>
---
arch/powerpc/include/asm/vdso/gettimeofday.h | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h b/arch/powerpc/include/asm/vdso/gettimeofday.h
index 74b6eef8fbe9..716a137ab166 100644
--- a/arch/powerpc/include/asm/vdso/gettimeofday.h
+++ b/arch/powerpc/include/asm/vdso/gettimeofday.h
@@ -95,6 +95,23 @@ static __always_inline u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 m
}
#define vdso_calc_delta vdso_calc_delta
+#ifndef __powerpc64__
+static __always_inline u64 vdso_shift_ns(u64 ns, unsigned long shift)
+{
+ u32 hi = ns >> 32;
+ u32 lo = ns;
+
+ lo = (lo >> shift) | (hi << (32 - shift));
+ hi >>= shift;
+
+ if (likely(hi == 0))
+ return lo;
+
+ return ((u64)hi << 32) | lo;
+}
+#define vdso_shift_ns vdso_shift_ns
+#endif
+
#endif /* !__ASSEMBLY__ */
#endif /* __ASM_VDSO_GETTIMEOFDAY_H */
--
2.13.3
Hi!
On Thu, Jan 16, 2020 at 05:58:24PM +0000, Christophe Leroy wrote:
> On a powerpc8xx, with current powerpc/32 ASM VDSO:
>
> gettimeofday: vdso: 907 nsec/call
> clock-getres-realtime: vdso: 484 nsec/call
> clock-gettime-realtime: vdso: 899 nsec/call
>
> The first patch adds VDSO generic C support without any changes to common code.
> Performance is as follows:
>
> gettimeofday: vdso: 1211 nsec/call
> clock-getres-realtime: vdso: 722 nsec/call
> clock-gettime-realtime: vdso: 1216 nsec/call
>
> Then a few changes in the common code have allowed performance improvement. At
> the end of the series we have:
>
> gettimeofday: vdso: 974 nsec/call
> clock-getres-realtime: vdso: 545 nsec/call
> clock-gettime-realtime: vdso: 941 nsec/call
>
> The final result is rather close to pure ASM VDSO:
> * 7% more on gettimeofday (9 cycles)
> * 5% more on clock-gettime-realtime (6 cycles)
> * 12% more on clock-getres-realtime (8 cycles)
Nice! Much better.
It should be tested on more representative hardware, too, but this looks
promising alright :-)
Segher
Le 17/01/2020 à 09:58, Segher Boessenkool a écrit :
> Hi!
>
> On Thu, Jan 16, 2020 at 05:58:24PM +0000, Christophe Leroy wrote:
>> On a powerpc8xx, with current powerpc/32 ASM VDSO:
>>
>> gettimeofday: vdso: 907 nsec/call
>> clock-getres-realtime: vdso: 484 nsec/call
>> clock-gettime-realtime: vdso: 899 nsec/call
>>
>> The first patch adds VDSO generic C support without any changes to common code.
>> Performance is as follows:
>>
>> gettimeofday: vdso: 1211 nsec/call
>> clock-getres-realtime: vdso: 722 nsec/call
>> clock-gettime-realtime: vdso: 1216 nsec/call
>>
>> Then a few changes in the common code have allowed performance improvement. At
>> the end of the series we have:
>>
>> gettimeofday: vdso: 974 nsec/call
>> clock-getres-realtime: vdso: 545 nsec/call
>> clock-gettime-realtime: vdso: 941 nsec/call
>>
>> The final result is rather close to pure ASM VDSO:
>> * 7% more on gettimeofday (9 cycles)
>> * 5% more on clock-gettime-realtime (6 cycles)
>> * 12% more on clock-getres-realtime (8 cycles)
>
> Nice! Much better.
>
> It should be tested on more representative hardware, too, but this looks
> promising alright :-)
>
Yes.
Now the challenge is to get VDSO32 buildable on PPC64. The big issue is
that in most powerpc/include/asm/*.h , CONFIG_PPC64 is used to know if
the build is a 64 bits build or a 32 bits build, so VDSO32 build fails.
I don't know how this could be easily fixed.
Christophe
Hi
On 01/17/2020 08:58 AM, Segher Boessenkool wrote:
> Hi!
>
> On Thu, Jan 16, 2020 at 05:58:24PM +0000, Christophe Leroy wrote:
>> On a powerpc8xx, with current powerpc/32 ASM VDSO:
>>
>> gettimeofday: vdso: 907 nsec/call
>> clock-getres-realtime: vdso: 484 nsec/call
>> clock-gettime-realtime: vdso: 899 nsec/call
>>
>> The first patch adds VDSO generic C support without any changes to common code.
>> Performance is as follows:
>>
>> gettimeofday: vdso: 1211 nsec/call
>> clock-getres-realtime: vdso: 722 nsec/call
>> clock-gettime-realtime: vdso: 1216 nsec/call
>>
>> Then a few changes in the common code have allowed performance improvement. At
>> the end of the series we have:
>>
>> gettimeofday: vdso: 974 nsec/call
>> clock-getres-realtime: vdso: 545 nsec/call
>> clock-gettime-realtime: vdso: 941 nsec/call
>>
>> The final result is rather close to pure ASM VDSO:
>> * 7% more on gettimeofday (9 cycles)
>> * 5% more on clock-gettime-realtime (6 cycles)
>> * 12% more on clock-getres-realtime (8 cycles)
>
> Nice! Much better.
>
> It should be tested on more representative hardware, too, but this looks
> promising alright :-)
>
mpc832x (e300c2 core) at 333 MHz:
Before:
gettimeofday: vdso: 235 nsec/call
clock-getres-realtime-coarse: vdso: 1668 nsec/call
clock-gettime-realtime-coarse: vdso: 1338 nsec/call
clock-getres-realtime: vdso: 135 nsec/call
clock-gettime-realtime: vdso: 244 nsec/call
clock-getres-boottime: vdso: 1232 nsec/call
clock-gettime-boottime: vdso: 1935 nsec/call
clock-getres-tai: vdso: 1257 nsec/call
clock-gettime-tai: vdso: 1898 nsec/call
clock-getres-monotonic-raw: vdso: 1229 nsec/call
clock-gettime-monotonic-raw: vdso: 1541 nsec/call
clock-getres-monotonic-coarse: vdso: 1699 nsec/call
clock-gettime-monotonic-coarse: vdso: 1477 nsec/call
clock-getres-monotonic: vdso: 135 nsec/call
clock-gettime-monotonic: vdso: 283 nsec/call
With the series:
gettimeofday: vdso: 271 nsec/call
clock-getres-realtime-coarse: vdso: 159 nsec/call
clock-gettime-realtime-coarse: vdso: 184 nsec/call
clock-getres-realtime: vdso: 163 nsec/call
clock-gettime-realtime: vdso: 281 nsec/call
clock-getres-boottime: vdso: 169 nsec/call
clock-gettime-boottime: vdso: 274 nsec/call
clock-getres-tai: vdso: 163 nsec/call
clock-gettime-tai: vdso: 277 nsec/call
clock-getres-monotonic-raw: vdso: 166 nsec/call
clock-gettime-monotonic-raw: vdso: 302 nsec/call
clock-getres-monotonic-coarse: vdso: 159 nsec/call
clock-gettime-monotonic-coarse: vdso: 184 nsec/call
clock-getres-monotonic: vdso: 166 nsec/call
clock-gettime-monotonic: vdso: 274 nsec/call
Christophe
On Mon, Jan 20, 2020 at 02:56:00PM +0000, Christophe Leroy wrote:
> >Nice! Much better.
> >
> >It should be tested on more representative hardware, too, but this looks
> >promising alright :-)
>
> mpc832x (e300c2 core) at 333 MHz:
>
> Before:
>
> gettimeofday: vdso: 235 nsec/call
> clock-gettime-realtime: vdso: 244 nsec/call
>
> With the series:
>
> gettimeofday: vdso: 271 nsec/call
> clock-gettime-realtime: vdso: 281 nsec/call
Those are important, and degrade ~15%. That is acceptable IMO, but do
you see a way to optimise this (later)?
Anyway, excellent results, thanks for your persistence!
Segher
Le 20/01/2020 à 16:19, Segher Boessenkool a écrit :
> On Mon, Jan 20, 2020 at 02:56:00PM +0000, Christophe Leroy wrote:
>>> Nice! Much better.
>>>
>>> It should be tested on more representative hardware, too, but this looks
>>> promising alright :-)
>>
>> mpc832x (e300c2 core) at 333 MHz:
>>
>> Before:
>>
>> gettimeofday: vdso: 235 nsec/call
>> clock-gettime-realtime: vdso: 244 nsec/call
>>
>> With the series:
>>
>> gettimeofday: vdso: 271 nsec/call
>> clock-gettime-realtime: vdso: 281 nsec/call
>
> Those are important, and degrade ~15%. That is acceptable IMO, but do
> you see a way to optimise this (later)?
Not easy I think.
First we have the unavoidable ASM entry function that can't be dropped
because of the CR[SO] bit the set on error or clear on no error and that
can't be done in C.
In our ASM VDSO, fixed shifts are used, while in generic C VDSO, shifts
are generic and read from the VDSO data.
And there is still some funny code generated by GCC (8.1), like:
620: 7d 29 3c 30 srw r9,r9,r7
624: 21 87 00 20 subfic r12,r7,32
628: 7d 07 3c 31 srw. r7,r8,r7
62c: 7d 08 60 30 slw r8,r8,r12
630: 7d 0b 4b 78 or r11,r8,r9
634: 39 40 00 00 li r10,0
638: 40 82 00 84 bne 6bc <__c_kernel_clock_gettime+0x114>
63c: 81 23 00 24 lwz r9,36(r3)
640: 81 05 00 00 lwz r8,0(r5)
...
6bc: 7d 69 5b 78 mr r9,r11
6c0: 7c ea 3b 78 mr r10,r7
6c4: 7d 2b 4b 78 mr r11,r9
6c8: 4b ff ff 74 b 63c <__c_kernel_clock_gettime+0x94>
This branch to 6bc is totally useless:
- copying r11 into r9 is pointless as r9 is overwritten in 63c
- copying back r9 into r11 is pointless as r11 has not been modified
inbetween.
- loading r10 with 0 then overwritting r10 with r7 when r7 is not 0 is
pointless as well, could have directly put the result of srw. in r10.
Christophe
On Mon, Jan 20, 2020 at 06:08:23PM +0100, Christophe Leroy wrote:
> Not easy I think.
>
> First we have the unavoidable ASM entry function that can't be dropped
> because of the CR[SO] bit the set on error or clear on no error and that
> can't be done in C.
Yup.
> In our ASM VDSO, fixed shifts are used, while in generic C VDSO, shifts
> are generic and read from the VDSO data.
Does that cost more than just a few cycles?
> And there is still some funny code generated by GCC (8.1), like:
>
> 620: 7d 29 3c 30 srw r9,r9,r7
> 624: 21 87 00 20 subfic r12,r7,32
> 628: 7d 07 3c 31 srw. r7,r8,r7
> 62c: 7d 08 60 30 slw r8,r8,r12
> 630: 7d 0b 4b 78 or r11,r8,r9
(This can be done cheaper for fixed shifts, you can use rlwimi then).
> 634: 39 40 00 00 li r10,0
> 638: 40 82 00 84 bne 6bc <__c_kernel_clock_gettime+0x114>
> 63c: 81 23 00 24 lwz r9,36(r3)
> 640: 81 05 00 00 lwz r8,0(r5)
> ...
> 6bc: 7d 69 5b 78 mr r9,r11
> 6c0: 7c ea 3b 78 mr r10,r7
> 6c4: 7d 2b 4b 78 mr r11,r9
> 6c8: 4b ff ff 74 b 63c <__c_kernel_clock_gettime+0x94>
>
> This branch to 6bc is totally useless:
> - copying r11 into r9 is pointless as r9 is overwritten in 63c
> - copying back r9 into r11 is pointless as r11 has not been modified
> inbetween.
Yeah, huh, how did that happen.
> - loading r10 with 0 then overwritting r10 with r7 when r7 is not 0 is
> pointless as well, could have directly put the result of srw. in r10.
This may be harder to make the compiler do.
But the r9/r11 thing suggests you are preventing optimisation somewhere,
maybe with some asm? Do you have some small testcase I can compile?
Segher