Process clock_gettime(CLOCK_TAI) in vDSO.
This makes the call about as fast as CLOCK_REALTIME and CLOCK_MONOTONIC:
nanoseconds
before after clockname
---- ----- ---------
233 87 CLOCK_TAI
96 93 CLOCK_REALTIME
88 87 CLOCK_MONOTONIC
Signed-off-by: Matt Rickard <[email protected]>
---
arch/x86/entry/vdso/vclock_gettime.c | 25 +++++++++++++++++++++++++
arch/x86/entry/vsyscall/vsyscall_gtod.c | 2 ++
arch/x86/include/asm/vgtod.h | 1 +
3 files changed, 28 insertions(+)
diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index f19856d95c60..91ed1bb2a3bb 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -246,6 +246,27 @@ notrace static int __always_inline do_monotonic(struct timespec *ts)
return mode;
}
+notrace static int __always_inline do_tai(struct timespec *ts)
+{
+ unsigned long seq;
+ u64 ns;
+ int mode;
+
+ do {
+ seq = gtod_read_begin(gtod);
+ mode = gtod->vclock_mode;
+ ts->tv_sec = gtod->tai_time_sec;
+ ns = gtod->wall_time_snsec;
+ ns += vgetsns(&mode);
+ ns >>= gtod->shift;
+ } while (unlikely(gtod_read_retry(gtod, seq)));
+
+ ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
+ ts->tv_nsec = ns;
+
+ return mode;
+}
+
notrace static void do_realtime_coarse(struct timespec *ts)
{
unsigned long seq;
@@ -277,6 +298,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
if (do_monotonic(ts) == VCLOCK_NONE)
goto fallback;
break;
+ case CLOCK_TAI:
+ if (do_tai(ts) == VCLOCK_NONE)
+ goto fallback;
+ break;
case CLOCK_REALTIME_COARSE:
do_realtime_coarse(ts);
break;
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index e1216dd95c04..d61392fe17f6 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -53,6 +53,8 @@ void update_vsyscall(struct timekeeper *tk)
vdata->monotonic_time_snsec = tk->tkr_mono.xtime_nsec
+ ((u64)tk->wall_to_monotonic.tv_nsec
<< tk->tkr_mono.shift);
+ vdata->tai_time_sec = tk->xtime_sec
+ + tk->tai_offset;
while (vdata->monotonic_time_snsec >=
(((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) {
vdata->monotonic_time_snsec -=
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index fb856c9f0449..adc9f7b20b9c 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -32,6 +32,7 @@ struct vsyscall_gtod_data {
gtod_long_t wall_time_coarse_nsec;
gtod_long_t monotonic_time_coarse_sec;
gtod_long_t monotonic_time_coarse_nsec;
+ gtod_long_t tai_time_sec;
int tz_minuteswest;
int tz_dsttime;
(Hi, Florian!)
On Fri, Aug 31, 2018 at 6:59 PM, Matt Rickard <[email protected]> wrote:
> Process clock_gettime(CLOCK_TAI) in vDSO.
> This makes the call about as fast as CLOCK_REALTIME and CLOCK_MONOTONIC:
>
> nanoseconds
> before after clockname
> ---- ----- ---------
> 233 87 CLOCK_TAI
> 96 93 CLOCK_REALTIME
> 88 87 CLOCK_MONOTONIC
Are you sure you did this right? With the clocksource set to TSC
(which is the only reasonable choice unless KVM has seriously cleaned
up its act), with retpolines enabled, I get 24ns for CLOCK_MONOTONIC
without your patch and 32ns with your patch. And there is indeed a
retpoline in the disassembled output:
e5: e8 07 00 00 00 callq f1 <__vdso_clock_gettime+0x31>
ea: f3 90 pause
ec: 0f ae e8 lfence
ef: eb f9 jmp ea <__vdso_clock_gettime+0x2a>
f1: 48 89 04 24 mov %rax,(%rsp)
f5: c3 retq
You're probably going to have to set -fno-jump-tables or do something
clever like adding a whole array of (seconds, nsec) in gtod and
indexing that array by the clock id.
Meanwhile, I wrote the following trivial patch to add a
__vdso_clock_gettime_monotonic export. It runs in 21ns, and I suspect
that the speedup is even a bit bigger when cache-cold because it
avoids some branches. What do you all think? Florian, do you think
glibc would be willing to add some magic to turn
clock_gettime(CLOCK_MONOTONIC, t) into
__vdso_clock_gettime_monotonic(t) when CLOCK_MONOTONIC is a constant?
On 09/01/2018 05:39 AM, Andy Lutomirski wrote:
> Florian, do you think
> glibc would be willing to add some magic to turn
> clock_gettime(CLOCK_MONOTONIC, t) into
> __vdso_clock_gettime_monotonic(t) when CLOCK_MONOTONIC is a constant?
What's the goal here? Turn the indirect call/conditional jump/indirect
call sequence into a single indirect call, purely for performance reasons?
Thanks,
Florian
On Sat, Sep 1, 2018 at 2:33 AM, Florian Weimer <[email protected]> wrote:
> On 09/01/2018 05:39 AM, Andy Lutomirski wrote:
>>
>> Florian, do you think
>> glibc would be willing to add some magic to turn
>> clock_gettime(CLOCK_MONOTONIC, t) into
>> __vdso_clock_gettime_monotonic(t) when CLOCK_MONOTONIC is a constant?
>
>
> What's the goal here? Turn the indirect call/conditional jump/indirect call
> sequence into a single indirect call, purely for performance reasons?
Almost. It's to bypass some of the branches in
__vdso_clock_gettime(), which is supposed to be very fast. AFAIK most
user code that uses clock_gettime() passes a constant for the first
argument, and we can squeeze out some performance by optimizing that
case. The indirect branches internal to the vDSO are a separate issue
and should be solved separately.
(It's really too bad that x86 doesn't have a 64-bit call instruction.
If it did, then the PLT could get rewritten at dynamic link time to
avoid indirect calls entirely, and presumably glibc could use the same
technique to call into the vDSO without indirect calls.)
On Fri, 31 Aug 2018, Andy Lutomirski wrote:
> (Hi, Florian!)
>
> On Fri, Aug 31, 2018 at 6:59 PM, Matt Rickard <[email protected]> wrote:
> > Process clock_gettime(CLOCK_TAI) in vDSO.
> > This makes the call about as fast as CLOCK_REALTIME and CLOCK_MONOTONIC:
> >
> > nanoseconds
> > before after clockname
> > ---- ----- ---------
> > 233 87 CLOCK_TAI
> > 96 93 CLOCK_REALTIME
> > 88 87 CLOCK_MONOTONIC
>
> Are you sure you did this right? With the clocksource set to TSC
> (which is the only reasonable choice unless KVM has seriously cleaned
> up its act), with retpolines enabled, I get 24ns for CLOCK_MONOTONIC
> without your patch and 32ns with your patch. And there is indeed a
> retpoline in the disassembled output:
>
> e5: e8 07 00 00 00 callq f1 <__vdso_clock_gettime+0x31>
> ea: f3 90 pause
> ec: 0f ae e8 lfence
> ef: eb f9 jmp ea <__vdso_clock_gettime+0x2a>
> f1: 48 89 04 24 mov %rax,(%rsp)
> f5: c3 retq
>
> You're probably going to have to set -fno-jump-tables or do something
> clever like adding a whole array of (seconds, nsec) in gtod and
> indexing that array by the clock id.
See the patch below. It's integrating TAI without slowing down everything
and it definitely does not result in indirect calls.
On a HSW it slows down clock_gettime() by ~0.5ns. On a SKL I get a speedup
by ~0.5ns. On a AMD Epyc server it's 1.2ns speedup. So it somehow depends
on the uarch and I also observed compiler version dependend variations.
Thanks,
tglx
----
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -203,39 +203,18 @@ notrace static inline u64 vgetsns(int *m
return v * gtod->mult;
}
-/* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
-notrace static int __always_inline do_realtime(struct timespec *ts)
+notrace static int __always_inline do_hres(struct timespec *ts, clockid_t clk)
{
- unsigned long seq;
- u64 ns;
+ struct vgtod_ts *base = >od->basetime[clk & VGTOD_HRES_MASK];
+ unsigned int seq;
int mode;
-
- do {
- seq = gtod_read_begin(gtod);
- mode = gtod->vclock_mode;
- ts->tv_sec = gtod->wall_time_sec;
- ns = gtod->wall_time_snsec;
- ns += vgetsns(&mode);
- ns >>= gtod->shift;
- } while (unlikely(gtod_read_retry(gtod, seq)));
-
- ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
- ts->tv_nsec = ns;
-
- return mode;
-}
-
-notrace static int __always_inline do_monotonic(struct timespec *ts)
-{
- unsigned long seq;
u64 ns;
- int mode;
do {
seq = gtod_read_begin(gtod);
mode = gtod->vclock_mode;
- ts->tv_sec = gtod->monotonic_time_sec;
- ns = gtod->monotonic_time_snsec;
+ ts->tv_sec = base->sec;
+ ns = base->nsec;
ns += vgetsns(&mode);
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
@@ -246,58 +225,50 @@ notrace static int __always_inline do_mo
return mode;
}
-notrace static void do_realtime_coarse(struct timespec *ts)
+notrace static void do_coarse(struct timespec *ts, clockid_t clk)
{
+ struct vgtod_ts *base = >od->basetime[clk];
unsigned long seq;
- do {
- seq = gtod_read_begin(gtod);
- ts->tv_sec = gtod->wall_time_coarse_sec;
- ts->tv_nsec = gtod->wall_time_coarse_nsec;
- } while (unlikely(gtod_read_retry(gtod, seq)));
-}
-notrace static void do_monotonic_coarse(struct timespec *ts)
-{
- unsigned long seq;
do {
seq = gtod_read_begin(gtod);
- ts->tv_sec = gtod->monotonic_time_coarse_sec;
- ts->tv_nsec = gtod->monotonic_time_coarse_nsec;
+ ts->tv_sec = base->sec;
+ ts->tv_nsec = base->nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));
}
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
{
- switch (clock) {
- case CLOCK_REALTIME:
- if (do_realtime(ts) == VCLOCK_NONE)
- goto fallback;
- break;
- case CLOCK_MONOTONIC:
- if (do_monotonic(ts) == VCLOCK_NONE)
- goto fallback;
- break;
- case CLOCK_REALTIME_COARSE:
- do_realtime_coarse(ts);
- break;
- case CLOCK_MONOTONIC_COARSE:
- do_monotonic_coarse(ts);
- break;
- default:
- goto fallback;
- }
+ unsigned int msk;
- return 0;
-fallback:
+ /* Sort out negative (CPU/FD) and invalid clocks */
+ if (unlikely((unsigned int) clock >= MAX_CLOCKS))
+ return vdso_fallback_gettime(clock, ts);
+
+ /*
+ * Convert the clockid to a bitmask and use it to check which
+ * clocks are handled in the VDSO directly.
+ */
+ msk = 1U << clock;
+ if (likely(msk & VGTOD_HRES)) {
+ if (do_hres(ts, clock) != VCLOCK_NONE)
+ return 0;
+ } else if (msk & VGTOD_COARSE) {
+ do_coarse(ts, clock);
+ return 0;
+ }
return vdso_fallback_gettime(clock, ts);
}
+
int clock_gettime(clockid_t, struct timespec *)
__attribute__((weak, alias("__vdso_clock_gettime")));
notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
{
if (likely(tv != NULL)) {
- if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE))
+ struct timespec *ts = (struct timespec *) tv;
+
+ if (unlikely(do_hres(ts, CLOCK_REALTIME) == VCLOCK_NONE))
return vdso_fallback_gtod(tv, tz);
tv->tv_usec /= 1000;
}
@@ -318,7 +289,7 @@ int gettimeofday(struct timeval *, struc
notrace time_t __vdso_time(time_t *t)
{
/* This is atomic on x86 so we don't need any locks. */
- time_t result = READ_ONCE(gtod->wall_time_sec);
+ time_t result = READ_ONCE(gtod->basetime[CLOCK_REALTIME].sec);
if (t)
*t = result;
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -31,6 +31,8 @@ void update_vsyscall(struct timekeeper *
{
int vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode;
struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data;
+ struct vgtod_ts *base;
+ u64 nsec;
/* Mark the new vclock used. */
BUILD_BUG_ON(VCLOCK_MAX >= 32);
@@ -45,34 +47,37 @@ void update_vsyscall(struct timekeeper *
vdata->mult = tk->tkr_mono.mult;
vdata->shift = tk->tkr_mono.shift;
- vdata->wall_time_sec = tk->xtime_sec;
- vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec;
-
- vdata->monotonic_time_sec = tk->xtime_sec
- + tk->wall_to_monotonic.tv_sec;
- vdata->monotonic_time_snsec = tk->tkr_mono.xtime_nsec
- + ((u64)tk->wall_to_monotonic.tv_nsec
- << tk->tkr_mono.shift);
- while (vdata->monotonic_time_snsec >=
- (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) {
- vdata->monotonic_time_snsec -=
- ((u64)NSEC_PER_SEC) << tk->tkr_mono.shift;
- vdata->monotonic_time_sec++;
+ base = &vdata->basetime[CLOCK_REALTIME];
+ base->sec = (gtod_long_t)tk->xtime_sec;
+ base->nsec = tk->tkr_mono.xtime_nsec;
+
+ base = &vdata->basetime[VGTOD_TAI];
+ base->sec = (gtod_long_t)(tk->xtime_sec + (s64)tk->tai_offset);
+ base->nsec = tk->tkr_mono.xtime_nsec;
+
+ base = &vdata->basetime[CLOCK_MONOTONIC];
+ base->sec = (gtod_long_t)(tk->xtime_sec + tk->wall_to_monotonic.tv_sec);
+ nsec = tk->tkr_mono.xtime_nsec;
+ nsec += ((u64)tk->wall_to_monotonic.tv_nsec << tk->tkr_mono.shift);
+ while (nsec >= (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) {
+ nsec -= ((u64)NSEC_PER_SEC) << tk->tkr_mono.shift;
+ base->sec++;
}
+ base->nsec = nsec;
- vdata->wall_time_coarse_sec = tk->xtime_sec;
- vdata->wall_time_coarse_nsec = (long)(tk->tkr_mono.xtime_nsec >>
- tk->tkr_mono.shift);
-
- vdata->monotonic_time_coarse_sec =
- vdata->wall_time_coarse_sec + tk->wall_to_monotonic.tv_sec;
- vdata->monotonic_time_coarse_nsec =
- vdata->wall_time_coarse_nsec + tk->wall_to_monotonic.tv_nsec;
-
- while (vdata->monotonic_time_coarse_nsec >= NSEC_PER_SEC) {
- vdata->monotonic_time_coarse_nsec -= NSEC_PER_SEC;
- vdata->monotonic_time_coarse_sec++;
+ base = &vdata->basetime[CLOCK_REALTIME_COARSE];
+ base->sec = (gtod_long_t)tk->xtime_sec;
+ base->nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+
+ base = &vdata->basetime[CLOCK_MONOTONIC_COARSE];
+ base->sec = (gtod_long_t)(tk->xtime_sec + tk->wall_to_monotonic.tv_sec);
+ nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+ nsec += tk->wall_to_monotonic.tv_nsec;
+ while (nsec >= NSEC_PER_SEC) {
+ nsec -= NSEC_PER_SEC;
+ base->sec++;
}
+ base->nsec = nsec;
gtod_write_end(vdata);
}
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -5,33 +5,41 @@
#include <linux/compiler.h>
#include <linux/clocksource.h>
+#include <uapi/linux/time.h>
+
#ifdef BUILD_VDSO32_64
typedef u64 gtod_long_t;
#else
typedef unsigned long gtod_long_t;
#endif
+
+struct vgtod_ts {
+ gtod_long_t sec;
+ u64 nsec;
+};
+
+#define VGTOD_BASES (CLOCK_MONOTONIC_COARSE + 1)
+#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC) | BIT(CLOCK_TAI))
+#define VGTOD_COARSE (BIT(CLOCK_REALTIME_COARSE) | BIT(CLOCK_MONOTONIC_COARSE))
+
+/* Abuse CLOCK_THREAD_CPUTIME_ID for VGTOD CLOCK TAI */
+#define VGTOD_HRES_MASK 0x3
+#define VGTOD_TAI (CLOCK_TAI & VGTOD_HRES_MASK)
+
/*
* vsyscall_gtod_data will be accessed by 32 and 64 bit code at the same time
* so be carefull by modifying this structure.
*/
struct vsyscall_gtod_data {
- unsigned seq;
+ unsigned int seq;
+
+ int vclock_mode;
+ u64 cycle_last;
+ u64 mask;
+ u32 mult;
+ u32 shift;
- int vclock_mode;
- u64 cycle_last;
- u64 mask;
- u32 mult;
- u32 shift;
-
- /* open coded 'struct timespec' */
- u64 wall_time_snsec;
- gtod_long_t wall_time_sec;
- gtod_long_t monotonic_time_sec;
- u64 monotonic_time_snsec;
- gtod_long_t wall_time_coarse_sec;
- gtod_long_t wall_time_coarse_nsec;
- gtod_long_t monotonic_time_coarse_sec;
- gtod_long_t monotonic_time_coarse_nsec;
+ struct vgtod_ts basetime[VGTOD_BASES];
int tz_minuteswest;
int tz_dsttime;
@@ -44,9 +52,9 @@ static inline bool vclock_was_used(int v
return READ_ONCE(vclocks_used) & (1 << vclock);
}
-static inline unsigned gtod_read_begin(const struct vsyscall_gtod_data *s)
+static inline unsigned int gtod_read_begin(const struct vsyscall_gtod_data *s)
{
- unsigned ret;
+ unsigned int ret;
repeat:
ret = READ_ONCE(s->seq);
On Sun, 9 Sep 2018, Thomas Gleixner wrote:
> #ifdef BUILD_VDSO32_64
> typedef u64 gtod_long_t;
> #else
> typedef unsigned long gtod_long_t;
> #endif
> +
> +struct vgtod_ts {
> + gtod_long_t sec;
and actually this wants to become u64 unconditionally as we need to provide
the full seconds even on 32bit for the upcoming y2038 support. We still
have to truncate it for the current 32bit interface, but the core code
can be made ready now.
Thanks,
tglx
On 09/09/2018 10:05 PM, Thomas Gleixner wrote:
> See the patch below. It's integrating TAI without slowing down everything
> and it definitely does not result in indirect calls.
>
> On a HSW it slows down clock_gettime() by ~0.5ns. On a SKL I get a speedup
> by ~0.5ns. On a AMD Epyc server it's 1.2ns speedup. So it somehow depends
> on the uarch and I also observed compiler version dependend variations.
Does this mean glibc can keep using a single vDSO entrypoint, the one we
have today?
Thanks,
Florian
On Wed, 12 Sep 2018, Florian Weimer wrote:
> On 09/09/2018 10:05 PM, Thomas Gleixner wrote:
> > See the patch below. It's integrating TAI without slowing down everything
> > and it definitely does not result in indirect calls.
> >
> > On a HSW it slows down clock_gettime() by ~0.5ns. On a SKL I get a speedup
> > by ~0.5ns. On a AMD Epyc server it's 1.2ns speedup. So it somehow depends
> > on the uarch and I also observed compiler version dependend variations.
>
> Does this mean glibc can keep using a single vDSO entrypoint, the one we
> have today?
We have no intention to change that.
But we surely could provide separate entry points as an extra to avoid a
bunch of conditionals.
Thanks,
tglx
On 09/12/2018 04:17 PM, Thomas Gleixner wrote:
> On Wed, 12 Sep 2018, Florian Weimer wrote:
>> On 09/09/2018 10:05 PM, Thomas Gleixner wrote:
>>> See the patch below. It's integrating TAI without slowing down everything
>>> and it definitely does not result in indirect calls.
>>>
>>> On a HSW it slows down clock_gettime() by ~0.5ns. On a SKL I get a speedup
>>> by ~0.5ns. On a AMD Epyc server it's 1.2ns speedup. So it somehow depends
>>> on the uarch and I also observed compiler version dependend variations.
>>
>> Does this mean glibc can keep using a single vDSO entrypoint, the one we
>> have today?
>
> We have no intention to change that.
Okay, I was wondering because Andy seemed to have proposed just that.
> But we surely could provide separate entry points as an extra to avoid a
> bunch of conditionals.
We could adjust to that, but the benefit would be long-term because it's
an ABI change for glibc, and they tend to take a long time to propagate.
But I must say that clock_gettime is an odd place to start. I would
have expected any of the type-polymorphic multiplexer interfaces (fcntl,
ioctl, ptrace, futex) to be a more natural starting point. 8-)
Thanks,
Florian
On Wed, 12 Sep 2018, Florian Weimer wrote:
> On 09/12/2018 04:17 PM, Thomas Gleixner wrote:
> > On Wed, 12 Sep 2018, Florian Weimer wrote:
> > > Does this mean glibc can keep using a single vDSO entrypoint, the one we
> > > have today?
> >
> > We have no intention to change that.
>
> Okay, I was wondering because Andy seemed to have proposed just that.
>
> > But we surely could provide separate entry points as an extra to avoid a
> > bunch of conditionals.
>
> We could adjust to that, but the benefit would be long-term because it's an
> ABI change for glibc, and they tend to take a long time to propagate.
>
> But I must say that clock_gettime is an odd place to start. I would have
> expected any of the type-polymorphic multiplexer interfaces (fcntl, ioctl,
> ptrace, futex) to be a more natural starting point. 8-)
Well, the starting point of this was to provide clock_tai support in the
vdso. clock_gettime() in the vdso vs. the real syscall is a factor of 10 in
speed. clock_gettime() is a pretty hot function in some workloads.
Andy then noticed that some conditionals could be avoided entirely by using
a different entry point and offered one along with a 10% speedup. We don't
have to go there, we can.
The multiplexer interfaces need much more surgery and talking about futex,
we'd need to sit down with quite some people and identify the things they
actually care about before just splitting it up and keeping the existing
overloaded trainwreck the same.
Thanks,
tglx
> On Sep 12, 2018, at 7:29 AM, Thomas Gleixner <[email protected]> wrote:
>
>> On Wed, 12 Sep 2018, Florian Weimer wrote:
>>> On 09/12/2018 04:17 PM, Thomas Gleixner wrote:
>>>> On Wed, 12 Sep 2018, Florian Weimer wrote:
>>>> Does this mean glibc can keep using a single vDSO entrypoint, the one we
>>>> have today?
>>>
>>> We have no intention to change that.
>>
>> Okay, I was wondering because Andy seemed to have proposed just that.
>>
>>> But we surely could provide separate entry points as an extra to avoid a
>>> bunch of conditionals.
>>
>> We could adjust to that, but the benefit would be long-term because it's an
>> ABI change for glibc, and they tend to take a long time to propagate.
>>
>> But I must say that clock_gettime is an odd place to start. I would have
>> expected any of the type-polymorphic multiplexer interfaces (fcntl, ioctl,
>> ptrace, futex) to be a more natural starting point. 8-)
>
> Well, the starting point of this was to provide clock_tai support in the
> vdso. clock_gettime() in the vdso vs. the real syscall is a factor of 10 in
> speed. clock_gettime() is a pretty hot function in some workloads.
>
> Andy then noticed that some conditionals could be avoided entirely by using
> a different entry point and offered one along with a 10% speedup. We don't
> have to go there, we can.
>
> The multiplexer interfaces need much more surgery and talking about futex,
> we'd need to sit down with quite some people and identify the things they
> actually care about before just splitting it up and keeping the existing
> overloaded trainwreck the same.
>
There’s also the issue of how much the speedup matters. For futex, maybe a better interface saves 3ns, but a futex syscall is hundreds of ns. clock_gettime() is called at high frequency and can be ~25ns. Saving a few ns is a bigger deal.
On 09/12/2018 07:11 PM, Andy Lutomirski wrote:
>> The multiplexer interfaces need much more surgery and talking about futex,
>> we'd need to sit down with quite some people and identify the things they
>> actually care about before just splitting it up and keeping the existing
>> overloaded trainwreck the same.
>>
> There’s also the issue of how much the speedup matters. For futex, maybe a better interface saves 3ns, but a futex syscall is hundreds of ns. clock_gettime() is called at high frequency and can be ~25ns. Saving a few ns is a bigger deal.
My concern is that the userspace system call wrappers currently do not
know how many arguments the individual operations take and what types
the arguments have (hence the “type-polymorphic” nature I mentioned).
This could be a problem for on-stack argument passing (where you might
read values beyond the end of the stack, and glibc avoids that most of
the time by having enough cruft on the stack), and for architectures
which pass pointers and integers in different registers (like some m68k
ABIs do for the return value).
Thanks,
Florian
> On Sep 13, 2018, at 1:07 AM, Florian Weimer <[email protected]> wrote:
>
> On 09/12/2018 07:11 PM, Andy Lutomirski wrote:
>>> The multiplexer interfaces need much more surgery and talking about futex,
>>> we'd need to sit down with quite some people and identify the things they
>>> actually care about before just splitting it up and keeping the existing
>>> overloaded trainwreck the same.
>>>
>> There’s also the issue of how much the speedup matters. For futex, maybe a better interface saves 3ns, but a futex syscall is hundreds of ns. clock_gettime() is called at high frequency and can be ~25ns. Saving a few ns is a bigger deal.
>
> My concern is that the userspace system call wrappers currently do not know how many arguments the individual operations take and what types the arguments have (hence the “type-polymorphic” nature I mentioned). This could be a problem for on-stack argument passing (where you might read values beyond the end of the stack, and glibc avoids that most of the time by having enough cruft on the stack), and for architectures which pass pointers and integers in different registers (like some m68k ABIs do for the return value).
>
>
Isn’t clock_gettime already special because of the vDSO entry point, though?
On 09/13/2018 05:22 PM, Andy Lutomirski wrote:
>
>
>> On Sep 13, 2018, at 1:07 AM, Florian Weimer <[email protected]> wrote:
>>
>> On 09/12/2018 07:11 PM, Andy Lutomirski wrote:
>>>> The multiplexer interfaces need much more surgery and talking about futex,
>>>> we'd need to sit down with quite some people and identify the things they
>>>> actually care about before just splitting it up and keeping the existing
>>>> overloaded trainwreck the same.
>>>>
>>> There’s also the issue of how much the speedup matters. For futex, maybe a better interface saves 3ns, but a futex syscall is hundreds of ns. clock_gettime() is called at high frequency and can be ~25ns. Saving a few ns is a bigger deal.
>>
>> My concern is that the userspace system call wrappers currently do not know how many arguments the individual operations take and what types the arguments have (hence the “type-polymorphic” nature I mentioned). This could be a problem for on-stack argument passing (where you might read values beyond the end of the stack, and glibc avoids that most of the time by having enough cruft on the stack), and for architectures which pass pointers and integers in different registers (like some m68k ABIs do for the return value).
> Isn’t clock_gettime already special because of the vDSO entry point, though?
Somewhat special, yes, but not overly so, and not in the
type-polymorphic sense. We can't give direct access of the vDSO
implementation to applications because the kernel does not know about
the userspace errno variable. We do that for time on x86_64, where
applications call into the vDSO directly, bypassing glibc completely
after binding.
I suspect most Linux libcs that know about the vDSO at all have generic
vsyscall support, just like they have generic support for plain system
calls.
Thanks,
Florian
> On Sep 13, 2018, at 12:07 PM, Florian Weimer <[email protected]> wrote:
>
> On 09/13/2018 05:22 PM, Andy Lutomirski wrote:
>>> On Sep 13, 2018, at 1:07 AM, Florian Weimer <[email protected]> wrote:
>>>
>>> On 09/12/2018 07:11 PM, Andy Lutomirski wrote:
>>>>> The multiplexer interfaces need much more surgery and talking about futex,
>>>>> we'd need to sit down with quite some people and identify the things they
>>>>> actually care about before just splitting it up and keeping the existing
>>>>> overloaded trainwreck the same.
>>>>>
>>>> There’s also the issue of how much the speedup matters. For futex, maybe a better interface saves 3ns, but a futex syscall is hundreds of ns. clock_gettime() is called at high frequency and can be ~25ns. Saving a few ns is a bigger deal.
>>>
>>> My concern is that the userspace system call wrappers currently do not know how many arguments the individual operations take and what types the arguments have (hence the “type-polymorphic” nature I mentioned). This could be a problem for on-stack argument passing (where you might read values beyond the end of the stack, and glibc avoids that most of the time by having enough cruft on the stack), and for architectures which pass pointers and integers in different registers (like some m68k ABIs do for the return value).
>
>> Isn’t clock_gettime already special because of the vDSO entry point, though?
>
> Somewhat special, yes, but not overly so, and not in the type-polymorphic sense. We can't give direct access of the vDSO implementation to applications because the kernel does not know about the userspace errno variable. We do that for time on x86_64, where applications call into the vDSO directly, bypassing glibc completely after binding.
If the vDSO adds special helpers for CLOCK_MONOTONIC and CLOCK_REALTIME, I think we can reasonably safely promise that they never fail. (seccomp can obviously break that promise if there’s no TSC, but I think that seccomp users who do that get to keep both pieces.)
On 09/13/2018 09:35 PM, Andy Lutomirski wrote:
>> Somewhat special, yes, but not overly so, and not in the type-polymorphic sense. We can't give direct access of the vDSO implementation to applications because the kernel does not know about the userspace errno variable. We do that for time on x86_64, where applications call into the vDSO directly, bypassing glibc completely after binding.
>
> If the vDSO adds special helpers for CLOCK_MONOTONIC and CLOCK_REALTIME, I think we can reasonably safely promise that they never fail. (seccomp can obviously break that promise if there’s no TSC, but I think that seccomp users who do that get to keep both pieces.)
I agree, I thought about the same thing. We already do not return
EFAULT for invalid pointers, for obvious reasons. And if the clock ID
is fixed, the EINVAL error is impossible.
That would shave off a few nanoseconds more if the calling convention is
identical to what glibc exposes to applications. If the vDSO is not
available or the symbol is missing, we can provide an implementation
based on the current clock_gettime in glibc.
Thanks,
Florian