2014-02-26 19:35:42

by Stefani Seibold

[permalink] [raw]
Subject: Final: Add 32 bit VDSO time function support

Hi,

i still wait for ACK's for the 32 bit VDSO time function support. Whats
the next step? Is there a way to apply it to the linux-git or linux-next
in near future?

- Stefani


2014-02-26 20:10:54

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Final: Add 32 bit VDSO time function support

I'm planning on testing this, hopefully today.

--Andy

On Wed, Feb 26, 2014 at 11:34 AM, Stefani Seibold <[email protected]> wrote:
> Hi,
>
> i still wait for ACK's for the 32 bit VDSO time function support. Whats
> the next step? Is there a way to apply it to the linux-git or linux-next
> in near future?
>
> - Stefani
>
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-02-26 20:43:50

by Greg KH

[permalink] [raw]
Subject: Re: Final: Add 32 bit VDSO time function support

On Wed, Feb 26, 2014 at 08:34:58PM +0100, Stefani Seibold wrote:
> Hi,
>
> i still wait for ACK's for the 32 bit VDSO time function support. Whats
> the next step? Is there a way to apply it to the linux-git or linux-next
> in near future?

I thought this was already in the tip tree. Didn't the emails saying
this happened come by a few days ago? Or was it pulled from there for
some reason?

thanks,

greg k-h

2014-02-26 20:55:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Final: Add 32 bit VDSO time function support

On Wed, Feb 26, 2014 at 12:45 PM, Greg KH <[email protected]> wrote:
> On Wed, Feb 26, 2014 at 08:34:58PM +0100, Stefani Seibold wrote:
>> Hi,
>>
>> i still wait for ACK's for the 32 bit VDSO time function support. Whats
>> the next step? Is there a way to apply it to the linux-git or linux-next
>> in near future?
>
> I thought this was already in the tip tree. Didn't the emails saying
> this happened come by a few days ago? Or was it pulled from there for
> some reason?

It is in tip. I think that hpa wants people to test it before it goes
to Linus, and I certainly want to convince myself that there's no
performance regression.

--Andy

>
> thanks,
>
> greg k-h



--
Andy Lutomirski
AMA Capital Management, LLC

2014-02-27 00:55:49

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Final: Add 32 bit VDSO time function support

Um. This code doesn't work. I'll send a patch. I can't speak
towards how well it compiles in different configurations.

I can't speak towards how well it compiles in different
configurations. Also, vdso_fallback_gettime needs .cfi annotations, I
think. I could probably dredge the required incantations from
somewhere, but someone else may know how to do it.

Once I patch it to work, your 32-bit code is considerably faster than
the 64-bit case. It's enough faster that I suspect a bug. Dumping
the in-memory shows some rather suspicious nops before the rdtsc
instruction. I suspect that you've forgotten to run the 32-bit vdso
through the alternatives code. The is a nasty bug: it will appear to
work, but you'll see non-monotonic times on some SMP systems.

In my configuration, with your patches, I get (64-bit):

CLOCK_REALTIME:
100000000 loops in 2.07105s = 20.71 nsec / loop
100000000 loops in 2.06874s = 20.69 nsec / loop
100000000 loops in 2.29415s = 22.94 nsec / loop
CLOCK_MONOTONIC:
100000000 loops in 2.06526s = 20.65 nsec / loop
100000000 loops in 2.10134s = 21.01 nsec / loop
100000000 loops in 2.10615s = 21.06 nsec / loop
CLOCK_REALTIME_COARSE:
100000000 loops in 0.37440s = 3.74 nsec / loop
[ 503.011756] perf samples too long (2550 > 2500), lowering
kernel.perf_event_max_sample_rate to 50000
100000000 loops in 0.37399s = 3.74 nsec / loop
100000000 loops in 0.38445s = 3.84 nsec / loop
CLOCK_MONOTONIC_COARSE:
100000000 loops in 0.40238s = 4.02 nsec / loop
100000000 loops in 0.40939s = 4.09 nsec / loop
100000000 loops in 0.41152s = 4.12 nsec / loop

Without the patches, I get:

CLOCK_REALTIME:
100000000 loops in 2.07348s = 20.73 nsec / loop
100000000 loops in 2.07346s = 20.73 nsec / loop
100000000 loops in 2.06922s = 20.69 nsec / loop
CLOCK_MONOTONIC:
100000000 loops in 1.98955s = 19.90 nsec / loop
100000000 loops in 1.98895s = 19.89 nsec / loop
100000000 loops in 1.98881s = 19.89 nsec / loop
CLOCK_REALTIME_COARSE:
100000000 loops in 0.37462s = 3.75 nsec / loop
100000000 loops in 0.37460s = 3.75 nsec / loop
100000000 loops in 0.37428s = 3.74 nsec / loop
CLOCK_MONOTONIC_COARSE:
100000000 loops in 0.40081s = 4.01 nsec / loop
100000000 loops in 0.39834s = 3.98 nsec / loop
[ 36.706696] perf samples too long (2565 > 2500), lowering
kernel.perf_event_max_sample_rate to 50000
100000000 loops in 0.39949s = 3.99 nsec / loop

This looks like a wash, except for CLOCK_MONOTONIC, which got a bit
slower. I'll send a followup patch once the bugs are fixed that
improves the timings to:

CLOCK_REALTIME:
100000000 loops in 2.08621s = 20.86 nsec / loop
100000000 loops in 2.07122s = 20.71 nsec / loop
100000000 loops in 2.07089s = 20.71 nsec / loop
CLOCK_MONOTONIC:
100000000 loops in 2.06831s = 20.68 nsec / loop
100000000 loops in 2.06862s = 20.69 nsec / loop
100000000 loops in 2.06195s = 20.62 nsec / loop
CLOCK_REALTIME_COARSE:
100000000 loops in 0.37274s = 3.73 nsec / loop
100000000 loops in 0.37247s = 3.72 nsec / loop
100000000 loops in 0.37234s = 3.72 nsec / loop
CLOCK_MONOTONIC_COARSE:
100000000 loops in 0.39944s = 3.99 nsec / loop
100000000 loops in 0.39940s = 3.99 nsec / loop
100000000 loops in 0.40054s = 4.01 nsec / loop

I'm not quite sure that causes the remaining loss.

Test code is here:

https://gitorious.org/linux-test-utils/linux-clock-tests

2014-02-27 01:02:26

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH 0/2] Improvements/fixes to 32-bit vdso timing

hpa, please either nuke x86/vdso or apply these. That branch is
certainly *not* ready for merging, but this gets it a little closer.

Andy Lutomirski (2):
x86: Mark __vdso entries as asmlinkage
x86: Inline the CLOCK_MONOTONIC vdso code

arch/x86/vdso/vclock_gettime.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

--
1.8.5.3

2014-02-27 01:02:32

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH 1/2] x86: Mark __vdso entries as asmlinkage

This makes no difference for 64-bit, bit it's critical for 32-bit code:
these functions are called from outside the kernel, so they need to comply
with the ABI.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/vdso/vclock_gettime.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 828888e..260a422 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -25,9 +25,9 @@

#define gtod (&VVAR(vsyscall_gtod_data))

-extern int __vdso_clock_gettime(clockid_t clock, struct timespec *ts);
-extern int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz);
-extern time_t __vdso_time(time_t *t);
+extern asmlinkage int __vdso_clock_gettime(clockid_t clock, struct timespec *ts);
+extern asmlinkage int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz);
+extern asmlinkage time_t __vdso_time(time_t *t);

#ifdef CONFIG_HPET_TIMER
static inline u32 read_hpet_counter(const volatile void *addr)
@@ -304,7 +304,7 @@ notrace static void do_monotonic_coarse(struct timespec *ts)
} while (unlikely(gtod_read_retry(gtod, seq)));
}

-notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
+notrace asmlinkage int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
{
switch (clock) {
case CLOCK_REALTIME:
@@ -332,7 +332,7 @@ fallback:
int clock_gettime(clockid_t, struct timespec *)
__attribute__((weak, alias("__vdso_clock_gettime")));

-notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
+notrace asmlinkage int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
{
if (likely(tv != NULL)) {
if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE))
@@ -353,7 +353,7 @@ int gettimeofday(struct timeval *, struct timezone *)
* This will break when the xtime seconds get inaccurate, but that is
* unlikely
*/
-notrace time_t __vdso_time(time_t *t)
+notrace asmlinkage time_t __vdso_time(time_t *t)
{
/* This is atomic on x86 so we don't need any locks. */
time_t result = ACCESS_ONCE(gtod->wall_time_sec);
--
1.8.5.3

2014-02-27 01:02:45

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH 2/2] x86: Inline the CLOCK_MONOTONIC vdso code

This saves a nanosecond or so on my box.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/vdso/vclock_gettime.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 260a422..9234c83 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -263,7 +263,7 @@ notrace static int __always_inline do_realtime(struct timespec *ts)
return mode;
}

-notrace static int do_monotonic(struct timespec *ts)
+notrace static int __always_inline do_monotonic(struct timespec *ts)
{
unsigned long seq;
u64 ns;
--
1.8.5.3

2014-02-27 03:26:08

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: Mark __vdso entries as asmlinkage

On 02/26/2014 05:02 PM, Andy Lutomirski wrote:
> This makes no difference for 64-bit, bit it's critical for 32-bit code:
> these functions are called from outside the kernel, so they need to comply
> with the ABI.

Or at least with *an* ABI (the i386 syscall vdso uses the syscall
convention, not the i386 ABI convention, for example.) However, it
probably does makes most sense to just stick with the standard i386 ABI.

-hpa

2014-02-27 03:39:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: Mark __vdso entries as asmlinkage

On Wed, Feb 26, 2014 at 05:02:13PM -0800, Andy Lutomirski wrote:
> This makes no difference for 64-bit, bit it's critical for 32-bit code:
> these functions are called from outside the kernel, so they need to comply
> with the ABI.

That's an odd patch. If that was wrong things couldn't have worked at all.
Probably hidden by inlining? If yes just make it static

Also you would rather need notrace more often.

-Andi

2014-02-27 05:12:58

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: Mark __vdso entries as asmlinkage

On 02/26/2014 07:39 PM, Andi Kleen wrote:
>
> Also you would rather need notrace more often.
>

Again, can be dealt with in CFLAGS, no?

-hpa

2014-02-27 05:14:10

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: Mark __vdso entries as asmlinkage

On 02/26/2014 07:39 PM, Andi Kleen wrote:
> On Wed, Feb 26, 2014 at 05:02:13PM -0800, Andy Lutomirski wrote:
>> This makes no difference for 64-bit, bit it's critical for 32-bit code:
>> these functions are called from outside the kernel, so they need to comply
>> with the ABI.
>
> That's an odd patch. If that was wrong things couldn't have worked at all.
> Probably hidden by inlining? If yes just make it static
>
> Also you would rather need notrace more often.
>

It has to support *an* ABI... the syscall vdso entry point uses the old
int $0x80 calling convention rather than the normal ABI. It would
depend on the test program and eventual glibc implementation. And sure
enough, the test program has:

int (*vdso_gettimeofday)(struct timeval *tv, struct timezone *tz)
__attribute__ ((regparm (3)));
int (*vdso_clock_gettime)(clockid_t clk_id, struct timespec *tp)
__attribute__ ((regparm (3)));
time_t (*vdso_time)(time_t *t) __attribute__ ((regparm (3)));

That being said, since this code is compiled separately, the compiler
flags there determine what actually matters. However, there we have:

KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=3 -freg-struct-return -fpic

The normal ABI almost certainly makes more sense; as such -mregparm=3 is
probably not what we want, and I suspect it makes more sense to just
drop that from the CFLAGS line?

-hpa

2014-02-27 05:17:15

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Final: Add 32 bit VDSO time function support

On 02/26/2014 12:54 PM, Andy Lutomirski wrote:
> On Wed, Feb 26, 2014 at 12:45 PM, Greg KH <[email protected]> wrote:
>> On Wed, Feb 26, 2014 at 08:34:58PM +0100, Stefani Seibold wrote:
>>> Hi,
>>>
>>> i still wait for ACK's for the 32 bit VDSO time function support. Whats
>>> the next step? Is there a way to apply it to the linux-git or linux-next
>>> in near future?
>>
>> I thought this was already in the tip tree. Didn't the emails saying
>> this happened come by a few days ago? Or was it pulled from there for
>> some reason?
>
> It is in tip. I think that hpa wants people to test it before it goes
> to Linus, and I certainly want to convince myself that there's no
> performance regression.
>

Absolutely. Your help is *greatly* appreciated.

-hpa

2014-02-27 05:19:37

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: Mark __vdso entries as asmlinkage

On Wed, Feb 26, 2014 at 10:06 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/26/2014 07:39 PM, Andi Kleen wrote:
>> On Wed, Feb 26, 2014 at 05:02:13PM -0800, Andy Lutomirski wrote:
>>> This makes no difference for 64-bit, bit it's critical for 32-bit code:
>>> these functions are called from outside the kernel, so they need to comply
>>> with the ABI.
>>
>> That's an odd patch. If that was wrong things couldn't have worked at all.
>> Probably hidden by inlining? If yes just make it static
>>
>> Also you would rather need notrace more often.
>>
>
> It has to support *an* ABI... the syscall vdso entry point uses the old
> int $0x80 calling convention rather than the normal ABI. It would
> depend on the test program and eventual glibc implementation. And sure
> enough, the test program has:
>
> int (*vdso_gettimeofday)(struct timeval *tv, struct timezone *tz)
> __attribute__ ((regparm (3)));
> int (*vdso_clock_gettime)(clockid_t clk_id, struct timespec *tp)
> __attribute__ ((regparm (3)));
> time_t (*vdso_time)(time_t *t) __attribute__ ((regparm (3)));
>
> That being said, since this code is compiled separately, the compiler
> flags there determine what actually matters. However, there we have:
>
> KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=3 -freg-struct-return -fpic
>
> The normal ABI almost certainly makes more sense; as such -mregparm=3 is
> probably not what we want, and I suspect it makes more sense to just
> drop that from the CFLAGS line?

Hmm. What happens on a native 32-bit build? IIRC the whole kernel is
build with regparm(3).

If we want to save a cycle or two, then regparm(3) is probably faster.
But I think that these functions should either be asmlinkage or (on
32 bit builds) explicitly regparm(3) to avoid confusion.

--Andy

2014-02-27 05:31:14

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: Mark __vdso entries as asmlinkage

On 02/26/2014 09:19 PM, Andy Lutomirski wrote:
>>
>> The normal ABI almost certainly makes more sense; as such -mregparm=3 is
>> probably not what we want, and I suspect it makes more sense to just
>> drop that from the CFLAGS line?
>
> Hmm. What happens on a native 32-bit build? IIRC the whole kernel is
> build with regparm(3).
>

Well, the vdso is still built separately, so we can use different CFLAGS
if we want to.

> If we want to save a cycle or two, then regparm(3) is probably faster.
> But I think that these functions should either be asmlinkage or (on
> 32 bit builds) explicitly regparm(3) to avoid confusion.

I suggest using the standard ABI, but I suggest doing it via CFLAGS.

It isn't any faster if the C library has to provide a wrapper just to
marshal parameters.

-hpa

2014-02-27 12:14:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: Final: Add 32 bit VDSO time function support


* Andy Lutomirski <[email protected]> wrote:

> On Wed, Feb 26, 2014 at 12:45 PM, Greg KH <[email protected]> wrote:
> > On Wed, Feb 26, 2014 at 08:34:58PM +0100, Stefani Seibold wrote:
> >> Hi,
> >>
> >> i still wait for ACK's for the 32 bit VDSO time function support. Whats
> >> the next step? Is there a way to apply it to the linux-git or linux-next
> >> in near future?
> >
> > I thought this was already in the tip tree. Didn't the emails saying
> > this happened come by a few days ago? Or was it pulled from there for
> > some reason?
>
> It is in tip. [...]

It's excluded from tip:master right now due to testing failures. So
it's neither in tip:master nor propagated towards linux-next.

Thanks,

Ingo

2014-02-27 20:12:15

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: Mark __vdso entries as asmlinkage

On Wed, Feb 26, 2014 at 9:22 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/26/2014 09:19 PM, Andy Lutomirski wrote:
>>>
>>> The normal ABI almost certainly makes more sense; as such -mregparm=3 is
>>> probably not what we want, and I suspect it makes more sense to just
>>> drop that from the CFLAGS line?
>>
>> Hmm. What happens on a native 32-bit build? IIRC the whole kernel is
>> build with regparm(3).
>>
>
> Well, the vdso is still built separately, so we can use different CFLAGS
> if we want to.
>
>> If we want to save a cycle or two, then regparm(3) is probably faster.
>> But I think that these functions should either be asmlinkage or (on
>> 32 bit builds) explicitly regparm(3) to avoid confusion.
>
> I suggest using the standard ABI, but I suggest doing it via CFLAGS.

Hmm. This sort of goes against existing x86_32 practice where,
AFAICT, things that need a particular calling convention specify
asmlinkage and everything else uses regparm(3) if config/kbuild thinks
it's appropriate.

But I'm happy to resubmit the patch if you prefer the CFLAGS approach
for the 32-bit vdso. I don't think anything will break, since I don't
think that the 32-bit vdso has any other exported C code.

>
> It isn't any faster if the C library has to provide a wrapper just to
> marshal parameters.

Probably true, given that the glibc wrapper could, in principle, use
an optimized tail call. Also, I see no reason why vdso functions,
alone of all userspace code, should be special.

--Andy

2014-02-27 23:13:22

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: Mark __vdso entries as asmlinkage

On 02/27/2014 12:11 PM, Andy Lutomirski wrote:
>
> Hmm. This sort of goes against existing x86_32 practice where,
> AFAICT, things that need a particular calling convention specify
> asmlinkage and everything else uses regparm(3) if config/kbuild thinks
> it's appropriate.
>

That is not really true for things that aren't part of the kernel image
proper (e.g. the real mode code and so on.) This is a very special case.

> But I'm happy to resubmit the patch if you prefer the CFLAGS approach
> for the 32-bit vdso. I don't think anything will break, since I don't
> think that the 32-bit vdso has any other exported C code.

I think it is the better way to go.

>> It isn't any faster if the C library has to provide a wrapper just to
>> marshal parameters.
>
> Probably true, given that the glibc wrapper could, in principle, use
> an optimized tail call. Also, I see no reason why vdso functions,
> alone of all userspace code, should be special.

Yes, let's stick to the standard ABI. The syscall entry point was very
special.

-hpa

2014-02-28 00:18:28

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 0/4] vDSO fixes, on top of tip/x86/vdso

I'd still like someone else to confirm that the 32-bit vDSO is working
on all common configurations before I'm happy with it, but this should
be a considerable improvement.

Someone who maintains the clock code should review the changes in:

commit 0fc8a237cbe98a06962f5ea37d24fc2369e23c74
Author: Stefani Seibold <[email protected]>
Date: Wed Feb 19 10:09:10 2014 +0100

x86, vdso: Add 32-bit VDSO time support for the 64-bit kernel

very carefully.

There's still something extremely questionable about VDSO_PAGES in the
32-bit vDSO code. It appears to be terminally screwed up, and AFAICS it
only works at all because none of the 32-bit vDSO images ever exceed
4096 bytes.

Note: Patch 4 fixes a bug that's present even in -linus, so it might
make sense to send it to Linus more quickly than the rest of this
series.

Changes from v1:
- Adjust CFLAGS instead of using asmlinkage.
- Add more fixes.

Andy Lutomirski (4):
x86: Use the default ABI for the 32-bit vDSO
x86: Inline the CLOCK_MONOTONIC vdso code
x86: Patch alternatives in the 32-bit vDSO
x86: Zero-pad the VVAR page

arch/x86/include/asm/vdso.h | 2 ++
arch/x86/kernel/vmlinux.lds.S | 5 +++++
arch/x86/vdso/Makefile | 2 +-
arch/x86/vdso/vclock_gettime.c | 2 +-
arch/x86/vdso/vdso32-setup.c | 25 +++++++++++++------------
arch/x86/vdso/vma.c | 9 ++++++---
6 files changed, 28 insertions(+), 17 deletions(-)

--
1.8.5.3

2014-02-28 00:18:34

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 1/4] x86: Use the default ABI for the 32-bit vDSO

There's no reason for the vDSO to use a special function call ABI. Use
the platform defaults.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/vdso/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/vdso/Makefile b/arch/x86/vdso/Makefile
index 92daaa6..80584f5 100644
--- a/arch/x86/vdso/Makefile
+++ b/arch/x86/vdso/Makefile
@@ -149,7 +149,7 @@ KBUILD_CFLAGS_32 := $(filter-out -m64,$(KBUILD_CFLAGS))
KBUILD_CFLAGS_32 := $(filter-out -mcmodel=kernel,$(KBUILD_CFLAGS_32))
KBUILD_CFLAGS_32 := $(filter-out -fno-pic,$(KBUILD_CFLAGS_32))
KBUILD_CFLAGS_32 := $(filter-out -mfentry,$(KBUILD_CFLAGS_32))
-KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=3 -freg-struct-return -fpic
+KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=0 -fpic
$(vdso32-images:%=$(obj)/%.dbg): KBUILD_CFLAGS = $(KBUILD_CFLAGS_32)

$(vdso32-images:%=$(obj)/%.dbg): $(obj)/vdso32-%.so.dbg: FORCE \
--
1.8.5.3

2014-02-28 00:18:38

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 3/4] x86: Patch alternatives in the 32-bit vDSO

rdtsc_barrier() needs this.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/vdso.h | 2 ++
arch/x86/vdso/vdso32-setup.c | 25 +++++++++++++------------
arch/x86/vdso/vma.c | 9 ++++++---
3 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 6db8b23..a844f90 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -40,4 +40,6 @@ extern const char vdso32_int80_start, vdso32_int80_end;
extern const char vdso32_syscall_start, vdso32_syscall_end;
extern const char vdso32_sysenter_start, vdso32_sysenter_end;

+void __init patch_vdso32(void *vdso, size_t len);
+
#endif /* _ASM_X86_VDSO_H */
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index 6b74a46..b37aa1d 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -273,29 +273,30 @@ static void map_compat_vdso(int map)

int __init sysenter_setup(void)
{
- void *syscall_page = (void *)get_zeroed_page(GFP_ATOMIC);
- const void *vsyscall;
- size_t vsyscall_len;
+ void *vdso_page = (void *)get_zeroed_page(GFP_ATOMIC);
+ const void *vdso;
+ size_t vdso_len;

- vdso32_pages[0] = virt_to_page(syscall_page);
+ vdso32_pages[0] = virt_to_page(vdso_page);

#ifdef CONFIG_X86_32
gate_vma_init();
#endif

if (vdso32_syscall()) {
- vsyscall = &vdso32_syscall_start;
- vsyscall_len = &vdso32_syscall_end - &vdso32_syscall_start;
+ vdso = &vdso32_syscall_start;
+ vdso_len = &vdso32_syscall_end - &vdso32_syscall_start;
} else if (vdso32_sysenter()){
- vsyscall = &vdso32_sysenter_start;
- vsyscall_len = &vdso32_sysenter_end - &vdso32_sysenter_start;
+ vdso = &vdso32_sysenter_start;
+ vdso_len = &vdso32_sysenter_end - &vdso32_sysenter_start;
} else {
- vsyscall = &vdso32_int80_start;
- vsyscall_len = &vdso32_int80_end - &vdso32_int80_start;
+ vdso = &vdso32_int80_start;
+ vdso_len = &vdso32_int80_end - &vdso32_int80_start;
}

- memcpy(syscall_page, vsyscall, vsyscall_len);
- relocate_vdso(syscall_page);
+ memcpy(vdso_page, vdso, vdso_len);
+ patch_vdso32(vdso_page, vdso_len);
+ relocate_vdso(vdso_page);

return 0;
}
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 431e875..149caa1 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -28,8 +28,11 @@ static unsigned vdso_size;
extern char vdsox32_start[], vdsox32_end[];
extern struct page *vdsox32_pages[];
static unsigned vdsox32_size;
+#endif

-static void __init patch_vdsox32(void *vdso, size_t len)
+#if defined(CONFIG_X86_32) || defined(CONFIG_X86_X32_ABI) || \
+ defined(CONFIG_COMPAT)
+void __init patch_vdso32(void *vdso, size_t len)
{
Elf32_Ehdr *hdr = vdso;
Elf32_Shdr *sechdrs, *alt_sec = 0;
@@ -52,7 +55,7 @@ static void __init patch_vdsox32(void *vdso, size_t len)
}

/* If we get here, it's probably a bug. */
- pr_warning("patch_vdsox32: .altinstructions not found\n");
+ pr_warning("patch_vdso32: .altinstructions not found\n");
return; /* nothing to patch */

found:
@@ -104,7 +107,7 @@ static int __init init_vdso(void)
vdso_pages[i] = virt_to_page(vdso_start + i*PAGE_SIZE);

#ifdef CONFIG_X86_X32_ABI
- patch_vdsox32(vdsox32_start, vdsox32_end - vdsox32_start);
+ patch_vdso32(vdsox32_start, vdsox32_end - vdsox32_start);
npages = (vdsox32_end - vdsox32_start + PAGE_SIZE - 1) / PAGE_SIZE;
vdsox32_size = npages << PAGE_SHIFT;
for (i = 0; i < npages; i++)
--
1.8.5.3

2014-02-28 00:18:45

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 4/4] x86: Zero-pad the VVAR page

By coincidence, the VVAR page is at the end of an ELF segment. As a
result, if it ends up being a partial page, the kernel loader will
leave garbage behind at the end of the vvar page. Zero-pad it to a
full page to fix this issue.

This has probably been broken since the VVAR page was introduced.
On QEMU, if you dump the run-time contents of the VVAR page, you can
find entertaining strings from seabios left behind.

It's remotely possible that this is a security bug -- conceivably
there's some BIOS out there that leaves something sensitive in the
few K of memory that is exposed to userspace.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/vmlinux.lds.S | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 1d4897b..49edf2d 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -164,6 +164,11 @@ SECTIONS
#undef __VVAR_KERNEL_LDS
#undef EMIT_VVAR

+ /*
+ * Pad the rest of the page with zeros. Otherwise the loader
+ * can leave garbage here.
+ */
+ . = __vvar_beginning_hack + PAGE_SIZE;
} :data

. = ALIGN(__vvar_page + PAGE_SIZE, PAGE_SIZE);
--
1.8.5.3

2014-02-28 00:19:18

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 2/4] x86: Inline the CLOCK_MONOTONIC vdso code

This saves a nanosecond or so on my box.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/vdso/vclock_gettime.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 828888e..16d6861 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -263,7 +263,7 @@ notrace static int __always_inline do_realtime(struct timespec *ts)
return mode;
}

-notrace static int do_monotonic(struct timespec *ts)
+notrace static int __always_inline do_monotonic(struct timespec *ts)
{
unsigned long seq;
u64 ns;
--
1.8.5.3

2014-02-28 07:22:52

by Stefani Seibold

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] x86: Patch alternatives in the 32-bit vDSO

Am Donnerstag, den 27.02.2014, 16:18 -0800 schrieb Andy Lutomirski:
> rdtsc_barrier() needs this.
>

Thanks for doing this.

> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/vdso.h | 2 ++
> arch/x86/vdso/vdso32-setup.c | 25 +++++++++++++------------
> arch/x86/vdso/vma.c | 9 ++++++---
> 3 files changed, 21 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
> index 6db8b23..a844f90 100644
> --- a/arch/x86/include/asm/vdso.h
> +++ b/arch/x86/include/asm/vdso.h
> @@ -40,4 +40,6 @@ extern const char vdso32_int80_start, vdso32_int80_end;
> extern const char vdso32_syscall_start, vdso32_syscall_end;
> extern const char vdso32_sysenter_start, vdso32_sysenter_end;
>
> +void __init patch_vdso32(void *vdso, size_t len);
> +
> #endif /* _ASM_X86_VDSO_H */
> diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
> index 6b74a46..b37aa1d 100644
> --- a/arch/x86/vdso/vdso32-setup.c
> +++ b/arch/x86/vdso/vdso32-setup.c
> @@ -273,29 +273,30 @@ static void map_compat_vdso(int map)
>
> int __init sysenter_setup(void)
> {
> - void *syscall_page = (void *)get_zeroed_page(GFP_ATOMIC);
> - const void *vsyscall;
> - size_t vsyscall_len;
> + void *vdso_page = (void *)get_zeroed_page(GFP_ATOMIC);
> + const void *vdso;
> + size_t vdso_len;
>
> - vdso32_pages[0] = virt_to_page(syscall_page);
> + vdso32_pages[0] = virt_to_page(vdso_page);
>
> #ifdef CONFIG_X86_32
> gate_vma_init();
> #endif
>
> if (vdso32_syscall()) {
> - vsyscall = &vdso32_syscall_start;
> - vsyscall_len = &vdso32_syscall_end - &vdso32_syscall_start;
> + vdso = &vdso32_syscall_start;
> + vdso_len = &vdso32_syscall_end - &vdso32_syscall_start;
> } else if (vdso32_sysenter()){
> - vsyscall = &vdso32_sysenter_start;
> - vsyscall_len = &vdso32_sysenter_end - &vdso32_sysenter_start;
> + vdso = &vdso32_sysenter_start;
> + vdso_len = &vdso32_sysenter_end - &vdso32_sysenter_start;
> } else {
> - vsyscall = &vdso32_int80_start;
> - vsyscall_len = &vdso32_int80_end - &vdso32_int80_start;
> + vdso = &vdso32_int80_start;
> + vdso_len = &vdso32_int80_end - &vdso32_int80_start;
> }
>
> - memcpy(syscall_page, vsyscall, vsyscall_len);
> - relocate_vdso(syscall_page);
> + memcpy(vdso_page, vdso, vdso_len);
> + patch_vdso32(vdso_page, vdso_len);
> + relocate_vdso(vdso_page);
>
> return 0;
> }
> diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
> index 431e875..149caa1 100644
> --- a/arch/x86/vdso/vma.c
> +++ b/arch/x86/vdso/vma.c
> @@ -28,8 +28,11 @@ static unsigned vdso_size;
> extern char vdsox32_start[], vdsox32_end[];
> extern struct page *vdsox32_pages[];
> static unsigned vdsox32_size;
> +#endif
>
> -static void __init patch_vdsox32(void *vdso, size_t len)
> +#if defined(CONFIG_X86_32) || defined(CONFIG_X86_X32_ABI) || \
> + defined(CONFIG_COMPAT)
> +void __init patch_vdso32(void *vdso, size_t len)
> {
> Elf32_Ehdr *hdr = vdso;
> Elf32_Shdr *sechdrs, *alt_sec = 0;
> @@ -52,7 +55,7 @@ static void __init patch_vdsox32(void *vdso, size_t len)
> }
>
> /* If we get here, it's probably a bug. */
> - pr_warning("patch_vdsox32: .altinstructions not found\n");
> + pr_warning("patch_vdso32: .altinstructions not found\n");
> return; /* nothing to patch */
>
> found:
> @@ -104,7 +107,7 @@ static int __init init_vdso(void)
> vdso_pages[i] = virt_to_page(vdso_start + i*PAGE_SIZE);
>
> #ifdef CONFIG_X86_X32_ABI
> - patch_vdsox32(vdsox32_start, vdsox32_end - vdsox32_start);
> + patch_vdso32(vdsox32_start, vdsox32_end - vdsox32_start);
> npages = (vdsox32_end - vdsox32_start + PAGE_SIZE - 1) / PAGE_SIZE;
> vdsox32_size = npages << PAGE_SHIFT;
> for (i = 0; i < npages; i++)

2014-02-28 07:22:50

by Stefani Seibold

[permalink] [raw]
Subject: Re: Final: Add 32 bit VDSO time function support

Am Mittwoch, den 26.02.2014, 16:55 -0800 schrieb Andy Lutomirski:
> Um. This code doesn't work. I'll send a patch. I can't speak
> towards how well it compiles in different configurations.
>
> I can't speak towards how well it compiles in different
> configurations. Also, vdso_fallback_gettime needs .cfi annotations, I
> think. I could probably dredge the required incantations from
> somewhere, but someone else may know how to do it.
>
> Once I patch it to work, your 32-bit code is considerably faster than
> the 64-bit case. It's enough faster that I suspect a bug. Dumping
> the in-memory shows some rather suspicious nops before the rdtsc
> instruction. I suspect that you've forgotten to run the 32-bit vdso
> through the alternatives code. The is a nasty bug: it will appear to
> work, but you'll see non-monotonic times on some SMP systems.
>

I didn't know this. My basic test case is a KVM which defaults to 1 cpu.
Thanks for discovering the issue.

> In my configuration, with your patches, I get (64-bit):
>
> CLOCK_REALTIME:
> 100000000 loops in 2.07105s = 20.71 nsec / loop
> 100000000 loops in 2.06874s = 20.69 nsec / loop
> 100000000 loops in 2.29415s = 22.94 nsec / loop
> CLOCK_MONOTONIC:
> 100000000 loops in 2.06526s = 20.65 nsec / loop
> 100000000 loops in 2.10134s = 21.01 nsec / loop
> 100000000 loops in 2.10615s = 21.06 nsec / loop
> CLOCK_REALTIME_COARSE:
> 100000000 loops in 0.37440s = 3.74 nsec / loop
> [ 503.011756] perf samples too long (2550 > 2500), lowering
> kernel.perf_event_max_sample_rate to 50000
> 100000000 loops in 0.37399s = 3.74 nsec / loop
> 100000000 loops in 0.38445s = 3.84 nsec / loop
> CLOCK_MONOTONIC_COARSE:
> 100000000 loops in 0.40238s = 4.02 nsec / loop
> 100000000 loops in 0.40939s = 4.09 nsec / loop
> 100000000 loops in 0.41152s = 4.12 nsec / loop
>
> Without the patches, I get:
>
> CLOCK_REALTIME:
> 100000000 loops in 2.07348s = 20.73 nsec / loop
> 100000000 loops in 2.07346s = 20.73 nsec / loop
> 100000000 loops in 2.06922s = 20.69 nsec / loop
> CLOCK_MONOTONIC:
> 100000000 loops in 1.98955s = 19.90 nsec / loop
> 100000000 loops in 1.98895s = 19.89 nsec / loop
> 100000000 loops in 1.98881s = 19.89 nsec / loop
> CLOCK_REALTIME_COARSE:
> 100000000 loops in 0.37462s = 3.75 nsec / loop
> 100000000 loops in 0.37460s = 3.75 nsec / loop
> 100000000 loops in 0.37428s = 3.74 nsec / loop
> CLOCK_MONOTONIC_COARSE:
> 100000000 loops in 0.40081s = 4.01 nsec / loop
> 100000000 loops in 0.39834s = 3.98 nsec / loop
> [ 36.706696] perf samples too long (2565 > 2500), lowering
> kernel.perf_event_max_sample_rate to 50000
> 100000000 loops in 0.39949s = 3.99 nsec / loop
>
> This looks like a wash, except for CLOCK_MONOTONIC, which got a bit
> slower. I'll send a followup patch once the bugs are fixed that
> improves the timings to:
>
> CLOCK_REALTIME:
> 100000000 loops in 2.08621s = 20.86 nsec / loop
> 100000000 loops in 2.07122s = 20.71 nsec / loop
> 100000000 loops in 2.07089s = 20.71 nsec / loop
> CLOCK_MONOTONIC:
> 100000000 loops in 2.06831s = 20.68 nsec / loop
> 100000000 loops in 2.06862s = 20.69 nsec / loop
> 100000000 loops in 2.06195s = 20.62 nsec / loop
> CLOCK_REALTIME_COARSE:
> 100000000 loops in 0.37274s = 3.73 nsec / loop
> 100000000 loops in 0.37247s = 3.72 nsec / loop
> 100000000 loops in 0.37234s = 3.72 nsec / loop
> CLOCK_MONOTONIC_COARSE:
> 100000000 loops in 0.39944s = 3.99 nsec / loop
> 100000000 loops in 0.39940s = 3.99 nsec / loop
> 100000000 loops in 0.40054s = 4.01 nsec / loop
>
> I'm not quite sure that causes the remaining loss.
>
> Test code is here:
>
> https://gitorious.org/linux-test-utils/linux-clock-tests

2014-02-28 07:28:50

by Stefani Seibold

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] x86: Use the default ABI for the 32-bit vDSO

Am Donnerstag, den 27.02.2014, 16:18 -0800 schrieb Andy Lutomirski:
> There's no reason for the vDSO to use a special function call ABI. Use
> the platform defaults.
>

The only reason was performance. What is good for the kernel should be
also good for the VDSO. Now all functions inside the VDSO will pass the
arguments by stack.

> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/vdso/Makefile | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/vdso/Makefile b/arch/x86/vdso/Makefile
> index 92daaa6..80584f5 100644
> --- a/arch/x86/vdso/Makefile
> +++ b/arch/x86/vdso/Makefile
> @@ -149,7 +149,7 @@ KBUILD_CFLAGS_32 := $(filter-out -m64,$(KBUILD_CFLAGS))
> KBUILD_CFLAGS_32 := $(filter-out -mcmodel=kernel,$(KBUILD_CFLAGS_32))
> KBUILD_CFLAGS_32 := $(filter-out -fno-pic,$(KBUILD_CFLAGS_32))
> KBUILD_CFLAGS_32 := $(filter-out -mfentry,$(KBUILD_CFLAGS_32))
> -KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=3 -freg-struct-return -fpic
> +KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=0 -fpic
> $(vdso32-images:%=$(obj)/%.dbg): KBUILD_CFLAGS = $(KBUILD_CFLAGS_32)
>
> $(vdso32-images:%=$(obj)/%.dbg): $(obj)/vdso32-%.so.dbg: FORCE \

2014-02-28 07:33:45

by Stefani Seibold

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] vDSO fixes, on top of tip/x86/vdso

Am Donnerstag, den 27.02.2014, 16:18 -0800 schrieb Andy Lutomirski:
> I'd still like someone else to confirm that the 32-bit vDSO is working
> on all common configurations before I'm happy with it, but this should
> be a considerable improvement.
>
> Someone who maintains the clock code should review the changes in:
>
> commit 0fc8a237cbe98a06962f5ea37d24fc2369e23c74
> Author: Stefani Seibold <[email protected]>
> Date: Wed Feb 19 10:09:10 2014 +0100
>
> x86, vdso: Add 32-bit VDSO time support for the 64-bit kernel
>
> very carefully.
>
> There's still something extremely questionable about VDSO_PAGES in the
> 32-bit vDSO code. It appears to be terminally screwed up, and AFAICS it
> only works at all because none of the 32-bit vDSO images ever exceed
> 4096 bytes.
>

The 4096 bytes limit was not introduced by the patch set. It is still
there for compatibility reason.

> Note: Patch 4 fixes a bug that's present even in -linus, so it might
> make sense to send it to Linus more quickly than the rest of this
> series.
>
> Changes from v1:
> - Adjust CFLAGS instead of using asmlinkage.
> - Add more fixes.
>
> Andy Lutomirski (4):
> x86: Use the default ABI for the 32-bit vDSO
> x86: Inline the CLOCK_MONOTONIC vdso code
> x86: Patch alternatives in the 32-bit vDSO
> x86: Zero-pad the VVAR page
>
> arch/x86/include/asm/vdso.h | 2 ++
> arch/x86/kernel/vmlinux.lds.S | 5 +++++
> arch/x86/vdso/Makefile | 2 +-
> arch/x86/vdso/vclock_gettime.c | 2 +-
> arch/x86/vdso/vdso32-setup.c | 25 +++++++++++++------------
> arch/x86/vdso/vma.c | 9 ++++++---
> 6 files changed, 28 insertions(+), 17 deletions(-)
>

2014-02-28 15:08:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] x86: Use the default ABI for the 32-bit vDSO

How many internal function calls are there? It seems we should try to avoid those as much as possible by suitable inlining.

On February 27, 2014 11:28:25 PM PST, Stefani Seibold <[email protected]> wrote:
>Am Donnerstag, den 27.02.2014, 16:18 -0800 schrieb Andy Lutomirski:
>> There's no reason for the vDSO to use a special function call ABI.
>Use
>> the platform defaults.
>>
>
>The only reason was performance. What is good for the kernel should be
>also good for the VDSO. Now all functions inside the VDSO will pass the
>arguments by stack.
>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> ---
>> arch/x86/vdso/Makefile | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/vdso/Makefile b/arch/x86/vdso/Makefile
>> index 92daaa6..80584f5 100644
>> --- a/arch/x86/vdso/Makefile
>> +++ b/arch/x86/vdso/Makefile
>> @@ -149,7 +149,7 @@ KBUILD_CFLAGS_32 := $(filter-out
>-m64,$(KBUILD_CFLAGS))
>> KBUILD_CFLAGS_32 := $(filter-out
>-mcmodel=kernel,$(KBUILD_CFLAGS_32))
>> KBUILD_CFLAGS_32 := $(filter-out -fno-pic,$(KBUILD_CFLAGS_32))
>> KBUILD_CFLAGS_32 := $(filter-out -mfentry,$(KBUILD_CFLAGS_32))
>> -KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=3
>-freg-struct-return -fpic
>> +KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=0 -fpic
>> $(vdso32-images:%=$(obj)/%.dbg): KBUILD_CFLAGS = $(KBUILD_CFLAGS_32)
>>
>> $(vdso32-images:%=$(obj)/%.dbg): $(obj)/vdso32-%.so.dbg: FORCE \

--
Sent from my mobile phone. Please pardon brevity and lack of formatting.

2014-02-28 20:15:58

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] vDSO fixes, on top of tip/x86/vdso

On Thu, Feb 27, 2014 at 11:33 PM, Stefani Seibold <[email protected]> wrote:
> Am Donnerstag, den 27.02.2014, 16:18 -0800 schrieb Andy Lutomirski:
>> I'd still like someone else to confirm that the 32-bit vDSO is working
>> on all common configurations before I'm happy with it, but this should
>> be a considerable improvement.
>>
>> Someone who maintains the clock code should review the changes in:
>>
>> commit 0fc8a237cbe98a06962f5ea37d24fc2369e23c74
>> Author: Stefani Seibold <[email protected]>
>> Date: Wed Feb 19 10:09:10 2014 +0100
>>
>> x86, vdso: Add 32-bit VDSO time support for the 64-bit kernel
>>
>> very carefully.
>>
>> There's still something extremely questionable about VDSO_PAGES in the
>> 32-bit vDSO code. It appears to be terminally screwed up, and AFAICS it
>> only works at all because none of the 32-bit vDSO images ever exceed
>> 4096 bytes.
>>
>
> The 4096 bytes limit was not introduced by the patch set. It is still
> there for compatibility reason.

Compatibility with what? And it doesn't look like there's a sane 4k
"limit". It looks like the code with confuse itself and crash if the
image exceeds 4k.

But yes, this issue predates your code.

--Andy

2014-02-28 20:19:56

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] x86: Use the default ABI for the 32-bit vDSO

On Fri, Feb 28, 2014 at 7:06 AM, H. Peter Anvin <[email protected]> wrote:
> How many internal function calls are there? It seems we should try to avoid those as much as possible by suitable inlining.

There are no non-static calls at all, except for __x86.get_pc_thunk.
I imagine that gcc is smart enough to improve the calling convention
to non-externally-visible functions.

Amazingly (to me, anyway), the performance of the 32-bit version seems
to be within 1 ns or so of the 64-bit version on SNB. I suspect that
Intel has optimized the crap out of these things.

--Andy

2014-03-01 02:00:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Final: Add 32 bit VDSO time function support

On Thu, Feb 27, 2014 at 11:22 PM, Stefani Seibold <[email protected]> wrote:
> Am Mittwoch, den 26.02.2014, 16:55 -0800 schrieb Andy Lutomirski:
>>
>> Once I patch it to work, your 32-bit code is considerably faster than
>> the 64-bit case. It's enough faster that I suspect a bug. Dumping
>> the in-memory shows some rather suspicious nops before the rdtsc
>> instruction. I suspect that you've forgotten to run the 32-bit vdso
>> through the alternatives code. The is a nasty bug: it will appear to
>> work, but you'll see non-monotonic times on some SMP systems.
>>
>
> I didn't know this. My basic test case is a KVM which defaults to 1 cpu.
> Thanks for discovering the issue.

This leads to a potentially interesting question: is rdtsc_barrier()
actually necessary on UP? IIRC the point is that, if an
rdtsc_barrier(); rdtsc in one thread is "before" (in the sense of
being synchronized by some memory operation) an rdtsc_barrier(); rdtsc
in another thread, then the first rdtsc needs to return an earlier or
equal time to the second one.

I assume that no UP CPU is silly enough to execute two rdtsc
instructions out of order relative to each other in the absence of
barriers. So this is a nonissue on UP.

On the other hand, suppose that some code does:

volatile long x = *(something that's not in cache)
clock_gettime

I can imagine a modern CPU speculating far enough ahead that the rdtsc
happens *before* the cache miss. This won't cause visible
non-monotonicity as far as I can see, but it might annoy people who
try to benchmark their code.

Note: actually making this change might be a bit tricky. I don't know
if the alternatives code is smart enough.

--Andy