LinuxLists.cc - [RFC][PATCH] vsyscall-gtod

2004-03-04 00:11:42

Subject: [RFC][PATCH] vsyscall-gtod_B3 (0/3)

All,
This is my port of the x86-64 vsyscall gettimeofday code to i386. This
patch moves gettimeofday into userspace, so it can be calledwithout the
syscall overhead, greatly improving performance. This is important for
any application, like a database, which heavily uses gettimeofday for
time-stamping. It supports both the TSC and IBM x44X cyclone time
source.

Example performance gain (using cyclone timesource):

int80 gettimeofday:
gettimeofday ( 1665576us / 1000000runs ) = 1.665574us

systenter gettimeofday:
gettimeofday ( 1239215us / 1000000runs ) = 1.239214us

vsyscall gettimeofday:
gettimeofday ( 875876us / 1000000runs ) = 0.875875us

I've broken the patch into three logical chuncks for clarity and to make
it easier to cherry pick the desired bits.

o Part 1: Renames variables in timer_cyclone.c and timer_tsc.c to avoid
conflicts in the global namespace.
o Part 2: Core vsyscall-gtod implementation.
o Part 3: vDSO hooks to avoid LD_PRELOADing or needing changes to glibc

Please let me know if you have any comments or suggestions.

thanks
-john

Existing issues:
----------------
o Bad pointers cause segfaults, rather then -EFAULT.

Release History:
----------------
B2 -> B3:
o Broke the patch up into 3 chunks
o Added vsyscall-int80.S hooks (4G disables SEP)

B1 -> B2:
o No LD_PRELOADing or changes to userspace required!
o removed hard-coded address in linker script

B0 -> B1:
o Cleaned up 4/4 split code, so no additional patch is needed.
o Fixed permissions on fixmapped cyclone pageo Improved
alternate_instruction workaround
o Use NTP variables to avoid related time inconsistencieso minor code
cleanups

2004-03-04 00:12:52

by john stultz

[permalink] [raw]

Subject: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part1 (1/3)

All,
This patch renames variables in timer_cyclone.c and timer_tsc.c to
avoid conflicts in the global namespace. This allows the renamed
variables to be used in the vsyscall-gtod implementation, and is a
pre-requisite for the vsyscall-gtod_B3-part2 patch.

thanks
-john

diff -Nru a/arch/i386/kernel/timers/timer_cyclone.c b/arch/i386/kernel/timers/timer_cyclone.c
--- a/arch/i386/kernel/timers/timer_cyclone.c Mon Mar 1 13:15:32 2004
+++ b/arch/i386/kernel/timers/timer_cyclone.c Mon Mar 1 13:15:32 2004
@@ -21,18 +21,17 @@
extern spinlock_t i8253_lock;

/* Number of usecs that the last interrupt was delayed */
-static int delay_at_last_interrupt;
+int cyclone_delay_at_last_interrupt;

#define CYCLONE_CBAR_ADDR 0xFEB00CD0
#define CYCLONE_PMCC_OFFSET 0x51A0
#define CYCLONE_MPMC_OFFSET 0x51D0
#define CYCLONE_MPCS_OFFSET 0x51A8
-#define CYCLONE_TIMER_FREQ 100000000
#define CYCLONE_TIMER_MASK (((u64)1<<40)-1) /* 40 bit mask */
int use_cyclone = 0;

-static u32* volatile cyclone_timer; /* Cyclone MPMC0 register */
-static u32 last_cyclone_low;
+u32* volatile cyclone_timer; /* Cyclone MPMC0 register */
+u32 last_cyclone_low;
static u32 last_cyclone_high;
static unsigned long long monotonic_base;
static seqlock_t monotonic_lock = SEQLOCK_UNLOCKED;
@@ -57,7 +56,7 @@
spin_lock(&i8253_lock);
read_cyclone_counter(last_cyclone_low,last_cyclone_high);

- /* read values for delay_at_last_interrupt */
+ /* read values for cyclone_delay_at_last_interrupt */
outb_p(0x00, 0x43); /* latch the count ASAP */

count = inb_p(0x40); /* read the latched count */
@@ -67,7 +66,7 @@
/* lost tick compensation */
delta = last_cyclone_low - delta;
delta /= (CYCLONE_TIMER_FREQ/1000000);
- delta += delay_at_last_interrupt;
+ delta += cyclone_delay_at_last_interrupt;
lost = delta/(1000000/HZ);
delay = delta%(1000000/HZ);
if (lost >= 2)
@@ -78,16 +77,16 @@
monotonic_base += (this_offset - last_offset) & CYCLONE_TIMER_MASK;
write_sequnlock(&monotonic_lock);

- /* calculate delay_at_last_interrupt */
+ /* calculate cyclone_delay_at_last_interrupt */
count = ((LATCH-1) - count) * TICK_SIZE;
- delay_at_last_interrupt = (count + LATCH/2) / LATCH;
+ cyclone_delay_at_last_interrupt = (count + LATCH/2) / LATCH;

/* catch corner case where tick rollover occured
* between cyclone and pit reads (as noted when
* usec delta is > 90% # of usecs/tick)
*/
- if (lost && abs(delay - delay_at_last_interrupt) > (900000/HZ))
+ if (lost && abs(delay - cyclone_delay_at_last_interrupt) > (900000/HZ))
jiffies_64++;
}

@@ -96,7 +95,7 @@
u32 offset;

if(!cyclone_timer)
- return delay_at_last_interrupt;
+ return cyclone_delay_at_last_interrupt;

/* Read the cyclone timer */
offset = cyclone_timer[0];
@@ -109,7 +108,7 @@
offset = offset/(CYCLONE_TIMER_FREQ/1000000);

/* our adjusted time offset in microseconds */
- return delay_at_last_interrupt + offset;
+ return cyclone_delay_at_last_interrupt + offset;
}

static unsigned long long monotonic_clock_cyclone(void)
diff -Nru a/arch/i386/kernel/timers/timer_tsc.c b/arch/i386/kernel/timers/timer_tsc.c
--- a/arch/i386/kernel/timers/timer_tsc.c Mon Mar 1 13:15:32 2004
+++ b/arch/i386/kernel/timers/timer_tsc.c Mon Mar 1 13:15:32 2004
@@ -33,7 +33,7 @@

static int use_tsc;
/* Number of usecs that the last interrupt was delayed */
-static int delay_at_last_interrupt;
+int tsc_delay_at_last_interrupt;

static unsigned long last_tsc_low; /* lsb 32 bits of Time Stamp Counter */
static unsigned long last_tsc_high; /* msb 32 bits of Time Stamp Counter */
@@ -104,7 +104,7 @@
"0" (eax));

/* our adjusted time offset in microseconds */
- return delay_at_last_interrupt + edx;
+ return tsc_delay_at_last_interrupt + edx;
}

static unsigned long long monotonic_clock_tsc(void)
@@ -223,7 +223,7 @@
"0" (eax));
delta = edx;
}
- delta += delay_at_last_interrupt;
+ delta += tsc_delay_at_last_interrupt;
lost = delta/(1000000/HZ);
delay = delta%(1000000/HZ);
if (lost >= 2) {
@@ -248,15 +248,15 @@
monotonic_base += cycles_2_ns(this_offset - last_offset);
write_sequnlock(&monotonic_lock);

- /* calculate delay_at_last_interrupt */
+ /* calculate tsc_delay_at_last_interrupt */
count = ((LATCH-1) - count) * TICK_SIZE;
- delay_at_last_interrupt = (count + LATCH/2) / LATCH;
+ tsc_delay_at_last_interrupt = (count + LATCH/2) / LATCH;

/* catch corner case where tick rollover occured
* between tsc and pit reads (as noted when
* usec delta is > 90% # of usecs/tick)
*/
- if (lost && abs(delay - delay_at_last_interrupt) > (900000/HZ))
+ if (lost && abs(delay - tsc_delay_at_last_interrupt) > (900000/HZ))
jiffies_64++;
}

@@ -308,7 +308,7 @@
monotonic_base += cycles_2_ns(this_offset - last_offset);
write_sequnlock(&monotonic_lock);

- /* calculate delay_at_last_interrupt */
+ /* calculate tsc_delay_at_last_interrupt */
/*
* Time offset = (hpet delta) * ( usecs per HPET clock )
* = (hpet delta) * ( usecs per tick / HPET clocks per tick)
@@ -316,9 +316,9 @@
* Where,
* hpet_usec_quotient = (2^32 * usecs per tick)/HPET clocks per tick
*/
- delay_at_last_interrupt = hpet_current - offset;
- ASM_MUL64_REG(temp, delay_at_last_interrupt,
- hpet_usec_quotient, delay_at_last_interrupt);
+ tsc_delay_at_last_interrupt = hpet_current - offset;
+ ASM_MUL64_REG(temp, tsc_delay_at_last_interrupt,
+ hpet_usec_quotient, tsc_delay_at_last_interrupt);
}
#endif

diff -Nru a/include/asm-i386/timer.h b/include/asm-i386/timer.h
--- a/include/asm-i386/timer.h Mon Mar 1 13:15:32 2004
+++ b/include/asm-i386/timer.h Mon Mar 1 13:15:32 2004
@@ -20,6 +20,7 @@
};

#define TICK_SIZE (tick_nsec / 1000)
+#define CYCLONE_TIMER_FREQ 100000000

extern struct timer_opts* select_timer(void);
extern void clock_fallback(void);

2004-03-04 00:16:15

by john stultz

[permalink] [raw]

Subject: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3)

All,
This patch implements the somewhat controversial vDSO hooks for
vsyscall-gtod. This makes LD_PRELOADing or changes to glibc unnecessary,
but requires that the system have a sysenter enabled glibc to see any
performance benifit. However the LD_PRELOAD method will still work as
well with this patch. This patch depends on both vsyscall-gtod_B3-part1
and vsyscall-gtod_B3-part2.

thanks
-john

diff -Nru a/arch/i386/kernel/vsyscall-int80.S b/arch/i386/kernel/vsyscall-int80.S
--- a/arch/i386/kernel/vsyscall-int80.S Wed Mar 3 15:40:05 2004
+++ b/arch/i386/kernel/vsyscall-int80.S Wed Mar 3 15:40:05 2004
@@ -1,3 +1,6 @@
+#include <linux/config.h>
+#include <asm/unistd.h>
+#include <asm/vsyscall-gtod.h>
/*
* Code for the vsyscall page. This version uses the old int $0x80 method.
*/
@@ -7,8 +10,26 @@
.type __kernel_vsyscall,@function
__kernel_vsyscall:
.LSTART_vsyscall:
+#ifdef CONFIG_VSYSCALL_GTOD
+ cmp $__NR_gettimeofday, %eax
+ je .Lvgettimeofday
+#endif /* CONFIG_VSYSCALL_GTOD */
int $0x80
ret
+
+#ifdef CONFIG_VSYSCALL_GTOD
+/* vsyscall-gettimeofday code */
+.Lvgettimeofday:
+ pushl %edx
+ pushl %ecx
+ pushl %ebx
+ call VSYSCALL_GTOD_START
+ popl %ebx
+ popl %ecx
+ popl %edx
+ ret
+#endif /* CONFIG_VSYSCALL_GTOD */
+
.LEND_vsyscall:
.size __kernel_vsyscall,.-.LSTART_vsyscall
.previous
diff -Nru a/arch/i386/kernel/vsyscall-sysenter.S b/arch/i386/kernel/vsyscall-sysenter.S
--- a/arch/i386/kernel/vsyscall-sysenter.S Wed Mar 3 15:40:05 2004
+++ b/arch/i386/kernel/vsyscall-sysenter.S Wed Mar 3 15:40:05 2004
@@ -1,3 +1,6 @@
+#include <linux/config.h>
+#include <asm/unistd.h>
+#include <asm/vsyscall-gtod.h>
/*
* Code for the vsyscall page. This version uses the sysenter instruction.
*/
@@ -7,6 +10,10 @@
.type __kernel_vsyscall,@function
__kernel_vsyscall:
.LSTART_vsyscall:
+#ifdef CONFIG_VSYSCALL_GTOD
+ cmp $__NR_gettimeofday, %eax
+ je .Lvgettimeofday
+#endif /* CONFIG_VSYSCALL_GTOD */
push %ecx
.Lpush_ecx:
push %edx
@@ -31,6 +38,20 @@
pop %ecx
.Lpop_ecx:
ret
+
+#ifdef CONFIG_VSYSCALL_GTOD
+/* vsyscall-gettimeofday code */
+.Lvgettimeofday:
+ pushl %edx
+ pushl %ecx
+ pushl %ebx
+ call VSYSCALL_GTOD_START
+ popl %ebx
+ popl %ecx
+ popl %edx
+ ret
+#endif /* CONFIG_VSYSCALL_GTOD */
+
.LEND_vsyscall:
.size __kernel_vsyscall,.-.LSTART_vsyscall
.previous

2004-03-04 00:15:55

by john stultz

[permalink] [raw]

Subject: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part2 (2/3)

All,
This is the core vsyscall-gtod implementation. It depends on the
vsyscall-gtod_B3-part1 patch.

thanks
-john

diff -Nru a/arch/i386/Kconfig b/arch/i386/Kconfig
--- a/arch/i386/Kconfig Wed Mar 3 15:37:34 2004
+++ b/arch/i386/Kconfig Wed Mar 3 15:37:34 2004
@@ -435,6 +435,10 @@
config HPET_EMULATE_RTC
def_bool HPET_TIMER && RTC=y

+config VSYSCALL_GTOD
+ depends on EXPERIMENTAL
+ bool "VSYSCALL gettimeofday() interface"
+
config SMP
bool "Symmetric multi-processing support"
---help---
diff -Nru a/arch/i386/kernel/Makefile b/arch/i386/kernel/Makefile
--- a/arch/i386/kernel/Makefile Wed Mar 3 15:37:34 2004
+++ b/arch/i386/kernel/Makefile Wed Mar 3 15:37:34 2004
@@ -32,6 +32,7 @@
obj-$(CONFIG_HPET_TIMER) += time_hpet.o
obj-$(CONFIG_EFI) += efi.o efi_stub.o
obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
+obj-$(CONFIG_VSYSCALL_GTOD) += vsyscall-gtod.o

EXTRA_AFLAGS := -traditional

diff -Nru a/arch/i386/kernel/setup.c b/arch/i386/kernel/setup.c
--- a/arch/i386/kernel/setup.c Wed Mar 3 15:37:34 2004
+++ b/arch/i386/kernel/setup.c Wed Mar 3 15:37:34 2004
@@ -47,6 +47,7 @@
#include <asm/sections.h>
#include <asm/io_apic.h>
#include <asm/ist.h>
+#include <asm/vsyscall-gtod.h>
#include "setup_arch_pre.h"
#include "mach_resources.h"

@@ -1159,6 +1160,7 @@
conswitchp = &dummy_con;
#endif
#endif
+ vsyscall_init();
}

#include "setup_arch_post.h"
diff -Nru a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c
--- a/arch/i386/kernel/time.c Wed Mar 3 15:37:34 2004
+++ b/arch/i386/kernel/time.c Wed Mar 3 15:37:34 2004
@@ -393,5 +393,8 @@
cur_timer = select_timer();
printk(KERN_INFO "Using %s for high-res timesource\n",cur_timer->name);

+ /* set vsyscall to use selected time source */
+ vsyscall_set_timesource(cur_timer->name);
+
time_init_hook();
}
diff -Nru a/arch/i386/kernel/timers/timer.c b/arch/i386/kernel/timers/timer.c
--- a/arch/i386/kernel/timers/timer.c Wed Mar 3 15:37:34 2004
+++ b/arch/i386/kernel/timers/timer.c Wed Mar 3 15:37:34 2004
@@ -2,6 +2,7 @@
#include <linux/kernel.h>
#include <linux/string.h>
#include <asm/timer.h>
+#include <asm/vsyscall-gtod.h>

#ifdef CONFIG_HPET_TIMER
/*
@@ -44,6 +45,9 @@
void clock_fallback(void)
{
cur_timer = &timer_pit;
+
+ /* set vsyscall to use selected time source */
+ vsyscall_set_timesource(cur_timer->name);
}

/* iterates through the list of timers, returning the first
diff -Nru a/arch/i386/kernel/timers/timer_cyclone.c b/arch/i386/kernel/timers/timer_cyclone.c
--- a/arch/i386/kernel/timers/timer_cyclone.c Wed Mar 3 15:37:34 2004
+++ b/arch/i386/kernel/timers/timer_cyclone.c Wed Mar 3 15:37:34 2004
@@ -23,6 +23,13 @@
/* Number of usecs that the last interrupt was delayed */
int cyclone_delay_at_last_interrupt;

+/* FIXMAP flag */
+#ifdef CONFIG_VSYSCALL_GTOD
+#define PAGE_CYCLONE PAGE_KERNEL_VSYSCALL_NOCACHE
+#else
+#define PAGE_CYCLONE PAGE_KERNEL_NOCACHE
+#endif
+
#define CYCLONE_CBAR_ADDR 0xFEB00CD0
#define CYCLONE_PMCC_OFFSET 0x51A0
#define CYCLONE_MPMC_OFFSET 0x51D0
@@ -192,7 +199,7 @@
/* map in cyclone_timer */
pageaddr = (base + CYCLONE_MPMC_OFFSET)&PAGE_MASK;
offset = (base + CYCLONE_MPMC_OFFSET)&(~PAGE_MASK);
- set_fixmap_nocache(FIX_CYCLONE_TIMER, pageaddr);
+ __set_fixmap(FIX_CYCLONE_TIMER, pageaddr, PAGE_CYCLONE);
cyclone_timer = (u32*)(fix_to_virt(FIX_CYCLONE_TIMER) + offset);
if(!cyclone_timer){
printk(KERN_ERR "Summit chipset: Could not find valid MPMC register.\n");
diff -Nru a/arch/i386/kernel/vmlinux.lds.S b/arch/i386/kernel/vmlinux.lds.S
--- a/arch/i386/kernel/vmlinux.lds.S Wed Mar 3 15:37:34 2004
+++ b/arch/i386/kernel/vmlinux.lds.S Wed Mar 3 15:37:34 2004
@@ -3,11 +3,12 @@
*/

#include <asm-generic/vmlinux.lds.h>
-
+#include <linux/config.h>
+#include <asm/vsyscall-gtod.h>
+
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(startup_32)
-jiffies = jiffies_64;
SECTIONS
{
. = 0xC0000000 + 0x100000;
@@ -47,6 +48,79 @@
.data.cacheline_aligned : { *(.data.cacheline_aligned) }

_edata = .; /* End of data section */
+
+/* VSYSCALL_GTOD data */
+#ifdef CONFIG_VSYSCALL_GTOD
+
+ /* vsyscall entry */
+ . = ALIGN(64);
+ .data.cacheline_aligned : { *(.data.cacheline_aligned) }
+
+ .vsyscall_0 VSYSCALL_GTOD_START: AT ((LOADADDR(.data.cacheline_aligned) + SIZEOF(.data.cacheline_aligned) + 4095) & ~(4095)) { *(.vsyscall_0) }
+ __vsyscall_0 = LOADADDR(.vsyscall_0);
+
+
+ /* generic gtod variables */
+ . = ALIGN(64);
+ .vsyscall_timesource : AT ((LOADADDR(.vsyscall_0) + SIZEOF(.vsyscall_0) + 63) & ~(63)) { *(.vsyscall_timesource) }
+ vsyscall_timesource = LOADADDR(.vsyscall_timesource);
+
+ . = ALIGN(16);
+ .xtime_lock : AT ((LOADADDR(.vsyscall_timesource) + SIZEOF(.vsyscall_timesource) + 15) & ~(15)) { *(.xtime_lock) }
+ xtime_lock = LOADADDR(.xtime_lock);
+
+ . = ALIGN(16);
+ .xtime : AT ((LOADADDR(.xtime_lock) + SIZEOF(.xtime_lock) + 15) & ~(15)) { *(.xtime) }
+ xtime = LOADADDR(.xtime);
+
+ . = ALIGN(16);
+ .jiffies : AT ((LOADADDR(.xtime) + SIZEOF(.xtime) + 15) & ~(15)) { *(.jiffies) }
+ jiffies = LOADADDR(.jiffies);
+
+ . = ALIGN(16);
+ .wall_jiffies : AT ((LOADADDR(.jiffies) + SIZEOF(.jiffies) + 15) & ~(15)) { *(.wall_jiffies) }
+ wall_jiffies = LOADADDR(.wall_jiffies);
+
+ .sys_tz : AT (LOADADDR(.wall_jiffies) + SIZEOF(.wall_jiffies)) { *(.sys_tz) }
+ sys_tz = LOADADDR(.sys_tz);
+
+ /* NTP variables */
+ .tickadj : AT (LOADADDR(.sys_tz) + SIZEOF(.sys_tz)) { *(.tickadj) }
+ tickadj = LOADADDR(.tickadj);
+
+ .time_adjust : AT (LOADADDR(.tickadj) + SIZEOF(.tickadj)) { *(.time_adjust) }
+ time_adjust = LOADADDR(.time_adjust);
+
+ /* TSC variables*/
+ .last_tsc_low : AT (LOADADDR(.time_adjust) + SIZEOF(.time_adjust)) { *(.last_tsc_low) }
+ last_tsc_low = LOADADDR(.last_tsc_low);
+
+ .tsc_delay_at_last_interrupt : AT (LOADADDR(.last_tsc_low) + SIZEOF(.last_tsc_low)) { *(.tsc_delay_at_last_interrupt) }
+ tsc_delay_at_last_interrupt = LOADADDR(.tsc_delay_at_last_interrupt);
+
+ .fast_gettimeoffset_quotient : AT (LOADADDR(.tsc_delay_at_last_interrupt) + SIZEOF(.tsc_delay_at_last_interrupt)) { *(.fast_gettimeoffset_quotient) }
+ fast_gettimeoffset_quotient = LOADADDR(.fast_gettimeoffset_quotient);
+
+
+ /*cyclone values*/
+ .cyclone_timer : AT (LOADADDR(.fast_gettimeoffset_quotient) + SIZEOF(.fast_gettimeoffset_quotient)) { *(.cyclone_timer) }
+ cyclone_timer = LOADADDR(.cyclone_timer);
+
+ .last_cyclone_low : AT (LOADADDR(.cyclone_timer) + SIZEOF(.cyclone_timer)) { *(.last_cyclone_low) }
+ last_cyclone_low = LOADADDR(.last_cyclone_low);
+
+ .cyclone_delay_at_last_interrupt : AT (LOADADDR(.last_cyclone_low) + SIZEOF(.last_cyclone_low)) { *(.cyclone_delay_at_last_interrupt) }
+ cyclone_delay_at_last_interrupt = LOADADDR(.cyclone_delay_at_last_interrupt);
+
+
+ .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT (LOADADDR(.vsyscall_0) + 1024) { *(.vsyscall_1) }
+ . = LOADADDR(.vsyscall_0) + 4096;
+
+ jiffies_64 = jiffies;
+#else
+ jiffies = jiffies_64;
+#endif
+/* END of VSYSCALL_GTOD data*/

. = ALIGN(8192); /* init_task */
.data.init_task : { *(.data.init_task) }
diff -Nru a/arch/i386/kernel/vsyscall-gtod.c b/arch/i386/kernel/vsyscall-gtod.c
--- /dev/null Wed Dec 31 16:00:00 1969
+++ b/arch/i386/kernel/vsyscall-gtod.c Wed Mar 3 15:37:34 2004
@@ -0,0 +1,275 @@
+/*
+ * linux/arch/i386/kernel/vsyscall-gtod.c
+ *
+ * Copyright (C) 2001 Andrea Arcangeli <[email protected]> SuSE
+ * Copyright (C) 2003,2004 John Stultz <[email protected]> IBM
+ *
+ * Thanks to [email protected] for some useful hint.
+ * Special thanks to Ingo Molnar for his early experience with
+ * a different vsyscall implementation for Linux/IA32 and for the name.
+ *
+ * vsyscall 0 is located at VSYSCALL_START, vsyscall 1 is located
+ * at virtual address VSYSCALL_START+1024bytes etc...
+ *
+ * Originally written for x86-64 by Andrea Arcangeli <[email protected]>
+ * Ported to i386 by John Stultz <[email protected]>
+ */
+
+
+#include <linux/time.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/timer.h>
+#include <linux/sched.h>
+
+#include <asm/vsyscall-gtod.h>
+#include <asm/pgtable.h>
+#include <asm/page.h>
+#include <asm/fixmap.h>
+#include <asm/msr.h>
+#include <asm/timer.h>
+#include <asm/system.h>
+#include <asm/unistd.h>
+#include <asm/errno.h>
+
+int errno;
+static inline _syscall2(int,gettimeofday,struct timeval *,tv,struct timezone *,tz);
+static int vsyscall_mapped = 0; /* flag variable for remap_vsyscall() */
+
+enum vsyscall_timesource_e vsyscall_timesource;
+enum vsyscall_timesource_e __vsyscall_timesource __section_vsyscall_timesource;
+
+/* readonly clones of generic time values */
+seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED;
+struct timespec __xtime __section_xtime;
+volatile unsigned long __jiffies __section_jiffies;
+unsigned long __wall_jiffies __section_wall_jiffies;
+struct timezone __sys_tz __section_sys_tz;
+/* readonly clones of ntp time variables */
+int __tickadj __section_tickadj;
+long __time_adjust __section_time_adjust;
+
+/* readonly clones of TSC timesource values*/
+unsigned long __last_tsc_low __section_last_tsc_low;
+int __tsc_delay_at_last_interrupt __section_tsc_delay_at_last_interrupt;
+unsigned long __fast_gettimeoffset_quotient __section_fast_gettimeoffset_quotient;
+
+/* readonly clones of cyclone timesource values*/
+u32* __cyclone_timer __section_cyclone_timer; /* Cyclone MPMC0 register */
+u32 __last_cyclone_low __section_last_cyclone_low;
+int __cyclone_delay_at_last_interrupt __section_cyclone_delay_at_last_interrupt;
+
+
+static inline unsigned long vgettimeoffset_tsc(void)
+{
+ unsigned long eax, edx;
+
+ /* Read the Time Stamp Counter */
+ rdtsc(eax,edx);
+
+ /* .. relative to previous jiffy (32 bits is enough) */
+ eax -= __last_tsc_low; /* tsc_low delta */
+
+ /*
+ * Time offset = (tsc_low delta) * fast_gettimeoffset_quotient
+ * = (tsc_low delta) * (usecs_per_clock)
+ * = (tsc_low delta) * (usecs_per_jiffy / clocks_per_jiffy)
+ *
+ * Using a mull instead of a divl saves up to 31 clock cycles
+ * in the critical path.
+ */
+
+
+ __asm__("mull %2"
+ :"=a" (eax), "=d" (edx)
+ :"rm" (__fast_gettimeoffset_quotient),
+ "0" (eax));
+
+ /* our adjusted time offset in microseconds */
+ return __tsc_delay_at_last_interrupt + edx;
+
+}
+
+static inline unsigned long vgettimeoffset_cyclone(void)
+{
+ u32 offset;
+
+ if (!__cyclone_timer)
+ return 0;
+
+ /* Read the cyclone timer */
+ offset = __cyclone_timer[0];
+
+ /* .. relative to previous jiffy */
+ offset = offset - __last_cyclone_low;
+
+ /* convert cyclone ticks to microseconds */
+ offset = offset/(CYCLONE_TIMER_FREQ/1000000);
+
+ /* our adjusted time offset in microseconds */
+ return __cyclone_delay_at_last_interrupt + offset;
+}
+
+static inline void do_vgettimeofday(struct timeval * tv)
+{
+ long sequence;
+ unsigned long usec, sec;
+ unsigned long lost;
+ unsigned long max_ntp_tick;
+
+ /* If we don't have a valid vsyscall time source,
+ * just call gettimeofday()
+ */
+ if (__vsyscall_timesource == VSYSCALL_GTOD_NONE) {
+ gettimeofday(tv, NULL);
+ return;
+ }
+
+
+ do {
+ sequence = read_seqbegin(&__xtime_lock);
+
+ /* Get the high-res offset */
+ if (__vsyscall_timesource == VSYSCALL_GTOD_CYCLONE)
+ usec = vgettimeoffset_cyclone();
+ else
+ usec = vgettimeoffset_tsc();
+
+ lost = __jiffies - __wall_jiffies;
+
+ /*
+ * If time_adjust is negative then NTP is slowing the clock
+ * so make sure not to go into next possible interval.
+ * Better to lose some accuracy than have time go backwards..
+ */
+ if (unlikely(__time_adjust < 0)) {
+ max_ntp_tick = (USEC_PER_SEC / HZ) - __tickadj;
+ usec = min(usec, max_ntp_tick);
+
+ if (lost)
+ usec += lost * max_ntp_tick;
+ }
+ else if (unlikely(lost))
+ usec += lost * (USEC_PER_SEC / HZ);
+
+ sec = __xtime.tv_sec;
+ usec += (__xtime.tv_nsec / 1000);
+
+ } while (read_seqretry(&__xtime_lock, sequence));
+
+ tv->tv_sec = sec + usec / 1000000;
+ tv->tv_usec = usec % 1000000;
+}
+
+static inline void do_get_tz(struct timezone * tz)
+{
+ long sequence;
+
+ do {
+ sequence = read_seqbegin(&__xtime_lock);
+
+ *tz = __sys_tz;
+
+ } while (read_seqretry(&__xtime_lock, sequence));
+}
+
+static int __vsyscall(0) vgettimeofday(struct timeval * tv, struct timezone * tz)
+{
+ if (tv)
+ do_vgettimeofday(tv);
+ if (tz)
+ do_get_tz(tz);
+ return 0;
+}
+
+static time_t __vsyscall(1) vtime(time_t * t)
+{
+ struct timeval tv;
+ vgettimeofday(&tv,NULL);
+ if (t)
+ *t = tv.tv_sec;
+ return tv.tv_sec;
+}
+
+static long __vsyscall(2) venosys_0(void)
+{
+ return -ENOSYS;
+}
+
+static long __vsyscall(3) venosys_1(void)
+{
+ return -ENOSYS;
+}
+
+
+void vsyscall_set_timesource(char* name)
+{
+ if (!strncmp(name, "tsc", 3))
+ vsyscall_timesource = VSYSCALL_GTOD_TSC;
+ else if (!strncmp(name, "cyclone", 7))
+ vsyscall_timesource = VSYSCALL_GTOD_CYCLONE;
+ else
+ vsyscall_timesource = VSYSCALL_GTOD_NONE;
+}
+
+
+static void __init map_vsyscall(void)
+{
+ unsigned long physaddr_page0 = (unsigned long) &__vsyscall_0 - PAGE_OFFSET;
+
+ /* Initially we map the VSYSCALL page w/ PAGE_KERNEL permissions to
+ * keep the alternate_instruction code from bombing out when it
+ * changes the seq_lock memory barriers in vgettimeofday()
+ */
+ __set_fixmap(FIX_VSYSCALL_GTOD_FIRST_PAGE, physaddr_page0, PAGE_KERNEL);
+}
+
+static int __init remap_vsyscall(void)
+{
+ unsigned long physaddr_page0 = (unsigned long) &__vsyscall_0 - PAGE_OFFSET;
+
+ if (!vsyscall_mapped)
+ return 0;
+
+ /* Remap the VSYSCALL page w/ PAGE_KERNEL_VSYSCALL permissions
+ * after the alternate_instruction code has run
+ */
+ clear_fixmap(FIX_VSYSCALL_GTOD_FIRST_PAGE);
+ __set_fixmap(FIX_VSYSCALL_GTOD_FIRST_PAGE, physaddr_page0, PAGE_KERNEL_VSYSCALL);
+
+ return 0;
+}
+
+int __init vsyscall_init(void)
+{
+ printk("VSYSCALL: consistency checks...");
+ if ((unsigned long) &vgettimeofday != VSYSCALL_ADDR(__NR_vgettimeofday)) {
+ printk("vgettimeofday link addr broken\n");
+ printk("VSYSCALL: vsyscall_init failed!\n");
+ return -EFAULT;
+ }
+ if ((unsigned long) &vtime != VSYSCALL_ADDR(__NR_vtime)) {
+ printk("vtime link addr broken\n");
+ printk("VSYSCALL: vsyscall_init failed!\n");
+ return -EFAULT;
+ }
+ if (VSYSCALL_ADDR(0) != __fix_to_virt(FIX_VSYSCALL_GTOD_FIRST_PAGE)) {
+ printk("fixmap first vsyscall 0x%lx should be 0x%x\n",
+ __fix_to_virt(FIX_VSYSCALL_GTOD_FIRST_PAGE),
+ VSYSCALL_ADDR(0));
+ printk("VSYSCALL: vsyscall_init failed!\n");
+ return -EFAULT;
+ }
+
+
+ printk("passed...mapping...");
+ map_vsyscall();
+ printk("done.\n");
+ vsyscall_mapped = 1;
+ printk("VSYSCALL: fixmap virt addr: 0x%lx\n",
+ __fix_to_virt(FIX_VSYSCALL_GTOD_FIRST_PAGE));
+
+ return 0;
+}
+
+__initcall(remap_vsyscall);
diff -Nru a/include/asm-i386/fixmap.h b/include/asm-i386/fixmap.h
--- a/include/asm-i386/fixmap.h Wed Mar 3 15:37:34 2004
+++ b/include/asm-i386/fixmap.h Wed Mar 3 15:37:34 2004
@@ -18,6 +18,7 @@
#include <asm/acpi.h>
#include <asm/apicdef.h>
#include <asm/page.h>
+#include <asm/vsyscall-gtod.h>
#ifdef CONFIG_HIGHMEM
#include <linux/threads.h>
#include <asm/kmap_types.h>
@@ -44,6 +45,17 @@
enum fixed_addresses {
FIX_HOLE,
FIX_VSYSCALL,
+#ifdef CONFIG_VSYSCALL_GTOD
+#ifndef CONFIG_X86_4G
+ FIX_VSYSCALL_GTOD_PAD,
+#endif /* !CONFIG_X86_4G */
+ FIX_VSYSCALL_GTOD_LAST_PAGE,
+ FIX_VSYSCALL_GTOD_FIRST_PAGE = FIX_VSYSCALL_GTOD_LAST_PAGE
+ + VSYSCALL_GTOD_NUMPAGES - 1,
+#ifdef CONFIG_X86_4G
+ FIX_VSYSCALL_GTOD_4GALIGN,
+#endif /* CONFIG_X86_4G */
+#endif /* CONFIG_VSYSCALL_GTOD */
#ifdef CONFIG_X86_LOCAL_APIC
FIX_APIC_BASE, /* local (CPU) APIC) -- required for SMP or not */
#endif
diff -Nru a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h
--- a/include/asm-i386/pgtable.h Wed Mar 3 15:37:34 2004
+++ b/include/asm-i386/pgtable.h Wed Mar 3 15:37:34 2004
@@ -137,11 +137,15 @@
#define __PAGE_KERNEL_RO (__PAGE_KERNEL & ~_PAGE_RW)
#define __PAGE_KERNEL_NOCACHE (__PAGE_KERNEL | _PAGE_PCD)
#define __PAGE_KERNEL_LARGE (__PAGE_KERNEL | _PAGE_PSE)
+#define __PAGE_KERNEL_VSYSCALL \
+ (_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED)

#define PAGE_KERNEL __pgprot(__PAGE_KERNEL)
#define PAGE_KERNEL_RO __pgprot(__PAGE_KERNEL_RO)
#define PAGE_KERNEL_NOCACHE __pgprot(__PAGE_KERNEL_NOCACHE)
#define PAGE_KERNEL_LARGE __pgprot(__PAGE_KERNEL_LARGE)
+#define PAGE_KERNEL_VSYSCALL __pgprot(__PAGE_KERNEL_VSYSCALL)
+#define PAGE_KERNEL_VSYSCALL_NOCACHE __pgprot(__PAGE_KERNEL_VSYSCALL|(__PAGE_KERNEL_RO | _PAGE_PCD))

/*
* The i386 can't do page protection for execute, and considers that
diff -Nru a/include/asm-i386/vsyscall-gtod.h b/include/asm-i386/vsyscall-gtod.h
--- /dev/null Wed Dec 31 16:00:00 1969
+++ b/include/asm-i386/vsyscall-gtod.h Wed Mar 3 15:37:34 2004
@@ -0,0 +1,68 @@
+#ifndef _ASM_i386_VSYSCALL_GTOD_H_
+#define _ASM_i386_VSYSCALL_GTOD_H_
+
+#ifdef CONFIG_VSYSCALL_GTOD
+
+/* VSYSCALL_GTOD_START must be the same as
+ * __fix_to_virt(FIX_VSYSCALL_GTOD FIRST_PAGE)
+ * and must also be same as addr in vmlinux.lds.S */
+#define VSYSCALL_GTOD_START 0xffffc000
+#define VSYSCALL_GTOD_SIZE 1024
+#define VSYSCALL_GTOD_END (VSYSCALL_GTOD_START + PAGE_SIZE)
+#define VSYSCALL_GTOD_NUMPAGES \
+ ((VSYSCALL_GTOD_END-VSYSCALL_GTOD_START) >> PAGE_SHIFT)
+#define VSYSCALL_ADDR(vsyscall_nr) \
+ (VSYSCALL_GTOD_START+VSYSCALL_GTOD_SIZE*(vsyscall_nr))
+
+#ifdef __KERNEL__
+#ifndef __ASSEMBLY__
+#include <linux/seqlock.h>
+
+#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr)))
+
+/* ReadOnly generic time value attributes*/
+#define __section_vsyscall_timesource __attribute__ ((unused, __section__ (".vsyscall_timesource")))
+#define __section_xtime_lock __attribute__ ((unused, __section__ (".xtime_lock")))
+#define __section_xtime __attribute__ ((unused, __section__ (".xtime")))
+#define __section_jiffies __attribute__ ((unused, __section__ (".jiffies")))
+#define __section_wall_jiffies __attribute__ ((unused, __section__ (".wall_jiffies")))
+#define __section_sys_tz __attribute__ ((unused, __section__ (".sys_tz")))
+
+/* ReadOnly NTP variables */
+#define __section_tickadj __attribute__ ((unused, __section__ (".tickadj")))
+#define __section_time_adjust __attribute__ ((unused, __section__ (".time_adjust")))
+
+
+/* ReadOnly TSC time value attributes*/
+#define __section_last_tsc_low __attribute__ ((unused, __section__ (".last_tsc_low")))
+#define __section_tsc_delay_at_last_interrupt __attribute__ ((unused, __section__ (".tsc_delay_at_last_interrupt")))
+#define __section_fast_gettimeoffset_quotient __attribute__ ((unused, __section__ (".fast_gettimeoffset_quotient")))
+
+/* ReadOnly Cyclone time value attributes*/
+#define __section_cyclone_timer __attribute__ ((unused, __section__ (".cyclone_timer")))
+#define __section_last_cyclone_low __attribute__ ((unused, __section__ (".last_cyclone_low")))
+#define __section_cyclone_delay_at_last_interrupt __attribute__ ((unused, __section__ (".cyclone_delay_at_last_interrupt")))
+
+enum vsyscall_num {
+ __NR_vgettimeofday,
+ __NR_vtime,
+};
+
+enum vsyscall_timesource_e {
+ VSYSCALL_GTOD_NONE,
+ VSYSCALL_GTOD_TSC,
+ VSYSCALL_GTOD_CYCLONE,
+};
+
+int vsyscall_init(void);
+void vsyscall_set_timesource(char* name);
+
+extern char __vsyscall_0;
+#endif /* __ASSEMBLY__ */
+#endif /* __KERNEL__ */
+#else /* CONFIG_VSYSCALL_GTOD */
+#define vsyscall_init()
+#define vsyscall_set_timesource(x)
+#endif /* CONFIG_VSYSCALL_GTOD */
+#endif /* _ASM_i386_VSYSCALL_GTOD_H_ */
+

2004-03-04 00:19:03

by john stultz

[permalink] [raw]

Subject: [RFC] vsyscall-gtod_test_B3.tar.gz

All,
This tarball shows an example on how to use the LD_PRELOAD method of
calling vsyscall-gtod. This method is used in the absense of
vsyscall-gtod_B3-part3 patch, or on systems without a sysenter enabled
glibc. The example provides a micro-benchmark that compares the
performance of gettimeofday() when calling it normally, or through the
LD_PRELOAD method.

thanks
-john

Attachments:

vsyscall-gtod_test_B3.tar.gz (941.00 B)

2004-03-04 01:51:58

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3)

On Wed, Mar 03, 2004 at 04:14:08PM -0800, john stultz wrote:
> All,
> This patch implements the somewhat controversial vDSO hooks for
> vsyscall-gtod. This makes LD_PRELOADing or changes to glibc unnecessary,

the reason it's controversial is just because it's microslowing down
all syscalls to speedup gettimeofday, when you can avoid this kernel
change completely and implementing it zerocost like in x86-64. glibc
should simply call into the vsyscall directly. Why don't you simply
provide a patch against glibc, instead of proposing a patch against the
kernel? Of course this patch will depend on your vsyscall patch on the
kernel side, and that's fine. Another elf bitflag can be used to tell
glibc to use vgettimeofday or whatever, just like it happens with the
sysenter vsyscall.

This is just like the kernel patches people proposes when they get
vmalloc LDT allocation failure, because they run with the i686 glibc
instead of the only possibly supported i586 configuration. It makes no
sense to hide a glibc inefficiency in the kernel when you can fix it in
glibc and avoid the LDT 4k allocation completely since nobody will ever
call into pthread_create. It's not that wasting 4k of zone-normal per
task is a good thing, and wasting 64k of vmalloc per task is a bad
thing. they're both bad things, you just only can see the latter one
unless you're a kernel hacker, so people actually think the kmalloc LDT
thing is a bugfix, while it's just a bad band-aid (I mean, it's a good
thing at large, but not as the fix of the vmalloc LDT faliures). I bet
if the LDT allocation was visible in /proc as easily as the manger
thread was visible with `ps` in linuxthreads, the LDT allocation would
been deferred to pthread_create too in the first place. As a matter of
fact I spent a few hours trying to fixup glibc some month ago, but the
flood of #ifdefs and the fact linuxthreads is dead made me desist and I
will try again with NTPL since it seems they didn't fix it (at least
last time I checked the code the LDT waste as still there).

2004-03-04 02:18:18

by Ulrich Drepper

[permalink] [raw]

Subject: Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrea Arcangeli wrote:

> This is just like the kernel patches people proposes when they get
> vmalloc LDT allocation failure, because they run with the i686 glibc
> instead of the only possibly supported i586 configuration. It makes no
> sense to hide a glibc inefficiency

You apparently still haven't gotten any clue since your whining the last
time around. Absolute addresses are a fatal mistake.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQFARpGU2ijCOnn/RHQRAhs3AJ0XEZ5VGb40VAIPuO4negyo7cx/WwCbBrN6
EFZ7UnY7W/it0sUiq6Dodeg=
=KSMr
-----END PGP SIGNATURE-----

2004-03-04 02:43:47

by john stultz

[permalink] [raw]

Subject: Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3)

On Wed, 2004-03-03 at 18:16, Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Andrea Arcangeli wrote:
>
> > This is just like the kernel patches people proposes when they get
> > vmalloc LDT allocation failure, because they run with the i686 glibc
> > instead of the only possibly supported i586 configuration. It makes no
> > sense to hide a glibc inefficiency
>
> You apparently still haven't gotten any clue since your whining the last
> time around. Absolute addresses are a fatal mistake.

Before we start up this larger debate again, might there be some short
term solution for my patch that would satisfy both of you?

If I understand the earlier arguments, if we're going to have the
dynamically relocated segments at some point, I agree that absolute
addresses are going to have problems. However, if I'm not mistaken, this
problem already exists w/ the vsyscall-sysenter code, correct?

What is the plan for avoiding the absolute address issue there? If I
implemented the vsyscall-gettimeofday code in a similar manner (as
Andrea suggested), could the planned solution for vsyscall-sysenter be
applied here as well?

thanks
-john

2004-03-04 02:47:04

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3)

On Wed, Mar 03, 2004 at 06:16:52PM -0800, Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Andrea Arcangeli wrote:
>
> > This is just like the kernel patches people proposes when they get
> > vmalloc LDT allocation failure, because they run with the i686 glibc
> > instead of the only possibly supported i586 configuration. It makes no
> > sense to hide a glibc inefficiency
>
> You apparently still haven't gotten any clue since your whining the last
> time around. Absolute addresses are a fatal mistake.

the above ldt issue has nothing to do with any address at all, it's all
about deferring the ldt allocation after pthread_create, like
linuxthreads also defer the genreation of the manager thread post the
first pthread_create.

about the vsyscall part (the only thing with a relation with "fixed
addresses"), you can pass the address of vgettimeofday via elf or in any
other way, I'm not forcing you to setup a fixed address, I never spoke
about fixed addresses (infact I specified the elf bit) you can do the
same as the vsysenter, if you're fine with the way vsysenter works you
must be fine using the same way for vgettimeofday too. My only point
(and the only reason I'm against this patch) is that glibc should call
into the vgettimeofday without passing through vsysenter, and in turn
glibc should have _knowledge_ of the existence of vgettimeofday,
otherwise every other regular syscall invoked through vsysenter would
need to pay for it. Now probably we'll never have more than two vsyscall
in x86, so wasting a few nanoseconds for a conditional jump at every
vsysenter may not be measurable but it doesn't look the right design.

And sysenter is at a fixed address in 2.6 x86 too (it doesn't even
change between different kernel compiles).

2004-03-04 02:55:24

by john stultz

[permalink] [raw]

Subject: Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3)

On Wed, 2004-03-03 at 18:47, Andrea Arcangeli wrote:
> And sysenter is at a fixed address in 2.6 x86 too (it doesn't even
> change between different kernel compiles).

Actually, the 4G patch pushes vsysenter down a page, and glibc seems to
handle this properly.

-john

2004-03-04 03:13:33

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3)

On Wed, Mar 03, 2004 at 06:43:18PM -0800, john stultz wrote:
> On Wed, 2004-03-03 at 18:16, Ulrich Drepper wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Andrea Arcangeli wrote:
> >
> > > This is just like the kernel patches people proposes when they get
> > > vmalloc LDT allocation failure, because they run with the i686 glibc
> > > instead of the only possibly supported i586 configuration. It makes no
> > > sense to hide a glibc inefficiency
> >
> > You apparently still haven't gotten any clue since your whining the last
> > time around. Absolute addresses are a fatal mistake.
>
> Before we start up this larger debate again, might there be some short
> term solution for my patch that would satisfy both of you?

For a production release short term solutions like this would be ok, but
the mainline source that will fork off in 2.7 should have the best
design IMHO, and the same for glibc.

> If I understand the earlier arguments, if we're going to have the
> dynamically relocated segments at some point, I agree that absolute
> addresses are going to have problems. However, if I'm not mistaken, this
> problem already exists w/ the vsyscall-sysenter code, correct?

this is exactly my point, the fixed address trouble mentioned by Ulirch
make little sense to me because of this (especially in reply to the ldt
part).

and in practice the sysenter instruction is already at a fixed address
in any 2.6 kernel out there (yeah, we could change that number without
breaking glibc, but the attacker won't care less).

> What is the plan for avoiding the absolute address issue there? If I
> implemented the vsyscall-gettimeofday code in a similar manner (as
> Andrea suggested), could the planned solution for vsyscall-sysenter be
> applied here as well?

I think yes but thinking twice my preferred way is not to pass another
variable address to userspace (that was the first solution that come to
mind, and I wrote that just to show there's no real "fixed address"
trouble). Fixed _offsets_ (not virtual addresses) are perfectly fine
w.r.t. security. So we can just assume the vgettimeofday is at a fixed
offset after the vsysenter code. This should result in the most
efficient code possible while providing flexiblity to the address like
vsysenter does (vgettimeofday will move together with vsysenter).
However it could be the second value in elf is a cleaner way to pass the
vgettimeofday address, I don't mind either ways.

thanks.

2004-03-04 03:15:24

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3)

2004-03-04 08:06:44

by Jamie Lokier

[permalink] [raw]

Subject: Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3)

Andrea Arcangeli wrote:
> I will try again with NTPL since it seems they didn't fix it (at
> least last time I checked the code the LDT waste as still there).

Does NPTL use the LDT at all? sys_set_thread_area was created
specifically so that the LDT isn't needed.

-- Jamie

2004-03-04 08:10:44

by Ulrich Drepper

[permalink] [raw]

Subject: Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

john stultz wrote:

> Before we start up this larger debate again, might there be some short
> term solution for my patch that would satisfy both of you?

I suggest the following:

~ define a symbol __kernel_gettimeofday_offset in the vdso's symbol
table. This should be an absolute symbol containing the offset of the
gettimeofday implementation from the beginning of the vdso (the address
passed up in the auxiliary vector)

~ glibc can then use the equivalent of
dlsym("__kernel_gettimeofday_offset"). If the symbol is not defined,
it's not used (doh). If it is defined, the final function address is
computed by adding the offset to the vdso address.

This ensures a direct jump and it still keeps the vdso relocatable
without modifying the symbol table.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD4DBQFARuRa2ijCOnn/RHQRAnITAKCeS30ShpbeadFA5n/TlaTOXYNzygCVG3tg
2HCPVqo5DRtQfUoKwLY6vQ==
=ST37
-----END PGP SIGNATURE-----

2004-03-04 08:45:27

by Jakub Jelinek

[permalink] [raw]

Subject: Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3)

On Thu, Mar 04, 2004 at 08:00:56AM +0000, Jamie Lokier wrote:
> Andrea Arcangeli wrote:
> > I will try again with NTPL since it seems they didn't fix it (at
> > least last time I checked the code the LDT waste as still there).
>
> Does NPTL use the LDT at all? sys_set_thread_area was created
> specifically so that the LDT isn't needed.

It doesn't and neither does LinuxThreads when run on a recent kernel
(which has set_thread_area).

Jakub

2004-03-04 08:59:35

by Jakub Jelinek

[permalink] [raw]

Subject: Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3)

On Wed, Mar 03, 2004 at 06:54:49PM -0800, john stultz wrote:
> On Wed, 2004-03-03 at 18:47, Andrea Arcangeli wrote:
> > And sysenter is at a fixed address in 2.6 x86 too (it doesn't even
> > change between different kernel compiles).
>
> Actually, the 4G patch pushes vsysenter down a page, and glibc seems to
> handle this properly.

But the 4G/4G patch relinks the vDSO to the address it uses, this is no
big problem for glibc which of course doesn't use hardcoded address but
reads AT_SYSINFO{,_EHDR} values kernel passes to it.

But the fixed vDSO location is a problem, exploits certainly appreciate
a fixed address at which they with high probability can enter the kernel.

Ingo Molnar recently wrote a patch to randomize the vDSO address on
IA-32. Unfortunately it revealed some bugs in glibc where ld.so did not
handle properly vDSOs linked to one address, but mmaped to a different one
(which is a must if kernel wants to share one vDSO page for each process).
So now the problem is if kernel randomizes vDSO, it will not even boot
with glibcs >= 2003-04-22 and <= 2004-02-27. There are 2 possible solutions
for this IMHO:
1) tell users of the glibc's which don't handle this they must upgrade glibc
first before booting a newer kernel and add kernel cmdline option to turn
vDSO off, so that a user can turn it off, upgrade glibc and then on next
boot use vDSO again
2) start using a different AT_SYSINFO_* value (just one is enough,
ATM AT_SYSINFO is ((ElfNN_Ehdr *)AT_SYSINFO_EHDR)->e_entry), stop using
the old 2 values. This would mean old glibcs will stop using vDSO, but hey,
it is just an optimization. Upgrading glibc would use vDSO again.

Jakub

2004-03-04 16:45:11

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3)

On Thu, Mar 04, 2004 at 03:57:36AM -0500, Jakub Jelinek wrote:
> On Wed, Mar 03, 2004 at 06:54:49PM -0800, john stultz wrote:
> > On Wed, 2004-03-03 at 18:47, Andrea Arcangeli wrote:
> > > And sysenter is at a fixed address in 2.6 x86 too (it doesn't even
> > > change between different kernel compiles).
> >
> > Actually, the 4G patch pushes vsysenter down a page, and glibc seems to
> > handle this properly.
>
> But the 4G/4G patch relinks the vDSO to the address it uses, this is no
> big problem for glibc which of course doesn't use hardcoded address but
> reads AT_SYSINFO{,_EHDR} values kernel passes to it.
>
> But the fixed vDSO location is a problem, exploits certainly appreciate
> a fixed address at which they with high probability can enter the kernel.
>
> Ingo Molnar recently wrote a patch to randomize the vDSO address on
> IA-32. Unfortunately it revealed some bugs in glibc where ld.so did not

do you have a link to the patch? (I don't see it in his homepage) just
curious to see how much precious address space you're throwing at this
randomization and in turn how many tries are needed to brute force.

if you really care so much about randomization vs performance, it would
been a lot better if you implemented vsysenter in a completely different
way: by exporting some position indipendent bytecode to userspace via a
syscall, and have glibc loadup this code somewhere in the address space
(truly random thing making it trivial to do the intra-page offsets with
byte granularity) and have kernel exporting only data, not exeuctables
in the address space. The executable bytecode would be returned by a
syscall. That is something slower at startup but fairly secure, and it
doesn't waste kernel or user address space. The max-performance way of
x86 vsysenter and x86-64 vgettimeofday simply don't fit for your
security object best IMHO, it wasn't designed for that and it's an hack
to try to randomize it with page-offsets. Keeping a local copy of the
vsyscall bytecode at different intra-page offsets is something sanely
doable in userspace, doing it in kernel is hairy and non-natural thing
to do for kernel (involve copies, replacation of the vsyscall page etc.,
so one can as well do the copy in a load_usyscall syscall when the
dynamic linker is asked to run gettimeofday, so this page can also be
swapped out). What kernel can do more or less naturally is to share the
same _physical_ vsyscall "page" (w/o intra page offset differences
making it trivial to brut), and export it at a different addresses for
each task, that is what Ingo implemented I guess, but it's 4096 times *
3.5G faster to crack.

2004-03-04 17:48:18

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3)

On Thu, Mar 04, 2004 at 03:37:55AM -0500, Jakub Jelinek wrote:
> On Thu, Mar 04, 2004 at 08:00:56AM +0000, Jamie Lokier wrote:
> > Andrea Arcangeli wrote:
> > > I will try again with NTPL since it seems they didn't fix it (at
> > > least last time I checked the code the LDT waste as still there).
> >
> > Does NPTL use the LDT at all? sys_set_thread_area was created
> > specifically so that the LDT isn't needed.
>
> It doesn't and neither does LinuxThreads when run on a recent kernel
> (which has set_thread_area).

I really thought set_thread_area would depend on a LDT being allocated
first, I was wrong sorry, the parameter passed is in the same format of
modify_ldt (that's what fooled me) but it seems used only to write
directly into the gdt, it's not an entry in the ldt but it's an entry
into the gdt, and the gdt is overwritten at every thread switch (not at
every mm switch). So as far as glibc doesn't allocate the LDT and only
uses this logic the problem I mentioned (using 2.4 kernels) is just
solved and I really appreciate this. The thing that scared me most (and
that hurted me most in deferring the allocation to pthread_create with
2.4 kernels) is that at least in linuxthreads the ldt allocation spreads
out of the linuxthread package (it spreads into the dynamic linker IIRC,
but I looked at last time a few months ago), I hope with nptl this is
fixed too. clearly the ldt had to be nuked somehow to get past some
thousand threads anyways, but avoiding the ldt allocation is an even more
important feature in real life with real apps not using threads but just
linking with pthreads by mistake like /bin/ls.

2004-03-04 19:02:43

by john stultz

[permalink] [raw]

Subject: Re: [RFC][PATCH] linux-2.6.4-pre1_vsyscall-gtod_B3-part3 (3/3)

On Thu, 2004-03-04 at 00:09, Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> john stultz wrote:
>
> > Before we start up this larger debate again, might there be some short
> > term solution for my patch that would satisfy both of you?
>
> I suggest the following:
>
> ~ define a symbol __kernel_gettimeofday_offset in the vdso's symbol
> table. This should be an absolute symbol containing the offset of the
> gettimeofday implementation from the beginning of the vdso (the address
> passed up in the auxiliary vector)
>
> ~ glibc can then use the equivalent of
> dlsym("__kernel_gettimeofday_offset"). If the symbol is not defined,
> it's not used (doh). If it is defined, the final function address is
> computed by adding the offset to the vdso address.
>
>
> This ensures a direct jump and it still keeps the vdso relocatable
> without modifying the symbol table.

Excellent, this sounds similar to what Andrea was suggesting.
I'll start working to implement this.

thanks for the momentary truce ;)
-john