2002-10-18 23:00:29

by john stultz

[permalink] [raw]
Subject: [PATCH] linux-2.5.43_vsyscall_A0

Linus, All

Since no one really brought up any issues with the code itself (correct
me if I'm wrong), here is the i386 vsyscall gettimeofday port I sent
last night, synced up and ready for integration. This patch implements
gettimeofday in a user readable page, allowing for calls to gettimeofday
to execute completely in userspace, giving a significant performance
boost.

Changes to glibc are unnecessary, because users that want to use the
vsyscall can do so by LD_PRELOADING a library which alias gettimeofday
before executing their application. This will not affect any other
application and allows the backward compatibility issue to be ignored.
I've created an example library (to be attached in a following email)
and ran a quick performance test w/ and w/o the preloaded library,
giving the following results:

Normal gettimeofday
gettimeofday ( 1403307us / 1000000runs ) = 1.403306us
vsyscall LD_PRELOAD gettimeofday
gettimeofday ( 285423us / 1000000runs ) = 0.285423us

Since this code uses the TSC for calculating time of day, this patch
will not help systems that suffer from TSC skew (ie: many NUMA systems,
etc). However, for UP and SMP boxes this is a pretty major win.
Alternative methods to use the cyclone/HPET registers for NUMA boxes are
also feasible in the future.

Please consider for inclusion.

(And if linker scripts aren't your idea of a scary costume, I don't know
what is :)

thanks
-john


diff -Nru a/arch/i386/kernel/Makefile b/arch/i386/kernel/Makefile
--- a/arch/i386/kernel/Makefile Fri Oct 18 15:52:11 2002
+++ b/arch/i386/kernel/Makefile Fri Oct 18 15:52:11 2002
@@ -9,7 +9,7 @@
obj-y := process.o semaphore.o signal.o entry.o traps.o irq.o vm86.o \
ptrace.o i8259.o ioport.o ldt.o setup.o time.o sys_i386.o \
pci-dma.o i386_ksyms.o i387.o bluesmoke.o dmi_scan.o \
- bootflag.o
+ bootflag.o vsyscall.o

obj-y += cpu/
obj-y += timers/
diff -Nru a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c
--- a/arch/i386/kernel/time.c Fri Oct 18 15:52:11 2002
+++ b/arch/i386/kernel/time.c Fri Oct 18 15:52:11 2002
@@ -69,7 +69,10 @@
unsigned long cpu_khz; /* Detected as we calibrate the TSC */

extern rwlock_t xtime_lock;
-extern unsigned long wall_jiffies;
+struct timespec __xtime __section_xtime;
+unsigned long __wall_jiffies __section_wall_jiffies;
+struct timezone __sys_tz __section_sys_tz;
+volatile unsigned long __jiffies __section_jiffies;

spinlock_t rtc_lock = SPIN_LOCK_UNLOCKED;

@@ -110,6 +113,7 @@
void do_settimeofday(struct timeval *tv)
{
write_lock_irq(&xtime_lock);
+ vxtime_lock();
/*
* This is revolting. We need to set "xtime" correctly. However, the
* value in this location is the value at the most recent update of
@@ -126,6 +130,8 @@

xtime.tv_sec = tv->tv_sec;
xtime.tv_nsec = (tv->tv_usec * 1000);
+ vxtime_unlock();
+
time_adjust = 0; /* stop active adjtime() */
time_status |= STA_UNSYNC;
time_maxerror = NTP_PHASE_LIMIT;
@@ -277,11 +283,11 @@
* locally disabled. -arca
*/
write_lock(&xtime_lock);
-
+ vxtime_lock();
timer->mark_offset();

do_timer_interrupt(irq, NULL, regs);
-
+ vxtime_unlock();
write_unlock(&xtime_lock);

}
diff -Nru a/arch/i386/kernel/timers/timer_tsc.c b/arch/i386/kernel/timers/timer_tsc.c
--- a/arch/i386/kernel/timers/timer_tsc.c Fri Oct 18 15:52:12 2002
+++ b/arch/i386/kernel/timers/timer_tsc.c Fri Oct 18 15:52:12 2002
@@ -10,6 +10,7 @@
#include <linux/cpufreq.h>

#include <asm/timer.h>
+#include <asm/vsyscall.h>
#include <asm/io.h>

extern int x86_udelay_tsc;
@@ -17,16 +18,16 @@

static int use_tsc;
/* Number of usecs that the last interrupt was delayed */
-static int delay_at_last_interrupt;
+int __delay_at_last_interrupt __section_delay_at_last_interrupt;

-static unsigned long last_tsc_low; /* lsb 32 bits of Time Stamp Counter */
+unsigned long __last_tsc_low __section_last_tsc_low; /* lsb 32 bits of Time Stamp Counter */

/* Cached *multiplier* to convert TSC counts to microseconds.
* (see the equation below).
* Equal to 2^32 * (1 / (clocks per usec) ).
* Initialized in time_init.
*/
-unsigned long fast_gettimeoffset_quotient;
+unsigned long __fast_gettimeoffset_quotient __section_fast_gettimeoffset_quotient;

static unsigned long get_offset_tsc(void)
{
diff -Nru a/arch/i386/kernel/vsyscall.c b/arch/i386/kernel/vsyscall.c
--- /dev/null Wed Dec 31 16:00:00 1969
+++ b/arch/i386/kernel/vsyscall.c Fri Oct 18 15:52:12 2002
@@ -0,0 +1,203 @@
+/*
+ * linux/arch/x86_64/kernel/vsyscall.c
+ *
+ * Copyright (C) 2001 Andrea Arcangeli <[email protected]> SuSE
+ *
+ * Thanks to [email protected] for some useful hint.
+ * Special thanks to Ingo Molnar for his early experience with
+ * a different vsyscall implementation for Linux/IA32 and for the name.
+ *
+ * vsyscall 1 is located at -10Mbyte, vsyscall 2 is located
+ * at virtual address -10Mbyte+1024bytes etc... There are at max 8192
+ * vsyscalls. One vsyscall can reserve more than 1 slot to avoid
+ * jumping out of line if necessary.
+ *
+ * $Id: vsyscall.c,v 1.9 2002/03/21 13:42:58 ak Exp $
+ *
+ * Ported to i386 by John Stultz <[email protected]>
+ */
+
+/*
+ * TODO 2001-03-20:
+ *
+ * 1) make page fault handler detect faults on page1-page-last of the vsyscall
+ * virtual space, and make it increase %rip and write -ENOSYS in %rax (so
+ * we'll be able to upgrade to a new glibc without upgrading kernel after
+ * we add more vsyscalls.
+ * 2) Possibly we need a fixmap table for the vsyscalls too if we want
+ * to avoid SIGSEGV and we want to return -EFAULT from the vsyscalls as well.
+ * Can we segfault inside a "syscall"? We can fix this anytime and those fixes
+ * won't be visible for userspace. Not fixing this is a noop for correct programs,
+ * broken programs will segfault and there's no security risk until we choose to
+ * fix it.
+ *
+ * These are not urgent things that we need to address only before shipping the first
+ * production binary kernels.
+ */
+
+#include <linux/time.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/timer.h>
+#include <linux/sched.h>
+
+#include <asm/vsyscall.h>
+#include <asm/pgtable.h>
+#include <asm/page.h>
+#include <asm/fixmap.h>
+#include <asm/errno.h>
+#include <asm/msr.h>
+#include <asm/system.h>
+
+#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr)))
+
+//#define NO_VSYSCALL 1
+
+#ifdef NO_VSYSCALL
+#include <asm/unistd.h>
+
+static int errno __section_vxtime_sequence;
+
+static inline _syscall2(int,gettimeofday,struct timeval *,tv,struct timezone *,tz)
+
+#else
+long __vxtime_sequence[2] __section_vxtime_sequence;
+
+static inline void do_vgettimeofday(struct timeval * tv)
+{
+ long sequence;
+ unsigned long usec, sec;
+
+ do {
+ unsigned long eax, edx;
+
+ sequence = __vxtime_sequence[1];
+ rmb();
+
+ /* Read the Time Stamp Counter */
+ rdtsc(eax,edx);
+
+ /* .. relative to previous jiffy (32 bits is enough) */
+ eax -= __last_tsc_low; /* tsc_low delta */
+
+ /*
+ * Time offset = (tsc_low delta) * fast_gettimeoffset_quotient
+ * = (tsc_low delta) * (usecs_per_clock)
+ * = (tsc_low delta) * (usecs_per_jiffy / clocks_per_jiffy)
+ *
+ * Using a mull instead of a divl saves up to 31 clock cycles
+ * in the critical path.
+ */
+
+
+ __asm__("mull %2"
+ :"=a" (eax), "=d" (edx)
+ :"rm" (__fast_gettimeoffset_quotient),
+ "0" (eax));
+
+ /* our adjusted time offset in microseconds */
+ usec = __delay_at_last_interrupt + edx;
+
+ {
+ unsigned long lost = __jiffies - __wall_jiffies;
+ if (lost)
+ usec += lost * (1000000 / HZ);
+ }
+ sec = __xtime.tv_sec;
+ usec += (__xtime.tv_nsec / 1000);;
+
+ rmb();
+ } while (sequence != __vxtime_sequence[0]);
+
+ tv->tv_sec = sec + usec / 1000000;
+ tv->tv_usec = usec % 1000000;
+}
+
+static inline void do_get_tz(struct timezone * tz)
+{
+ long sequence;
+
+ do {
+ sequence = __vxtime_sequence[1];
+ rmb();
+
+ *tz = __sys_tz;
+
+ rmb();
+ } while (sequence != __vxtime_sequence[0]);
+}
+#endif
+
+static int __vsyscall(0) vgettimeofday(struct timeval * tv, struct timezone * tz)
+{
+#ifdef NO_VSYSCALL
+ return gettimeofday(tv,tz);
+#else
+ if (tv)
+ do_vgettimeofday(tv);
+ if (tz)
+ do_get_tz(tz);
+ return 0;
+#endif
+}
+
+static time_t __vsyscall(1) vtime(time_t * t)
+{
+ struct timeval tv;
+ vgettimeofday(&tv,NULL);
+ if (t)
+ *t = tv.tv_sec;
+ return tv.tv_sec;
+}
+
+static long __vsyscall(2) venosys_0(void)
+{
+ return -ENOSYS;
+}
+
+static long __vsyscall(3) venosys_1(void)
+{
+ return -ENOSYS;
+}
+static void __init map_vsyscall(void)
+{
+ extern char __vsyscall_0;
+ unsigned long physaddr_page0 = (unsigned long) &__vsyscall_0 - __START_KERNEL_map;
+ pgd_t* pd;
+ pmd_t* pm;
+
+ __set_fixmap(VSYSCALL_FIRST_PAGE, physaddr_page0, PAGE_KERNEL_VSYSCALL);
+
+ /*
+ * Set vsyscall's fixmap pmd to be user readable:
+ * XXX HACK ALERT: this assumes non-vsyscall kernel space
+ * pte's will not have their userbit set. Otherwise this could
+ * be a security problem. Is this ok? Please advise [email protected]
+ */
+ pd = pgd_offset_k((unsigned long)&__last_tsc_low);
+ pm = pmd_offset(pd,(unsigned long)&__last_tsc_low);
+ pm->pmd |= _PAGE_USER;
+
+}
+
+static int __init vsyscall_init(void)
+{
+ printk("VSYSCALL: consistency checks...");
+ if ((unsigned long) &vgettimeofday != VSYSCALL_ADDR(__NR_vgettimeofday))
+ panic("vgettimeofday link addr broken");
+ if ((unsigned long) &vtime != VSYSCALL_ADDR(__NR_vtime))
+ panic("vtime link addr broken");
+ if (VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE))
+ panic("fixmap first vsyscall %lx should be %lx", __fix_to_virt(VSYSCALL_FIRST_PAGE),
+ VSYSCALL_ADDR(0));
+ printk("passed...mapping...");
+ map_vsyscall();
+ printk("done.\n");
+
+ printk("VSYSCALL: fixmap virt addr: 0x%lx\n",
+ __fix_to_virt(VSYSCALL_FIRST_PAGE));
+
+ return 0;
+}
+
+__initcall(vsyscall_init);
diff -Nru a/arch/i386/vmlinux.lds.S b/arch/i386/vmlinux.lds.S
--- a/arch/i386/vmlinux.lds.S Fri Oct 18 15:52:11 2002
+++ b/arch/i386/vmlinux.lds.S Fri Oct 18 15:52:11 2002
@@ -4,7 +4,7 @@
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)
-jiffies = jiffies_64;
+jiffies_64 = jiffies;
SECTIONS
{
. = 0xC0000000 + 0x100000;
@@ -90,6 +90,33 @@
__bss_stop = .;

_end = . ;
+
+ . = ALIGN(64);
+ .data.cacheline_aligned : { *(.data.cacheline_aligned) }
+
+ .vsyscall_0 0xffffe000: AT ((LOADADDR(.data.cacheline_aligned) + SIZEOF(.data.cacheline_aligned) + 4095) & ~(4095)) { *(.vsyscall_0) }
+ __vsyscall_0 = LOADADDR(.vsyscall_0);
+ . = ALIGN(64);
+ .vxtime_sequence : AT ((LOADADDR(.vsyscall_0) + SIZEOF(.vsyscall_0) + 63) & ~(63)) { *(.vxtime_sequence) }
+ vxtime_sequence = LOADADDR(.vxtime_sequence);
+ .last_tsc_low : AT (LOADADDR(.vxtime_sequence) + SIZEOF(.vxtime_sequence)) { *(.last_tsc_low) }
+ last_tsc_low = LOADADDR(.last_tsc_low);
+ .delay_at_last_interrupt : AT (LOADADDR(.last_tsc_low) + SIZEOF(.last_tsc_low)) { *(.delay_at_last_interrupt) }
+ delay_at_last_interrupt = LOADADDR(.delay_at_last_interrupt);
+ .fast_gettimeoffset_quotient : AT (LOADADDR(.delay_at_last_interrupt) + SIZEOF(.delay_at_last_interrupt)) { *(.fast_gettimeoffset_quotient) }
+ fast_gettimeoffset_quotient = LOADADDR(.fast_gettimeoffset_quotient);
+ .wall_jiffies : AT (LOADADDR(.fast_gettimeoffset_quotient) + SIZEOF(.fast_gettimeoffset_quotient)) { *(.wall_jiffies) }
+ wall_jiffies = LOADADDR(.wall_jiffies);
+ .sys_tz : AT (LOADADDR(.wall_jiffies) + SIZEOF(.wall_jiffies)) { *(.sys_tz) }
+ sys_tz = LOADADDR(.sys_tz);
+ . = ALIGN(16);
+ .jiffies : AT ((LOADADDR(.sys_tz) + SIZEOF(.sys_tz) + 15) & ~(15)) { *(.jiffies) }
+ jiffies = LOADADDR(.jiffies);
+ . = ALIGN(16);
+ .xtime : AT ((LOADADDR(.jiffies) + SIZEOF(.jiffies) + 15) & ~(15)) { *(.xtime) }
+ xtime = LOADADDR(.xtime);
+ .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT (LOADADDR(.vsyscall_0) + 1024) { *(.vsyscall_1) }
+ . = LOADADDR(.vsyscall_0) + 4096;

/* Sections to be discarded */
/DISCARD/ : {
diff -Nru a/include/asm-i386/fixmap.h b/include/asm-i386/fixmap.h
--- a/include/asm-i386/fixmap.h Fri Oct 18 15:52:11 2002
+++ b/include/asm-i386/fixmap.h Fri Oct 18 15:52:11 2002
@@ -18,6 +18,7 @@
#include <asm/acpi.h>
#include <asm/apicdef.h>
#include <asm/page.h>
+#include <asm/vsyscall.h>
#ifdef CONFIG_HIGHMEM
#include <linux/threads.h>
#include <asm/kmap_types.h>
@@ -49,6 +50,8 @@
* fix-mapped?
*/
enum fixed_addresses {
+ VSYSCALL_LAST_PAGE,
+ VSYSCALL_FIRST_PAGE = VSYSCALL_LAST_PAGE + ((VSYSCALL_END-VSYSCALL_START) >> PAGE_SHIFT) - 1,
#ifdef CONFIG_X86_LOCAL_APIC
FIX_APIC_BASE, /* local (CPU) APIC) -- required for SMP or not */
#endif
diff -Nru a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h
--- a/include/asm-i386/pgtable.h Fri Oct 18 15:52:12 2002
+++ b/include/asm-i386/pgtable.h Fri Oct 18 15:52:12 2002
@@ -139,11 +139,14 @@
#define __PAGE_KERNEL_RO (__PAGE_KERNEL & ~_PAGE_RW)
#define __PAGE_KERNEL_NOCACHE (__PAGE_KERNEL | _PAGE_PCD)
#define __PAGE_KERNEL_LARGE (__PAGE_KERNEL | _PAGE_PSE)
-
+#define __PAGE_KERNEL_VSYSCALL \
+ (_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED)
+
#define PAGE_KERNEL __pgprot(__PAGE_KERNEL)
#define PAGE_KERNEL_RO __pgprot(__PAGE_KERNEL_RO)
#define PAGE_KERNEL_NOCACHE __pgprot(__PAGE_KERNEL_NOCACHE)
#define PAGE_KERNEL_LARGE __pgprot(__PAGE_KERNEL_LARGE)
+#define PAGE_KERNEL_VSYSCALL __pgprot(__PAGE_KERNEL_VSYSCALL)

/*
* The i386 can't do page protection for execute, and considers that
diff -Nru a/include/asm-i386/vsyscall.h b/include/asm-i386/vsyscall.h
--- /dev/null Wed Dec 31 16:00:00 1969
+++ b/include/asm-i386/vsyscall.h Fri Oct 18 15:52:12 2002
@@ -0,0 +1,49 @@
+#ifndef _ASM_i386_VSYSCALL_H_
+#define _ASM_i386_VSYSCALL_H_
+
+enum vsyscall_num {
+ __NR_vgettimeofday,
+ __NR_vtime,
+};
+
+#define VSYSCALL_START 0xffffe000
+#define VSYSCALL_SIZE 1024
+#define VSYSCALL_END (0xffffe000 + PAGE_SIZE)
+#define VSYSCALL_ADDR(vsyscall_nr) (VSYSCALL_START+VSYSCALL_SIZE*(vsyscall_nr))
+
+#ifdef __KERNEL__
+#define __START_KERNEL_map 0xC0000000
+
+#define __section_last_tsc_low __attribute__ ((unused, __section__ (".last_tsc_low")))
+#define __section_delay_at_last_interrupt __attribute__ ((unused, __section__ (".delay_at_last_interrupt")))
+#define __section_fast_gettimeoffset_quotient __attribute__ ((unused, __section__ (".fast_gettimeoffset_quotient")))
+#define __section_wall_jiffies __attribute__ ((unused, __section__ (".wall_jiffies")))
+#define __section_jiffies __attribute__ ((unused, __section__ (".jiffies")))
+#define __section_sys_tz __attribute__ ((unused, __section__ (".sys_tz")))
+#define __section_xtime __attribute__ ((unused, __section__ (".xtime")))
+#define __section_vxtime_sequence __attribute__ ((unused, __section__ (".vxtime_sequence")))
+
+/* vsyscall space (readonly) */
+extern long __vxtime_sequence[2];
+extern int __delay_at_last_interrupt;
+extern unsigned long __last_tsc_low;
+extern unsigned long __fast_gettimeoffset_quotient;
+extern struct timespec __xtime;
+extern volatile unsigned long __jiffies;
+extern unsigned long __wall_jiffies;
+extern struct timezone __sys_tz;
+
+/* kernel space (writeable) */
+extern unsigned long last_tsc_low;
+extern int delay_at_last_interrupt;
+extern unsigned long fast_gettimeoffset_quotient;
+extern unsigned long wall_jiffies;
+extern struct timezone sys_tz;
+extern long vxtime_sequence[2];
+
+#define vxtime_lock() do { vxtime_sequence[0]++; wmb(); } while(0)
+#define vxtime_unlock() do { wmb(); vxtime_sequence[1]++; } while (0)
+
+#endif /* __KERNEL__ */
+
+#endif /* _ASM_i386_VSYSCALL_H_ */


2002-10-18 23:02:08

by john stultz

[permalink] [raw]
Subject: [EXAMPLE CODE] linux-2.5.43_vsyscall_A0

Linus, All

Here again is the tarball containing the example vsyscall LD_PRELOAD
library, and the quick/rough performance measuring tool.

thanks
-john


Attachments:
vsyscall_test.tar.gz (787.00 B)

2002-10-19 02:44:24

by Jeff Dike

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

[email protected] said:
> Since no one really brought up any issues with the code itself
> (correct me if I'm wrong), here is the i386 vsyscall gettimeofday port
> I sent last night, synced up and ready for integration.

This vsyscall implementation breaks UML. Any app that's run inside UML
that uses vsyscalls will get the host's vsyscalls rather than the UML
vsyscalls.

This needs to be virtualizable somehow, which means that apps run inside
UML, without being changed, get the UML vsyscalls. There are a couple of
possiblities that I can think of:
a get_vsyscall system call which is executed by libc on startup -
UML would return a different calue from the host
some mechanism for UML to map its own vsyscall area in place of
the host's - it wouldn't necessarily need to be able to unmap it when it's
running its own kernel code because it can probably arrange not to use the
host's vsyscalls.

Jeff

2002-10-19 03:04:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Sat, Oct 19, 2002 at 05:52:53AM +0200, Jeff Dike wrote:
> [email protected] said:
> > Since no one really brought up any issues with the code itself
> > (correct me if I'm wrong), here is the i386 vsyscall gettimeofday port
> > I sent last night, synced up and ready for integration.
>
> This vsyscall implementation breaks UML. Any app that's run inside UML
> that uses vsyscalls will get the host's vsyscalls rather than the UML
> vsyscalls.

Ugh.

Guess you'll have some problems then with UML on x86-64, which always uses
vgettimeofday. But it's only used for gettimeofday() currently, perhaps it's
not that bad when the UML child runs with the host's time.

I guess it would be possible to add some support for UML
to map own code over the vsyscall reserved locations. UML would need
to use the syscalls then. But it'll be likely ugly.

-Andi

2002-10-19 03:41:03

by Jeff Dike

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

[email protected] said:
> Guess you'll have some problems then with UML on x86-64, which always
> uses vgettimeofday. But it's only used for gettimeofday() currently,
> perhaps it's not that bad when the UML child runs with the host's
> time.

It's not horrible, but it's still broken. There are people who depend
on UML being able to keep its own time separately from the host.

> I guess it would be possible to add some support for UML to map own
> code over the vsyscall reserved locations. UML would need to use the
> syscalls then. But it'll be likely ugly.

Yeah, it would be.

My preferred solution would be for libc to ask the kernel where the vsyscall
area is. That's reasonably clean and virtualizable. Andrea doesn't like it
because it adds a few instructions to the vsyscall address calculation.

Jeff

2002-10-19 03:57:12

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

[full quote for context]

On Sat, Oct 19, 2002 at 06:49:59AM +0200, Jeff Dike wrote:
> [email protected] said:
> > Guess you'll have some problems then with UML on x86-64, which always
> > uses vgettimeofday. But it's only used for gettimeofday() currently,
> > perhaps it's not that bad when the UML child runs with the host's
> > time.
>
> It's not horrible, but it's still broken. There are people who depend
> on UML being able to keep its own time separately from the host.
>
> > I guess it would be possible to add some support for UML to map own
> > code over the vsyscall reserved locations. UML would need to use the
> > syscalls then. But it'll be likely ugly.
>
> Yeah, it would be.
>
> My preferred solution would be for libc to ask the kernel where the vsyscall
> area is. That's reasonably clean and virtualizable. Andrea doesn't like it
> because it adds a few instructions to the vsyscall address calculation.

I would have no problems with adding that to the x86-64 kernel. It could
be passed in by the ELF environment vector and added to the ABI.
Overhead should be negligible, it just needs a single table lookup.
Andreas, what do you think ?

-Andi

2002-10-19 04:04:24

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Fri, Oct 18, 2002 at 11:49:59PM -0500, Jeff Dike wrote:
> [email protected] said:
> > Guess you'll have some problems then with UML on x86-64, which always
> > uses vgettimeofday. But it's only used for gettimeofday() currently,
> > perhaps it's not that bad when the UML child runs with the host's
> > time.
>
> It's not horrible, but it's still broken. There are people who depend
> on UML being able to keep its own time separately from the host.
>
> > I guess it would be possible to add some support for UML to map own
> > code over the vsyscall reserved locations. UML would need to use the
> > syscalls then. But it'll be likely ugly.
>
> Yeah, it would be.
>
> My preferred solution would be for libc to ask the kernel where the vsyscall
> area is. That's reasonably clean and virtualizable. Andrea doesn't like it
> because it adds a few instructions to the vsyscall address calculation.

yes, my preferred solution is still a runtime /proc entry that turns off
vsyscalls completely by root so you could trap gettimeofday/time via the
usual ptrace. That would be zero cost. Of course this would be needed
only for the special users needing a revirtualized time. I tend to think
most people don't need a revirtualized time in uml, the exceptions can
run slower [not slower than x86 of course except for an additional
call/ret pair that won't matter compared to the ptrace overhread of
every syscall]. I mean, I would prefer to optimize for the people who
needs fast performance, if you can deal with the ptrace overhead at
every sysenter/sysret most probably you don't need the vsyscalls in the
first place.

My argument is that whatever solution to this problem has a penalty of
some kind, and I prefer to keep the penalty on the side of the most
unlikely case, and as far as I can tell it's the case of people needing
uml running with revirtualized real time. certainly I want to make it
possible, but I don't care to optimize for it, I want it (not the
others) to pay for the additional feature it needs.

If the global sysctl is unacceptable, the next fallback would be to have
a per-task information that defaults to vsyscall to execute the syscall,
a check in switch_to could replace the fixmap entry with the one of the
other vsyscall. That would be an additional single unlikely branch in
switch_to unless I'm overlooking something and I could live with it
despite it's not an absolute zero cost. That should be still *much*
better than glibc asking to kernel the address of the vsyscalls using a
syscall after execve and using pointer to functions later at runtime to
invoke the vsyscalls, I don't really like that solution.

So I would go with:

1) global sysctl to turn off vgettimeofday/vtime
2) if 1) is unacceptable then per-task turnoff of vsyscalls would be the
next viable solution

Comments?

Andrea

2002-10-19 04:11:08

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Sat, Oct 19, 2002 at 06:02:38AM +0200, Andi Kleen wrote:
> [full quote for context]
>
> On Sat, Oct 19, 2002 at 06:49:59AM +0200, Jeff Dike wrote:
> > [email protected] said:
> > > Guess you'll have some problems then with UML on x86-64, which always
> > > uses vgettimeofday. But it's only used for gettimeofday() currently,
> > > perhaps it's not that bad when the UML child runs with the host's
> > > time.
> >
> > It's not horrible, but it's still broken. There are people who depend
> > on UML being able to keep its own time separately from the host.
> >
> > > I guess it would be possible to add some support for UML to map own
> > > code over the vsyscall reserved locations. UML would need to use the
> > > syscalls then. But it'll be likely ugly.
> >
> > Yeah, it would be.
> >
> > My preferred solution would be for libc to ask the kernel where the vsyscall
> > area is. That's reasonably clean and virtualizable. Andrea doesn't like it
> > because it adds a few instructions to the vsyscall address calculation.
>
> I would have no problems with adding that to the x86-64 kernel. It could
> be passed in by the ELF environment vector and added to the ABI.
> Overhead should be negligible, it just needs a single table lookup.
> Andreas, what do you think ?

see my last email. And I think he needed it as an additional syscall
after execve that he could trap and revirtualize with ptrace as usual
and that would return variable addresses of pointer to functions (that
would be revirtualized inside the uml kernel of course), not an ELF
information that should be valid for both UML and host kernel.

Andrea

2002-10-19 04:40:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Sat, Oct 19, 2002 at 06:10:19AM +0200, Andrea Arcangeli wrote:
> On Fri, Oct 18, 2002 at 11:49:59PM -0500, Jeff Dike wrote:
> > [email protected] said:
> > > Guess you'll have some problems then with UML on x86-64, which always
> > > uses vgettimeofday. But it's only used for gettimeofday() currently,
> > > perhaps it's not that bad when the UML child runs with the host's
> > > time.
> >
> > It's not horrible, but it's still broken. There are people who depend
> > on UML being able to keep its own time separately from the host.
> >
> > > I guess it would be possible to add some support for UML to map own
> > > code over the vsyscall reserved locations. UML would need to use the
> > > syscalls then. But it'll be likely ugly.
> >
> > Yeah, it would be.
> >
> > My preferred solution would be for libc to ask the kernel where the vsyscall
> > area is. That's reasonably clean and virtualizable. Andrea doesn't like it
> > because it adds a few instructions to the vsyscall address calculation.
>
> yes, my preferred solution is still a runtime /proc entry that turns off
> vsyscalls completely by root so you could trap gettimeofday/time via the
> usual ptrace. That would be zero cost. Of course this would be needed

Ok, a sysctl that modifies a variable in the vsyscall page and is
tested by the code. That would be an option, I agree.

For the locked TSC code we will need something like that anyways,
so that locked TSC can force a syscall.

-Andi

2002-10-19 04:56:05

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Sat, Oct 19, 2002 at 06:45:56AM +0200, Andi Kleen wrote:
> On Sat, Oct 19, 2002 at 06:10:19AM +0200, Andrea Arcangeli wrote:
> > On Fri, Oct 18, 2002 at 11:49:59PM -0500, Jeff Dike wrote:
> > > [email protected] said:
> > > > Guess you'll have some problems then with UML on x86-64, which always
> > > > uses vgettimeofday. But it's only used for gettimeofday() currently,
> > > > perhaps it's not that bad when the UML child runs with the host's
> > > > time.
> > >
> > > It's not horrible, but it's still broken. There are people who depend
> > > on UML being able to keep its own time separately from the host.
> > >
> > > > I guess it would be possible to add some support for UML to map own
> > > > code over the vsyscall reserved locations. UML would need to use the
> > > > syscalls then. But it'll be likely ugly.
> > >
> > > Yeah, it would be.
> > >
> > > My preferred solution would be for libc to ask the kernel where the vsyscall
> > > area is. That's reasonably clean and virtualizable. Andrea doesn't like it
> > > because it adds a few instructions to the vsyscall address calculation.
> >
> > yes, my preferred solution is still a runtime /proc entry that turns off
> > vsyscalls completely by root so you could trap gettimeofday/time via the
> > usual ptrace. That would be zero cost. Of course this would be needed
>
> Ok, a sysctl that modifies a variable in the vsyscall page and is
> tested by the code. That would be an option, I agree.

the sysctl would replace the vsyscall fixmap fixmap entry for the
current cpu enterely at switch_to time, I certainly don't want to add
not necessary branches in the core vsyscall code. Doing it globally is
zerocost but it would probably need privilegies as said, per-task could
be more dynamic without privilegies and it would be an unlikely branch
added in switch_to, still a very low cost so still acceptable.

> For the locked TSC code we will need something like that anyways,
> so that locked TSC can force a syscall.

If we use a per-cpu TSC we don't need the syscall, the cpuid encoded in
each 64bit variable will be enough (see my past email of yesterday
evening, I realized a way to hanle per-cpu info with vsyscalls). the
main problem is as usual that the TSC isn't a real time source, it
changes frequency all the time, but as usual all the problems in the
gettimeofday implementation have little to with the vsyscalls details,
in particular now that I realized how to handle per-cpu data, they're
generic issues that needs solving even if vsyscalls would redirect to
the syscalls.

The only thing we definitely can't do in the vsyscalls is to read the
PIT because it's an old I/O mapped device, but who could ever live only
with the PIT anyways these computing days? If you've only the PIT
vsyscalls or not the gettimeofday functionality would suck.

Andrea

2002-10-19 19:10:06

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Sat, 19 Oct 2002, Andrea Arcangeli wrote:

> So I would go with:
>
> 1) global sysctl to turn off vgettimeofday/vtime
> 2) if 1) is unacceptable then per-task turnoff of vsyscalls would be the
> next viable solution
>
> Comments?

I think (2) is a more general solution. Let UML handle any cases it
wishes, not limited to just this particular vsyscall.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-10-19 22:27:53

by linux-kernel

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

In article <[email protected]>,
Jeff Dike <[email protected]> writes:
>
> This needs to be virtualizable somehow, which means that apps run inside
> UML, without being changed, get the UML vsyscalls. There are a couple of
> possiblities that I can think of:
> a get_vsyscall system call which is executed by libc on startup -
> UML would return a different calue from the host
> some mechanism for UML to map its own vsyscall area in place of
> the host's - it wouldn't necessarily need to be able to unmap it when it's
> running its own kernel code because it can probably arrange not to use the
> host's vsyscalls.
>
> Jeff
>
In case you want UML to also be able to work as a jail, it should
actually be impossible to get to the "real" systemcalls. In that case
just asking libc is not acceptable if the other area remains available

2002-10-19 22:33:34

by Jeff Dike

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

[email protected] said:
> the sysctl would replace the vsyscall fixmap fixmap entry for the
> current cpu enterely at switch_to time, I certainly don't want to add
> not necessary branches in the core vsyscall code. Doing it globally
> is zerocost but it would probably need privilegies as said, per-task
> could be more dynamic without privilegies and it would be an unlikely
> branch added in switch_to, still a very low cost so still acceptable.

If I'm understanding this (and reading the code) correctly, this would
allow UML to specify that, for a given process, it should have a page of
its choosing mapped into the vsyscall area. Correct?

If so, I can go along with this. It makes vsyscalls virtualizable, and thus
available to UML, which needs them more than the other arches :-)

The one suggestion I'd make is to make it per-mm rather than per-task and
put it in switch_mm rather than switch_to.

Jeff

2002-10-20 00:11:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Sat, Oct 19, 2002 at 06:43:28PM -0500, Jeff Dike wrote:
> [email protected] said:
> > the sysctl would replace the vsyscall fixmap fixmap entry for the
> > current cpu enterely at switch_to time, I certainly don't want to add
> > not necessary branches in the core vsyscall code. Doing it globally
> > is zerocost but it would probably need privilegies as said, per-task
> > could be more dynamic without privilegies and it would be an unlikely
> > branch added in switch_to, still a very low cost so still acceptable.
>
> If I'm understanding this (and reading the code) correctly, this would
> allow UML to specify that, for a given process, it should have a page of
> its choosing mapped into the vsyscall area. Correct?

What I suggested is an arch specific syscall to shutdown vsyscalls
enterely for the current task and its childs, the vsyscall will call
into the real syscall with sysenter, and you will be able to
revirtualize gettimeofday/time like you do on x86 with ptrace.

> If so, I can go along with this. It makes vsyscalls virtualizable, and thus
> available to UML, which needs them more than the other arches :-)

what do you mean that uml needs the vsyscalls more than the other archs?
You can use the vsyscall in-kernel by executing such syscall to turn the
vsyscall on and then turning them off again before returning to
the uml userspace. Remapping the vsyscall address is messy, we would
need to define a fallback place where to put them and we'd need to deal
with mapping your userspace bytecode in the place of the kernel code
containing the vsyscalls, it's an overdesign mess and it breaks so many
things about the mm design. I much prefer you to keep trapping the
gettimeofday and time with ptrace after shutting down the vsyscalls for
the current task, it's so much cleaner. The overhead of ptrace cannot be
your point, if that overhead is a showstopper uml isn't an option in the
first place.

> The one suggestion I'd make is to make it per-mm rather than per-task and
> put it in switch_mm rather than switch_to.

doesn't make sense to me, the time of the day has nothing specific to
the mm. The fact that turning off the vsyscall involves an invlpg
doesn't matter either since they're global and they need the explicit
flush anyways, furthmore the per-task info would be in the task struct
not in the mm_struct. I don't see any point in binding a time-of-day
information with the mm. It sounds quite natural to make it a per-task
information not per-mm (also think strace/ptrace, maybe strace/ptrace
could like to still handle gettimeofday, but there's no point to force
all other threads to use the syscall just because we're playing with one
task in the process group).

Andrea

2002-10-20 00:52:51

by Jeff Dike

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

[email protected] said:
> What I suggested is an arch specific syscall to shutdown vsyscalls
> enterely for the current task and its childs,

Then I misunderstood.

> the vsyscall will call
> into the real syscall with sysenter, and you will be able to
> revirtualize gettimeofday/time like you do on x86 with ptrace.

And the task-specific fixmap entry would point to a page that makes the normal
system call?

> what do you mean that uml needs the vsyscalls more than the other
> archs?

Because its system calls are much slower than the host's. It would benefit
more from vsyscalls.

> I much prefer you to keep trapping the gettimeofday and time with
> ptrace after shutting down the vsyscalls for the current task, it's so
> much cleaner.

And so much slower.

> The overhead of ptrace cannot be your point, if that
> overhead is a showstopper uml isn't an option in the first place.

I don't plan on using ptrace forever. That overhead is going to shrink, and
vsyscalls are one way to make it shrink.

I intend to make UML perform by grabbing whatever improvements from wherever
I can get them, and if I can't get vsyscalls because they're not virtualizable,
then, from my point of view, their design is broken.

Jeff

2002-10-20 01:45:05

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Fri, 18 Oct 2002, Jeff Dike wrote:

> My preferred solution would be for libc to ask the kernel where the
> vsyscall area is. That's reasonably clean and virtualizable. Andrea
> doesn't like it because it adds a few instructions to the vsyscall
> address calculation.

Sounds like the best solution indeed, especially when keeping
in mind the strange people who want to run with a different
user:kernel split or statically linked binaries at fun addresses
so they've got more space for their fortran arrays ;)

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-10-20 02:29:03

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Sat, Oct 19, 2002 at 09:03:09PM -0500, Jeff Dike wrote:
> [email protected] said:
> > What I suggested is an arch specific syscall to shutdown vsyscalls
> > enterely for the current task and its childs,
>
> Then I misunderstood.
>
> > the vsyscall will call
> > into the real syscall with sysenter, and you will be able to
> > revirtualize gettimeofday/time like you do on x86 with ptrace.
>
> And the task-specific fixmap entry would point to a page that makes the normal
> system call?

correct.

>
> > what do you mean that uml needs the vsyscalls more than the other
> > archs?
>
> Because its system calls are much slower than the host's. It would benefit
> more from vsyscalls.

yes, this is true for all the syscalls, if that's a problem uml isn't an
option for the user in the first place.

> > I much prefer you to keep trapping the gettimeofday and time with
> > ptrace after shutting down the vsyscalls for the current task, it's so
> > much cleaner.
>
> And so much slower.

so much slower like all other syscalls under uml.

> > The overhead of ptrace cannot be your point, if that
> > overhead is a showstopper uml isn't an option in the first place.
>
> I don't plan on using ptrace forever. That overhead is going to shrink, and
> vsyscalls are one way to make it shrink.

what do you plan to do to make all other syscall faster? gettimeofday is
the only thing that can be implemented as a vsyscall.

> I intend to make UML perform by grabbing whatever improvements from wherever
> I can get them, and if I can't get vsyscalls because they're not virtualizable,
> then, from my point of view, their design is broken.

By the same argument you could claim x86 and several other archs broken
because they don't support revirtualization in hardware like s390, not
even vmware is claiming anything like that.

And let's assume you're right, let's declare vsyscall design broken,
let's back them out, so we will force you to take the ptrace hit always,
just like in x86, so even the 99% of uml users like me that are fine to
live with system time being in sync with the host will have to take the
ptrace hit at every gettimeofday.

Your claim is obviously broken, not the vsyscall design.

The fact is that the vsyscalls design is the only way for you to run as
fast as the host kernel for the very big exception of gettimeofday that
is a purerly readonly operation, go figure how much vsyscalls are broken
(yeah, you could optimize getpid too in UP [smp would make it slower by
having to add some code to the scheduler, scheduler run much more
frequently than getpid(2)] like for the host kernel, but that's not
worth the complexity even outside uml).

My problem is that mapping user code into the vsyscall fixmap is complex
and not very clean at all, breaks various concepts in the mm and
last but not the least it is slow, if you want to run fast you must
_NOT_ revirtualize.

Everybody who needs an ultra optimized uml can simply run with the uml system
time in sync with the host, it will run faster than the revirtualization
since you won't get the tlb flush at every switch, you need zero kernel
changes, it will run at full speed like the host kernel. revirtualized
vsyscalls sounds like overdesign, before I will consider adding that
features I want to know _who_ needs this optimization for gettimeofday
that can't live with the system time in sync with the host. vsyscalls
just improves automatically uml at full speed if only you will set the
default to have the system time in sync with the host. The other 1% of
uml users will be fine to take the usual ptrace hit like on x86
currently and just like for all other syscalls out there. I just want to
make sure not to overdesign, maintainance is just hard enough with
simple concepts.

As far as I can tell the guy using uml to fake the system time for its
preferred evaluation demo will be certainly fine to ptrace around
gettimeofday. And OTOH the production guys using uml for security
reasons to run Oracle or the apache cgi on top of it will want _NOT_ to
pay the tlb flush at switch_to either and they will want uml _not_
revirtualizing the vsyscalls regardless of that feature provided by the
kernel or not. Hence what you ask for is plain overdesign IMHO.

If only you could raise a valid interesting real life scenario where
they can't live with the system time being in sync with the host and
where gettimeofday performance is critical I would love to hear.

In any case the libc option is a no-way, the whole point of the current
vsyscall api is not to deal with pointer to functions so the libc way is
not worth considering IMHO. revirtualized vsyscalls are certainly worth
discussing but I'm not for it.

Andrea

2002-10-20 02:53:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Sat, Oct 19, 2002 at 06:16:59AM +0200, Andrea Arcangeli wrote:
> see my last email. And I think he needed it as an additional syscall
> after execve that he could trap and revirtualize with ptrace as usual
> and that would return variable addresses of pointer to functions (that
> would be revirtualized inside the uml kernel of course), not an ELF
> information that should be valid for both UML and host kernel.

Implementing it per process is tricky. How do you access the per process
state in the vsyscall area ? To do it properly you would need one dedicated
page per mm_struct that is mapped in there. But it could not be in the
normal vsyscall area (otherwise you couldn't share the kernel pagetables
anymore), but somewhere else in the address space.

I think a global sysctl that just modifies the global vsyscall pages is more
suitable here. It avoids the overhead of needing a per process page.
I see no real need anyways to do it per process. When you have one process
that cannot deal with vsyscalls the whole system will get a bit slower,
but the slowdown shouldn't be noticeable anyways. If you run your highend
database which does thousands of gettimeofday each second just don't set
the sysctl.

-Andi

2002-10-20 02:50:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Sun, Oct 20, 2002 at 03:50:37AM +0200, Rik van Riel wrote:
> On Fri, 18 Oct 2002, Jeff Dike wrote:
>
> > My preferred solution would be for libc to ask the kernel where the
> > vsyscall area is. That's reasonably clean and virtualizable. Andrea
> > doesn't like it because it adds a few instructions to the vsyscall
> > address calculation.
>
> Sounds like the best solution indeed, especially when keeping
> in mind the strange people who want to run with a different
> user:kernel split or statically linked binaries at fun addresses
> so they've got more space for their fortran arrays ;)

x86-64 addresses that (not that it would need it) by putting vsyscalls at
the top of the virtual mapping after the direct mapped area. This part
never moves even with changed __PAGE_OFFSET. The same technique should work
on i386 too.

-Andi

2002-10-20 06:39:05

by Elladan

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Sun, Oct 20, 2002 at 04:59:14AM +0200, Andi Kleen wrote:
> On Sat, Oct 19, 2002 at 06:16:59AM +0200, Andrea Arcangeli wrote:
> > see my last email. And I think he needed it as an additional syscall
> > after execve that he could trap and revirtualize with ptrace as usual
> > and that would return variable addresses of pointer to functions (that
> > would be revirtualized inside the uml kernel of course), not an ELF
> > information that should be valid for both UML and host kernel.
>
> Implementing it per process is tricky. How do you access the per process
> state in the vsyscall area ? To do it properly you would need one dedicated
> page per mm_struct that is mapped in there. But it could not be in the
> normal vsyscall area (otherwise you couldn't share the kernel pagetables
> anymore), but somewhere else in the address space.
>
> I think a global sysctl that just modifies the global vsyscall pages is more
> suitable here. It avoids the overhead of needing a per process page.
> I see no real need anyways to do it per process. When you have one process
> that cannot deal with vsyscalls the whole system will get a bit slower,
> but the slowdown shouldn't be noticeable anyways. If you run your highend
> database which does thousands of gettimeofday each second just don't set
> the sysctl.

The problem with modifying the executable code/pages in the vsyscall
area is that it's going to be very tricky to implement, if I understand
this discussion properly.

There may be any number of user processes idling in these pages on the
runqueue (or off it if say one received a SIGSTOP), and if you just go
change the instruction code on them, unless you're incredibly careful
and come up with some subtly safe machine code sequence, they're going
crash when you call this sysctl().

It seems like this indicates that you have to start getting crazy at
that point. That is, what you need to do is scan through all processes
on the runqueue (and also any that might be eg. frozen) and examine
their pc. If it's in the vsyscall area, either complete the system call
for them, or somehow roll-back their register state and reset their PC
to the start of the vsyscall function.

Just using a test in the vsyscall to check a variable seems like a much
cleaner global approach. It has its own problem though, since processes
that are idling in the vsyscall pages may wake up after vsyscalls have
been disabled. It seems like they could then be prone to return the
wrong result, if say the offset data was no longer being updated
properly by the kernel because of the mode change.

Making it per-process should avoid these problems nicely, at least, so
long as the process disabling vsyscalls knows what it's doing and
doesn't try to call the sysctl from a signal handler or something.

-J

2002-10-20 09:21:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: [PATCH] linux-2.5.43_vsyscall_A0

On Sat, Oct 19, 2002 at 11:44:33PM -0700, Elladan wrote:
> The problem with modifying the executable code/pages in the vsyscall
> area is that it's going to be very tricky to implement, if I understand
> this discussion properly.

Modifying the pages or variables in the pages from the kernel is no
problem. It just would affect all processes on the system

What's tricky is to give it per process state (which would
be needed to make a vsyscall/novsyscall flag process local)
>
> There may be any number of user processes idling in these pages on the
> runqueue (or off it if say one received a SIGSTOP), and if you just go
> change the instruction code on them, unless you're incredibly careful
> and come up with some subtly safe machine code sequence, they're going
> crash when you call this sysctl().

Nobody proposed to use self modifying code, it would just be a global
variable located in the vsyscall area that is tested by the vsyscall
code.

-Andi

2002-10-20 10:52:59

by Elladan

[permalink] [raw]
Subject: Re: [discuss] Re: [PATCH] linux-2.5.43_vsyscall_A0

On Sun, Oct 20, 2002 at 11:27:30AM +0200, Andi Kleen wrote:
> On Sat, Oct 19, 2002 at 11:44:33PM -0700, Elladan wrote:
> > The problem with modifying the executable code/pages in the vsyscall
> > area is that it's going to be very tricky to implement, if I understand
> > this discussion properly.
>
> Modifying the pages or variables in the pages from the kernel is no
> problem. It just would affect all processes on the system
>
> What's tricky is to give it per process state (which would
> be needed to make a vsyscall/novsyscall flag process local)

Not really any more tricky than turning it off globally with a flag.
It's just more expensive, because you have to propagate the flag into
vsyscall space on the context switch. In the per-process case,
self-modifying code would be a less non-viable approach than it is
globally.

> > There may be any number of user processes idling in these pages on the
> > runqueue (or off it if say one received a SIGSTOP), and if you just go
> > change the instruction code on them, unless you're incredibly careful
> > and come up with some subtly safe machine code sequence, they're going
> > crash when you call this sysctl().
>
> Nobody proposed to use self modifying code, it would just be a global
> variable located in the vsyscall area that is tested by the vsyscall
> code.

Well, self modifying code certainly looked like what Andrea was talking
about, to avoid the branch overhead on the userspace gettimeofday()
call.

I suspect that outside of a few database apps, it's pretty unlikely that
there will be any vsyscalls in your average time slice, so putting the
overhead into the vsyscall itself seems like a better idea than paying
the price during every context switch.


Of course, if you're really strictly worried about being able to fully
virtualize the vsyscalls, a global flag isn't really enough. A user app
running under the virtual machine will still be able to manually access
the data on the vsyscall pages, and if it wants, jump into the middle of
a function or something like that. So eg. a UML instance being used as
a sandbox would still expose the host time and such to its hostile
userspace, which could then execute subsets of vsyscall code at will.

It seems that to fix this with proper data-hiding, it would really need
to be possible to set the vsyscall pages as invalid for the UML process
(so it could manually emulate the vsyscall), which would then either
require expensive contex-switching costs to make it a per-process flag,
or we're back to global self-modifying-code fixups.

Better to just ignore that particular issue.

-J

2002-10-20 11:14:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: [PATCH] linux-2.5.43_vsyscall_A0

On Sun, Oct 20, 2002 at 03:58:41AM -0700, Elladan wrote:
> Not really any more tricky than turning it off globally with a flag.
> It's just more expensive, because you have to propagate the flag into
> vsyscall space on the context switch. In the per-process case,
> self-modifying code would be a less non-viable approach than it is
> globally.

Umm, you forgot about SMP. Modifying it on context switch would require
per CPU mappings and vsyscall pages. Currently the kernel page tables
are shared globally, changing them to be CPU local would be somewhat
involved. While it would be doable as long as x86-64 is limited to three
level page tables for user space it would be a major pain as soon
as four level page tables were supported. In this case multiple threads
sharing the same mm but running on multiple CPUs couldn't share the same
page table anymore, and changing that would make threading significantly
more complicated. On i386 this problem is always there.

Also the context switch is a very critical path, we don't want to add
random junk like this in there.

> Well, self modifying code certainly looked like what Andrea was talking
> about, to avoid the branch overhead on the userspace gettimeofday()
> call.

A test+branch here is completely lost in the noise, not even
worth thinking about.

> It seems that to fix this with proper data-hiding, it would really need
> to be possible to set the vsyscall pages as invalid for the UML process
> (so it could manually emulate the vsyscall), which would then either
> require expensive contex-switching costs to make it a per-process flag,
> or we're back to global self-modifying-code fixups.

I have no problems at all with a system global flag. After all vsyscalls
are just an optimization. When you use UML you can't use that optimization,
very simple. It's similar to other circumstances that have an impact
on the whole system, e.g. when you use floating point in any process
then the context switch becomes slower for everybody. Not a big deal.


-Andi

2002-10-20 13:13:31

by Andreas Jaeger

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

Andi Kleen <[email protected]> writes:

> [full quote for context]
>
> On Sat, Oct 19, 2002 at 06:49:59AM +0200, Jeff Dike wrote:
>> [email protected] said:
>> > Guess you'll have some problems then with UML on x86-64, which always
>> > uses vgettimeofday. But it's only used for gettimeofday() currently,
>> > perhaps it's not that bad when the UML child runs with the host's
>> > time.
>>
>> It's not horrible, but it's still broken. There are people who depend
>> on UML being able to keep its own time separately from the host.
>>
>> > I guess it would be possible to add some support for UML to map own
>> > code over the vsyscall reserved locations. UML would need to use the
>> > syscalls then. But it'll be likely ugly.
>>
>> Yeah, it would be.
>>
>> My preferred solution would be for libc to ask the kernel where the vsyscall
>> area is. That's reasonably clean and virtualizable. Andrea doesn't like it
>> because it adds a few instructions to the vsyscall address calculation.
>
> I would have no problems with adding that to the x86-64 kernel. It could
> be passed in by the ELF environment vector and added to the ABI.
> Overhead should be negligible, it just needs a single table lookup.
> Andreas, what do you think ?

Create a new AT_ constant, and pass it via the auxiliary vector and we
can use it in glibc.

Andreas
--
Andreas Jaeger
SuSE Labs [email protected]
private [email protected]
http://www.suse.de/~aj

2002-10-20 14:45:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Sun, Oct 20, 2002 at 04:59:14AM +0200, Andi Kleen wrote:
> On Sat, Oct 19, 2002 at 06:16:59AM +0200, Andrea Arcangeli wrote:
> > see my last email. And I think he needed it as an additional syscall
> > after execve that he could trap and revirtualize with ptrace as usual
> > and that would return variable addresses of pointer to functions (that
> > would be revirtualized inside the uml kernel of course), not an ELF
> > information that should be valid for both UML and host kernel.
>
> Implementing it per process is tricky. How do you access the per process
> state in the vsyscall area ? To do it properly you would need one dedicated

you don't need to access the per-process state of course.

the code will do:

switch_to() {
[..]
if (prev->vsyscall_disabled != next->vsyscall_disabled) {
set the right fixmap page and invlpg
}
[..]
}

the vsyscall code itself will not need to know anything about the
current->vsyscall_disabled. The only overhead is such above branch.

> page per mm_struct that is mapped in there. But it could not be in the
> normal vsyscall area (otherwise you couldn't share the kernel pagetables
> anymore), but somewhere else in the address space.
>
> I think a global sysctl that just modifies the global vsyscall pages is more
> suitable here. It avoids the overhead of needing a per process page.

I don't see the need of a per-process page. We only need a per-cpu
first level pagetable at the very end of the address space, but that's
not a problem at all, it's one more page _per_cpu_ only (not per task).
It's much less overhead of per-cpu keventd etc...

> I see no real need anyways to do it per process. When you have one process
> that cannot deal with vsyscalls the whole system will get a bit slower,
> but the slowdown shouldn't be noticeable anyways. If you run your highend
> database which does thousands of gettimeofday each second just don't set
> the sysctl.

yep. Having the per-task switch was only a way to make the host system
more efficient for the 1% uml users that has to fake the time and that
cannot live with the max possible perormance of having uml running with
the system time in sync with the host. the per-task switch is meant to
give uml the possibility of being fully revirtualized without hurting
the host too much, in the unlikely case you need it.

Andrea

2002-10-20 14:53:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Sun, Oct 20, 2002 at 03:19:32PM +0200, Andreas Jaeger wrote:
> Andi Kleen <[email protected]> writes:
>
> > [full quote for context]
> >
> > On Sat, Oct 19, 2002 at 06:49:59AM +0200, Jeff Dike wrote:
> >> [email protected] said:
> >> > Guess you'll have some problems then with UML on x86-64, which always
> >> > uses vgettimeofday. But it's only used for gettimeofday() currently,
> >> > perhaps it's not that bad when the UML child runs with the host's
> >> > time.
> >>
> >> It's not horrible, but it's still broken. There are people who depend
> >> on UML being able to keep its own time separately from the host.
> >>
> >> > I guess it would be possible to add some support for UML to map own
> >> > code over the vsyscall reserved locations. UML would need to use the
> >> > syscalls then. But it'll be likely ugly.
> >>
> >> Yeah, it would be.
> >>
> >> My preferred solution would be for libc to ask the kernel where the vsyscall
> >> area is. That's reasonably clean and virtualizable. Andrea doesn't like it
> >> because it adds a few instructions to the vsyscall address calculation.
> >
> > I would have no problems with adding that to the x86-64 kernel. It could
> > be passed in by the ELF environment vector and added to the ABI.
> > Overhead should be negligible, it just needs a single table lookup.
> > Andreas, what do you think ?
>
> Create a new AT_ constant, and pass it via the auxiliary vector and we
> can use it in glibc.

will it be a pointer to function?

Andrea

2002-10-21 15:37:36

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

Would it be possible to allow UML to use mmap to put an alternative text
page at the fixed address with the UML appropriate version of the
vsyscall?

On Fri, 2002-10-18 at 22:01, Andrea Arcangeli wrote:
> On Sat, Oct 19, 2002 at 06:45:56AM +0200, Andi Kleen wrote:
> > On Sat, Oct 19, 2002 at 06:10:19AM +0200, Andrea Arcangeli wrote:
> > > On Fri, Oct 18, 2002 at 11:49:59PM -0500, Jeff Dike wrote:
> > > > [email protected] said:
> > > > > Guess you'll have some problems then with UML on x86-64, which always
> > > > > uses vgettimeofday. But it's only used for gettimeofday() currently,
> > > > > perhaps it's not that bad when the UML child runs with the host's
> > > > > time.
> > > >
> > > > It's not horrible, but it's still broken. There are people who depend
> > > > on UML being able to keep its own time separately from the host.
> > > >
> > > > > I guess it would be possible to add some support for UML to map own
> > > > > code over the vsyscall reserved locations. UML would need to use the
> > > > > syscalls then. But it'll be likely ugly.
> > > >
> > > > Yeah, it would be.
> > > >
> > > > My preferred solution would be for libc to ask the kernel where the vsyscall
> > > > area is. That's reasonably clean and virtualizable. Andrea doesn't like it
> > > > because it adds a few instructions to the vsyscall address calculation.
> > >
> > > yes, my preferred solution is still a runtime /proc entry that turns off
> > > vsyscalls completely by root so you could trap gettimeofday/time via the
> > > usual ptrace. That would be zero cost. Of course this would be needed
> >
> > Ok, a sysctl that modifies a variable in the vsyscall page and is
> > tested by the code. That would be an option, I agree.
>
> the sysctl would replace the vsyscall fixmap fixmap entry for the
> current cpu enterely at switch_to time, I certainly don't want to add
> not necessary branches in the core vsyscall code. Doing it globally is
> zerocost but it would probably need privilegies as said, per-task could
> be more dynamic without privilegies and it would be an unlikely branch
> added in switch_to, still a very low cost so still acceptable.
>
> > For the locked TSC code we will need something like that anyways,
> > so that locked TSC can force a syscall.
>
> If we use a per-cpu TSC we don't need the syscall, the cpuid encoded in
> each 64bit variable will be enough (see my past email of yesterday
> evening, I realized a way to hanle per-cpu info with vsyscalls). the
> main problem is as usual that the TSC isn't a real time source, it
> changes frequency all the time, but as usual all the problems in the
> gettimeofday implementation have little to with the vsyscalls details,
> in particular now that I realized how to handle per-cpu data, they're
> generic issues that needs solving even if vsyscalls would redirect to
> the syscalls.
>
> The only thing we definitely can't do in the vsyscalls is to read the
> PIT because it's an old I/O mapped device, but who could ever live only
> with the PIT anyways these computing days? If you've only the PIT
> vsyscalls or not the gettimeofday functionality would suck.
>
> Andrea


2002-10-21 16:22:02

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Mo n, Oct 21, 2002 at 05:43:29PM +0200, Stephen Hemminger wrote:
> Would it be possible to allow UML to use mmap to put an alternative text
> page at the fixed address with the UML appropriate version of the
> vsyscall?
nly with lots of special cases in the VM. vsyscall is currently located
in the kernel fixing, giving user mmap access to kernel mappings
would need quite some changes.

Global sysctl is much easier and safer. As soon as x86-64 gets useful
vsyscalls again I will implement it (as I said earlier it is currently
turned off because of problems with the timers)

-Andi

2002-10-21 16:44:40

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

This may be way out there, but has any consideration been
given to high speed system calls. I have worked on systems
where this was done for selected calls. Only selected state
was saved. Only fast calls such as getpid, gettimeofday,
etc. were allowed. The calls were still executed in system
pages so there was still the context switch (i.e. the trap
overhead) but very little state save/ restore. It used a
different trap number so it did not impact the standard
calls.

You would need to defeat the standard way of checking if in
the system so the standard system calls and interrupt code
would think user space was executing. As I recall there was
some mapping tricks played so that while not actually in the
system map, it was still available.

I had only a brief encounter with these calls so I may not
have all the details right :(
--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-10-21 17:06:00

by john stultz

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Fri, 2002-10-18 at 21:45, Andi Kleen wrote:
> For the locked TSC code we will need something like that anyways,
> so that locked TSC can force a syscall.

Andi,

Can you further explain what this locked TSC issue is that you're
mentioning (this is why vsyscalls are disabled in x86-64, right?). I'm
not sure I'm following you completely. I apologize if I'm just being
daft.

thanks
-john

2002-10-22 03:57:51

by Jeff Dike

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

[email protected] said:
> yes, this is true for all the syscalls, if that's a problem uml isn't
> an option for the user in the first place.

Not true. Any marginal increase in performance will make a number of
applications fast enough that they become practical in UML. Since there
are apps which, to a first order approximation, do nothing but call
gettimeofday, they are not usable in UML today, but could become usable if
UML had vgettimeofday. I've had complaints about this, so the need is
definitely there.

> what do you plan to do to make all other syscall faster?

Right now, a UML syscall involves four host context switches and a host
signal delivery and return. I'm merging some changes which will reduce
that to two host context switches and no signals. Once that's done, I'm
going to look for more improvements.

> My problem is that mapping user code into the vsyscall fixmap is
> complex and not very clean at all, breaks various concepts in the mm
> and last but not the least it is slow

Can you explain, in small words, why mapping user code is so horrible?

Jeff

2002-10-22 04:11:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

> > My problem is that mapping user code into the vsyscall fixmap is
> > complex and not very clean at all, breaks various concepts in the mm
> > and last but not the least it is slow
>
> Can you explain, in small words, why mapping user code is so horrible?

Currently Linux has neatly separated kernel and user page tables.
On architectures which have tree type tables in hardware you just have
a user level table and you stick a pointer to the kernel level tables
somewhere at the end of the first page. The normal user level page
handling doesn't know about the kernel pages. The vsyscall code is in
the kernel mapping in the fixmaps. Allowing the user to map arbitary
pages into the vsyscall area would blur this clear separation and
require much more special case handling.

In addition it would break a lot of assumptions that user mappings are
only < __PAGE_OFFSET, probably having security implications. For example
you would need to special case this in uaccess.h's access_ok(), which
would be quite a lot of overhead (any change to this function causes
many KB of binary bloat because *_user is so heavily used all over the kernel)

-Andi

2002-10-22 04:23:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

Andi Kleen wrote:
>
> For example
> you would need to special case this in uaccess.h's access_ok(), which
> would be quite a lot of overhead (any change to this function causes
> many KB of binary bloat because *_user is so heavily used all over the kernel)

That's all uninlined in the -mm patches. Saves 33k of text.

2002-10-22 05:04:35

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Tue, Oct 22, 2002 at 06:15:24AM +0200, Andi Kleen wrote:
> > > My problem is that mapping user code into the vsyscall fixmap is
> > > complex and not very clean at all, breaks various concepts in the mm
> > > and last but not the least it is slow
> >
> > Can you explain, in small words, why mapping user code is so horrible?
>
> Currently Linux has neatly separated kernel and user page tables.
> On architectures which have tree type tables in hardware you just have
> a user level table and you stick a pointer to the kernel level tables
> somewhere at the end of the first page. The normal user level page
> handling doesn't know about the kernel pages. The vsyscall code is in
> the kernel mapping in the fixmaps. Allowing the user to map arbitary
> pages into the vsyscall area would blur this clear separation and
> require much more special case handling.
>
> In addition it would break a lot of assumptions that user mappings are
> only < __PAGE_OFFSET, probably having security implications. For example
> you would need to special case this in uaccess.h's access_ok(), which
> would be quite a lot of overhead (any change to this function causes
> many KB of binary bloat because *_user is so heavily used all over the kernel)

that's not the problem. you would use get_user_pages to pin the page,
after that such a page it's like a normal plain kernel page allocated
with GFP_KERNEL provided that you never write to it (that's fine, it's
mapped readonly like the regular vsyscall page) and that you don't
pretend it to be constant (again that's fine since if uml changes it
under itself that's its own problem ;). It's not *that* messy as
breaking the whole copy-user concept. See my other email for more
details on this issue.

Andrea

2002-10-22 05:22:54

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Tue, Oct 22, 2002 at 12:07:16AM -0500, Jeff Dike wrote:
> [email protected] said:
> > yes, this is true for all the syscalls, if that's a problem uml isn't
> > an option for the user in the first place.
>
> Not true. Any marginal increase in performance will make a number of
> applications fast enough that they become practical in UML. Since there
> are apps which, to a first order approximation, do nothing but call
> gettimeofday, they are not usable in UML today, but could become usable if
> UML had vgettimeofday. I've had complaints about this, so the need is
> definitely there.

This is not the point. You just have vsyscalls at full speed (completely
equivalent to the host speed, so only a few nanoseconds or more
depending on the hardware used) and the user code won't even notice to
be running inside uml, the user will get the full total host-speedup by
default inside uml, you don't need to change one single line of uml
codebase while porting to x86-64.

The point is that such a full total speedup cames with a feature-loss
from the uml point of view, that means uml programs will be in sync with
the real time of the host, and the time won't be revirtualized anymore
(unless we make some change to both kernel and uml).

So in short the problem is:

without any change uml runs too fast ;), exactly as fast as the host

You want to slow it down a bit (invplg and branch in switch_to) to
resurrect the time revirtualization somehow.

See this sentence in my last email that you didn't answer to:

If only you could raise a valid interesting real life scenario where
they can't live with the system time being in sync with the host and
where gettimeofday performance is critical I would love to hear.

This is the *key* point. I would prefer to avoid moving pinned user
pages into the end of the kernel address space unless you provide a
valid real world case where tons of users will benefit from it at large.

> > what do you plan to do to make all other syscall faster?
>
> Right now, a UML syscall involves four host context switches and a host
> signal delivery and return. I'm merging some changes which will reduce
> that to two host context switches and no signals. Once that's done, I'm
> going to look for more improvements.

Sounds good.

> > My problem is that mapping user code into the vsyscall fixmap is
> > complex and not very clean at all, breaks various concepts in the mm
> > and last but not the least it is slow
>
> Can you explain, in small words, why mapping user code is so horrible?

Mapping user code above PAGE_OFFSET is messy yes. We could
use get_user_pages, pin the page set the fixmap to point to it, set the
old vsyscall code in some other kernel space, tell userspace the new
location of the native vsyscalls, keep track of the whole thing in the
task struct and switch both old vsyscall and new vsyscall during
switch_to, and cleanup the mess when the task exits tell us nothing.

Actually the second copy of the vsyscalls could stay mapped all the
time, we only return to userspace its fixed address so we don't waste
address space. Plus it would be simpler and cleaner to have the vsyscall
calling into an user specified address, that sounds a much more usable
API too infact, you pass an pointer to function rather than a user page
address to remap, it would be more handy from your part too so you don't
need to build a magic vsyscall page to remap but you only care about the
callee.

So with these new ideas (to keep the second copy constantly mapped above
the last -ENOSYS and to have userspace specifying the address to call)
it sounds much simpler than the idea of mapping user code in kernel
space and not much more complicated than just redirecting the vsyscalls
to kernel.

Andrea

2002-10-22 07:19:04

by Elladan

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Tue, Oct 22, 2002 at 07:27:17AM +0200, Andrea Arcangeli wrote:
> On Tue, Oct 22, 2002 at 12:07:16AM -0500, Jeff Dike wrote:
>
> > > My problem is that mapping user code into the vsyscall fixmap is
> > > complex and not very clean at all, breaks various concepts in the mm
> > > and last but not the least it is slow
> >
> > Can you explain, in small words, why mapping user code is so horrible?
>
> Mapping user code above PAGE_OFFSET is messy yes. We could
> use get_user_pages, pin the page set the fixmap to point to it, set the
> old vsyscall code in some other kernel space, tell userspace the new
> location of the native vsyscalls, keep track of the whole thing in the
> task struct and switch both old vsyscall and new vsyscall during
> switch_to, and cleanup the mess when the task exits tell us nothing.
>
> Actually the second copy of the vsyscalls could stay mapped all the
> time, we only return to userspace its fixed address so we don't waste
> address space. Plus it would be simpler and cleaner to have the vsyscall
> calling into an user specified address, that sounds a much more usable
> API too infact, you pass an pointer to function rather than a user page
> address to remap, it would be more handy from your part too so you don't
> need to build a magic vsyscall page to remap but you only care about the
> callee.
>
> So with these new ideas (to keep the second copy constantly mapped above
> the last -ENOSYS and to have userspace specifying the address to call)
> it sounds much simpler than the idea of mapping user code in kernel
> space and not much more complicated than just redirecting the vsyscalls
> to kernel.

This seems somewhat painful all-around, since if I'm reading this right,
you take a switch_to hit to find out whether the user redirected the
vsyscall, and a vsyscall branch hit as well.

A switch_to seems like something to avoid, since it slows down the
system at all times, even when vgettimeofday is not being used very
often. Does the average system call gettimeofday() more than once per
context switch? If not, don't change switch_to...

If you don't mind a somewhat nasty looking tactic, another way you could
implement something like this and give a boost to virtualizing programs
would be to do a special case in the syscall_trace handler for
gettimeofday.

Just do the global flag test in the vgettimeofday code, and when it does
the fallback system call, if the process is being traced, we check to
see if the control process requested some special handling for the
syscall (obviously very easy to do in kernel mode). If so, do a special
fixup which, say, returns execution back to user space in a
user-specified location without notifying the tracing task of the system
call event.

This way, the main system just sees vsyscalls degrade to normal system
calls, which is ok, and programs that want to virtualize like UML get to
bounce execution into some special user-specified vsyscall code of their
own, with the cost being just one system call transition for UML as
well - a big speedup.

This sort of tactic might be interesting in general for a virtualizing
program, since you could bounce many of the system calls in the traced
process into user-specified code pages (especially if security in the
virtualized program isn't too big a concern).

It also would have the advantage of only mangling the trace path in the
kernel, which isn't the most performance-critical one around, without
overly complicating the vsyscalls in the average case.

Of course, it's kind of... ugly.

-J

2002-10-22 07:35:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Tue, Oct 22, 2002 at 12:24:38AM -0700, Elladan wrote:
> On Tue, Oct 22, 2002 at 07:27:17AM +0200, Andrea Arcangeli wrote:
> > On Tue, Oct 22, 2002 at 12:07:16AM -0500, Jeff Dike wrote:
> >
> > > > My problem is that mapping user code into the vsyscall fixmap is
> > > > complex and not very clean at all, breaks various concepts in the mm
> > > > and last but not the least it is slow
> > >
> > > Can you explain, in small words, why mapping user code is so horrible?
> >
> > Mapping user code above PAGE_OFFSET is messy yes. We could
> > use get_user_pages, pin the page set the fixmap to point to it, set the
> > old vsyscall code in some other kernel space, tell userspace the new
> > location of the native vsyscalls, keep track of the whole thing in the
> > task struct and switch both old vsyscall and new vsyscall during
> > switch_to, and cleanup the mess when the task exits tell us nothing.
> >
> > Actually the second copy of the vsyscalls could stay mapped all the
> > time, we only return to userspace its fixed address so we don't waste
> > address space. Plus it would be simpler and cleaner to have the vsyscall
> > calling into an user specified address, that sounds a much more usable
> > API too infact, you pass an pointer to function rather than a user page
> > address to remap, it would be more handy from your part too so you don't
> > need to build a magic vsyscall page to remap but you only care about the
> > callee.
> >
> > So with these new ideas (to keep the second copy constantly mapped above
> > the last -ENOSYS and to have userspace specifying the address to call)
> > it sounds much simpler than the idea of mapping user code in kernel
> > space and not much more complicated than just redirecting the vsyscalls
> > to kernel.
>
> This seems somewhat painful all-around, since if I'm reading this right,
> you take a switch_to hit to find out whether the user redirected the
> vsyscall, and a vsyscall branch hit as well.

there's no vsyscall branch hit, and no switch_to hit, just a single
unlikely branch in switch_to, that's minor overehad, for istance the
segmentation checks (as well unlikely) are more expensive.

> A switch_to seems like something to avoid, since it slows down the
> system at all times, even when vgettimeofday is not being used very
> often. Does the average system call gettimeofday() more than once per

switch_to only gets an additional check, the real hit it takes is when
Jeff revirtualizes the vsyscalls, not when you run in production w/o uml
or with uml in speed-mode without vsyscall revirtualization.

> context switch? If not, don't change switch_to...
>
> If you don't mind a somewhat nasty looking tactic, another way you could
> implement something like this and give a boost to virtualizing programs
> would be to do a special case in the syscall_trace handler for
> gettimeofday.
>
> Just do the global flag test in the vgettimeofday code, and when it does

I prefer not to have branches in vgettimeoday, it is better to have a
single branch in switch_to where it is certainly hidden in the scheduler
and generic context switch noise, infact if we put in the right place it
could have zero l1 cache impact, gettimeofday call frequency may be very
high, much higher than the context switch frequency and the size of the
gettimeofday is much smaller than the one of the scheduler, so there's
less stuff to hide the branch in the noise.

> the fallback system call, if the process is being traced, we check to
> see if the control process requested some special handling for the
> syscall (obviously very easy to do in kernel mode). If so, do a special
> fixup which, say, returns execution back to user space in a
> user-specified location without notifying the tracing task of the system
> call event.
>
> This way, the main system just sees vsyscalls degrade to normal system
> calls, which is ok, and programs that want to virtualize like UML get to
> bounce execution into some special user-specified vsyscall code of their
> own, with the cost being just one system call transition for UML as
> well - a big speedup.

you're optimizing the system for strace? What's the point of optimizing
strace and penalizing the normal syscall fast path?

>
> This sort of tactic might be interesting in general for a virtualizing
> program, since you could bounce many of the system calls in the traced
> process into user-specified code pages (especially if security in the
> virtualized program isn't too big a concern).
>
> It also would have the advantage of only mangling the trace path in the
> kernel, which isn't the most performance-critical one around, without
> overly complicating the vsyscalls in the average case.
>
> Of course, it's kind of... ugly.
>
> -J


Andrea

2002-10-22 09:19:26

by Alan

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Tue, 2002-10-22 at 05:29, Andrew Morton wrote:
> Andi Kleen wrote:
> >
> > For example
> > you would need to special case this in uaccess.h's access_ok(), which
> > would be quite a lot of overhead (any change to this function causes
> > many KB of binary bloat because *_user is so heavily used all over the kernel)
>
> That's all uninlined in the -mm patches. Saves 33k of text.

I assume it saves a sizable amount of exception tables too ?

2002-10-22 16:06:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

Alan Cox wrote:
>
> On Tue, 2002-10-22 at 05:29, Andrew Morton wrote:
> > Andi Kleen wrote:
> > >
> > > For example
> > > you would need to special case this in uaccess.h's access_ok(), which
> > > would be quite a lot of overhead (any change to this function causes
> > > many KB of binary bloat because *_user is so heavily used all over the kernel)
> >
> > That's all uninlined in the -mm patches. Saves 33k of text.
>
> I assume it saves a sizable amount of exception tables too ?

Strangely, no. 3/4 came from .text and 1/4 from __ex_table.

2002-10-23 05:06:48

by Elladan

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Tue, Oct 22, 2002 at 09:40:06AM +0200, Andrea Arcangeli wrote:
> On Tue, Oct 22, 2002 at 12:24:38AM -0700, Elladan wrote:
> >
> > This seems somewhat painful all-around, since if I'm reading this right,
> > you take a switch_to hit to find out whether the user redirected the
> > vsyscall, and a vsyscall branch hit as well.
>
> there's no vsyscall branch hit, and no switch_to hit, just a single
> unlikely branch in switch_to, that's minor overehad, for istance the
> segmentation checks (as well unlikely) are more expensive.

It's still more expensive than nothing, so it's still penalizing the
context switch for an obscure vsyscall UML feature.

> > Just do the global flag test in the vgettimeofday code, and when it does
>
> I prefer not to have branches in vgettimeoday, it is better to have a
> single branch in switch_to where it is certainly hidden in the scheduler
> and generic context switch noise, infact if we put in the right place it
> could have zero l1 cache impact, gettimeofday call frequency may be very
> high, much higher than the context switch frequency and the size of the
> gettimeofday is much smaller than the one of the scheduler, so there's
> less stuff to hide the branch in the noise.

Both the vgettimeofday and switch_to are fast paths. So, the right
thing to do, if one of them has to take an unlikely branch penalty, is
to ask which one is entered more often.

gettimeofday() call frequency *can* be very high, but let's test it...

On my system, under a basic desktop user load (eg., browsing the web,
running folding@home), I see numbers like this (sampled every 10
seconds):

initial:
ctx 349728 gett 307946

ctx 21937 gett 11173
ctx 24791 gett 15761
ctx 2715 gett 3714
ctx 6748 gett 3789
ctx 4334 gett 2423
ctx 1575 gett 1002
ctx 4295 gett 2508
ctx 14913 gett 8220
ctx 600 gett 350
ctx 3821 gett 4860
ctx 4948 gett 7601
ctx 4800 gett 10720
ctx 3064 gett 6547
ctx 7716 gett 9592
ctx 3518 gett 4600
ctx 2760 gett 3505
ctx 4892 gett 4349
ctx 6258 gett 6547
ctx 7331 gett 9577
ctx 21701 gett 11674
ctx 3113 gett 1854
ctx 1502 gett 1063
ctx 27841 gett 14406

final:
ctx 536121 gett 454398

So, there were more context switches than calls to gettimeofday.
However, the numbers were close to each other. Any idea what the
numbers are like for other workloads?

> > This way, the main system just sees vsyscalls degrade to normal system
> > calls, which is ok, and programs that want to virtualize like UML get to
> > bounce execution into some special user-specified vsyscall code of their
> > own, with the cost being just one system call transition for UML as
> > well - a big speedup.
>
> you're optimizing the system for strace? What's the point of optimizing
> strace and penalizing the normal syscall fast path?

No, I'm penalizing strace to provide UML with a fast(er) syscall
mechanism. This is totally optional, but may be interesting for
virtualization in general. strace is not in the normal syscall fast
path, so this is a reasonable place to put optimizations for
virtualizing programs.

I'm also penalizing the vsyscall fast path, but that was just to avoid
the switch_to penalty. Since both are rather critical, here's another,
even uglier scheme, which should have no overhead on switch_to or
vgettimeofday, but adds a bit of overhead to the page fault handler
(though, with the -ENOSYS fixup mentioned in the comment there, maybe
nothing relevant).

Of course, it hurts systems which run UML with virtualized time.

Try 2:

Create a second mapping of the vsyscall page in some special location
above the normal page. Make a new sysctl, which globally invalidates
the page that the standard mapping is on. Basically, this disables
vsyscalls for everyone when turned on.

Now, obviously this won't work without some trick. What we do now is,
we make the page fault handler path for vsyscalls (to be added anyway)
work like so:

If the pc is within the allocated vsyscall page(s), then:

If the pc is on the entrypoint to a vsyscall function, check whether the
process is being traced. If so, turn this into a somewhat normal
looking syscall so it can be virtualized (or do something else, if you
want - have userspace jump somewhere, etc).

If not traced, or if the pc is not at the entrypoint, reset the pc to be
on the second vsyscall copy, with the same offset, and return to
userspace.

This lets us do a global vsyscall disable, but (I hope) fixes up the
problem of userspace going to sleep inside a vsyscall. The process
wakes up, faults, and gets shunted off to identical code in another
location, which should have the same behavior.

Downside: vgettimeofday takes a performance penalty for everyone in the
special case where UML is running with full time virtualization, because
of the page fault. This is the very unusual case, so who cares?

Downside 2: Would this actually work? It's a bit scary sounding...

-J

2002-10-23 05:37:53

by Elladan

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

On Tue, Oct 22, 2002 at 10:12:08PM -0700, Elladan wrote:
>
> Try 2:
>
> Create a second mapping of the vsyscall page in some special location
> above the normal page. Make a new sysctl, which globally invalidates
> the page that the standard mapping is on. Basically, this disables
> vsyscalls for everyone when turned on.
>
> Now, obviously this won't work without some trick. What we do now is,
> we make the page fault handler path for vsyscalls (to be added anyway)
> work like so:
>
> If the pc is within the allocated vsyscall page(s), then:
>
> If the pc is on the entrypoint to a vsyscall function, check whether the
> process is being traced. If so, turn this into a somewhat normal
> looking syscall so it can be virtualized (or do something else, if you
> want - have userspace jump somewhere, etc).
>
> If not traced, or if the pc is not at the entrypoint, reset the pc to be
> on the second vsyscall copy, with the same offset, and return to
> userspace.
>
> This lets us do a global vsyscall disable, but (I hope) fixes up the
> problem of userspace going to sleep inside a vsyscall. The process
> wakes up, faults, and gets shunted off to identical code in another
> location, which should have the same behavior.
>
> Downside: vgettimeofday takes a performance penalty for everyone in the
> special case where UML is running with full time virtualization, because
> of the page fault. This is the very unusual case, so who cares?
>
> Downside 2: Would this actually work? It's a bit scary sounding...

One caveat to this, I suppose, is that the vsyscall itself would need to
be position-independant code (which might not be overhead, if done very
carefully), or else the code would have to be modified inside the
sysctl() at invalidation time. Both of which make the implementation
ugly.

-J

2002-10-23 17:47:52

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

In message <[email protected]>, > : Elladan writes:
> gettimeofday() call frequency *can* be very high, but let's test it...

Oracle and possibly DB2 call gettimeofday to timestamp each potential
roll back transaction. With a large machine, the number of calls to
gettimeofday() can be enormous, as well as one of the top few bottlenecks
for TPC-C style workloads (OLTP).

gerrit

2002-10-26 10:26:49

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

Hi!

> > > Since no one really brought up any issues with the code itself
> > > (correct me if I'm wrong), here is the i386 vsyscall gettimeofday port
> > > I sent last night, synced up and ready for integration.
> >
> > This vsyscall implementation breaks UML. Any app that's run inside UML
> > that uses vsyscalls will get the host's vsyscalls rather than the UML
> > vsyscalls.
>
> Ugh.
>
> Guess you'll have some problems then with UML on x86-64, which always uses
> vgettimeofday. But it's only used for gettimeofday() currently, perhaps it's
> not that bad when the UML child runs with the host's time.

Well, if you want to use UML for time shifting (pretty common use,
AFAICS)...

> I guess it would be possible to add some support for UML
> to map own code over the vsyscall reserved locations. UML would need
> to use the syscalls then. But it'll be likely ugly.

I guess this is the right solution. [Or UML could simply unmap that
area and handle the faults...].
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

2002-10-26 10:26:49

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] linux-2.5.43_vsyscall_A0

Hi!

> > Guess you'll have some problems then with UML on x86-64, which always
> > uses vgettimeofday. But it's only used for gettimeofday() currently,
> > perhaps it's not that bad when the UML child runs with the host's
> > time.
>
> It's not horrible, but it's still broken. There are people who depend
> on UML being able to keep its own time separately from the host.
>
> > I guess it would be possible to add some support for UML to map own
> > code over the vsyscall reserved locations. UML would need to use the
> > syscalls then. But it'll be likely ugly.
>
> Yeah, it would be.
>
> My preferred solution would be for libc to ask the kernel where the vsyscall
> area is. That's reasonably clean and virtualizable. Andrea doesn't like it
> because it adds a few instructions to the vsyscall address calculation.

But sandboxed application could still "guess" where vsyscall address
is and get the data it is not supposed to get, right?
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?