From: Stefani Seibold <[email protected]>
This patch add the functions vdso_gettimeofday(), vdso_clock_gettime()
and vdso_time() to the 32 bit VDSO.
The reason to do this was to get a fast reliable time stamp. Many developers
uses TSC to get a fast time stamp, without knowing the pitfalls. VDSO
time functions a fast and a reliable way, because the kernel knows the
best time source and the P- and C-state of the CPU.
The helper library to use the VDSO functions can be download at
http://http://seibold.net/vdso.c
The libary is very small, only 228 lines of code. Compile it with
gcc -Wall -O3 -fpic vdso.c -lrt -shared -o libvdso.so
and use it with LD_PRELOAD=<path>/libvdso.so
This kind of helper must be integrated into glibc, for x86 64 bit and
PowerPC it is already there.
Some linux 32 bit kernel benchmark results (all measurements are in nano
seconds):
Intel(R) Celeron(TM) CPU 400MHz
Average time kernel call:
gettimeofday(): 1039
clock_gettime(): 1578
time(): 526
Average time VDSO call:
gettimeofday(): 378
clock_gettime(): 303
time(): 60
Celeron(R) Dual-Core CPU T3100 1.90GHz
Average time kernel call:
gettimeofday(): 209
clock_gettime(): 406
time(): 135
Average time VDSO call:
gettimeofday(): 51
clock_gettime(): 43
time(): 10
So you can see a performance increase between 4 and 13, depending on the
CPU and the function.
The address layout of the VDSO has changed, because there is no fixed
address space available on a x86 32 bit kernel, despite the name. Because
someone decided to add an offset to the __FIXADDR_TOP for virtualization.
Also the IA32 Emulation uses the whole 4 GB address space, so there is no
fixed address available.
This was the reason not depend on this kind of address and change the layout
of the VDSO. The VDSO for a 32 bit application has now three pages:
+----------------------------------------+
+ VDSO page (includes code) ro+x +
+----------------------------------------+
+ VVAR page (export kernel variables) ro +
+----------------------------------------+
+ HPET page (mapped registers) ro
+----------------------------------------+
The VDSO page for a 32 bit resided now on 0xffffc000, followed by the VVAR and
HPET page.
In the non compat mode the VMA of the VDSO is now 3 pages for a 32 bit kernel.
So this decrease the available logical address room by 2 pages.
The patch is against kernel 3.14 (e7651b819e90da924991d727d3c007200a18670d)
Changelog:
25.11.2012 - first release and proof of concept for linux 3.4
11.12.2012 - Port to linux 3.7 and code cleanup
12.12.2012 - fixes suggested by Andy Lutomirski
- fixes suggested by John Stultz
- use call VDSO32_vsyscall instead of int 80
- code cleanup
17.12.2012 - support for IA32_EMULATION, this includes
- code cleanup
- include cleanup to fix compile warnings and errors
- move out seqcount from seqlock, enable use in VDSO
- map FIXMAP and HPET into the 32 bit address space
18.12.2012 - split into separate patches
30.01.2013 - revamp the code
- code clean up
- VDSO layout changed
- no fixed addresses
- port to 3.14
01.02.2013 - code cleanup
From: Stefani Seibold <[email protected]>
This patch add the VDSO time support for the IA32 Emulation Layer.
Due the nature of the kernel headers and the LP64 compiler where the
size of a long and a pointer differs against a 32 bit compiler, there
is a lot of type hacking necessary.
This kind of type hacking could be prevent in the future by doing a call to the
64 bit code by the following sequence:
- Compile the arch/x86/vdso/vclock_gettime.c as 64 bit, but only generate
the assembler output.
- Next compile a 32 bit object by including the 64 bit vclock_gettime.s
prefixed with .code64
- At least we need a trampolin code which invokes the 64 bit code and do
the API conversation (64 bit longs to 32 bit longs), like the
followig snipped:
ENTRY(call64)
push %ebp
movl %esp, %ebp
ljmp $__USER_CS, $1f
.code64
1:
andq -0x10, %rsp
movl $__USER_DS, %ecx
movl %ecx, %ds
movl %ecx, %ss
movl %ecx, %es
call *%rax
movl $__USER32_DS, %ecx
movl %ecx, %ds
movl %ecx, %ss
movl %ecx, %es
leaq ret32(%rip), %rdx
movl $__USER32_CS, %ecx
salq $32, %rcx
leaq (%rcx, %rdx), %rcx
push %rcx
ljmp *(%esp)
.code32
ret32:
movl %ebp, %esp
pop %ebp
ret
.code32
ENTRY(gettimeofday32)
push %edi
movl gettimeofday64, %eax
movl 16(%esp), %edi
call call64
pop %edi
ret
Signed-off-by: Stefani Seibold <[email protected]>
---
arch/x86/vdso/vclock_gettime.c | 112 ++++++++++++++++++++++++++--------
arch/x86/vdso/vdso32/vclock_gettime.c | 7 +++
2 files changed, 95 insertions(+), 24 deletions(-)
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index d163bb5..53eea37 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -31,12 +31,24 @@
#define gtod (&VVAR(vsyscall_gtod_data))
+struct api_timeval {
+ long tv_sec; /* seconds */
+ long tv_usec; /* microseconds */
+};
+
+struct api_timespec {
+ long tv_sec; /* seconds */
+ long tv_nsec; /* microseconds */
+};
+
+typedef long api_time_t;
+
static notrace cycle_t vread_hpet(void)
{
return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + HPET_COUNTER);
}
-notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
+notrace static long vdso_fallback_gettime(long clock, struct api_timespec *ts)
{
long ret;
asm("syscall" : "=a" (ret) :
@@ -44,7 +56,8 @@ notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
return ret;
}
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
+notrace static long vdso_fallback_gtod(struct api_timeval *tv,
+ struct timezone *tz)
{
long ret;
@@ -54,20 +67,68 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
}
#else
+#ifdef CONFIG_IA32_EMULATION
+typedef s64 arch_time_t;
+
+struct arch_timespec {
+ s64 tv_sec;
+ s64 tv_nsec;
+};
+
+#define ALIGN8 __attribute__ ((aligned (8)))
+
+struct arch_vsyscall_gtod_data {
+ seqcount_t seq ALIGN8;
+
+ struct { /* extract of a clocksource struct */
+ int vclock_mode ALIGN8;
+ cycle_t cycle_last ALIGN8;
+ cycle_t mask ALIGN8;
+ u32 mult;
+ u32 shift;
+ } clock;
+
+ /* open coded 'struct timespec' */
+ arch_time_t wall_time_sec;
+ u64 wall_time_snsec;
+ u64 monotonic_time_snsec;
+ arch_time_t monotonic_time_sec;
+
+ struct timezone sys_tz;
+ struct arch_timespec wall_time_coarse;
+ struct arch_timespec monotonic_time_coarse;
+};
+
+struct arch_vsyscall_gtod_data vvar_vsyscall_gtod_data
+ __attribute__((visibility("hidden")));
+#else
struct vsyscall_gtod_data vvar_vsyscall_gtod_data
__attribute__((visibility("hidden")));
+#endif
u32 hpet_counter
__attribute__((visibility("hidden")));
#define gtod (&vvar_vsyscall_gtod_data)
+struct api_timeval {
+ s32 tv_sec; /* seconds */
+ s32 tv_usec; /* microseconds */
+};
+
+struct api_timespec {
+ s32 tv_sec; /* seconds */
+ s32 tv_nsec; /* microseconds */
+};
+
+typedef s32 api_time_t;
+
static notrace cycle_t vread_hpet(void)
{
return readl(&hpet_counter);
}
-notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
+notrace static long vdso_fallback_gettime(long clock, struct api_timespec *ts)
{
long ret;
@@ -77,12 +138,12 @@ notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
"call VDSO32_vsyscall \n"
"pop %%ebx \n"
: "=a" (ret)
- : "0" (__NR_clock_gettime), "d" (clock), "c" (ts)
+ : "0" (__NR_ia32_clock_gettime), "d" (clock), "c" (ts)
: "memory");
return ret;
}
-notrace static long vdso_fallback_gtod(struct timeval *tv,
+notrace static long vdso_fallback_gtod(struct api_timeval *tv,
struct timezone *tz)
{
long ret;
@@ -93,7 +154,7 @@ notrace static long vdso_fallback_gtod(struct timeval *tv,
"call VDSO32_vsyscall \n"
"pop %%ebx \n"
: "=a" (ret)
- : "0" (__NR_gettimeofday), "d" (tv), "c" (tz)
+ : "0" (__NR_ia32_gettimeofday), "d" (tv), "c" (tz)
: "memory");
return ret;
}
@@ -280,43 +341,48 @@ notrace static void do_monotonic_coarse(struct timespec *ts)
} while (unlikely(read_seqcount_retry(>od->seq, seq)));
}
-notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
+notrace int __vdso_clock_gettime(clockid_t clock, struct api_timespec *ts)
{
+ struct timespec tmp;
+
switch (clock) {
case CLOCK_REALTIME:
- if (do_realtime(ts) == VCLOCK_NONE)
+ if (do_realtime(&tmp) == VCLOCK_NONE)
goto fallback;
break;
case CLOCK_MONOTONIC:
- if (do_monotonic(ts) == VCLOCK_NONE)
+ if (do_monotonic(&tmp) == VCLOCK_NONE)
goto fallback;
break;
case CLOCK_REALTIME_COARSE:
- do_realtime_coarse(ts);
+ do_realtime_coarse(&tmp);
break;
case CLOCK_MONOTONIC_COARSE:
- do_monotonic_coarse(ts);
+ do_monotonic_coarse(&tmp);
break;
default:
goto fallback;
}
+ ts->tv_sec = tmp.tv_sec;
+ ts->tv_nsec = tmp.tv_nsec;
+
return 0;
fallback:
return vdso_fallback_gettime(clock, ts);
}
-int clock_gettime(clockid_t, struct timespec *)
+int clock_gettime(clockid_t, struct api_timespec *)
__attribute__((weak, alias("__vdso_clock_gettime")));
-notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
+notrace int __vdso_gettimeofday(struct api_timeval *tv, struct timezone *tz)
{
- long ret = VCLOCK_NONE;
+ struct timespec tmp;
if (likely(tv != NULL)) {
- BUILD_BUG_ON(offsetof(struct timeval, tv_usec) !=
- offsetof(struct timespec, tv_nsec) ||
- sizeof(*tv) != sizeof(struct timespec));
- ret = do_realtime((struct timespec *)tv);
+ if (do_realtime(&tmp) == VCLOCK_NONE)
+ return vdso_fallback_gtod(tv, tz);
+ tv->tv_sec = tmp.tv_sec;
+ tv->tv_usec = tmp.tv_nsec;
tv->tv_usec /= 1000;
}
if (unlikely(tz != NULL)) {
@@ -325,25 +391,23 @@ notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
tz->tz_dsttime = gtod->sys_tz.tz_dsttime;
}
- if (ret == VCLOCK_NONE)
- return vdso_fallback_gtod(tv, tz);
return 0;
}
-int gettimeofday(struct timeval *, struct timezone *)
+int gettimeofday(struct api_timeval *, struct timezone *)
__attribute__((weak, alias("__vdso_gettimeofday")));
/*
* This will break when the xtime seconds get inaccurate, but that is
* unlikely
*/
-notrace time_t __vdso_time(time_t *t)
+notrace api_time_t __vdso_time(api_time_t *t)
{
/* This is atomic on x86 so we don't need any locks. */
- time_t result = ACCESS_ONCE(gtod->wall_time_sec);
+ api_time_t result = ACCESS_ONCE(gtod->wall_time_sec);
if (t)
*t = result;
return result;
}
-int time(time_t *t)
+int time(api_time_t *t)
__attribute__((weak, alias("__vdso_time")));
diff --git a/arch/x86/vdso/vdso32/vclock_gettime.c b/arch/x86/vdso/vdso32/vclock_gettime.c
index fab4ec6..b6df952 100644
--- a/arch/x86/vdso/vdso32/vclock_gettime.c
+++ b/arch/x86/vdso/vdso32/vclock_gettime.c
@@ -2,6 +2,8 @@
#ifdef CONFIG_X86_64
+#include <generated/asm/unistd_32_ia32.h>
+
#define _ASM_X86_PAGE_H
#define __pa(x) 0
@@ -10,6 +12,11 @@
#undef CONFIG_ILLEGAL_POINTER_VALUE
#define CONFIG_ILLEGAL_POINTER_VALUE 0
+#else
+
+#define __NR_ia32_clock_gettime __NR_clock_gettime
+#define __NR_ia32_gettimeofday __NR_gettimeofday
+
#endif
#include "../vclock_gettime.c"
--
1.8.5.3
From: Stefani Seibold <[email protected]>
The _install_special_mapping() is the new base function for
install_special_mapping(). This function will return a pointer of the
created VMA or a error code in an ERR_PTR()
This new function will be needed by the for the vdso 32 bit support to map the
additonal vvar and hpet pages into the 32 bit address space. This will be done
with io_remap_pfn_range() and remap_pfn_range, which requieres a vm_area_struct.
Signed-off-by: Stefani Seibold <[email protected]>
---
include/linux/mm.h | 3 +++
mm/mmap.c | 20 ++++++++++++++++----
2 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46e..55342aa 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1740,6 +1740,9 @@ extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
extern struct file *get_mm_exe_file(struct mm_struct *mm);
extern int may_expand_vm(struct mm_struct *mm, unsigned long npages);
+extern struct vm_area_struct *_install_special_mapping(struct mm_struct *mm,
+ unsigned long addr, unsigned long len,
+ unsigned long flags, struct page **pages);
extern int install_special_mapping(struct mm_struct *mm,
unsigned long addr, unsigned long len,
unsigned long flags, struct page **pages);
diff --git a/mm/mmap.c b/mm/mmap.c
index 20ff0c3..81ba54f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2918,7 +2918,7 @@ static const struct vm_operations_struct special_mapping_vmops = {
* The array pointer and the pages it points to are assumed to stay alive
* for as long as this mapping might exist.
*/
-int install_special_mapping(struct mm_struct *mm,
+struct vm_area_struct *_install_special_mapping(struct mm_struct *mm,
unsigned long addr, unsigned long len,
unsigned long vm_flags, struct page **pages)
{
@@ -2927,7 +2927,7 @@ int install_special_mapping(struct mm_struct *mm,
vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
if (unlikely(vma == NULL))
- return -ENOMEM;
+ return ERR_PTR(-ENOMEM);
INIT_LIST_HEAD(&vma->anon_vma_chain);
vma->vm_mm = mm;
@@ -2948,11 +2948,23 @@ int install_special_mapping(struct mm_struct *mm,
perf_event_mmap(vma);
- return 0;
+ return vma;
out:
kmem_cache_free(vm_area_cachep, vma);
- return ret;
+ return ERR_PTR(ret);
+}
+
+int install_special_mapping(struct mm_struct *mm,
+ unsigned long addr, unsigned long len,
+ unsigned long vm_flags, struct page **pages)
+{
+ struct vm_area_struct *vma = _install_special_mapping(mm,
+ addr, len, vm_flags, pages);
+
+ if (IS_ERR(vma))
+ return PTR_ERR(vma);
+ return 0;
}
static DEFINE_MUTEX(mm_all_locks_mutex);
--
1.8.5.3
From: Stefani Seibold <[email protected]>
This patch add the time support for 32 bit a VDSO to a 32 bit kernel.
For 32 bit programs running on a 32 bit kernel, the same mechanism is
used as for 64 bit programs running on a 64 bit kernel.
Signed-off-by: Stefani Seibold <[email protected]>
---
arch/x86/include/asm/elf.h | 2 +-
arch/x86/include/asm/vdso.h | 3 +
arch/x86/include/asm/vdso32.h | 10 +++
arch/x86/vdso/Makefile | 7 ++
arch/x86/vdso/vclock_gettime.c | 165 ++++++++++++++++++++++------------
arch/x86/vdso/vdso-layout.lds.S | 25 ++++++
arch/x86/vdso/vdso32-setup.c | 47 ++++++++--
arch/x86/vdso/vdso32/vclock_gettime.c | 16 ++++
arch/x86/vdso/vdso32/vdso32.lds.S | 11 ++-
9 files changed, 218 insertions(+), 68 deletions(-)
create mode 100644 arch/x86/include/asm/vdso32.h
create mode 100644 arch/x86/vdso/vdso32/vclock_gettime.c
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 9c999c1..21bae90 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -289,7 +289,7 @@ do { \
#else /* CONFIG_X86_32 */
-#define VDSO_HIGH_BASE 0xffffe000U /* CONFIG_COMPAT_VDSO address */
+#define VDSO_HIGH_BASE 0xffffc000U /* CONFIG_COMPAT_VDSO address */
/* 1GB for 64bit, 8MB for 32bit */
#define STACK_RND_MASK (test_thread_flag(TIF_ADDR32) ? 0x7ff : 0x3fffff)
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index fddb53d..fe3cef9 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -2,6 +2,9 @@
#define _ASM_X86_VDSO_H
#if defined CONFIG_X86_32 || defined CONFIG_COMPAT
+
+#include <asm/vdso32.h>
+
extern const char VDSO32_PRELINK[];
/*
diff --git a/arch/x86/include/asm/vdso32.h b/arch/x86/include/asm/vdso32.h
new file mode 100644
index 0000000..7dd2eb8
--- /dev/null
+++ b/arch/x86/include/asm/vdso32.h
@@ -0,0 +1,10 @@
+#ifndef _ASM_X86_VDSO32_H
+#define _ASM_X86_VDSO32_H
+
+#define VDSO_BASE_PAGE 0
+#define VDSO_VVAR_PAGE 1
+#define VDSO_HPET_PAGE 2
+#define VDSO_PAGES 3
+#define VDSO_OFFSET(x) ((x) * PAGE_SIZE)
+
+#endif
diff --git a/arch/x86/vdso/Makefile b/arch/x86/vdso/Makefile
index fd14be1..1ff5b0a 100644
--- a/arch/x86/vdso/Makefile
+++ b/arch/x86/vdso/Makefile
@@ -145,8 +145,15 @@ KBUILD_AFLAGS_32 := $(filter-out -m64,$(KBUILD_AFLAGS))
$(vdso32-images:%=$(obj)/%.dbg): KBUILD_AFLAGS = $(KBUILD_AFLAGS_32)
$(vdso32-images:%=$(obj)/%.dbg): asflags-$(CONFIG_X86_64) += -m32
+KBUILD_CFLAGS_32 := $(filter-out -m64,$(KBUILD_CFLAGS))
+KBUILD_CFLAGS_32 := $(filter-out -mcmodel=kernel,$(KBUILD_CFLAGS_32))
+KBUILD_CFLAGS_32 := $(filter-out -fno-pic,$(KBUILD_CFLAGS_32))
+KBUILD_CFLAGS_32 += -m32 -msoft-float -mregparm=3 -freg-struct-return -fpic
+$(vdso32-images:%=$(obj)/%.dbg): KBUILD_CFLAGS = $(KBUILD_CFLAGS_32)
+
$(vdso32-images:%=$(obj)/%.dbg): $(obj)/vdso32-%.so.dbg: FORCE \
$(obj)/vdso32/vdso32.lds \
+ $(obj)/vdso32/vclock_gettime.o \
$(obj)/vdso32/note.o \
$(obj)/vdso32/%.o
$(call if_changed,vdso)
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index eb5d7a5..d163bb5 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -4,6 +4,9 @@
*
* Fast user context implementation of clock_gettime, gettimeofday, and time.
*
+ * 32 Bit compat layer by Stefani Seibold <[email protected]>
+ * sponsored by Rohde & Schwarz GmbH & Co. KG Munich/Germany
+ *
* The code should have no internal unresolved relocations.
* Check with readelf after changing.
*/
@@ -24,45 +27,78 @@
#include <asm/io.h>
#include <asm/pvclock.h>
+#ifndef BUILD_VDSO32
+
#define gtod (&VVAR(vsyscall_gtod_data))
-notrace static cycle_t vread_tsc(void)
+static notrace cycle_t vread_hpet(void)
{
- cycle_t ret;
- u64 last;
-
- /*
- * Empirically, a fence (of type that depends on the CPU)
- * before rdtsc is enough to ensure that rdtsc is ordered
- * with respect to loads. The various CPU manuals are unclear
- * as to whether rdtsc can be reordered with later loads,
- * but no one has ever seen it happen.
- */
- rdtsc_barrier();
- ret = (cycle_t)vget_cycles();
+ return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + HPET_COUNTER);
+}
- last = VVAR(vsyscall_gtod_data).clock.cycle_last;
+notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
+{
+ long ret;
+ asm("syscall" : "=a" (ret) :
+ "0" (__NR_clock_gettime), "D" (clock), "S" (ts) : "memory");
+ return ret;
+}
- if (likely(ret >= last))
- return ret;
+notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
+{
+ long ret;
- /*
- * GCC likes to generate cmov here, but this branch is extremely
- * predictable (it's just a funciton of time and the likely is
- * very likely) and there's a data dependence, so force GCC
- * to generate a branch instead. I don't barrier() because
- * we don't actually need a barrier, and if this function
- * ever gets inlined it will generate worse code.
- */
- asm volatile ("");
- return last;
+ asm("syscall" : "=a" (ret) :
+ "0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory");
+ return ret;
}
+#else
+
+struct vsyscall_gtod_data vvar_vsyscall_gtod_data
+ __attribute__((visibility("hidden")));
+
+u32 hpet_counter
+ __attribute__((visibility("hidden")));
+
+#define gtod (&vvar_vsyscall_gtod_data)
static notrace cycle_t vread_hpet(void)
{
- return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + HPET_COUNTER);
+ return readl(&hpet_counter);
+}
+
+notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
+{
+ long ret;
+
+ asm(
+ "push %%ebx \n"
+ "mov %2,%%ebx \n"
+ "call VDSO32_vsyscall \n"
+ "pop %%ebx \n"
+ : "=a" (ret)
+ : "0" (__NR_clock_gettime), "d" (clock), "c" (ts)
+ : "memory");
+ return ret;
}
+notrace static long vdso_fallback_gtod(struct timeval *tv,
+ struct timezone *tz)
+{
+ long ret;
+
+ asm(
+ "push %%ebx \n"
+ "mov %2,%%ebx \n"
+ "call VDSO32_vsyscall \n"
+ "pop %%ebx \n"
+ : "=a" (ret)
+ : "0" (__NR_gettimeofday), "d" (tv), "c" (tz)
+ : "memory");
+ return ret;
+}
+#endif
+
#ifdef CONFIG_PARAVIRT_CLOCK
static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
@@ -124,7 +160,7 @@ static notrace cycle_t vread_pvclock(int *mode)
*mode = VCLOCK_NONE;
/* refer to tsc.c read_tsc() comment for rationale */
- last = VVAR(vsyscall_gtod_data).clock.cycle_last;
+ last = gtod->clock.cycle_last;
if (likely(ret >= last))
return ret;
@@ -133,27 +169,41 @@ static notrace cycle_t vread_pvclock(int *mode)
}
#endif
-notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
+notrace static cycle_t vread_tsc(void)
{
- long ret;
- asm("syscall" : "=a" (ret) :
- "0" (__NR_clock_gettime),"D" (clock), "S" (ts) : "memory");
- return ret;
-}
+ cycle_t ret;
+ u64 last;
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
- long ret;
+ /*
+ * Empirically, a fence (of type that depends on the CPU)
+ * before rdtsc is enough to ensure that rdtsc is ordered
+ * with respect to loads. The various CPU manuals are unclear
+ * as to whether rdtsc can be reordered with later loads,
+ * but no one has ever seen it happen.
+ */
+ rdtsc_barrier();
+ ret = (cycle_t)vget_cycles();
- asm("syscall" : "=a" (ret) :
- "0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory");
- return ret;
-}
+ last = gtod->clock.cycle_last;
+ if (likely(ret >= last))
+ return ret;
+
+ /*
+ * GCC likes to generate cmov here, but this branch is extremely
+ * predictable (it's just a funciton of time and the likely is
+ * very likely) and there's a data dependence, so force GCC
+ * to generate a branch instead. I don't barrier() because
+ * we don't actually need a barrier, and if this function
+ * ever gets inlined it will generate worse code.
+ */
+ asm volatile ("");
+ return last;
+}
notrace static inline u64 vgetsns(int *mode)
{
- long v;
+ u64 v;
cycles_t cycles;
if (gtod->clock.vclock_mode == VCLOCK_TSC)
cycles = vread_tsc();
@@ -210,7 +260,7 @@ notrace static int do_monotonic(struct timespec *ts)
return mode;
}
-notrace static int do_realtime_coarse(struct timespec *ts)
+notrace static void do_realtime_coarse(struct timespec *ts)
{
unsigned long seq;
do {
@@ -218,10 +268,9 @@ notrace static int do_realtime_coarse(struct timespec *ts)
ts->tv_sec = gtod->wall_time_coarse.tv_sec;
ts->tv_nsec = gtod->wall_time_coarse.tv_nsec;
} while (unlikely(read_seqcount_retry(>od->seq, seq)));
- return 0;
}
-notrace static int do_monotonic_coarse(struct timespec *ts)
+notrace static void do_monotonic_coarse(struct timespec *ts)
{
unsigned long seq;
do {
@@ -229,30 +278,32 @@ notrace static int do_monotonic_coarse(struct timespec *ts)
ts->tv_sec = gtod->monotonic_time_coarse.tv_sec;
ts->tv_nsec = gtod->monotonic_time_coarse.tv_nsec;
} while (unlikely(read_seqcount_retry(>od->seq, seq)));
-
- return 0;
}
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
{
- int ret = VCLOCK_NONE;
-
switch (clock) {
case CLOCK_REALTIME:
- ret = do_realtime(ts);
+ if (do_realtime(ts) == VCLOCK_NONE)
+ goto fallback;
break;
case CLOCK_MONOTONIC:
- ret = do_monotonic(ts);
+ if (do_monotonic(ts) == VCLOCK_NONE)
+ goto fallback;
break;
case CLOCK_REALTIME_COARSE:
- return do_realtime_coarse(ts);
+ do_realtime_coarse(ts);
+ break;
case CLOCK_MONOTONIC_COARSE:
- return do_monotonic_coarse(ts);
+ do_monotonic_coarse(ts);
+ break;
+ default:
+ goto fallback;
}
- if (ret == VCLOCK_NONE)
- return vdso_fallback_gettime(clock, ts);
return 0;
+fallback:
+ return vdso_fallback_gettime(clock, ts);
}
int clock_gettime(clockid_t, struct timespec *)
__attribute__((weak, alias("__vdso_clock_gettime")));
@@ -287,8 +338,8 @@ int gettimeofday(struct timeval *, struct timezone *)
*/
notrace time_t __vdso_time(time_t *t)
{
- /* This is atomic on x86_64 so we don't need any locks. */
- time_t result = ACCESS_ONCE(VVAR(vsyscall_gtod_data).wall_time_sec);
+ /* This is atomic on x86 so we don't need any locks. */
+ time_t result = ACCESS_ONCE(gtod->wall_time_sec);
if (t)
*t = result;
diff --git a/arch/x86/vdso/vdso-layout.lds.S b/arch/x86/vdso/vdso-layout.lds.S
index 634a2cf..109a26b 100644
--- a/arch/x86/vdso/vdso-layout.lds.S
+++ b/arch/x86/vdso/vdso-layout.lds.S
@@ -44,6 +44,27 @@ SECTIONS
. = ALIGN(0x100);
.text : { *(.text*) } :text =0x90909090
+
+#ifdef BUILD_VDSO32
+ . = ALIGN(PAGE_SIZE);
+
+ .vvar_sect : {
+ vvar = .;
+
+ /* Place all vvars at the offsets in asm/vvar.h. */
+#define EMIT_VVAR(name, offset) vvar_ ## name = vvar + offset;
+#define __VVAR_KERNEL_LDS
+#include <asm/vvar.h>
+#undef __VVAR_KERNEL_LDS
+#undef EMIT_VVAR
+ } :text :vvar_sect
+
+ . += PAGE_SIZE;
+
+ .hpet_sect : {
+ hpet_counter = . + 0xf0;
+ } :text :hpet_sect
+#endif
}
/*
@@ -61,4 +82,8 @@ PHDRS
dynamic PT_DYNAMIC FLAGS(4); /* PF_R */
note PT_NOTE FLAGS(4); /* PF_R */
eh_frame_hdr PT_GNU_EH_FRAME;
+#ifdef BUILD_VDSO32
+ vvar_sect PT_NULL FLAGS(4); /* PF_R */
+ hpet_sect PT_NULL FLAGS(4); /* PF_R */
+#endif
}
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index d6bfb87..eb2050c 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -25,6 +25,7 @@
#include <asm/tlbflush.h>
#include <asm/vdso.h>
#include <asm/proto.h>
+#include <asm/fixmap.h>
enum {
VDSO_DISABLED = 0,
@@ -193,7 +194,7 @@ static __init void relocate_vdso(Elf32_Ehdr *ehdr)
}
}
-static struct page *vdso32_pages[1];
+static struct page *vdso32_pages[VDSO_PAGES];
#ifdef CONFIG_X86_64
@@ -310,6 +311,11 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
unsigned long addr;
int ret = 0;
bool compat;
+ struct vm_area_struct *vma;
+ extern char __vvar_page;
+#ifdef CONFIG_HPET_TIMER
+ extern unsigned long hpet_address;
+#endif
#ifdef CONFIG_X86_X32_ABI
if (test_thread_flag(TIF_X32))
@@ -330,7 +336,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
if (compat)
addr = VDSO_HIGH_BASE;
else {
- addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
+ addr = get_unmapped_area(NULL, 0, VDSO_OFFSET(VDSO_PAGES), 0, 0);
if (IS_ERR_VALUE(addr)) {
ret = addr;
goto up_fail;
@@ -340,16 +346,39 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
current->mm->context.vdso = (void *)addr;
if (compat_uses_vma || !compat) {
- /*
- * MAYWRITE to allow gdb to COW and set breakpoints
- */
- ret = install_special_mapping(mm, addr, PAGE_SIZE,
- VM_READ|VM_EXEC|
- VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- vdso32_pages);
+
+ vma = _install_special_mapping(mm,
+ addr,
+ VDSO_OFFSET(VDSO_PAGES),
+ VM_READ|VM_EXEC,
+ vdso32_pages);
+
+ if (IS_ERR(vma)) {
+ ret = PTR_ERR(vma);
+ goto up_fail;
+ }
+
+ ret = remap_pfn_range(vma,
+ vma->vm_start + VDSO_OFFSET(VDSO_VVAR_PAGE),
+ __pa_symbol(&__vvar_page) >> PAGE_SHIFT,
+ PAGE_SIZE,
+ PAGE_READONLY);
if (ret)
goto up_fail;
+
+#ifdef CONFIG_HPET_TIMER
+ if (hpet_address) {
+ ret = io_remap_pfn_range(vma,
+ vma->vm_start + VDSO_OFFSET(VDSO_HPET_PAGE),
+ hpet_address >> PAGE_SHIFT,
+ PAGE_SIZE,
+ pgprot_noncached(PAGE_READONLY));
+
+ if (ret)
+ goto up_fail;
+ }
+#endif
}
current_thread_info()->sysenter_return =
diff --git a/arch/x86/vdso/vdso32/vclock_gettime.c b/arch/x86/vdso/vdso32/vclock_gettime.c
new file mode 100644
index 0000000..fab4ec6
--- /dev/null
+++ b/arch/x86/vdso/vdso32/vclock_gettime.c
@@ -0,0 +1,16 @@
+#define BUILD_VDSO32
+
+#ifdef CONFIG_X86_64
+
+#define _ASM_X86_PAGE_H
+
+#define __pa(x) 0
+#define __va(x) 0
+
+#undef CONFIG_ILLEGAL_POINTER_VALUE
+#define CONFIG_ILLEGAL_POINTER_VALUE 0
+
+#endif
+
+#include "../vclock_gettime.c"
+
diff --git a/arch/x86/vdso/vdso32/vdso32.lds.S b/arch/x86/vdso/vdso32/vdso32.lds.S
index 976124b..66e73b2 100644
--- a/arch/x86/vdso/vdso32/vdso32.lds.S
+++ b/arch/x86/vdso/vdso32/vdso32.lds.S
@@ -8,6 +8,9 @@
* values visible using the asm-x86/vdso.h macros from the kernel proper.
*/
+#include <asm/page.h>
+
+#define BUILD_VDSO32
#define VDSO_PRELINK 0
#include "../vdso-layout.lds.S"
@@ -19,11 +22,14 @@ ENTRY(__kernel_vsyscall);
*/
VERSION
{
- LINUX_2.5 {
+ LINUX_2.6 {
global:
__kernel_vsyscall;
__kernel_sigreturn;
__kernel_rt_sigreturn;
+ __vdso_clock_gettime;
+ __vdso_gettimeofday;
+ __vdso_time;
local: *;
};
}
@@ -35,3 +41,6 @@ VDSO32_PRELINK = VDSO_PRELINK;
VDSO32_vsyscall = __kernel_vsyscall;
VDSO32_sigreturn = __kernel_sigreturn;
VDSO32_rt_sigreturn = __kernel_rt_sigreturn;
+VDSO32_clock_gettime = clock_gettime;
+VDSO32_gettimeofday = gettimeofday;
+VDSO32_time = time;
--
1.8.5.3
From: Stefani Seibold <[email protected]>
This patch move the vsyscall_gtod_data handling out of vsyscall_64.c
into an additonal file vsyscall_gtod.c to make the functionality
available for x86 32 bit kernel.
It also adds a new vsyscall_32.c which setup the VVAR page.
Signed-off-by: Stefani Seibold <[email protected]>
---
arch/x86/Kconfig | 4 +--
arch/x86/include/asm/clocksource.h | 4 ---
arch/x86/include/asm/fixmap.h | 2 ++
arch/x86/include/asm/vvar.h | 4 +++
arch/x86/kernel/Makefile | 3 +-
arch/x86/kernel/hpet.c | 4 ---
arch/x86/kernel/setup.c | 2 --
arch/x86/kernel/tsc.c | 2 --
arch/x86/kernel/vmlinux.lds.S | 3 --
arch/x86/kernel/vsyscall_32.c | 24 +++++++++++++++
arch/x86/kernel/vsyscall_64.c | 44 ----------------------------
arch/x86/kernel/vsyscall_gtod.c | 60 ++++++++++++++++++++++++++++++++++++++
12 files changed, 94 insertions(+), 62 deletions(-)
create mode 100644 arch/x86/kernel/vsyscall_32.c
create mode 100644 arch/x86/kernel/vsyscall_gtod.c
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 940e50e..b556f00 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -107,9 +107,9 @@ config X86
select HAVE_ARCH_SOFT_DIRTY
select CLOCKSOURCE_WATCHDOG
select GENERIC_CLOCKEVENTS
- select ARCH_CLOCKSOURCE_DATA if X86_64
+ select ARCH_CLOCKSOURCE_DATA
select GENERIC_CLOCKEVENTS_BROADCAST if X86_64 || (X86_32 && X86_LOCAL_APIC)
- select GENERIC_TIME_VSYSCALL if X86_64
+ select GENERIC_TIME_VSYSCALL
select KTIME_SCALAR if X86_32
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
diff --git a/arch/x86/include/asm/clocksource.h b/arch/x86/include/asm/clocksource.h
index 16a57f4..eda81dc 100644
--- a/arch/x86/include/asm/clocksource.h
+++ b/arch/x86/include/asm/clocksource.h
@@ -3,8 +3,6 @@
#ifndef _ASM_X86_CLOCKSOURCE_H
#define _ASM_X86_CLOCKSOURCE_H
-#ifdef CONFIG_X86_64
-
#define VCLOCK_NONE 0 /* No vDSO clock available. */
#define VCLOCK_TSC 1 /* vDSO should use vread_tsc. */
#define VCLOCK_HPET 2 /* vDSO should use vread_hpet. */
@@ -14,6 +12,4 @@ struct arch_clocksource_data {
int vclock_mode;
};
-#endif /* CONFIG_X86_64 */
-
#endif /* _ASM_X86_CLOCKSOURCE_H */
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 7252cd3..504c04a 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -74,6 +74,8 @@ extern unsigned long __FIXADDR_TOP;
enum fixed_addresses {
#ifdef CONFIG_X86_32
FIX_HOLE,
+ VSYSCALL_HPET,
+ VVAR_PAGE,
FIX_VDSO,
#else
VSYSCALL_LAST_PAGE,
diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
index d76ac40..c442782 100644
--- a/arch/x86/include/asm/vvar.h
+++ b/arch/x86/include/asm/vvar.h
@@ -17,7 +17,11 @@
*/
/* Base address of vvars. This is not ABI. */
+#ifdef CONFIG_X86_64
#define VVAR_ADDRESS (-10*1024*1024 - 4096)
+#else
+#define VVAR_ADDRESS 0xffffd000
+#endif
#if defined(__VVAR_KERNEL_LDS)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index cb648c8..3282eda 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -26,7 +26,8 @@ obj-$(CONFIG_IRQ_WORK) += irq_work.o
obj-y += probe_roms.o
obj-$(CONFIG_X86_32) += i386_ksyms_32.o
obj-$(CONFIG_X86_64) += sys_x86_64.o x8664_ksyms_64.o
-obj-y += syscall_$(BITS).o
+obj-y += syscall_$(BITS).o vsyscall_gtod.o
+obj-$(CONFIG_X86_32) += vsyscall_32.o
obj-$(CONFIG_X86_64) += vsyscall_64.o
obj-$(CONFIG_X86_64) += vsyscall_emu_64.o
obj-$(CONFIG_SYSFS) += ksysfs.o
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index da85a8e..54263f0 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -74,9 +74,7 @@ static inline void hpet_writel(unsigned int d, unsigned int a)
static inline void hpet_set_mapping(void)
{
hpet_virt_address = ioremap_nocache(hpet_address, HPET_MMAP_SIZE);
-#ifdef CONFIG_X86_64
__set_fixmap(VSYSCALL_HPET, hpet_address, PAGE_KERNEL_VVAR_NOCACHE);
-#endif
}
static inline void hpet_clear_mapping(void)
@@ -752,9 +750,7 @@ static struct clocksource clocksource_hpet = {
.mask = HPET_MASK,
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
.resume = hpet_resume_counter,
-#ifdef CONFIG_X86_64
.archdata = { .vclock_mode = VCLOCK_HPET },
-#endif
};
static int hpet_clocksource_register(void)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 06853e6..56ff330 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1182,9 +1182,7 @@ void __init setup_arch(char **cmdline_p)
tboot_probe();
-#ifdef CONFIG_X86_64
map_vsyscall();
-#endif
generic_apic_probe();
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 19e5adb..f49d1a2 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -988,9 +988,7 @@ static struct clocksource clocksource_tsc = {
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
CLOCK_SOURCE_MUST_VERIFY,
-#ifdef CONFIG_X86_64
.archdata = { .vclock_mode = VCLOCK_TSC },
-#endif
};
void mark_tsc_unstable(char *reason)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index da6b35a..1d4897b 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -147,7 +147,6 @@ SECTIONS
_edata = .;
} :data
-#ifdef CONFIG_X86_64
. = ALIGN(PAGE_SIZE);
__vvar_page = .;
@@ -169,8 +168,6 @@ SECTIONS
. = ALIGN(__vvar_page + PAGE_SIZE, PAGE_SIZE);
-#endif /* CONFIG_X86_64 */
-
/* Init code and data - will be freed after init */
. = ALIGN(PAGE_SIZE);
.init.begin : AT(ADDR(.init.begin) - LOAD_OFFSET) {
diff --git a/arch/x86/kernel/vsyscall_32.c b/arch/x86/kernel/vsyscall_32.c
new file mode 100644
index 0000000..f541dfa
--- /dev/null
+++ b/arch/x86/kernel/vsyscall_32.c
@@ -0,0 +1,24 @@
+/*
+ * Copyright (C) 2001 Andrea Arcangeli <[email protected]> SuSE
+ * Copyright 2003 Andi Kleen, SuSE Labs.
+ *
+ * Modified for x86 32 bit arch by Stefani Seibold <[email protected]>
+ *
+ * Thanks to [email protected] for some useful hint.
+ * Special thanks to Ingo Molnar for his early experience with
+ * a different vsyscall implementation for Linux/IA32 and for the name.
+ *
+ */
+
+#include <asm/vsyscall.h>
+#include <asm/pgtable.h>
+#include <asm/fixmap.h>
+
+void __init map_vsyscall(void)
+{
+ extern char __vvar_page;
+ unsigned long physaddr_vvar_page = __pa_symbol(&__vvar_page);
+
+ __set_fixmap(VVAR_PAGE, physaddr_vvar_page, PAGE_KERNEL_VVAR);
+}
+
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 1f96f93..50622ed 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -47,14 +47,12 @@
#include <asm/segment.h>
#include <asm/desc.h>
#include <asm/topology.h>
-#include <asm/vgtod.h>
#include <asm/traps.h>
#define CREATE_TRACE_POINTS
#include "vsyscall_trace.h"
DEFINE_VVAR(int, vgetcpu_mode);
-DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data);
static enum { EMULATE, NATIVE, NONE } vsyscall_mode = EMULATE;
@@ -77,48 +75,6 @@ static int __init vsyscall_setup(char *str)
}
early_param("vsyscall", vsyscall_setup);
-void update_vsyscall_tz(void)
-{
- vsyscall_gtod_data.sys_tz = sys_tz;
-}
-
-void update_vsyscall(struct timekeeper *tk)
-{
- struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data;
-
- write_seqcount_begin(&vdata->seq);
-
- /* copy vsyscall data */
- vdata->clock.vclock_mode = tk->clock->archdata.vclock_mode;
- vdata->clock.cycle_last = tk->clock->cycle_last;
- vdata->clock.mask = tk->clock->mask;
- vdata->clock.mult = tk->mult;
- vdata->clock.shift = tk->shift;
-
- vdata->wall_time_sec = tk->xtime_sec;
- vdata->wall_time_snsec = tk->xtime_nsec;
-
- vdata->monotonic_time_sec = tk->xtime_sec
- + tk->wall_to_monotonic.tv_sec;
- vdata->monotonic_time_snsec = tk->xtime_nsec
- + (tk->wall_to_monotonic.tv_nsec
- << tk->shift);
- while (vdata->monotonic_time_snsec >=
- (((u64)NSEC_PER_SEC) << tk->shift)) {
- vdata->monotonic_time_snsec -=
- ((u64)NSEC_PER_SEC) << tk->shift;
- vdata->monotonic_time_sec++;
- }
-
- vdata->wall_time_coarse.tv_sec = tk->xtime_sec;
- vdata->wall_time_coarse.tv_nsec = (long)(tk->xtime_nsec >> tk->shift);
-
- vdata->monotonic_time_coarse = timespec_add(vdata->wall_time_coarse,
- tk->wall_to_monotonic);
-
- write_seqcount_end(&vdata->seq);
-}
-
static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
const char *message)
{
diff --git a/arch/x86/kernel/vsyscall_gtod.c b/arch/x86/kernel/vsyscall_gtod.c
new file mode 100644
index 0000000..91862a4
--- /dev/null
+++ b/arch/x86/kernel/vsyscall_gtod.c
@@ -0,0 +1,60 @@
+/*
+ * Copyright (C) 2001 Andrea Arcangeli <[email protected]> SuSE
+ * Copyright 2003 Andi Kleen, SuSE Labs.
+ *
+ * Modified for x86 32 bit architecture by
+ * Stefani Seibold <[email protected]>
+ *
+ * Thanks to [email protected] for some useful hint.
+ * Special thanks to Ingo Molnar for his early experience with
+ * a different vsyscall implementation for Linux/IA32 and for the name.
+ *
+ */
+
+#include <linux/timekeeper_internal.h>
+#include <asm/vgtod.h>
+
+DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data);
+
+void update_vsyscall_tz(void)
+{
+ vsyscall_gtod_data.sys_tz = sys_tz;
+}
+
+void update_vsyscall(struct timekeeper *tk)
+{
+ struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data;
+
+ write_seqcount_begin(&vdata->seq);
+
+ /* copy vsyscall data */
+ vdata->clock.vclock_mode = tk->clock->archdata.vclock_mode;
+ vdata->clock.cycle_last = tk->clock->cycle_last;
+ vdata->clock.mask = tk->clock->mask;
+ vdata->clock.mult = tk->mult;
+ vdata->clock.shift = tk->shift;
+
+ vdata->wall_time_sec = tk->xtime_sec;
+ vdata->wall_time_snsec = tk->xtime_nsec;
+
+ vdata->monotonic_time_sec = tk->xtime_sec
+ + tk->wall_to_monotonic.tv_sec;
+ vdata->monotonic_time_snsec = tk->xtime_nsec
+ + (tk->wall_to_monotonic.tv_nsec
+ << tk->shift);
+ while (vdata->monotonic_time_snsec >=
+ (((u64)NSEC_PER_SEC) << tk->shift)) {
+ vdata->monotonic_time_snsec -=
+ ((u64)NSEC_PER_SEC) << tk->shift;
+ vdata->monotonic_time_sec++;
+ }
+
+ vdata->wall_time_coarse.tv_sec = tk->xtime_sec;
+ vdata->wall_time_coarse.tv_nsec = (long)(tk->xtime_nsec >> tk->shift);
+
+ vdata->monotonic_time_coarse = timespec_add(vdata->wall_time_coarse,
+ tk->wall_to_monotonic);
+
+ write_seqcount_end(&vdata->seq);
+}
+
--
1.8.5.3
On 02/01/2014 07:32 AM, [email protected] wrote:
>
> This kind of type hacking could be prevent in the future by doing a call to the
> 64 bit code by the following sequence:
>
> - Compile the arch/x86/vdso/vclock_gettime.c as 64 bit, but only generate
> the assembler output.
> - Next compile a 32 bit object by including the 64 bit vclock_gettime.s
> prefixed with .code64
> - At least we need a trampolin code which invokes the 64 bit code and do
> the API conversation (64 bit longs to 32 bit longs), like the
> followig snipped:
>
Honestly, I think the overhead of the mode switch would ruin a lot of
the advantages with the vdso.
-hpa
On Sat, Feb 1, 2014 at 7:32 AM, <[email protected]> wrote:
> From: Stefani Seibold <[email protected]>
>
> This patch move the vsyscall_gtod_data handling out of vsyscall_64.c
> into an additonal file vsyscall_gtod.c to make the functionality
> available for x86 32 bit kernel.
>
> It also adds a new vsyscall_32.c which setup the VVAR page.
>
> Signed-off-by: Stefani Seibold <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
...with the caveat that it would be even better if you moved vgetcpu
over and called this thing vvar.c instead of vsyscall_gtod.c.
> ---
> arch/x86/Kconfig | 4 +--
> arch/x86/include/asm/clocksource.h | 4 ---
> arch/x86/include/asm/fixmap.h | 2 ++
> arch/x86/include/asm/vvar.h | 4 +++
> arch/x86/kernel/Makefile | 3 +-
> arch/x86/kernel/hpet.c | 4 ---
> arch/x86/kernel/setup.c | 2 --
> arch/x86/kernel/tsc.c | 2 --
> arch/x86/kernel/vmlinux.lds.S | 3 --
> arch/x86/kernel/vsyscall_32.c | 24 +++++++++++++++
> arch/x86/kernel/vsyscall_64.c | 44 ----------------------------
> arch/x86/kernel/vsyscall_gtod.c | 60 ++++++++++++++++++++++++++++++++++++++
> 12 files changed, 94 insertions(+), 62 deletions(-)
> create mode 100644 arch/x86/kernel/vsyscall_32.c
> create mode 100644 arch/x86/kernel/vsyscall_gtod.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 940e50e..b556f00 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -107,9 +107,9 @@ config X86
> select HAVE_ARCH_SOFT_DIRTY
> select CLOCKSOURCE_WATCHDOG
> select GENERIC_CLOCKEVENTS
> - select ARCH_CLOCKSOURCE_DATA if X86_64
> + select ARCH_CLOCKSOURCE_DATA
> select GENERIC_CLOCKEVENTS_BROADCAST if X86_64 || (X86_32 && X86_LOCAL_APIC)
> - select GENERIC_TIME_VSYSCALL if X86_64
> + select GENERIC_TIME_VSYSCALL
> select KTIME_SCALAR if X86_32
> select GENERIC_STRNCPY_FROM_USER
> select GENERIC_STRNLEN_USER
> diff --git a/arch/x86/include/asm/clocksource.h b/arch/x86/include/asm/clocksource.h
> index 16a57f4..eda81dc 100644
> --- a/arch/x86/include/asm/clocksource.h
> +++ b/arch/x86/include/asm/clocksource.h
> @@ -3,8 +3,6 @@
> #ifndef _ASM_X86_CLOCKSOURCE_H
> #define _ASM_X86_CLOCKSOURCE_H
>
> -#ifdef CONFIG_X86_64
> -
> #define VCLOCK_NONE 0 /* No vDSO clock available. */
> #define VCLOCK_TSC 1 /* vDSO should use vread_tsc. */
> #define VCLOCK_HPET 2 /* vDSO should use vread_hpet. */
> @@ -14,6 +12,4 @@ struct arch_clocksource_data {
> int vclock_mode;
> };
>
> -#endif /* CONFIG_X86_64 */
> -
> #endif /* _ASM_X86_CLOCKSOURCE_H */
> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
> index 7252cd3..504c04a 100644
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -74,6 +74,8 @@ extern unsigned long __FIXADDR_TOP;
> enum fixed_addresses {
> #ifdef CONFIG_X86_32
> FIX_HOLE,
> + VSYSCALL_HPET,
> + VVAR_PAGE,
> FIX_VDSO,
> #else
> VSYSCALL_LAST_PAGE,
> diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
> index d76ac40..c442782 100644
> --- a/arch/x86/include/asm/vvar.h
> +++ b/arch/x86/include/asm/vvar.h
> @@ -17,7 +17,11 @@
> */
>
> /* Base address of vvars. This is not ABI. */
> +#ifdef CONFIG_X86_64
> #define VVAR_ADDRESS (-10*1024*1024 - 4096)
> +#else
> +#define VVAR_ADDRESS 0xffffd000
> +#endif
>
> #if defined(__VVAR_KERNEL_LDS)
>
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index cb648c8..3282eda 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -26,7 +26,8 @@ obj-$(CONFIG_IRQ_WORK) += irq_work.o
> obj-y += probe_roms.o
> obj-$(CONFIG_X86_32) += i386_ksyms_32.o
> obj-$(CONFIG_X86_64) += sys_x86_64.o x8664_ksyms_64.o
> -obj-y += syscall_$(BITS).o
> +obj-y += syscall_$(BITS).o vsyscall_gtod.o
> +obj-$(CONFIG_X86_32) += vsyscall_32.o
> obj-$(CONFIG_X86_64) += vsyscall_64.o
> obj-$(CONFIG_X86_64) += vsyscall_emu_64.o
> obj-$(CONFIG_SYSFS) += ksysfs.o
> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> index da85a8e..54263f0 100644
> --- a/arch/x86/kernel/hpet.c
> +++ b/arch/x86/kernel/hpet.c
> @@ -74,9 +74,7 @@ static inline void hpet_writel(unsigned int d, unsigned int a)
> static inline void hpet_set_mapping(void)
> {
> hpet_virt_address = ioremap_nocache(hpet_address, HPET_MMAP_SIZE);
> -#ifdef CONFIG_X86_64
> __set_fixmap(VSYSCALL_HPET, hpet_address, PAGE_KERNEL_VVAR_NOCACHE);
> -#endif
> }
>
> static inline void hpet_clear_mapping(void)
> @@ -752,9 +750,7 @@ static struct clocksource clocksource_hpet = {
> .mask = HPET_MASK,
> .flags = CLOCK_SOURCE_IS_CONTINUOUS,
> .resume = hpet_resume_counter,
> -#ifdef CONFIG_X86_64
> .archdata = { .vclock_mode = VCLOCK_HPET },
> -#endif
> };
>
> static int hpet_clocksource_register(void)
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 06853e6..56ff330 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1182,9 +1182,7 @@ void __init setup_arch(char **cmdline_p)
>
> tboot_probe();
>
> -#ifdef CONFIG_X86_64
> map_vsyscall();
> -#endif
>
> generic_apic_probe();
>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 19e5adb..f49d1a2 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -988,9 +988,7 @@ static struct clocksource clocksource_tsc = {
> .mask = CLOCKSOURCE_MASK(64),
> .flags = CLOCK_SOURCE_IS_CONTINUOUS |
> CLOCK_SOURCE_MUST_VERIFY,
> -#ifdef CONFIG_X86_64
> .archdata = { .vclock_mode = VCLOCK_TSC },
> -#endif
> };
>
> void mark_tsc_unstable(char *reason)
> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> index da6b35a..1d4897b 100644
> --- a/arch/x86/kernel/vmlinux.lds.S
> +++ b/arch/x86/kernel/vmlinux.lds.S
> @@ -147,7 +147,6 @@ SECTIONS
> _edata = .;
> } :data
>
> -#ifdef CONFIG_X86_64
>
> . = ALIGN(PAGE_SIZE);
> __vvar_page = .;
> @@ -169,8 +168,6 @@ SECTIONS
>
> . = ALIGN(__vvar_page + PAGE_SIZE, PAGE_SIZE);
>
> -#endif /* CONFIG_X86_64 */
> -
> /* Init code and data - will be freed after init */
> . = ALIGN(PAGE_SIZE);
> .init.begin : AT(ADDR(.init.begin) - LOAD_OFFSET) {
> diff --git a/arch/x86/kernel/vsyscall_32.c b/arch/x86/kernel/vsyscall_32.c
> new file mode 100644
> index 0000000..f541dfa
> --- /dev/null
> +++ b/arch/x86/kernel/vsyscall_32.c
> @@ -0,0 +1,24 @@
> +/*
> + * Copyright (C) 2001 Andrea Arcangeli <[email protected]> SuSE
> + * Copyright 2003 Andi Kleen, SuSE Labs.
> + *
> + * Modified for x86 32 bit arch by Stefani Seibold <[email protected]>
> + *
> + * Thanks to [email protected] for some useful hint.
> + * Special thanks to Ingo Molnar for his early experience with
> + * a different vsyscall implementation for Linux/IA32 and for the name.
> + *
> + */
> +
> +#include <asm/vsyscall.h>
> +#include <asm/pgtable.h>
> +#include <asm/fixmap.h>
> +
> +void __init map_vsyscall(void)
> +{
> + extern char __vvar_page;
> + unsigned long physaddr_vvar_page = __pa_symbol(&__vvar_page);
> +
> + __set_fixmap(VVAR_PAGE, physaddr_vvar_page, PAGE_KERNEL_VVAR);
> +}
> +
> diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
> index 1f96f93..50622ed 100644
> --- a/arch/x86/kernel/vsyscall_64.c
> +++ b/arch/x86/kernel/vsyscall_64.c
> @@ -47,14 +47,12 @@
> #include <asm/segment.h>
> #include <asm/desc.h>
> #include <asm/topology.h>
> -#include <asm/vgtod.h>
> #include <asm/traps.h>
>
> #define CREATE_TRACE_POINTS
> #include "vsyscall_trace.h"
>
> DEFINE_VVAR(int, vgetcpu_mode);
> -DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data);
>
> static enum { EMULATE, NATIVE, NONE } vsyscall_mode = EMULATE;
>
> @@ -77,48 +75,6 @@ static int __init vsyscall_setup(char *str)
> }
> early_param("vsyscall", vsyscall_setup);
>
> -void update_vsyscall_tz(void)
> -{
> - vsyscall_gtod_data.sys_tz = sys_tz;
> -}
> -
> -void update_vsyscall(struct timekeeper *tk)
> -{
> - struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data;
> -
> - write_seqcount_begin(&vdata->seq);
> -
> - /* copy vsyscall data */
> - vdata->clock.vclock_mode = tk->clock->archdata.vclock_mode;
> - vdata->clock.cycle_last = tk->clock->cycle_last;
> - vdata->clock.mask = tk->clock->mask;
> - vdata->clock.mult = tk->mult;
> - vdata->clock.shift = tk->shift;
> -
> - vdata->wall_time_sec = tk->xtime_sec;
> - vdata->wall_time_snsec = tk->xtime_nsec;
> -
> - vdata->monotonic_time_sec = tk->xtime_sec
> - + tk->wall_to_monotonic.tv_sec;
> - vdata->monotonic_time_snsec = tk->xtime_nsec
> - + (tk->wall_to_monotonic.tv_nsec
> - << tk->shift);
> - while (vdata->monotonic_time_snsec >=
> - (((u64)NSEC_PER_SEC) << tk->shift)) {
> - vdata->monotonic_time_snsec -=
> - ((u64)NSEC_PER_SEC) << tk->shift;
> - vdata->monotonic_time_sec++;
> - }
> -
> - vdata->wall_time_coarse.tv_sec = tk->xtime_sec;
> - vdata->wall_time_coarse.tv_nsec = (long)(tk->xtime_nsec >> tk->shift);
> -
> - vdata->monotonic_time_coarse = timespec_add(vdata->wall_time_coarse,
> - tk->wall_to_monotonic);
> -
> - write_seqcount_end(&vdata->seq);
> -}
> -
> static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
> const char *message)
> {
> diff --git a/arch/x86/kernel/vsyscall_gtod.c b/arch/x86/kernel/vsyscall_gtod.c
> new file mode 100644
> index 0000000..91862a4
> --- /dev/null
> +++ b/arch/x86/kernel/vsyscall_gtod.c
> @@ -0,0 +1,60 @@
> +/*
> + * Copyright (C) 2001 Andrea Arcangeli <[email protected]> SuSE
> + * Copyright 2003 Andi Kleen, SuSE Labs.
> + *
> + * Modified for x86 32 bit architecture by
> + * Stefani Seibold <[email protected]>
> + *
> + * Thanks to [email protected] for some useful hint.
> + * Special thanks to Ingo Molnar for his early experience with
> + * a different vsyscall implementation for Linux/IA32 and for the name.
> + *
> + */
> +
> +#include <linux/timekeeper_internal.h>
> +#include <asm/vgtod.h>
> +
> +DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data);
> +
> +void update_vsyscall_tz(void)
> +{
> + vsyscall_gtod_data.sys_tz = sys_tz;
> +}
> +
> +void update_vsyscall(struct timekeeper *tk)
> +{
> + struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data;
> +
> + write_seqcount_begin(&vdata->seq);
> +
> + /* copy vsyscall data */
> + vdata->clock.vclock_mode = tk->clock->archdata.vclock_mode;
> + vdata->clock.cycle_last = tk->clock->cycle_last;
> + vdata->clock.mask = tk->clock->mask;
> + vdata->clock.mult = tk->mult;
> + vdata->clock.shift = tk->shift;
> +
> + vdata->wall_time_sec = tk->xtime_sec;
> + vdata->wall_time_snsec = tk->xtime_nsec;
> +
> + vdata->monotonic_time_sec = tk->xtime_sec
> + + tk->wall_to_monotonic.tv_sec;
> + vdata->monotonic_time_snsec = tk->xtime_nsec
> + + (tk->wall_to_monotonic.tv_nsec
> + << tk->shift);
> + while (vdata->monotonic_time_snsec >=
> + (((u64)NSEC_PER_SEC) << tk->shift)) {
> + vdata->monotonic_time_snsec -=
> + ((u64)NSEC_PER_SEC) << tk->shift;
> + vdata->monotonic_time_sec++;
> + }
> +
> + vdata->wall_time_coarse.tv_sec = tk->xtime_sec;
> + vdata->wall_time_coarse.tv_nsec = (long)(tk->xtime_nsec >> tk->shift);
> +
> + vdata->monotonic_time_coarse = timespec_add(vdata->wall_time_coarse,
> + tk->wall_to_monotonic);
> +
> + write_seqcount_end(&vdata->seq);
> +}
> +
> --
> 1.8.5.3
>
--
Andy Lutomirski
AMA Capital Management, LLC
On Sat, Feb 1, 2014 at 7:32 AM, <[email protected]> wrote:
> From: Stefani Seibold <[email protected]>
>
> This patch add the time support for 32 bit a VDSO to a 32 bit kernel.
>
> For 32 bit programs running on a 32 bit kernel, the same mechanism is
> used as for 64 bit programs running on a 64 bit kernel.
>
> Signed-off-by: Stefani Seibold <[email protected]>
I have multiple issues with this patch.
- It's very hard to review the vclock_gettime.c changes. You're
doing at least three things in there: reworking the return types of
all the helpers, moving code, and adding 32-bit code. Can you split
the patch so that the resulting diff is readable? (e.g. one patch
that purely reorders things, one patch that reworks the return types,
and a third patch with the meat? if splitting into just two results
in a comprehensible diff, that's fine, too.)
- There are multiple vvar mappings. I still don't understand why.
You've stuck the vvar page into the fixmap *and* into a special
mapping. The former works on native 32 bit only, but you don't seem
to *use* the mapping, which confuses me. The latter may or may not
confuse things (criu?) that try to parse /proc/self/maps.
If it is, indeed, okay to use non-fixed maps on 32-bit, it might
also be okay on 64-bit. If so, it could be useful to implement that,
which would remove a bit of a wart and allow PR_SET_TSC to work
usefully for 64-bit userspace. (This would remove the need for the
VVAR macro and would allow shorter rip-relative address modes.)
(Note that those fixmaps are a security problem on native 32-bit if
NX is not available. We may not care.)
- The VVAR macro is all screwed up now. Please either continue to
use it (and fix the definition) or stop using it everywhere and figure
out a different way. The current approach is just going to fester if
anyone ports the rest of the vdso to work in 32-bit environments.
- You've hardcoded the address of the counter in the HPET page into
the linker script. Please don't. If you really need to have the HPET
mapping in the linker script, please just reference the base address
and add the offset in C code using the appropriate macro.
> ---
> arch/x86/include/asm/elf.h | 2 +-
> arch/x86/include/asm/vdso.h | 3 +
> arch/x86/include/asm/vdso32.h | 10 +++
> arch/x86/vdso/Makefile | 7 ++
> arch/x86/vdso/vclock_gettime.c | 165 ++++++++++++++++++++++------------
> arch/x86/vdso/vdso-layout.lds.S | 25 ++++++
> arch/x86/vdso/vdso32-setup.c | 47 ++++++++--
> arch/x86/vdso/vdso32/vclock_gettime.c | 16 ++++
> arch/x86/vdso/vdso32/vdso32.lds.S | 11 ++-
> 9 files changed, 218 insertions(+), 68 deletions(-)
> create mode 100644 arch/x86/include/asm/vdso32.h
> create mode 100644 arch/x86/vdso/vdso32/vclock_gettime.c
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index 9c999c1..21bae90 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -289,7 +289,7 @@ do { \
>
> #else /* CONFIG_X86_32 */
>
> -#define VDSO_HIGH_BASE 0xffffe000U /* CONFIG_COMPAT_VDSO address */
> +#define VDSO_HIGH_BASE 0xffffc000U /* CONFIG_COMPAT_VDSO address */
This is odd. Can you explain it?
--Andy
On 02/01/2014 03:59 PM, Andy Lutomirski wrote:
>
> If it is, indeed, okay to use non-fixed maps on 32-bit, it might
> also be okay on 64-bit. If so, it could be useful to implement that,
> which would remove a bit of a wart and allow PR_SET_TSC to work
> usefully for 64-bit userspace. (This would remove the need for the
> VVAR macro and would allow shorter rip-relative address modes.)
>
We can't really move the 64-bit legacy vsyscall area, though, as it is
an ABI. It can be disabled with vsyscall=none, but Linus has vehemently
vetoed removing them.
> (Note that those fixmaps are a security problem on native 32-bit if
> NX is not available. We may not care.)
Not only on native 32 bit... although the amount of 64-bit hardware
without NX is quite small, the same is true for anywhere near modern
32-bit hardware.
>>
>> -#define VDSO_HIGH_BASE 0xffffe000U /* CONFIG_COMPAT_VDSO address */
>> +#define VDSO_HIGH_BASE 0xffffc000U /* CONFIG_COMPAT_VDSO address */
>
> This is odd. Can you explain it?
>
He needs 3 pages instead of 1 after his changes.
-hpa
On Sat, Feb 1, 2014 at 4:26 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/01/2014 03:59 PM, Andy Lutomirski wrote:
>>
>> If it is, indeed, okay to use non-fixed maps on 32-bit, it might
>> also be okay on 64-bit. If so, it could be useful to implement that,
>> which would remove a bit of a wart and allow PR_SET_TSC to work
>> usefully for 64-bit userspace. (This would remove the need for the
>> VVAR macro and would allow shorter rip-relative address modes.)
>>
>
> We can't really move the 64-bit legacy vsyscall area, though, as it is
> an ABI. It can be disabled with vsyscall=none, but Linus has vehemently
> vetoed removing them.
VVAR != vsyscall. They've been different pages since I wrote the
vsyscall emulation stuff. Any userspace code that relies on any of
the contents of the VVAR page is totally screwed already, since the
layout changes semi-regularly and depends on whether lockdep is
enabled.
>
>> (Note that those fixmaps are a security problem on native 32-bit if
>> NX is not available. We may not care.)
>
> Not only on native 32 bit... although the amount of 64-bit hardware
> without NX is quite small, the same is true for anywhere near modern
> 32-bit hardware.
<irrelevant>It can't be a problem for 32-bit compat mode, though,
since userspace can't address the fixmap anyway.</irrelevant>
>
>>>
>>> -#define VDSO_HIGH_BASE 0xffffe000U /* CONFIG_COMPAT_VDSO address */
>>> +#define VDSO_HIGH_BASE 0xffffc000U /* CONFIG_COMPAT_VDSO address */
>>
>> This is odd. Can you explain it?
>>
>
> He needs 3 pages instead of 1 after his changes.
Right. But there's some obscure ABI reason for CONFIG_COMPAT_VDSO,
and if this breaks it, then it's no good. From extremely vague
memory, there's some version of SuSE that breaks if the 32-bit vdso
moves. I have no idea what the bug is, but moving a "compat" address
seems suspect.
--Andy
Yes, that we can move, of course.
On February 1, 2014 4:30:17 PM PST, Andy Lutomirski <[email protected]> wrote:
>On Sat, Feb 1, 2014 at 4:26 PM, H. Peter Anvin <[email protected]> wrote:
>> On 02/01/2014 03:59 PM, Andy Lutomirski wrote:
>>>
>>> If it is, indeed, okay to use non-fixed maps on 32-bit, it might
>>> also be okay on 64-bit. If so, it could be useful to implement
>that,
>>> which would remove a bit of a wart and allow PR_SET_TSC to work
>>> usefully for 64-bit userspace. (This would remove the need for the
>>> VVAR macro and would allow shorter rip-relative address modes.)
>>>
>>
>> We can't really move the 64-bit legacy vsyscall area, though, as it
>is
>> an ABI. It can be disabled with vsyscall=none, but Linus has
>vehemently
>> vetoed removing them.
>
>VVAR != vsyscall. They've been different pages since I wrote the
>vsyscall emulation stuff. Any userspace code that relies on any of
>the contents of the VVAR page is totally screwed already, since the
>layout changes semi-regularly and depends on whether lockdep is
>enabled.
>
>>
>>> (Note that those fixmaps are a security problem on native 32-bit if
>>> NX is not available. We may not care.)
>>
>> Not only on native 32 bit... although the amount of 64-bit hardware
>> without NX is quite small, the same is true for anywhere near modern
>> 32-bit hardware.
>
><irrelevant>It can't be a problem for 32-bit compat mode, though,
>since userspace can't address the fixmap anyway.</irrelevant>
>
>>
>>>>
>>>> -#define VDSO_HIGH_BASE 0xffffe000U /* CONFIG_COMPAT_VDSO
>address */
>>>> +#define VDSO_HIGH_BASE 0xffffc000U /* CONFIG_COMPAT_VDSO
>address */
>>>
>>> This is odd. Can you explain it?
>>>
>>
>> He needs 3 pages instead of 1 after his changes.
>
>Right. But there's some obscure ABI reason for CONFIG_COMPAT_VDSO,
>and if this breaks it, then it's no good. From extremely vague
>memory, there's some version of SuSE that breaks if the 32-bit vdso
>moves. I have no idea what the bug is, but moving a "compat" address
>seems suspect.
>
>--Andy
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
On 02/01/2014 04:41 PM, H. Peter Anvin wrote:
>>
>> Right. But there's some obscure ABI reason for CONFIG_COMPAT_VDSO,
>> and if this breaks it, then it's no good. From extremely vague
>> memory, there's some version of SuSE that breaks if the 32-bit vdso
>> moves. I have no idea what the bug is, but moving a "compat" address
>> seems suspect.
>>
Sure enough:
> config COMPAT_VDSO
> def_bool y
> prompt "Compat VDSO support"
> depends on X86_32 || IA32_EMULATION
> ---help---
> Map the 32-bit VDSO to the predictable old-style address too.
>
> Say N here if you are running a sufficiently recent glibc
> version (2.3.3 or later), to remove the high-mapped
> VDSO mapping and to exclusively use the randomized VDSO.
>
> If unsure, say Y.
So we need this for 32-bit glibc < 2.3.3, and we effecively have the
same problem as on 64 bits. Next question is if those old glibcs rely
on the entry point alone or if they also expect the vdso header at that
address.
I looked at the glibc diffs from 2.3.2 to 2.3.3, but it isn't really
obvious to me what assumptions the 2.3.2 glibc made. Perhaps Roland has
any idea?
The safest thing for that might be to have the compat vdso be a
completely separate object from the real vdso, and let the former be an
object as similar to the current one as possible.
-hpa
> >>
> >> -#define VDSO_HIGH_BASE 0xffffe000U /* CONFIG_COMPAT_VDSO address */
> >> +#define VDSO_HIGH_BASE 0xffffc000U /* CONFIG_COMPAT_VDSO address */
> >
> > This is odd. Can you explain it?
> >
>
> He needs 3 pages instead of 1 after his changes.
>
Not every kernel hackers a male. By definition i am femal, so please
replace he and his with she and her.
- Stefani
Sorry, I did not know/realize, and unfortunately "he" is the grammatical default in English, although the singular they is fortunately making a comeback.
I stand corrected, through. Please acer my apologies.
-hpa
On February 2, 2014 3:20:25 AM PST, Stefani Seibold <[email protected]> wrote:
>
>> >>
>> >> -#define VDSO_HIGH_BASE 0xffffe000U /* CONFIG_COMPAT_VDSO
>address */
>> >> +#define VDSO_HIGH_BASE 0xffffc000U /* CONFIG_COMPAT_VDSO
>address */
>> >
>> > This is odd. Can you explain it?
>> >
>>
>> He needs 3 pages instead of 1 after his changes.
>>
>
>Not every kernel hackers a male. By definition i am femal, so please
>replace he and his with she and her.
>
>- Stefani
--
Sent from my mobile phone. Please pardon brevity and lack of formatting.
> > config COMPAT_VDSO
> > def_bool y
> > prompt "Compat VDSO support"
> > depends on X86_32 || IA32_EMULATION
> > ---help---
> > Map the 32-bit VDSO to the predictable old-style address too.
> >
> > Say N here if you are running a sufficiently recent glibc
> > version (2.3.3 or later), to remove the high-mapped
> > VDSO mapping and to exclusively use the randomized VDSO.
> >
> > If unsure, say Y.
>
> So we need this for 32-bit glibc < 2.3.3, and we effecively have the
> same problem as on 64 bits. Next question is if those old glibcs rely
> on the entry point alone or if they also expect the vdso header at that
> address.
>
> I looked at the glibc diffs from 2.3.2 to 2.3.3, but it isn't really
> obvious to me what assumptions the 2.3.2 glibc made. Perhaps Roland has
> any idea?
Jakub often has more reliable memories of these things than I do.
>From looking at the old states of the code, AFAICT 2.3.3 was the first
version that actually looked at AT_SYSINFO_EHDR or cared about the vDSO per
se; 2.3.2 just uses AT_SYSINFO.
I have a vague recollection that there was a period wherein ld.so would
crash (trying to modify part of the read-only vDSO image in place) if the
vDSO was loaded somewhere other than its prelinked location. But I don't
see any evidence in the code that there was actually a release made of code
with that issue.
I'm fairly sure there are some relevant issues that I've forgotten and am
overlooking now.
> The safest thing for that might be to have the compat vdso be a
> completely separate object from the real vdso, and let the former be an
> object as similar to the current one as possible.
I'm not at all clear on what particular dangers that avoids.
Thanks,
Roland