2007-10-31 22:02:41

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 0/7] (Re-)introducing pvops for x86_64 - Real pvops work part


Hey folks,

This is the part-of-pvops-implementation-that-is-not-exactly-a-merge. Neat,
uh? This is the majority of the work.

The first patch in the series does not really belong here. It was already
sent to lkml separetedly before, but I'm including it again, for a very
simple reason: Try to test the paravirt patches without it, and you'll fail
miserably ;-) (and it was not yet included).

Other than that, I thank you all in advance for the review.

Have fun!


2007-10-31 22:02:26

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 1/16] Wipe out traditional opt from x86_64 Makefile

Among other things, using -traditional as a gcc option stops us from
using macro token pasting, which is a feature we heavily rely on.

There was still a use of -traditional in arch/x86/kernel/Makefile_64,
which this patch removes.

I don't see any problems building kernels in my x86_64 box without
-traditional.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
---
arch/x86/kernel/Makefile_64 | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/Makefile_64 b/arch/x86/kernel/Makefile_64
index 24671c3..74d3727 100644
--- a/arch/x86/kernel/Makefile_64
+++ b/arch/x86/kernel/Makefile_64
@@ -3,7 +3,6 @@
#

extra-y := head_64.o head64.o init_task.o vmlinux.lds
-EXTRA_AFLAGS := -traditional
obj-y := process_64.o signal_64.o entry_64.o traps_64.o irq_64.o \
ptrace_64.o time_64.o ioport_64.o ldt_64.o setup_64.o i8259_64.o sys_x86_64.o \
x8664_ksyms_64.o i387_64.o syscall_64.o vsyscall_64.o \
--
1.4.4.2

2007-10-31 22:03:00

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 5/16] report ring kernel is running without paravirt

When paravirtualization is disabled, the kernel is always
running at ring 0. So report it in the appropriate macro

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
---
include/asm-x86/segment_64.h | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/asm-x86/segment_64.h b/include/asm-x86/segment_64.h
index 04b8ab2..240c1bf 100644
--- a/include/asm-x86/segment_64.h
+++ b/include/asm-x86/segment_64.h
@@ -50,4 +50,8 @@
#define GDT_SIZE (GDT_ENTRIES * 8)
#define TLS_SIZE (GDT_ENTRY_TLS_ENTRIES * 8)

+#ifndef CONFIG_PARAVIRT
+#define get_kernel_rpl() 0
+#endif
+
#endif
--
1.4.4.2

2007-10-31 22:03:25

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 6/16] export math_state_restore

Export math_state_restore symbol, so it can be used for hypervisors.
They are commonly loaded as modules (lguest being an example).

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/traps_64.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/traps_64.c b/arch/x86/kernel/traps_64.c
index d0c2bc7..a533ecd 100644
--- a/arch/x86/kernel/traps_64.c
+++ b/arch/x86/kernel/traps_64.c
@@ -1077,6 +1077,7 @@ asmlinkage void math_state_restore(void)
task_thread_info(me)->status |= TS_USEDFPU;
me->fpu_counter++;
}
+EXPORT_SYMBOL_GPL(math_state_restore);

void __init trap_init(void)
{
--
1.4.4.2

2007-10-31 22:03:44

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 2/16] paravirt hooks at entry functions.

Those are the hooks needed for paravirt at entry_64.S
In general, they follow the way of i386.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/entry_64.S | 108 +++++++++++++++++++++++++++-----------------
1 files changed, 66 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 3a058bb..b6d7008 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -50,6 +50,7 @@
#include <asm/hw_irq.h>
#include <asm/page.h>
#include <asm/irqflags.h>
+#include <asm/paravirt.h>

.code64

@@ -57,6 +58,20 @@
#define retint_kernel retint_restore_args
#endif

+#ifdef CONFIG_PARAVIRT
+ENTRY(native_irq_enable_syscall_ret)
+ movq %gs:pda_oldrsp,%rsp
+ swapgs
+ sysretq
+/*
+ * This could well be defined as a C function, but as it is only used here,
+ * let it be locally defined
+ */
+ENTRY(native_swapgs)
+ swapgs
+ retq
+#endif /* CONFIG_PARAVIRT */
+

.macro TRACE_IRQS_IRETQ offset=ARGOFFSET
#ifdef CONFIG_TRACE_IRQFLAGS
@@ -216,14 +231,21 @@ ENTRY(system_call)
CFI_DEF_CFA rsp,PDA_STACKOFFSET
CFI_REGISTER rip,rcx
/*CFI_REGISTER rflags,r11*/
- swapgs
+ SWAPGS_UNSAFE_STACK
+ /*
+ * A hypervisor implementation might want to use a label
+ * after the swapgs, so that it can do the swapgs
+ * for the guest and jump here on syscall.
+ */
+ENTRY(system_call_after_swapgs)
+
movq %rsp,%gs:pda_oldrsp
movq %gs:pda_kernelstack,%rsp
/*
* No need to follow this irqs off/on section - it's straight
* and short:
*/
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_ARGS 8,1
movq %rax,ORIG_RAX-ARGOFFSET(%rsp)
movq %rcx,RIP-ARGOFFSET(%rsp)
@@ -246,7 +268,7 @@ ret_from_sys_call:
sysret_check:
LOCKDEP_SYS_EXIT
GET_THREAD_INFO(%rcx)
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
movl threadinfo_flags(%rcx),%edx
andl %edi,%edx
@@ -260,9 +282,7 @@ sysret_check:
CFI_REGISTER rip,rcx
RESTORE_ARGS 0,-ARG_SKIP,1
/*CFI_REGISTER rflags,r11*/
- movq %gs:pda_oldrsp,%rsp
- swapgs
- sysretq
+ ENABLE_INTERRUPTS_SYSCALL_RET

CFI_RESTORE_STATE
/* Handle reschedules */
@@ -271,7 +291,7 @@ sysret_careful:
bt $TIF_NEED_RESCHED,%edx
jnc sysret_signal
TRACE_IRQS_ON
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
pushq %rdi
CFI_ADJUST_CFA_OFFSET 8
call schedule
@@ -282,7 +302,7 @@ sysret_careful:
/* Handle a signal */
sysret_signal:
TRACE_IRQS_ON
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
testl $(_TIF_SIGPENDING|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jz 1f

@@ -295,7 +315,7 @@ sysret_signal:
1: movl $_TIF_NEED_RESCHED,%edi
/* Use IRET because user could have changed frame. This
works because ptregscall_common has called FIXUP_TOP_OF_STACK. */
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
jmp int_with_check

@@ -327,7 +347,7 @@ tracesys:
*/
.globl int_ret_from_sys_call
int_ret_from_sys_call:
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
testl $3,CS-ARGOFFSET(%rsp)
je retint_restore_args
@@ -349,20 +369,20 @@ int_careful:
bt $TIF_NEED_RESCHED,%edx
jnc int_very_careful
TRACE_IRQS_ON
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
pushq %rdi
CFI_ADJUST_CFA_OFFSET 8
call schedule
popq %rdi
CFI_ADJUST_CFA_OFFSET -8
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
jmp int_with_check

/* handle signals and tracing -- both require a full stack frame */
int_very_careful:
TRACE_IRQS_ON
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_REST
/* Check for syscall exit trace */
testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP),%edx
@@ -385,7 +405,7 @@ int_signal:
1: movl $_TIF_NEED_RESCHED,%edi
int_restore_rest:
RESTORE_REST
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
jmp int_with_check
CFI_ENDPROC
@@ -506,7 +526,7 @@ END(stub_rt_sigreturn)
CFI_DEF_CFA_REGISTER rbp
testl $3,CS(%rdi)
je 1f
- swapgs
+ SWAPGS
/* irqcount is used to check if a CPU is already on an interrupt
stack or not. While this is essentially redundant with preempt_count
it is a little cheaper to use a separate counter in the PDA
@@ -527,7 +547,7 @@ ENTRY(common_interrupt)
interrupt do_IRQ
/* 0(%rsp): oldrsp-ARGOFFSET */
ret_from_intr:
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
decl %gs:pda_irqcount
leaveq
@@ -556,13 +576,13 @@ retint_swapgs: /* return to user-space */
/*
* The iretq could re-enable interrupts:
*/
- cli
+ DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_IRETQ
- swapgs
+ SWAPGS
jmp restore_args

retint_restore_args: /* return to kernel space */
- cli
+ DISABLE_INTERRUPTS(CLBR_ANY)
/*
* The iretq could re-enable interrupts:
*/
@@ -570,10 +590,14 @@ retint_restore_args: /* return to kernel space */
restore_args:
RESTORE_ARGS 0,8,0
iret_label:
+#ifdef CONFIG_PARAVIRT
+ INTERRUPT_RETURN
+#endif
+ENTRY(native_iret)
iretq

.section __ex_table,"a"
- .quad iret_label,bad_iret
+ .quad native_iret, bad_iret
.previous
.section .fixup,"ax"
/* force a signal here? this matches i386 behaviour */
@@ -581,24 +605,24 @@ iret_label:
bad_iret:
movq $11,%rdi /* SIGSEGV */
TRACE_IRQS_ON
- sti
- jmp do_exit
- .previous
-
+ ENABLE_INTERRUPTS(CLBR_ANY | ~(CLBR_RDI))
+ jmp do_exit
+ .previous
+
/* edi: workmask, edx: work */
retint_careful:
CFI_RESTORE_STATE
bt $TIF_NEED_RESCHED,%edx
jnc retint_signal
TRACE_IRQS_ON
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
pushq %rdi
CFI_ADJUST_CFA_OFFSET 8
call schedule
popq %rdi
CFI_ADJUST_CFA_OFFSET -8
GET_THREAD_INFO(%rcx)
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
jmp retint_check

@@ -606,14 +630,14 @@ retint_signal:
testl $(_TIF_SIGPENDING|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jz retint_swapgs
TRACE_IRQS_ON
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_REST
movq $-1,ORIG_RAX(%rsp)
xorl %esi,%esi # oldset
movq %rsp,%rdi # &pt_regs
call do_notify_resume
RESTORE_REST
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
movl $_TIF_NEED_RESCHED,%edi
GET_THREAD_INFO(%rcx)
@@ -731,7 +755,7 @@ END(spurious_interrupt)
rdmsr
testl %edx,%edx
js 1f
- swapgs
+ SWAPGS
xorl %ebx,%ebx
1:
.if \ist
@@ -747,7 +771,7 @@ END(spurious_interrupt)
.if \ist
addq $EXCEPTION_STKSZ, per_cpu__init_tss + TSS_ist + (\ist - 1) * 8(%rbp)
.endif
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
.if \irqtrace
TRACE_IRQS_OFF
.endif
@@ -776,10 +800,10 @@ paranoid_swapgs\trace:
.if \trace
TRACE_IRQS_IRETQ 0
.endif
- swapgs
+ SWAPGS_UNSAFE_STACK
paranoid_restore\trace:
RESTORE_ALL 8
- iretq
+ INTERRUPT_RETURN
paranoid_userspace\trace:
GET_THREAD_INFO(%rcx)
movl threadinfo_flags(%rcx),%ebx
@@ -794,11 +818,11 @@ paranoid_userspace\trace:
.if \trace
TRACE_IRQS_ON
.endif
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
xorl %esi,%esi /* arg2: oldset */
movq %rsp,%rdi /* arg1: &pt_regs */
call do_notify_resume
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
.if \trace
TRACE_IRQS_OFF
.endif
@@ -807,9 +831,9 @@ paranoid_schedule\trace:
.if \trace
TRACE_IRQS_ON
.endif
- sti
+ ENABLE_INTERRUPTS(CLBR_ANY)
call schedule
- cli
+ DISABLE_INTERRUPTS(CLBR_ANY)
.if \trace
TRACE_IRQS_OFF
.endif
@@ -862,7 +886,7 @@ KPROBE_ENTRY(error_entry)
testl $3,CS(%rsp)
je error_kernelspace
error_swapgs:
- swapgs
+ SWAPGS
error_sti:
movq %rdi,RDI(%rsp)
CFI_REL_OFFSET rdi,RDI
@@ -874,7 +898,7 @@ error_sti:
error_exit:
movl %ebx,%eax
RESTORE_REST
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
GET_THREAD_INFO(%rcx)
testl %eax,%eax
@@ -911,12 +935,12 @@ ENTRY(load_gs_index)
CFI_STARTPROC
pushf
CFI_ADJUST_CFA_OFFSET 8
- cli
- swapgs
+ DISABLE_INTERRUPTS(CLBR_ANY | ~(CLBR_RDI))
+ SWAPGS
gs_change:
movl %edi,%gs
2: mfence /* workaround */
- swapgs
+ SWAPGS
popf
CFI_ADJUST_CFA_OFFSET -8
ret
@@ -930,7 +954,7 @@ ENDPROC(load_gs_index)
.section .fixup,"ax"
/* running with kernelgs */
bad_gs:
- swapgs /* switch back to user gs */
+ SWAPGS /* switch back to user gs */
xorl %eax,%eax
movl %eax,%gs
jmp 2b
--
1.4.4.2

2007-10-31 22:04:02

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

This patch introduces, and patch callers when needed, native
versions for read/write_crX functions, clts and wbinvd.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/mm/pageattr_64.c | 3 +-
include/asm-x86/system_64.h | 60 ++++++++++++++++++++++++++++++------------
2 files changed, 45 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/pageattr_64.c b/arch/x86/mm/pageattr_64.c
index c40afba..59a52b0 100644
--- a/arch/x86/mm/pageattr_64.c
+++ b/arch/x86/mm/pageattr_64.c
@@ -12,6 +12,7 @@
#include <asm/processor.h>
#include <asm/tlbflush.h>
#include <asm/io.h>
+#include <asm/paravirt.h>

pte_t *lookup_address(unsigned long address)
{
@@ -77,7 +78,7 @@ static void flush_kernel_map(void *arg)
much cheaper than WBINVD. */
/* clflush is still broken. Disable for now. */
if (1 || !cpu_has_clflush)
- asm volatile("wbinvd" ::: "memory");
+ wbinvd();
else list_for_each_entry(pg, l, lru) {
void *adr = page_address(pg);
clflush_cache_range(adr, PAGE_SIZE);
diff --git a/include/asm-x86/system_64.h b/include/asm-x86/system_64.h
index 4cb2384..b558cb2 100644
--- a/include/asm-x86/system_64.h
+++ b/include/asm-x86/system_64.h
@@ -65,53 +65,62 @@ extern void load_gs_index(unsigned);
/*
* Clear and set 'TS' bit respectively
*/
-#define clts() __asm__ __volatile__ ("clts")
+static inline void native_clts(void)
+{
+ asm volatile ("clts");
+}

-static inline unsigned long read_cr0(void)
-{
+static inline unsigned long native_read_cr0(void)
+{
unsigned long cr0;
asm volatile("movq %%cr0,%0" : "=r" (cr0));
return cr0;
}

-static inline void write_cr0(unsigned long val)
-{
+static inline void native_write_cr0(unsigned long val)
+{
asm volatile("movq %0,%%cr0" :: "r" (val));
}

-static inline unsigned long read_cr2(void)
+static inline unsigned long native_read_cr2(void)
{
unsigned long cr2;
asm volatile("movq %%cr2,%0" : "=r" (cr2));
return cr2;
}

-static inline void write_cr2(unsigned long val)
+static inline void native_write_cr2(unsigned long val)
{
asm volatile("movq %0,%%cr2" :: "r" (val));
}

-static inline unsigned long read_cr3(void)
-{
+static inline unsigned long native_read_cr3(void)
+{
unsigned long cr3;
asm volatile("movq %%cr3,%0" : "=r" (cr3));
return cr3;
}

-static inline void write_cr3(unsigned long val)
+static inline void native_write_cr3(unsigned long val)
{
asm volatile("movq %0,%%cr3" :: "r" (val) : "memory");
}

-static inline unsigned long read_cr4(void)
-{
+static inline unsigned long native_read_cr4(void)
+{
unsigned long cr4;
asm volatile("movq %%cr4,%0" : "=r" (cr4));
return cr4;
}

-static inline void write_cr4(unsigned long val)
-{
+static inline unsigned long native_read_cr4_safe(void)
+{
+ /* CR4 always exist */
+ return native_read_cr4();
+}
+
+static inline void native_write_cr4(unsigned long val)
+{
asm volatile("movq %0,%%cr4" :: "r" (val) : "memory");
}

@@ -127,10 +136,27 @@ static inline void write_cr8(unsigned long val)
asm volatile("movq %0,%%cr8" :: "r" (val) : "memory");
}

-#define stts() write_cr0(8 | read_cr0())
+static inline void native_wbinvd(void)
+{
+ asm volatile("wbinvd" ::: "memory");
+}

-#define wbinvd() \
- __asm__ __volatile__ ("wbinvd": : :"memory")
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define clts native_clts
+#define wbinvd native_wbinvd
+#define read_cr0 native_read_cr0
+#define read_cr2 native_read_cr2
+#define read_cr3 native_read_cr3
+#define read_cr4 native_read_cr4
+#define write_cr0 native_write_cr0
+#define write_cr2 native_write_cr2
+#define write_cr3 native_write_cr3
+#define write_cr4 native_write_cr4
+#endif
+
+#define stts() write_cr0(8 | read_cr0())

#endif /* __KERNEL__ */

--
1.4.4.2

2007-10-31 22:04:48

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 11/16] turn priviled operation into a macro in head_64.S

under paravirt, read cr2 cannot be issued directly anymore.
So wrap it in a macro, defined to the operation itself in case
paravirt is off, but to something else if we have paravirt
in the game

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/head_64.S | 9 ++++++++-
1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index b6167fe..c31b1c9 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -19,6 +19,13 @@
#include <asm/msr.h>
#include <asm/cache.h>

+#ifdef CONFIG_PARAVIRT
+#include <asm/asm-offsets.h>
+#include <asm/paravirt.h>
+#else
+#define GET_CR2_INTO_RCX movq %cr2, %rcx
+#endif
+
/* we are not able to switch in one step to the final KERNEL ADRESS SPACE
* because we need identity-mapped pages.
*
@@ -267,7 +274,7 @@ ENTRY(early_idt_handler)
xorl %eax,%eax
movq 8(%rsp),%rsi # get rip
movq (%rsp),%rdx
- movq %cr2,%rcx
+ GET_CR2_INTO_RCX
leaq early_idt_msg(%rip),%rdi
call early_printk
cmpl $2,early_recursion_flag(%rip)
--
1.4.4.2

2007-10-31 22:06:46

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 14/16] prepare x86_64 architecture initialization for paravirt

This patch prepares the x86_64 architecture initialization for
paravirt. It requires a memory initialization step, which is done
by implementing 64-bit version for machine_specific_memory_setup,
and putting an ARCH_SETUP hook, for guest-dependent initialization.
This last step is done akin to i386

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/e820_64.c | 9 +++++++--
arch/x86/kernel/setup_64.c | 28 +++++++++++++++++++++++++++-
include/asm-x86/setup.h | 11 ++++++++---
3 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/e820_64.c b/arch/x86/kernel/e820_64.c
index 04698e0..c67526c 100644
--- a/arch/x86/kernel/e820_64.c
+++ b/arch/x86/kernel/e820_64.c
@@ -593,8 +593,10 @@ void early_panic(char *msg)
panic(msg);
}

-void __init setup_memory_region(void)
+/* We're not void only for x86 32-bit compat */
+char * __init machine_specific_memory_setup(void)
{
+ char *who = "BIOS-e820";
/*
* Try to copy the BIOS-supplied E820-map.
*
@@ -605,7 +607,10 @@ void __init setup_memory_region(void)
if (copy_e820_map(boot_params.e820_map, boot_params.e820_entries) < 0)
early_panic("Cannot find a valid memory map");
printk(KERN_INFO "BIOS-provided physical RAM map:\n");
- e820_print_map("BIOS-e820");
+ e820_print_map(who);
+
+ /* In case someone cares... */
+ return who;
}

static int __init parse_memopt(char *p)
diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
index 238633d..44a11e3 100644
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -39,6 +39,7 @@
#include <linux/dmi.h>
#include <linux/dma-mapping.h>
#include <linux/ctype.h>
+#include <linux/uaccess.h>

#include <asm/mtrr.h>
#include <asm/uaccess.h>
@@ -60,6 +61,12 @@
#include <asm/dmi.h>
#include <asm/cacheflush.h>

+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define ARCH_SETUP
+#endif
+
/*
* Machine setup..
*/
@@ -241,6 +248,16 @@ static void discover_ebda(void)
* 4K EBDA area at 0x40E
*/
ebda_addr = *(unsigned short *)__va(EBDA_ADDR_POINTER);
+ /*
+ * There can be some situations, like paravirtualized guests,
+ * in which there is no available ebda information. In such
+ * case, just skip it
+ */
+ if (!ebda_addr) {
+ ebda_size = 0;
+ return;
+ }
+
ebda_addr <<= 4;

ebda_size = *(unsigned short *)__va(ebda_addr);
@@ -254,6 +271,12 @@ static void discover_ebda(void)
ebda_size = 64*1024;
}

+/* Overridden in paravirt.c if CONFIG_PARAVIRT */
+void __attribute__((weak)) memory_setup(void)
+{
+ machine_specific_memory_setup();
+}
+
void __init setup_arch(char **cmdline_p)
{
printk(KERN_INFO "Command line: %s\n", boot_command_line);
@@ -269,7 +292,10 @@ void __init setup_arch(char **cmdline_p)
rd_prompt = ((boot_params.hdr.ram_size & RAMDISK_PROMPT_FLAG) != 0);
rd_doload = ((boot_params.hdr.ram_size & RAMDISK_LOAD_FLAG) != 0);
#endif
- setup_memory_region();
+
+ ARCH_SETUP
+
+ memory_setup();
copy_edd();

if (!boot_params.hdr.root_flags)
diff --git a/include/asm-x86/setup.h b/include/asm-x86/setup.h
index 24d786e..071e054 100644
--- a/include/asm-x86/setup.h
+++ b/include/asm-x86/setup.h
@@ -3,6 +3,13 @@

#define COMMAND_LINE_SIZE 2048

+#ifndef __ASSEMBLY__
+char *machine_specific_memory_setup(void);
+#ifndef CONFIG_PARAVIRT
+#define paravirt_post_allocator_init() do {} while (0)
+#endif
+#endif /* __ASSEMBLY__ */
+
#ifdef __KERNEL__

#ifdef __i386__
@@ -51,9 +58,7 @@ void __init add_memory_region(unsigned long long start,

extern unsigned long init_pg_tables_end;

-#ifndef CONFIG_PARAVIRT
-#define paravirt_post_allocator_init() do {} while (0)
-#endif
+

#endif /* __i386__ */
#endif /* _SETUP */
--
1.4.4.2

2007-10-31 22:07:10

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 13/16] native versions for page table entries values

This patch turns the page operations (set and make a page table)
into native_ versions. The operations itself will be later
overriden by paravirt.

It uses unsigned long long for consistency with 32-bit. So we
have to fix fault_64.c to get rid of warnings.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/mm/fault_64.c | 8 +++---
include/asm-x86/page_64.h | 56 +++++++++++++++++++++++++++++++++++++++++----
2 files changed, 55 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/fault_64.c b/arch/x86/mm/fault_64.c
index 644b4f7..725c574 100644
--- a/arch/x86/mm/fault_64.c
+++ b/arch/x86/mm/fault_64.c
@@ -158,22 +158,22 @@ void dump_pagetable(unsigned long address)
pgd = __va((unsigned long)pgd & PHYSICAL_PAGE_MASK);
pgd += pgd_index(address);
if (bad_address(pgd)) goto bad;
- printk("PGD %lx ", pgd_val(*pgd));
+ printk("PGD %llx ", pgd_val(*pgd));
if (!pgd_present(*pgd)) goto ret;

pud = pud_offset(pgd, address);
if (bad_address(pud)) goto bad;
- printk("PUD %lx ", pud_val(*pud));
+ printk("PUD %llx ", pud_val(*pud));
if (!pud_present(*pud)) goto ret;

pmd = pmd_offset(pud, address);
if (bad_address(pmd)) goto bad;
- printk("PMD %lx ", pmd_val(*pmd));
+ printk("PMD %llx ", pmd_val(*pmd));
if (!pmd_present(*pmd) || pmd_large(*pmd)) goto ret;

pte = pte_offset_kernel(pmd, address);
if (bad_address(pte)) goto bad;
- printk("PTE %lx", pte_val(*pte));
+ printk("PTE %llx", pte_val(*pte));
ret:
printk("\n");
return;
diff --git a/include/asm-x86/page_64.h b/include/asm-x86/page_64.h
index c3b52bc..71d5c27 100644
--- a/include/asm-x86/page_64.h
+++ b/include/asm-x86/page_64.h
@@ -64,16 +64,62 @@ typedef struct { unsigned long pgprot; } pgprot_t;

extern unsigned long phys_base;

-#define pte_val(x) ((x).pte)
-#define pmd_val(x) ((x).pmd)
-#define pud_val(x) ((x).pud)
-#define pgd_val(x) ((x).pgd)
-#define pgprot_val(x) ((x).pgprot)
+static inline unsigned long long native_pte_val(pte_t pte)
+{
+ return pte.pte;
+}
+
+static inline unsigned long long native_pud_val(pud_t pud)
+{
+ return pud.pud;
+}
+
+
+static inline unsigned long long native_pmd_val(pmd_t pmd)
+{
+ return pmd.pmd;
+}
+
+static inline unsigned long long native_pgd_val(pgd_t pgd)
+{
+ return pgd.pgd;
+}
+
+static inline pte_t native_make_pte(unsigned long long pte)
+{
+ return (pte_t){ pte };
+}
+
+static inline pud_t native_make_pud(unsigned long long pud)
+{
+ return (pud_t){ pud };
+}
+
+static inline pmd_t native_make_pmd(unsigned long long pmd)
+{
+ return (pmd_t){ pmd };
+}
+
+static inline pgd_t native_make_pgd(unsigned long long pgd)
+{
+ return (pgd_t){ pgd };
+}
+
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define pte_val(x) native_pte_val(x)
+#define pmd_val(x) native_pmd_val(x)
+#define pud_val(x) native_pud_val(x)
+#define pgd_val(x) native_pgd_val(x)

#define __pte(x) ((pte_t) { (x) } )
#define __pmd(x) ((pmd_t) { (x) } )
#define __pud(x) ((pud_t) { (x) } )
#define __pgd(x) ((pgd_t) { (x) } )
+#endif /* CONFIG_PARAVIRT */
+
+#define pgprot_val(x) ((x).pgprot)
#define __pgprot(x) ((pgprot_t) { (x) } )

#endif /* !__ASSEMBLY__ */
--
1.4.4.2

2007-10-31 22:07:39

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 15/16] consolidation of paravirt for 32 and 64 bits

This patch unifies the paravirt ops structures for usage in
x86_64 and i386 architectures. Some new field had to be created
to accomodate the differences between the architectures.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/Kconfig.x86_64 | 11 +
arch/x86/kernel/Makefile | 3 +
arch/x86/kernel/Makefile_32 | 1 -
arch/x86/kernel/asm-offsets.c | 8 +
arch/x86/kernel/asm-offsets_32.c | 8 -
arch/x86/kernel/asm-offsets_64.c | 22 +-
arch/x86/kernel/paravirt.c | 445 +++++++++++++++++++++++++++++++++
arch/x86/kernel/paravirt_32.c | 472 -----------------------------------
arch/x86/kernel/paravirt_patch_32.c | 52 ++++
arch/x86/kernel/paravirt_patch_64.c | 56 ++++
arch/x86/kernel/vmlinux_64.lds.S | 6 +
include/asm-x86/paravirt.h | 458 ++++++++++++++++++++++++++++------
12 files changed, 971 insertions(+), 571 deletions(-)

diff --git a/arch/x86/Kconfig.x86_64 b/arch/x86/Kconfig.x86_64
index cc468ea..04734dd 100644
--- a/arch/x86/Kconfig.x86_64
+++ b/arch/x86/Kconfig.x86_64
@@ -372,6 +372,17 @@ config NODES_SHIFT

# Dummy CONFIG option to select ACPI_NUMA from drivers/acpi/Kconfig.

+config PARAVIRT
+ bool
+ depends on EXPERIMENTAL
+ help
+ Paravirtualization is a way of running multiple instances of
+ Linux on the same machine, under a hypervisor. This option
+ changes the kernel so it can modify itself when it is run
+ under a hypervisor, improving performance significantly.
+ However, when run without a hypervisor the kernel is
+ theoretically slower. If in doubt, say N.
+
config X86_64_ACPI_NUMA
bool "ACPI NUMA detection"
depends on NUMA
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 3857334..f444d0e 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -1,8 +1,11 @@
ifeq ($(CONFIG_X86_32),y)
include ${srctree}/arch/x86/kernel/Makefile_32
+obj-$(CONFIG_PARAVIRT) += paravirt_patch_32.o
else
include ${srctree}/arch/x86/kernel/Makefile_64
+obj-$(CONFIG_PARAVIRT) += paravirt_patch_64.o
endif
+obj-$(CONFIG_PARAVIRT) += paravirt.o

# Workaround to delete .lds files with make clean
# The problem is that we do not enter Makefile_32 with make clean.
diff --git a/arch/x86/kernel/Makefile_32 b/arch/x86/kernel/Makefile_32
index b9d6798..f19e0d4 100644
--- a/arch/x86/kernel/Makefile_32
+++ b/arch/x86/kernel/Makefile_32
@@ -43,7 +43,6 @@ obj-$(CONFIG_K8_NB) += k8.o
obj-$(CONFIG_MGEODE_LX) += geode_32.o mfgpt_32.o

obj-$(CONFIG_VMI) += vmi_32.o vmiclock_32.o
-obj-$(CONFIG_PARAVIRT) += paravirt_32.o
obj-y += pcspeaker.o

obj-$(CONFIG_SCx200) += scx200_32.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index cfa82c8..25530d5 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -1,3 +1,11 @@
+#define DEFINE(sym, val) \
+ asm volatile("\n->" #sym " %0 " #val : : "i" (val))
+
+#define BLANK() asm volatile("\n->" : : )
+
+#define OFFSET(sym, str, mem) \
+ DEFINE(sym, offsetof(struct str, mem));
+
#ifdef CONFIG_X86_32
# include "asm-offsets_32.c"
#else
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index c1ccfab..f320b2d 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -25,14 +25,6 @@
#include "../../../drivers/lguest/lg.h"
#endif

-#define DEFINE(sym, val) \
- asm volatile("\n->" #sym " %0 " #val : : "i" (val))
-
-#define BLANK() asm volatile("\n->" : : )
-
-#define OFFSET(sym, str, mem) \
- DEFINE(sym, offsetof(struct str, mem));
-
/* workaround for a warning with -Wmissing-prototypes */
void foo(void);

diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index d1b6ed9..3ef77dd 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -16,14 +16,7 @@
#include <asm/thread_info.h>
#include <asm/ia32.h>
#include <asm/bootparam.h>
-
-#define DEFINE(sym, val) \
- asm volatile("\n->" #sym " %0 " #val : : "i" (val))
-
-#define BLANK() asm volatile("\n->" : : )
-
-#define OFFSET(sym, str, mem) \
- DEFINE(sym, offsetof(struct str, mem))
+#include <asm/paravirt.h>

#define __NO_STUBS 1
#undef __SYSCALL
@@ -76,6 +69,19 @@ int main(void)
offsetof (struct rt_sigframe32, uc.uc_mcontext));
BLANK();
#endif
+#ifdef CONFIG_PARAVIRT
+ OFFSET(PARAVIRT_enabled, pv_info, paravirt_enabled);
+ OFFSET(PARAVIRT_PATCH_pv_cpu_ops, paravirt_patch_template, pv_cpu_ops);
+ OFFSET(PARAVIRT_PATCH_pv_irq_ops, paravirt_patch_template, pv_irq_ops);
+ OFFSET(PARAVIRT_PATCH_pv_mmu_ops, paravirt_patch_template, pv_mmu_ops);
+ OFFSET(PV_IRQ_irq_disable, pv_irq_ops, irq_disable);
+ OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
+ OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
+ OFFSET(PV_CPU_irq_enable_syscall_ret, pv_cpu_ops, irq_enable_syscall_ret);
+ OFFSET(PV_MMU_read_cr2, pv_mmu_ops, read_cr2);
+ OFFSET(PV_CPU_swapgs, pv_cpu_ops, swapgs);
+ BLANK();
+#endif
DEFINE(pbe_address, offsetof(struct pbe, address));
DEFINE(pbe_orig_address, offsetof(struct pbe, orig_address));
DEFINE(pbe_next, offsetof(struct pbe, next));
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
new file mode 100644
index 0000000..c4c1e51
--- /dev/null
+++ b/arch/x86/kernel/paravirt.c
@@ -0,0 +1,445 @@
+/* Paravirtualization interfaces
+ Copyright (C) 2006 Rusty Russell IBM Corporation
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+
+ 2007 - x86_64 support added by Glauber de Oliveira Costa
+*/
+#include <linux/errno.h>
+#include <linux/module.h>
+#include <linux/efi.h>
+#include <linux/bcd.h>
+#include <linux/highmem.h>
+#include <linux/pci_regs.h>
+#include <linux/pci_ids.h>
+
+#include <asm/bug.h>
+#include <asm/paravirt.h>
+#include <asm/desc.h>
+#include <asm/setup.h>
+#include <asm/arch_hooks.h>
+#include <asm/time.h>
+#include <asm/irq.h>
+#include <asm/delay.h>
+#include <asm/fixmap.h>
+#include <asm/apic.h>
+#include <asm/tlbflush.h>
+#include <asm/timer.h>
+#include <asm/io.h>
+#include <asm/pci-direct.h>
+
+
+/* nop stub */
+void _paravirt_nop(void)
+{
+}
+
+static void __init default_banner(void)
+{
+ printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
+ pv_info.name);
+}
+
+char *memory_setup(void)
+{
+ return pv_init_ops.memory_setup();
+}
+
+unsigned paravirt_patch_nop(void)
+{
+ return 0;
+}
+
+unsigned paravirt_patch_ignore(unsigned len)
+{
+ return len;
+}
+
+struct branch {
+ unsigned char opcode;
+ u32 delta;
+} __attribute__((packed));
+
+unsigned paravirt_patch_call(void *insnbuf,
+ const void *target, u16 tgt_clobbers,
+ unsigned long addr, u16 site_clobbers,
+ unsigned len)
+{
+ struct branch *b = insnbuf;
+ unsigned long delta = (unsigned long)target - (addr+5);
+
+ if (tgt_clobbers & ~site_clobbers)
+ return len; /* target would clobber too much for this site */
+ if (len < 5)
+ return len; /* call too long for patch site */
+
+ b->opcode = 0xe8; /* call */
+ b->delta = delta;
+ BUILD_BUG_ON(sizeof(*b) != 5);
+
+ return 5;
+}
+
+unsigned paravirt_patch_jmp(void *insnbuf, const void *target,
+ unsigned long addr, unsigned len)
+{
+ struct branch *b = insnbuf;
+ unsigned long delta = (unsigned long)target - (addr+5);
+
+ if (len < 5)
+ return len; /* call too long for patch site */
+
+ b->opcode = 0xe9; /* jmp */
+ b->delta = delta;
+
+ return 5;
+}
+
+/* Undefined instruction for dealing with missing ops pointers. */
+static const unsigned char ud2a[] = { 0x0f, 0x0b };
+
+/* Neat trick to map patch type back to the call within the
+ * corresponding structure. */
+static void *get_call_destination(u8 type)
+{
+ struct paravirt_patch_template tmpl = {
+ .pv_init_ops = pv_init_ops,
+ .pv_time_ops = pv_time_ops,
+ .pv_cpu_ops = pv_cpu_ops,
+ .pv_irq_ops = pv_irq_ops,
+ .pv_apic_ops = pv_apic_ops,
+ .pv_mmu_ops = pv_mmu_ops,
+ };
+ return *((void **)&tmpl + type);
+}
+
+unsigned paravirt_patch_default(u8 type, u16 clobbers, void *insnbuf,
+ unsigned long addr, unsigned len)
+{
+ void *opfunc = get_call_destination(type);
+ unsigned ret;
+
+ if (opfunc == NULL)
+ /* If there's no function, patch it with a ud2a (BUG) */
+ ret = paravirt_patch_insns(insnbuf, len, ud2a, ud2a+sizeof(ud2a));
+ else if (opfunc == paravirt_nop)
+ /* If the operation is a nop, then nop the callsite */
+ ret = paravirt_patch_nop();
+ else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
+ type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_syscall_ret))
+ /* If operation requires a jmp, then jmp */
+ ret = paravirt_patch_jmp(insnbuf, opfunc, addr, len);
+ else
+ /* Otherwise call the function; assume target could
+ clobber any caller-save reg */
+ ret = paravirt_patch_call(insnbuf, opfunc, CLBR_ANY,
+ addr, clobbers, len);
+
+ return ret;
+}
+
+unsigned paravirt_patch_insns(void *insnbuf, unsigned len,
+ const char *start, const char *end)
+{
+ unsigned insn_len = end - start;
+
+ if (insn_len > len || start == NULL)
+ insn_len = len;
+ else
+ memcpy(insnbuf, start, insn_len);
+
+ return insn_len;
+}
+
+void init_IRQ(void)
+{
+ pv_irq_ops.init_IRQ();
+}
+
+static void native_flush_tlb(void)
+{
+ __native_flush_tlb();
+}
+
+/*
+ * Global pages have to be flushed a bit differently. Not a real
+ * performance problem because this does not happen often.
+ */
+static void native_flush_tlb_global(void)
+{
+ __native_flush_tlb_global();
+}
+
+static void native_flush_tlb_single(unsigned long addr)
+{
+ __native_flush_tlb_single(addr);
+}
+
+/* These are in entry.S */
+extern void native_iret(void);
+extern void native_irq_enable_syscall_ret(void);
+
+static int __init print_banner(void)
+{
+ pv_init_ops.banner();
+ return 0;
+}
+core_initcall(print_banner);
+
+static struct resource reserve_ioports = {
+ .start = 0,
+ .end = IO_SPACE_LIMIT,
+ .name = "paravirt-ioport",
+ .flags = IORESOURCE_IO | IORESOURCE_BUSY,
+};
+
+static struct resource reserve_iomem = {
+ .start = 0,
+ .end = -1,
+ .name = "paravirt-iomem",
+ .flags = IORESOURCE_MEM | IORESOURCE_BUSY,
+};
+
+/*
+ * Reserve the whole legacy IO space to prevent any legacy drivers
+ * from wasting time probing for their hardware. This is a fairly
+ * brute-force approach to disabling all non-virtual drivers.
+ *
+ * Note that this must be called very early to have any effect.
+ */
+int paravirt_disable_iospace(void)
+{
+ int ret;
+
+ ret = request_resource(&ioport_resource, &reserve_ioports);
+ if (ret == 0) {
+ ret = request_resource(&iomem_resource, &reserve_iomem);
+ if (ret)
+ release_resource(&reserve_ioports);
+ }
+
+ return ret;
+}
+
+static DEFINE_PER_CPU(enum paravirt_lazy_mode, paravirt_lazy_mode) = PARAVIRT_LAZY_NONE;
+
+static inline void enter_lazy(enum paravirt_lazy_mode mode)
+{
+ BUG_ON(__get_cpu_var(paravirt_lazy_mode) != PARAVIRT_LAZY_NONE);
+ BUG_ON(preemptible());
+
+ __get_cpu_var(paravirt_lazy_mode) = mode;
+}
+
+void paravirt_leave_lazy(enum paravirt_lazy_mode mode)
+{
+ BUG_ON(__get_cpu_var(paravirt_lazy_mode) != mode);
+ BUG_ON(preemptible());
+
+ __get_cpu_var(paravirt_lazy_mode) = PARAVIRT_LAZY_NONE;
+}
+
+void paravirt_enter_lazy_mmu(void)
+{
+ enter_lazy(PARAVIRT_LAZY_MMU);
+}
+
+void paravirt_leave_lazy_mmu(void)
+{
+ paravirt_leave_lazy(PARAVIRT_LAZY_MMU);
+}
+
+void paravirt_enter_lazy_cpu(void)
+{
+ enter_lazy(PARAVIRT_LAZY_CPU);
+}
+
+void paravirt_leave_lazy_cpu(void)
+{
+ paravirt_leave_lazy(PARAVIRT_LAZY_CPU);
+}
+
+enum paravirt_lazy_mode paravirt_get_lazy_mode(void)
+{
+ return __get_cpu_var(paravirt_lazy_mode);
+}
+
+struct pv_info pv_info = {
+ .name = "bare hardware",
+ .paravirt_enabled = 0,
+ .kernel_rpl = 0,
+ .shared_kernel_pmd = 1, /* Only used when CONFIG_X86_PAE is set */
+};
+
+struct pv_init_ops pv_init_ops = {
+ .patch = native_patch,
+ .banner = default_banner,
+ .arch_setup = paravirt_nop,
+ .memory_setup = machine_specific_memory_setup,
+};
+
+struct pv_time_ops pv_time_ops = {
+ .time_init = hpet_time_init,
+ .get_wallclock = native_get_wallclock,
+ .set_wallclock = native_set_wallclock,
+ .sched_clock = native_sched_clock,
+ .get_cpu_khz = native_calculate_cpu_khz,
+};
+
+struct pv_irq_ops pv_irq_ops = {
+ .init_IRQ = native_init_IRQ,
+ .save_fl = native_save_fl,
+ .restore_fl = native_restore_fl,
+ .irq_disable = native_irq_disable,
+ .irq_enable = native_irq_enable,
+ .safe_halt = native_safe_halt,
+ .halt = native_halt,
+};
+
+struct pv_cpu_ops pv_cpu_ops = {
+ .cpuid = native_cpuid,
+ .get_debugreg = native_get_debugreg,
+ .set_debugreg = native_set_debugreg,
+ .clts = native_clts,
+ .read_cr0 = native_read_cr0,
+ .write_cr0 = native_write_cr0,
+ .read_cr4 = native_read_cr4,
+ .read_cr4_safe = native_read_cr4_safe,
+ .write_cr4 = native_write_cr4,
+ .wbinvd = native_wbinvd,
+ .read_msr = native_read_msr_safe,
+ .write_msr = native_write_msr_safe,
+ .read_tsc = native_read_tsc,
+ .read_pmc = native_read_pmc,
+ .read_tscp = native_read_tscp,
+ .load_tr_desc = native_load_tr_desc,
+ .set_ldt = native_set_ldt,
+ .load_gdt = native_load_gdt,
+ .load_idt = native_load_idt,
+ .store_gdt = native_store_gdt,
+ .store_idt = native_store_idt,
+ .store_tr = native_store_tr,
+ .load_tls = native_load_tls,
+ .write_ldt_entry = write_dt_entry,
+#ifdef CONFIG_X86_32
+ .write_gdt_entry = write_dt_entry,
+ .write_idt_entry = write_dt_entry,
+#else
+ .write_gdt_entry = native_write_gdt_entry,
+ .write_idt_entry = native_write_idt_entry,
+#endif
+ .load_esp0 = native_load_esp0,
+
+ .irq_enable_syscall_ret = native_irq_enable_syscall_ret,
+ .iret = native_iret,
+ .swapgs = native_swapgs,
+
+ .set_iopl_mask = native_set_iopl_mask,
+ .io_delay = native_io_delay,
+
+ .lazy_mode = {
+ .enter = paravirt_nop,
+ .leave = paravirt_nop,
+ },
+};
+
+struct pv_apic_ops pv_apic_ops = {
+#ifdef CONFIG_X86_LOCAL_APIC
+ .apic_write = native_apic_write,
+ .apic_write_atomic = native_apic_write_atomic,
+ .apic_read = native_apic_read,
+ .setup_boot_clock = setup_boot_APIC_clock,
+ .setup_secondary_clock = setup_secondary_APIC_clock,
+ .startup_ipi_hook = paravirt_nop,
+#endif
+};
+
+struct pv_mmu_ops pv_mmu_ops = {
+#ifdef CONFIG_X86_32
+ .pagetable_setup_start = native_pagetable_setup_start,
+ .pagetable_setup_done = native_pagetable_setup_done,
+#else
+ .pagetable_setup_start = paravirt_nop,
+ .pagetable_setup_done = paravirt_nop,
+#endif
+
+ .read_cr2 = native_read_cr2,
+ .write_cr2 = native_write_cr2,
+ .read_cr3 = native_read_cr3,
+ .write_cr3 = native_write_cr3,
+
+ .flush_tlb_user = native_flush_tlb,
+ .flush_tlb_kernel = native_flush_tlb_global,
+ .flush_tlb_single = native_flush_tlb_single,
+ .flush_tlb_others = native_flush_tlb_others,
+
+ .alloc_pt = paravirt_nop,
+ .alloc_pd = paravirt_nop,
+ .alloc_pd_clone = paravirt_nop,
+ .release_pt = paravirt_nop,
+ .release_pd = paravirt_nop,
+
+ .set_pte = native_set_pte,
+ .set_pte_at = native_set_pte_at,
+ .set_pmd = native_set_pmd,
+ .pte_update = paravirt_nop,
+ .pte_update_defer = paravirt_nop,
+
+#ifdef CONFIG_HIGHPTE
+ .kmap_atomic_pte = kmap_atomic,
+#endif
+
+#ifdef CONFIG_X86_PAE
+ .set_pte_atomic = native_set_pte_atomic,
+ .set_pte_present = native_set_pte_present,
+#endif
+#if defined(CONFIG_X86_PAE) || defined(CONFIG_X86_64)
+ .set_pud = native_set_pud,
+ .pte_clear = native_pte_clear,
+ .pmd_clear = native_pmd_clear,
+ .pmd_val = native_pmd_val,
+ .make_pmd = native_make_pmd,
+#endif
+ .pte_val = native_pte_val,
+ .pgd_val = native_pgd_val,
+
+ .make_pte = native_make_pte,
+ .make_pgd = native_make_pgd,
+#ifdef CONFIG_X86_64
+ .set_pgd = native_set_pgd,
+
+ .pud_clear = native_pud_clear,
+ .pgd_clear = native_pgd_clear,
+
+ .pud_val = native_pud_val,
+
+ .make_pud = native_make_pud,
+#endif
+ .dup_mmap = paravirt_nop,
+ .exit_mmap = paravirt_nop,
+ .activate_mm = paravirt_nop,
+
+ .lazy_mode = {
+ .enter = paravirt_nop,
+ .leave = paravirt_nop,
+ },
+};
+
+EXPORT_SYMBOL_GPL(pv_time_ops);
+EXPORT_SYMBOL_GPL(pv_cpu_ops);
+EXPORT_SYMBOL_GPL(pv_mmu_ops);
+EXPORT_SYMBOL_GPL(pv_apic_ops);
+EXPORT_SYMBOL_GPL(pv_info);
+EXPORT_SYMBOL (pv_irq_ops);
diff --git a/arch/x86/kernel/paravirt_32.c b/arch/x86/kernel/paravirt_32.c
deleted file mode 100644
index 04f51d0..0000000
--- a/arch/x86/kernel/paravirt_32.c
+++ /dev/null
@@ -1,472 +0,0 @@
-/* Paravirtualization interfaces
- Copyright (C) 2006 Rusty Russell IBM Corporation
-
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation; either version 2 of the License, or
- (at your option) any later version.
-
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- GNU General Public License for more details.
-
- You should have received a copy of the GNU General Public License
- along with this program; if not, write to the Free Software
- Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
-*/
-#include <linux/errno.h>
-#include <linux/module.h>
-#include <linux/efi.h>
-#include <linux/bcd.h>
-#include <linux/highmem.h>
-
-#include <asm/bug.h>
-#include <asm/paravirt.h>
-#include <asm/desc.h>
-#include <asm/setup.h>
-#include <asm/arch_hooks.h>
-#include <asm/time.h>
-#include <asm/irq.h>
-#include <asm/delay.h>
-#include <asm/fixmap.h>
-#include <asm/apic.h>
-#include <asm/tlbflush.h>
-#include <asm/timer.h>
-
-/* nop stub */
-void _paravirt_nop(void)
-{
-}
-
-static void __init default_banner(void)
-{
- printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
- pv_info.name);
-}
-
-char *memory_setup(void)
-{
- return pv_init_ops.memory_setup();
-}
-
-/* Simple instruction patching code. */
-#define DEF_NATIVE(ops, name, code) \
- extern const char start_##ops##_##name[], end_##ops##_##name[]; \
- asm("start_" #ops "_" #name ": " code "; end_" #ops "_" #name ":")
-
-DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
-DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
-DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf");
-DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax");
-DEF_NATIVE(pv_cpu_ops, iret, "iret");
-DEF_NATIVE(pv_cpu_ops, irq_enable_syscall_ret, "sti; sysexit");
-DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax");
-DEF_NATIVE(pv_mmu_ops, write_cr3, "mov %eax, %cr3");
-DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %cr3, %eax");
-DEF_NATIVE(pv_cpu_ops, clts, "clts");
-DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc");
-
-/* Undefined instruction for dealing with missing ops pointers. */
-static const unsigned char ud2a[] = { 0x0f, 0x0b };
-
-static unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
- unsigned long addr, unsigned len)
-{
- const unsigned char *start, *end;
- unsigned ret;
-
- switch(type) {
-#define SITE(ops, x) \
- case PARAVIRT_PATCH(ops.x): \
- start = start_##ops##_##x; \
- end = end_##ops##_##x; \
- goto patch_site
-
- SITE(pv_irq_ops, irq_disable);
- SITE(pv_irq_ops, irq_enable);
- SITE(pv_irq_ops, restore_fl);
- SITE(pv_irq_ops, save_fl);
- SITE(pv_cpu_ops, iret);
- SITE(pv_cpu_ops, irq_enable_syscall_ret);
- SITE(pv_mmu_ops, read_cr2);
- SITE(pv_mmu_ops, read_cr3);
- SITE(pv_mmu_ops, write_cr3);
- SITE(pv_cpu_ops, clts);
- SITE(pv_cpu_ops, read_tsc);
-#undef SITE
-
- patch_site:
- ret = paravirt_patch_insns(ibuf, len, start, end);
- break;
-
- default:
- ret = paravirt_patch_default(type, clobbers, ibuf, addr, len);
- break;
- }
-
- return ret;
-}
-
-unsigned paravirt_patch_nop(void)
-{
- return 0;
-}
-
-unsigned paravirt_patch_ignore(unsigned len)
-{
- return len;
-}
-
-struct branch {
- unsigned char opcode;
- u32 delta;
-} __attribute__((packed));
-
-unsigned paravirt_patch_call(void *insnbuf,
- const void *target, u16 tgt_clobbers,
- unsigned long addr, u16 site_clobbers,
- unsigned len)
-{
- struct branch *b = insnbuf;
- unsigned long delta = (unsigned long)target - (addr+5);
-
- if (tgt_clobbers & ~site_clobbers)
- return len; /* target would clobber too much for this site */
- if (len < 5)
- return len; /* call too long for patch site */
-
- b->opcode = 0xe8; /* call */
- b->delta = delta;
- BUILD_BUG_ON(sizeof(*b) != 5);
-
- return 5;
-}
-
-unsigned paravirt_patch_jmp(void *insnbuf, const void *target,
- unsigned long addr, unsigned len)
-{
- struct branch *b = insnbuf;
- unsigned long delta = (unsigned long)target - (addr+5);
-
- if (len < 5)
- return len; /* call too long for patch site */
-
- b->opcode = 0xe9; /* jmp */
- b->delta = delta;
-
- return 5;
-}
-
-/* Neat trick to map patch type back to the call within the
- * corresponding structure. */
-static void *get_call_destination(u8 type)
-{
- struct paravirt_patch_template tmpl = {
- .pv_init_ops = pv_init_ops,
- .pv_time_ops = pv_time_ops,
- .pv_cpu_ops = pv_cpu_ops,
- .pv_irq_ops = pv_irq_ops,
- .pv_apic_ops = pv_apic_ops,
- .pv_mmu_ops = pv_mmu_ops,
- };
- return *((void **)&tmpl + type);
-}
-
-unsigned paravirt_patch_default(u8 type, u16 clobbers, void *insnbuf,
- unsigned long addr, unsigned len)
-{
- void *opfunc = get_call_destination(type);
- unsigned ret;
-
- if (opfunc == NULL)
- /* If there's no function, patch it with a ud2a (BUG) */
- ret = paravirt_patch_insns(insnbuf, len, ud2a, ud2a+sizeof(ud2a));
- else if (opfunc == paravirt_nop)
- /* If the operation is a nop, then nop the callsite */
- ret = paravirt_patch_nop();
- else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
- type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_syscall_ret))
- /* If operation requires a jmp, then jmp */
- ret = paravirt_patch_jmp(insnbuf, opfunc, addr, len);
- else
- /* Otherwise call the function; assume target could
- clobber any caller-save reg */
- ret = paravirt_patch_call(insnbuf, opfunc, CLBR_ANY,
- addr, clobbers, len);
-
- return ret;
-}
-
-unsigned paravirt_patch_insns(void *insnbuf, unsigned len,
- const char *start, const char *end)
-{
- unsigned insn_len = end - start;
-
- if (insn_len > len || start == NULL)
- insn_len = len;
- else
- memcpy(insnbuf, start, insn_len);
-
- return insn_len;
-}
-
-void init_IRQ(void)
-{
- pv_irq_ops.init_IRQ();
-}
-
-static void native_flush_tlb(void)
-{
- __native_flush_tlb();
-}
-
-/*
- * Global pages have to be flushed a bit differently. Not a real
- * performance problem because this does not happen often.
- */
-static void native_flush_tlb_global(void)
-{
- __native_flush_tlb_global();
-}
-
-static void native_flush_tlb_single(unsigned long addr)
-{
- __native_flush_tlb_single(addr);
-}
-
-/* These are in entry.S */
-extern void native_iret(void);
-extern void native_irq_enable_syscall_ret(void);
-
-static int __init print_banner(void)
-{
- pv_init_ops.banner();
- return 0;
-}
-core_initcall(print_banner);
-
-static struct resource reserve_ioports = {
- .start = 0,
- .end = IO_SPACE_LIMIT,
- .name = "paravirt-ioport",
- .flags = IORESOURCE_IO | IORESOURCE_BUSY,
-};
-
-static struct resource reserve_iomem = {
- .start = 0,
- .end = -1,
- .name = "paravirt-iomem",
- .flags = IORESOURCE_MEM | IORESOURCE_BUSY,
-};
-
-/*
- * Reserve the whole legacy IO space to prevent any legacy drivers
- * from wasting time probing for their hardware. This is a fairly
- * brute-force approach to disabling all non-virtual drivers.
- *
- * Note that this must be called very early to have any effect.
- */
-int paravirt_disable_iospace(void)
-{
- int ret;
-
- ret = request_resource(&ioport_resource, &reserve_ioports);
- if (ret == 0) {
- ret = request_resource(&iomem_resource, &reserve_iomem);
- if (ret)
- release_resource(&reserve_ioports);
- }
-
- return ret;
-}
-
-static DEFINE_PER_CPU(enum paravirt_lazy_mode, paravirt_lazy_mode) = PARAVIRT_LAZY_NONE;
-
-static inline void enter_lazy(enum paravirt_lazy_mode mode)
-{
- BUG_ON(x86_read_percpu(paravirt_lazy_mode) != PARAVIRT_LAZY_NONE);
- BUG_ON(preemptible());
-
- x86_write_percpu(paravirt_lazy_mode, mode);
-}
-
-void paravirt_leave_lazy(enum paravirt_lazy_mode mode)
-{
- BUG_ON(x86_read_percpu(paravirt_lazy_mode) != mode);
- BUG_ON(preemptible());
-
- x86_write_percpu(paravirt_lazy_mode, PARAVIRT_LAZY_NONE);
-}
-
-void paravirt_enter_lazy_mmu(void)
-{
- enter_lazy(PARAVIRT_LAZY_MMU);
-}
-
-void paravirt_leave_lazy_mmu(void)
-{
- paravirt_leave_lazy(PARAVIRT_LAZY_MMU);
-}
-
-void paravirt_enter_lazy_cpu(void)
-{
- enter_lazy(PARAVIRT_LAZY_CPU);
-}
-
-void paravirt_leave_lazy_cpu(void)
-{
- paravirt_leave_lazy(PARAVIRT_LAZY_CPU);
-}
-
-enum paravirt_lazy_mode paravirt_get_lazy_mode(void)
-{
- return x86_read_percpu(paravirt_lazy_mode);
-}
-
-struct pv_info pv_info = {
- .name = "bare hardware",
- .paravirt_enabled = 0,
- .kernel_rpl = 0,
- .shared_kernel_pmd = 1, /* Only used when CONFIG_X86_PAE is set */
-};
-
-struct pv_init_ops pv_init_ops = {
- .patch = native_patch,
- .banner = default_banner,
- .arch_setup = paravirt_nop,
- .memory_setup = machine_specific_memory_setup,
-};
-
-struct pv_time_ops pv_time_ops = {
- .time_init = hpet_time_init,
- .get_wallclock = native_get_wallclock,
- .set_wallclock = native_set_wallclock,
- .sched_clock = native_sched_clock,
- .get_cpu_khz = native_calculate_cpu_khz,
-};
-
-struct pv_irq_ops pv_irq_ops = {
- .init_IRQ = native_init_IRQ,
- .save_fl = native_save_fl,
- .restore_fl = native_restore_fl,
- .irq_disable = native_irq_disable,
- .irq_enable = native_irq_enable,
- .safe_halt = native_safe_halt,
- .halt = native_halt,
-};
-
-struct pv_cpu_ops pv_cpu_ops = {
- .cpuid = native_cpuid,
- .get_debugreg = native_get_debugreg,
- .set_debugreg = native_set_debugreg,
- .clts = native_clts,
- .read_cr0 = native_read_cr0,
- .write_cr0 = native_write_cr0,
- .read_cr4 = native_read_cr4,
- .read_cr4_safe = native_read_cr4_safe,
- .write_cr4 = native_write_cr4,
- .wbinvd = native_wbinvd,
- .read_msr = native_read_msr_safe,
- .write_msr = native_write_msr_safe,
- .read_tsc = native_read_tsc,
- .read_pmc = native_read_pmc,
- .load_tr_desc = native_load_tr_desc,
- .set_ldt = native_set_ldt,
- .load_gdt = native_load_gdt,
- .load_idt = native_load_idt,
- .store_gdt = native_store_gdt,
- .store_idt = native_store_idt,
- .store_tr = native_store_tr,
- .load_tls = native_load_tls,
- .write_ldt_entry = write_dt_entry,
- .write_gdt_entry = write_dt_entry,
- .write_idt_entry = write_dt_entry,
- .load_esp0 = native_load_esp0,
-
- .irq_enable_syscall_ret = native_irq_enable_syscall_ret,
- .iret = native_iret,
-
- .set_iopl_mask = native_set_iopl_mask,
- .io_delay = native_io_delay,
-
- .lazy_mode = {
- .enter = paravirt_nop,
- .leave = paravirt_nop,
- },
-};
-
-struct pv_apic_ops pv_apic_ops = {
-#ifdef CONFIG_X86_LOCAL_APIC
- .apic_write = native_apic_write,
- .apic_write_atomic = native_apic_write_atomic,
- .apic_read = native_apic_read,
- .setup_boot_clock = setup_boot_APIC_clock,
- .setup_secondary_clock = setup_secondary_APIC_clock,
- .startup_ipi_hook = paravirt_nop,
-#endif
-};
-
-struct pv_mmu_ops pv_mmu_ops = {
- .pagetable_setup_start = native_pagetable_setup_start,
- .pagetable_setup_done = native_pagetable_setup_done,
-
- .read_cr2 = native_read_cr2,
- .write_cr2 = native_write_cr2,
- .read_cr3 = native_read_cr3,
- .write_cr3 = native_write_cr3,
-
- .flush_tlb_user = native_flush_tlb,
- .flush_tlb_kernel = native_flush_tlb_global,
- .flush_tlb_single = native_flush_tlb_single,
- .flush_tlb_others = native_flush_tlb_others,
-
- .alloc_pt = paravirt_nop,
- .alloc_pd = paravirt_nop,
- .alloc_pd_clone = paravirt_nop,
- .release_pt = paravirt_nop,
- .release_pd = paravirt_nop,
-
- .set_pte = native_set_pte,
- .set_pte_at = native_set_pte_at,
- .set_pmd = native_set_pmd,
- .pte_update = paravirt_nop,
- .pte_update_defer = paravirt_nop,
-
-#ifdef CONFIG_HIGHPTE
- .kmap_atomic_pte = kmap_atomic,
-#endif
-
-#ifdef CONFIG_X86_PAE
- .set_pte_atomic = native_set_pte_atomic,
- .set_pte_present = native_set_pte_present,
- .set_pud = native_set_pud,
- .pte_clear = native_pte_clear,
- .pmd_clear = native_pmd_clear,
-
- .pmd_val = native_pmd_val,
- .make_pmd = native_make_pmd,
-#endif
-
- .pte_val = native_pte_val,
- .pgd_val = native_pgd_val,
-
- .make_pte = native_make_pte,
- .make_pgd = native_make_pgd,
-
- .dup_mmap = paravirt_nop,
- .exit_mmap = paravirt_nop,
- .activate_mm = paravirt_nop,
-
- .lazy_mode = {
- .enter = paravirt_nop,
- .leave = paravirt_nop,
- },
-};
-
-EXPORT_SYMBOL_GPL(pv_time_ops);
-EXPORT_SYMBOL_GPL(pv_cpu_ops);
-EXPORT_SYMBOL_GPL(pv_mmu_ops);
-EXPORT_SYMBOL_GPL(pv_apic_ops);
-EXPORT_SYMBOL_GPL(pv_info);
-EXPORT_SYMBOL (pv_irq_ops);
diff --git a/arch/x86/kernel/paravirt_patch_32.c b/arch/x86/kernel/paravirt_patch_32.c
new file mode 100644
index 0000000..46ae585
--- /dev/null
+++ b/arch/x86/kernel/paravirt_patch_32.c
@@ -0,0 +1,52 @@
+#include <asm/paravirt.h>
+
+DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
+DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
+DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf");
+DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax");
+DEF_NATIVE(pv_cpu_ops, iret, "iret");
+DEF_NATIVE(pv_cpu_ops, irq_enable_syscall_ret, "sti; sysexit");
+DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax");
+DEF_NATIVE(pv_mmu_ops, write_cr3, "mov %eax, %cr3");
+DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %cr3, %eax");
+DEF_NATIVE(pv_cpu_ops, clts, "clts");
+DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc");
+
+/* Undefined instruction for dealing with missing ops pointers. */
+static const unsigned char ud2a[] = { 0x0f, 0x0b };
+
+unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
+ unsigned long addr, unsigned len)
+{
+ const unsigned char *start, *end;
+ unsigned ret;
+
+#define PATCH_SITE(ops, x) \
+ case PARAVIRT_PATCH(ops.x): \
+ start = start_##ops##_##x; \
+ end = end_##ops##_##x; \
+ goto patch_site
+ switch(type) {
+ PATCH_SITE(pv_irq_ops, irq_disable);
+ PATCH_SITE(pv_irq_ops, irq_enable);
+ PATCH_SITE(pv_irq_ops, restore_fl);
+ PATCH_SITE(pv_irq_ops, save_fl);
+ PATCH_SITE(pv_cpu_ops, iret);
+ PATCH_SITE(pv_cpu_ops, irq_enable_syscall_ret);
+ PATCH_SITE(pv_mmu_ops, read_cr2);
+ PATCH_SITE(pv_mmu_ops, read_cr3);
+ PATCH_SITE(pv_mmu_ops, write_cr3);
+ PATCH_SITE(pv_cpu_ops, clts);
+ PATCH_SITE(pv_cpu_ops, read_tsc);
+
+ patch_site:
+ ret = paravirt_patch_insns(ibuf, len, start, end);
+ break;
+
+ default:
+ ret = paravirt_patch_default(type, clobbers, ibuf, addr, len);
+ break;
+ }
+#undef PATCH_SITE
+ return ret;
+}
diff --git a/arch/x86/kernel/paravirt_patch_64.c b/arch/x86/kernel/paravirt_patch_64.c
new file mode 100644
index 0000000..cbfc4f3
--- /dev/null
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -0,0 +1,56 @@
+#include <asm/paravirt.h>
+#include <asm/asm-offsets.h>
+
+DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
+DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
+DEF_NATIVE(pv_irq_ops, restore_fl, "pushq %rdi; popfq");
+DEF_NATIVE(pv_irq_ops, save_fl, "pushfq; popq %rax");
+DEF_NATIVE(pv_cpu_ops, iret, "iretq");
+DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax");
+DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax");
+DEF_NATIVE(pv_mmu_ops, write_cr3, "movq %rdi, %cr3");
+DEF_NATIVE(pv_mmu_ops, flush_tlb_single, "invlpg (%rdi)");
+DEF_NATIVE(pv_cpu_ops, clts, "clts");
+DEF_NATIVE(pv_cpu_ops, wbinvd, "wbinvd");
+
+/* the three commands give us more control to how to return from a syscall */
+DEF_NATIVE(pv_cpu_ops, irq_enable_syscall_ret, "movq %gs:" __stringify(pda_oldrsp) ", %rsp; swapgs; sysretq;");
+DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs");
+
+unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
+ unsigned long addr, unsigned len)
+{
+ const unsigned char *start, *end;
+ unsigned ret;
+
+#define PATCH_SITE(ops, x) \
+ case PARAVIRT_PATCH(ops.x): \
+ start = start_##ops##_##x; \
+ end = end_##ops##_##x; \
+ goto patch_site
+ switch(type) {
+ PATCH_SITE(pv_irq_ops, restore_fl);
+ PATCH_SITE(pv_irq_ops, save_fl);
+ PATCH_SITE(pv_irq_ops, irq_enable);
+ PATCH_SITE(pv_irq_ops, irq_disable);
+ PATCH_SITE(pv_cpu_ops, iret);
+ PATCH_SITE(pv_cpu_ops, irq_enable_syscall_ret);
+ PATCH_SITE(pv_cpu_ops, swapgs);
+ PATCH_SITE(pv_mmu_ops, read_cr2);
+ PATCH_SITE(pv_mmu_ops, read_cr3);
+ PATCH_SITE(pv_mmu_ops, write_cr3);
+ PATCH_SITE(pv_cpu_ops, clts);
+ PATCH_SITE(pv_mmu_ops, flush_tlb_single);
+ PATCH_SITE(pv_cpu_ops, wbinvd);
+
+ patch_site:
+ ret = paravirt_patch_insns(ibuf, len, start, end);
+ break;
+
+ default:
+ ret = paravirt_patch_default(type, clobbers, ibuf, addr, len);
+ break;
+ }
+#undef PATCH_SITE
+ return ret;
+}
diff --git a/arch/x86/kernel/vmlinux_64.lds.S b/arch/x86/kernel/vmlinux_64.lds.S
index ba8ea97..c3fce85 100644
--- a/arch/x86/kernel/vmlinux_64.lds.S
+++ b/arch/x86/kernel/vmlinux_64.lds.S
@@ -185,6 +185,12 @@ SECTIONS
.altinstr_replacement : AT(ADDR(.altinstr_replacement) - LOAD_OFFSET) {
*(.altinstr_replacement)
}
+ . = ALIGN(8);
+ .parainstructions : AT(ADDR(.parainstructions) - LOAD_OFFSET) {
+ __parainstructions = .;
+ *(.parainstructions)
+ __parainstructions_end = .;
+ }
/* .exit.text is discard at runtime, not link time, to deal with references
from .altinstructions and .eh_frame */
.exit.text : AT(ADDR(.exit.text) - LOAD_OFFSET) { *(.exit.text) }
diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
index d81a361..4ca335a 100644
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -7,23 +7,42 @@
#include <asm/page.h>

/* Bitmask of what can be clobbered: usually at least eax. */
-#define CLBR_NONE 0x0
-#define CLBR_EAX 0x1
-#define CLBR_ECX 0x2
-#define CLBR_EDX 0x4
-#define CLBR_ANY 0x7
+#define CLBR_NONE 0
+#define CLBR_EAX (1 << 0)
+#define CLBR_ECX (1 << 1)
+#define CLBR_EDX (1 << 2)
+
+#ifdef CONFIG_X86_64
+#define CLBR_RSI (1 << 3)
+#define CLBR_RDI (1 << 4)
+#define CLBR_R8 (1 << 5)
+#define CLBR_R9 (1 << 6)
+#define CLBR_R10 (1 << 7)
+#define CLBR_R11 (1 << 8)
+#define CLBR_ANY ((1 << 9) - 1)
+#include <asm/desc_defs.h>
+#else
+/* CLBR_ANY should match all regs platform has. For i386, that's just it */
+#define CLBR_ANY ((1 << 3) - 1)
+#endif /* X86_64 */

#ifndef __ASSEMBLY__
#include <linux/types.h>
#include <linux/cpumask.h>
#include <asm/kmap_types.h>
+#include <linux/stringify.h>

struct page;
struct thread_struct;
-struct Xgt_desc_struct;
struct tss_struct;
struct mm_struct;
struct desc_struct;
+/* FIXME: Ideally, the two arches would use the same data structure */
+#ifdef CONFIG_X86_64
+typedef struct desc_ptr x86_descr_ptr;
+#else
+typedef struct Xgt_desc_struct x86_descr_ptr;
+#endif

/* general info */
struct pv_info {
@@ -54,7 +73,6 @@ struct pv_init_ops {
void (*banner)(void);
};

-
struct pv_lazy_ops {
/* Set deferred update mode, used for batching operations. */
void (*enter)(void);
@@ -88,19 +106,26 @@ struct pv_cpu_ops {

/* Segment descriptor handling */
void (*load_tr_desc)(void);
- void (*load_gdt)(const struct Xgt_desc_struct *);
- void (*load_idt)(const struct Xgt_desc_struct *);
- void (*store_gdt)(struct Xgt_desc_struct *);
- void (*store_idt)(struct Xgt_desc_struct *);
+ void (*load_gdt)(const x86_descr_ptr *);
+ void (*load_idt)(const x86_descr_ptr *);
+ void (*store_gdt)(x86_descr_ptr *);
+ void (*store_idt)(x86_descr_ptr *);
void (*set_ldt)(const void *desc, unsigned entries);
unsigned long (*store_tr)(void);
void (*load_tls)(struct thread_struct *t, unsigned int cpu);
void (*write_ldt_entry)(struct desc_struct *,
int entrynum, u32 low, u32 high);
+#ifdef CONFIG_X86_32
void (*write_gdt_entry)(struct desc_struct *,
int entrynum, u32 low, u32 high);
void (*write_idt_entry)(struct desc_struct *,
int entrynum, u32 low, u32 high);
+#else
+ void (*write_gdt_entry)(void *ptr, void *entry, unsigned type,
+ unsigned size);
+ void (*write_idt_entry)(void *adr, struct gate_struct *s);
+#endif
+
void (*load_esp0)(struct tss_struct *tss, struct thread_struct *t);

void (*set_iopl_mask)(unsigned mask);
@@ -115,15 +140,18 @@ struct pv_cpu_ops {
/* MSR, PMC and TSR operations.
err = 0/-EFAULT. wrmsr returns 0/-EFAULT. */
u64 (*read_msr)(unsigned int msr, int *err);
- int (*write_msr)(unsigned int msr, u64 val);
+ int (*write_msr)(unsigned int msr, unsigned int low, unsigned int high);

u64 (*read_tsc)(void);
- u64 (*read_pmc)(void);
+ u64 (*read_pmc)(int counter);
+ u64 (*read_tscp)(int *aux);

/* These two are jmp to, not actually called. */
void (*irq_enable_syscall_ret)(void);
void (*iret)(void);

+ void (*swapgs)(void);
+
struct pv_lazy_ops lazy_mode;
};

@@ -150,9 +178,9 @@ struct pv_apic_ops {
* Direct APIC operations, principally for VMI. Ideally
* these shouldn't be in this interface.
*/
- void (*apic_write)(unsigned long reg, unsigned long v);
- void (*apic_write_atomic)(unsigned long reg, unsigned long v);
- unsigned long (*apic_read)(unsigned long reg);
+ void (*apic_write)(unsigned long reg, u32 v);
+ void (*apic_write_atomic)(unsigned long reg, u32 v);
+ u32 (*apic_read)(unsigned long reg);
void (*setup_boot_clock)(void);
void (*setup_secondary_clock)(void);

@@ -216,6 +244,8 @@ struct pv_mmu_ops {
void (*set_pte_atomic)(pte_t *ptep, pte_t pteval);
void (*set_pte_present)(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte);
+#endif
+#if defined(CONFIG_X86_PAE) || defined(CONFIG_X86_64)
void (*set_pud)(pud_t *pudp, pud_t pudval);
void (*pte_clear)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
void (*pmd_clear)(pmd_t *pmdp);
@@ -227,6 +257,16 @@ struct pv_mmu_ops {
pte_t (*make_pte)(unsigned long long pte);
pmd_t (*make_pmd)(unsigned long long pmd);
pgd_t (*make_pgd)(unsigned long long pgd);
+ #ifdef CONFIG_X86_64
+ void (*set_pgd)(pgd_t *pgdp, pgd_t pgdval);
+
+ void (*pud_clear)(pud_t *pudp);
+ void (*pgd_clear)(pgd_t *pgdp);
+
+ unsigned long long (*pud_val)(pud_t);
+
+ pud_t (*make_pud)(unsigned long long pud);
+ #endif
#else
unsigned long (*pte_val)(pte_t);
unsigned long (*pgd_val)(pgd_t);
@@ -255,6 +295,12 @@ struct paravirt_patch_template
struct pv_mmu_ops pv_mmu_ops;
};

+#ifdef CONFIG_X86_64
+#define WORDSIZE_STR " .quad"
+#else
+#define WORDSIZE_STR " .long"
+#endif
+
extern struct pv_info pv_info;
extern struct pv_init_ops pv_init_ops;
extern struct pv_time_ops pv_time_ops;
@@ -279,7 +325,8 @@ extern struct pv_mmu_ops pv_mmu_ops;
#define _paravirt_alt(insn_string, type, clobber) \
"771:\n\t" insn_string "\n" "772:\n" \
".pushsection .parainstructions,\"a\"\n" \
- " .long 771b\n" \
+ ".align 8\n" \
+ WORDSIZE_STR " 771b\n" \
" .byte " type "\n" \
" .byte 772b-771b\n" \
" .short " clobber "\n" \
@@ -289,6 +336,11 @@ extern struct pv_mmu_ops pv_mmu_ops;
#define paravirt_alt(insn_string) \
_paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")

+/* Simple instruction patching code. */
+#define DEF_NATIVE(ops, name, code) \
+ extern const char start_##ops##_##name[], end_##ops##_##name[]; \
+ asm("start_" #ops "_" #name ": " code "; end_" #ops "_" #name ":")
+
unsigned paravirt_patch_nop(void);
unsigned paravirt_patch_ignore(unsigned len);
unsigned paravirt_patch_call(void *insnbuf,
@@ -303,6 +355,9 @@ unsigned paravirt_patch_default(u8 type, u16 clobbers, void *insnbuf,
unsigned paravirt_patch_insns(void *insnbuf, unsigned len,
const char *start, const char *end);

+unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
+ unsigned long addr, unsigned len);
+
int paravirt_disable_iospace(void);

/*
@@ -319,22 +374,29 @@ int paravirt_disable_iospace(void);
* runtime.
*
* Normally, a call to a pv_op function is a simple indirect call:
- * (paravirt_ops.operations)(args...).
+ * (pv_op_struct.operations)(args...).
*
* Unfortunately, this is a relatively slow operation for modern CPUs,
* because it cannot necessarily determine what the destination
- * address is. In this case, the address is a runtime constant, so at
- * the very least we can patch the call to e a simple direct call, or
+ * address is. In this case, the address is a runtime constant, so at
+ * the very least we can patch the call to be a simple direct call, or
* ideally, patch an inline implementation into the callsite. (Direct
* calls are essentially free, because the call and return addresses
* are completely predictable.)
*
- * These macros rely on the standard gcc "regparm(3)" calling
+ * For i386, these macros rely on the standard gcc "regparm(3)" calling
* convention, in which the first three arguments are placed in %eax,
* %edx, %ecx (in that order), and the remaining arguments are placed
* on the stack. All caller-save registers (eax,edx,ecx) are expected
* to be modified (either clobbered or used for return values).
*
+ * X86_64, on the other hand, already specifies a register-based calling
+ * conventions, returning at %rax, with parameteres going on %rdi, %rsi,
+ * %rdx, and %rcx. Note that for this reason, x86_64 does not need any
+ * special handling for dealing with 4 arguments, unlike i386.
+ * However, x86_64 also have to clobber all caller saved registers, which
+ * unfortunately, are quite a bit (r8 - r11)
+ *
* The call instruction itself is marked by placing its start address
* and size into the .parainstructions section, so that
* apply_paravirt() in arch/i386/kernel/alternative.c can do the
@@ -356,9 +418,10 @@ int paravirt_disable_iospace(void);
* the return type. The macro then uses sizeof() on that type to
* determine whether its a 32 or 64 bit value, and places the return
* in the right register(s) (just %eax for 32-bit, and %edx:%eax for
- * 64-bit).
+ * 64-bit). For x86_64 machines, it just returns at %rax regardless of
+ * the return value size.
*
- * 64-bit arguments are passed as a pair of adjacent 32-bit arguments
+ * i386 also passes 64-bit arguments as a pair of adjacent 32-bit arguments
* in low,high order.
*
* Small structures are passed and returned in registers. The macro
@@ -369,46 +432,67 @@ int paravirt_disable_iospace(void);
* means that all uses must be wrapped in inline functions. This also
* makes sure the incoming and outgoing types are always correct.
*/
+#ifdef CONFIG_X86_32
+#define PVOP_VCALL_ARGS unsigned long __eax,__edx,__ecx
+#define PVOP_CALL_ARGS PVOP_VCALL_ARGS
+#define PVOP_VCALL_CLOBBERS "=a" (__eax), "=d" (__edx), \
+ "=c" (__ecx)
+#define PVOP_CALL_CLOBBERS PVOP_VCALL_CLOBBERS
+#define EXTRA_CLOBBERS
+#define VEXTRA_CLOBBERS
+#else
+#define PVOP_VCALL_ARGS unsigned long __edi,__esi,__edx,__ecx
+#define PVOP_CALL_ARGS PVOP_VCALL_ARGS, __eax
+#define PVOP_VCALL_CLOBBERS "=D" (__edi), \
+ "=S" (__esi), "=d" (__edx), \
+ "=c" (__ecx)
+
+#define PVOP_CALL_CLOBBERS PVOP_VCALL_CLOBBERS, "=a" (__eax)
+
+#define EXTRA_CLOBBERS , "r8", "r9", "r10", "r11"
+#define VEXTRA_CLOBBERS , "rax", "r8", "r9", "r10", "r11"
+#endif
+
#define __PVOP_CALL(rettype, op, pre, post, ...) \
({ \
rettype __ret; \
- unsigned long __eax, __edx, __ecx; \
+ PVOP_CALL_ARGS; \
+ /* This is 32-bit specific, but is okay in 64-bit */ \
+ /* since this condition will never hold */ \
if (sizeof(rettype) > sizeof(unsigned long)) { \
asm volatile(pre \
paravirt_alt(PARAVIRT_CALL) \
post \
- : "=a" (__eax), "=d" (__edx), \
- "=c" (__ecx) \
+ : PVOP_CALL_CLOBBERS \
: paravirt_type(op), \
paravirt_clobber(CLBR_ANY), \
##__VA_ARGS__ \
- : "memory", "cc"); \
+ : "memory", "cc" EXTRA_CLOBBERS); \
__ret = (rettype)((((u64)__edx) << 32) | __eax); \
} else { \
asm volatile(pre \
paravirt_alt(PARAVIRT_CALL) \
post \
- : "=a" (__eax), "=d" (__edx), \
- "=c" (__ecx) \
+ : PVOP_CALL_CLOBBERS \
: paravirt_type(op), \
paravirt_clobber(CLBR_ANY), \
##__VA_ARGS__ \
- : "memory", "cc"); \
+ : "memory", "cc" EXTRA_CLOBBERS); \
__ret = (rettype)__eax; \
} \
__ret; \
})
#define __PVOP_VCALL(op, pre, post, ...) \
({ \
- unsigned long __eax, __edx, __ecx; \
+ PVOP_VCALL_ARGS; \
asm volatile(pre \
paravirt_alt(PARAVIRT_CALL) \
post \
- : "=a" (__eax), "=d" (__edx), "=c" (__ecx) \
+ : PVOP_VCALL_CLOBBERS \
: paravirt_type(op), \
paravirt_clobber(CLBR_ANY), \
##__VA_ARGS__ \
- : "memory", "cc"); \
+ : "memory", "cc" VEXTRA_CLOBBERS); \
})

#define PVOP_CALL0(rettype, op) \
@@ -417,22 +501,27 @@ int paravirt_disable_iospace(void);
__PVOP_VCALL(op, "", "")

#define PVOP_CALL1(rettype, op, arg1) \
- __PVOP_CALL(rettype, op, "", "", "0" ((u32)(arg1)))
+ __PVOP_CALL(rettype, op, "", "", "0" ((unsigned long)(arg1)))
#define PVOP_VCALL1(op, arg1) \
- __PVOP_VCALL(op, "", "", "0" ((u32)(arg1)))
+ __PVOP_VCALL(op, "", "", "0" ((unsigned long)(arg1)))

#define PVOP_CALL2(rettype, op, arg1, arg2) \
- __PVOP_CALL(rettype, op, "", "", "0" ((u32)(arg1)), "1" ((u32)(arg2)))
+ __PVOP_CALL(rettype, op, "", "", "0" ((unsigned long)(arg1)), \
+ "1" ((unsigned long)(arg2)))
+
#define PVOP_VCALL2(op, arg1, arg2) \
- __PVOP_VCALL(op, "", "", "0" ((u32)(arg1)), "1" ((u32)(arg2)))
+ __PVOP_VCALL(op, "", "", "0" ((unsigned long)(arg1)), \
+ "1" ((unsigned long)(arg2)))

#define PVOP_CALL3(rettype, op, arg1, arg2, arg3) \
- __PVOP_CALL(rettype, op, "", "", "0" ((u32)(arg1)), \
- "1"((u32)(arg2)), "2"((u32)(arg3)))
+ __PVOP_CALL(rettype, op, "", "", "0" ((unsigned long)(arg1)), \
+ "1"((unsigned long)(arg2)), "2"((unsigned long)(arg3)))
#define PVOP_VCALL3(op, arg1, arg2, arg3) \
- __PVOP_VCALL(op, "", "", "0" ((u32)(arg1)), "1"((u32)(arg2)), \
- "2"((u32)(arg3)))
+ __PVOP_VCALL(op, "", "", "0" ((unsigned long)(arg1)), \
+ "1"((unsigned long)(arg2)), "2"((unsigned long)(arg3)))

+/* This is the only difference in x86_64. We can make it much simpler */
+#ifdef CONFIG_X86_32
#define PVOP_CALL4(rettype, op, arg1, arg2, arg3, arg4) \
__PVOP_CALL(rettype, op, \
"push %[_arg4];", "lea 4(%%esp),%%esp;", \
@@ -443,6 +532,16 @@ int paravirt_disable_iospace(void);
"push %[_arg4];", "lea 4(%%esp),%%esp;", \
"0" ((u32)(arg1)), "1" ((u32)(arg2)), \
"2" ((u32)(arg3)), [_arg4] "mr" ((u32)(arg4)))
+#else
+#define PVOP_CALL4(rettype, op, arg1, arg2, arg3, arg4) \
+ __PVOP_CALL(rettype, op, "", "", "0" ((unsigned long)(arg1)), \
+ "1"((unsigned long)(arg2)), "2"((unsigned long)(arg3)), \
+ "3"((unsigned long)(arg4)))
+#define PVOP_VCALL4(op, arg1, arg2, arg3, arg4) \
+ __PVOP_VCALL(op, "", "", "0" ((unsigned long)(arg1)), \
+ "1"((unsigned long)(arg2)), "2"((unsigned long)(arg3)), \
+ "3"((unsigned long)(arg4)))
+#endif

static inline int paravirt_enabled(void)
{
@@ -561,6 +660,7 @@ static inline u64 paravirt_read_msr(unsigned msr, int *err)
{
return PVOP_CALL2(u64, pv_cpu_ops.read_msr, msr, err);
}
+
static inline int paravirt_write_msr(unsigned msr, unsigned low, unsigned high)
{
return PVOP_CALL3(int, pv_cpu_ops.write_msr, msr, low, high);
@@ -613,8 +713,6 @@ static inline unsigned long long paravirt_sched_clock(void)
}
#define calculate_cpu_khz() (pv_time_ops.get_cpu_khz())

-#define write_tsc(val1,val2) wrmsr(0x10, val1, val2)
-
static inline unsigned long long paravirt_read_pmc(int counter)
{
return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter);
@@ -626,15 +724,36 @@ static inline unsigned long long paravirt_read_pmc(int counter)
high = _l >> 32; \
} while(0)

+static inline unsigned long paravirt_readt_scp(int *aux)
+{
+ return PVOP_CALL1(u64, pv_cpu_ops.read_tscp, aux);
+}
+
+#define rdtscp(low, high, aux) \
+do { \
+ int __aux; \
+ unsigned long __val = paravirt_rdtscp(&__aux); \
+ (low) = (u32)__val; \
+ (high) = (u32)(__val >> 32); \
+ (aux) = __aux; \
+} while (0)
+
+#define rdtscpll(val, aux) \
+do { \
+ unsigned long __aux; \
+ val = paravirt_rdtscp(&__aux); \
+ (aux) = __aux; \
+} while (0)
+
static inline void load_TR_desc(void)
{
PVOP_VCALL0(pv_cpu_ops.load_tr_desc);
}
-static inline void load_gdt(const struct Xgt_desc_struct *dtr)
+static inline void load_gdt(const x86_descr_ptr *dtr)
{
PVOP_VCALL1(pv_cpu_ops.load_gdt, dtr);
}
-static inline void load_idt(const struct Xgt_desc_struct *dtr)
+static inline void load_idt(const x86_descr_ptr *dtr)
{
PVOP_VCALL1(pv_cpu_ops.load_idt, dtr);
}
@@ -642,11 +761,11 @@ static inline void set_ldt(const void *addr, unsigned entries)
{
PVOP_VCALL2(pv_cpu_ops.set_ldt, addr, entries);
}
-static inline void store_gdt(struct Xgt_desc_struct *dtr)
+static inline void store_gdt(x86_descr_ptr *dtr)
{
PVOP_VCALL1(pv_cpu_ops.store_gdt, dtr);
}
-static inline void store_idt(struct Xgt_desc_struct *dtr)
+static inline void store_idt(x86_descr_ptr *dtr)
{
PVOP_VCALL1(pv_cpu_ops.store_idt, dtr);
}
@@ -663,6 +782,8 @@ static inline void write_ldt_entry(void *dt, int entry, u32 low, u32 high)
{
PVOP_VCALL4(pv_cpu_ops.write_ldt_entry, dt, entry, low, high);
}
+
+#ifdef CONFIG_X86_32
static inline void write_gdt_entry(void *dt, int entry, u32 low, u32 high)
{
PVOP_VCALL4(pv_cpu_ops.write_gdt_entry, dt, entry, low, high);
@@ -671,6 +792,19 @@ static inline void write_idt_entry(void *dt, int entry, u32 low, u32 high)
{
PVOP_VCALL4(pv_cpu_ops.write_idt_entry, dt, entry, low, high);
}
+#else
+static inline void write_gdt_entry(void *ptr, void *entry,
+ unsigned type, unsigned size)
+{
+ PVOP_VCALL4(pv_cpu_ops.write_gdt_entry, ptr, entry, type, size);
+}
+
+static inline void write_idt_entry(void *adr, struct gate_struct *s)
+{
+ PVOP_VCALL2(pv_cpu_ops.write_idt_entry, adr, s);
+}
+#endif
+
static inline void set_iopl_mask(unsigned mask)
{
PVOP_VCALL1(pv_cpu_ops.set_iopl_mask, mask);
@@ -690,19 +824,19 @@ static inline void slow_down_io(void) {
/*
* Basic functions accessing APICs.
*/
-static inline void apic_write(unsigned long reg, unsigned long v)
+static inline void apic_write(unsigned long reg, u32 v)
{
PVOP_VCALL2(pv_apic_ops.apic_write, reg, v);
}

-static inline void apic_write_atomic(unsigned long reg, unsigned long v)
+static inline void apic_write_atomic(unsigned long reg, u32 v)
{
PVOP_VCALL2(pv_apic_ops.apic_write_atomic, reg, v);
}

-static inline unsigned long apic_read(unsigned long reg)
+static inline u32 apic_read(unsigned long reg)
{
- return PVOP_CALL1(unsigned long, pv_apic_ops.apic_read, reg);
+ return PVOP_CALL1(u32, pv_apic_ops.apic_read, reg);
}

static inline void setup_boot_clock(void)
@@ -762,10 +896,12 @@ static inline void __flush_tlb(void)
{
PVOP_VCALL0(pv_mmu_ops.flush_tlb_user);
}
+
static inline void __flush_tlb_global(void)
{
PVOP_VCALL0(pv_mmu_ops.flush_tlb_kernel);
}
+
static inline void __flush_tlb_single(unsigned long addr)
{
PVOP_VCALL1(pv_mmu_ops.flush_tlb_single, addr);
@@ -908,7 +1044,103 @@ static inline void pmd_clear(pmd_t *pmdp)
PVOP_VCALL1(pv_mmu_ops.pmd_clear, pmdp);
}

-#else /* !CONFIG_X86_PAE */
+#elif defined(CONFIG_X86_64)
+/* FIXME: There ought to be a way to do it that duplicate less code */
+static inline pte_t __pte(unsigned long long val)
+{
+ unsigned long long ret;
+ ret = PVOP_CALL1(unsigned long long, pv_mmu_ops.make_pte, val);
+ return (pte_t) { ret };
+}
+
+static inline pmd_t __pmd(unsigned long long val)
+{
+ unsigned long long ret;
+ ret = PVOP_CALL1(unsigned long long, pv_mmu_ops.make_pmd, val);
+ return (pmd_t) { ret };
+}
+
+static inline pud_t __pud(unsigned long long val)
+{
+ unsigned long long ret;
+ ret = PVOP_CALL1(unsigned long long, pv_mmu_ops.make_pud, val);
+ return (pud_t) { ret };
+}
+
+static inline pgd_t __pgd(unsigned long long val)
+{
+ unsigned long long ret;
+ ret = PVOP_CALL1(unsigned long long, pv_mmu_ops.make_pgd, val);
+ return (pgd_t) { ret };
+}
+
+static inline unsigned long long pte_val(pte_t x)
+{
+ return PVOP_CALL1(unsigned long long, pv_mmu_ops.pte_val, x.pte);
+}
+
+static inline unsigned long long pmd_val(pmd_t x)
+{
+ return PVOP_CALL1(unsigned long long, pv_mmu_ops.pmd_val, x.pmd);
+}
+
+static inline unsigned long long pud_val(pud_t x)
+{
+ return PVOP_CALL1(unsigned long long, pv_mmu_ops.pud_val, x.pud);
+}
+
+static inline unsigned long long pgd_val(pgd_t x)
+{
+ return PVOP_CALL1(unsigned long long, pv_mmu_ops.pgd_val, x.pgd);
+}
+
+static inline void set_pte(pte_t *ptep, pte_t pteval)
+{
+ PVOP_VCALL2(pv_mmu_ops.set_pte, ptep, pteval.pte);
+}
+
+static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pteval)
+{
+ PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pteval.pte);
+}
+
+static inline void set_pmd(pmd_t *pmdp, pmd_t pmdval)
+{
+ PVOP_VCALL2(pv_mmu_ops.set_pmd, pmdp, pmdval.pmd);
+}
+
+static inline void set_pud(pud_t *pudp, pud_t pudval)
+{
+ PVOP_VCALL2(pv_mmu_ops.set_pud, pudp, pudval.pud);
+}
+
+static inline void set_pgd(pgd_t *pgdp, pgd_t pgdval)
+{
+ PVOP_VCALL2(pv_mmu_ops.set_pgd, pgdp, pgdval.pgd);
+}
+
+static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+ PVOP_VCALL3(pv_mmu_ops.pte_clear, mm, addr, ptep);
+}
+
+static inline void pmd_clear(pmd_t *pmdp)
+{
+ PVOP_VCALL1(pv_mmu_ops.pmd_clear, pmdp);
+}
+
+static inline void pud_clear(pud_t *pudp)
+{
+ PVOP_VCALL1(pv_mmu_ops.pud_clear, pudp);
+}
+
+static inline void pgd_clear(pgd_t *pgdp)
+{
+ PVOP_VCALL1(pv_mmu_ops.pgd_clear, pgdp);
+}
+
+#else /* !CONFIG_X86_PAE && !CONFIG_X86_64*/

static inline pte_t __pte(unsigned long val)
{
@@ -1014,52 +1246,68 @@ struct paravirt_patch_site {
extern struct paravirt_patch_site __parainstructions[],
__parainstructions_end[];

+#ifdef CONFIG_X86_32
+#define PV_SAVE_REGS "pushl %%ecx; pushl %%edx;"
+#define PV_RESTORE_REGS "popl %%edx; popl %%ecx"
+#define PV_FLAGS_ARG "0"
+#define PV_EXTRA_CLOBBERS
+#define PV_VEXTRA_CLOBBERS
+#else
+/* We save some registers, but all of them, that's too much. We clobber all
+ * caller saved registers but the argument parameter */
+#define PV_SAVE_REGS "pushq %%rdi;"
+#define PV_RESTORE_REGS "popq %%rdi;"
+#define PV_EXTRA_CLOBBERS EXTRA_CLOBBERS, "rcx" , "rdx"
+#define PV_VEXTRA_CLOBBERS EXTRA_CLOBBERS, "rdi", "rcx" , "rdx"
+#define PV_FLAGS_ARG "D"
+#endif
+
static inline unsigned long __raw_local_save_flags(void)
{
unsigned long f;

- asm volatile(paravirt_alt("pushl %%ecx; pushl %%edx;"
+ asm volatile(paravirt_alt(PV_SAVE_REGS
PARAVIRT_CALL
- "popl %%edx; popl %%ecx")
+ PV_RESTORE_REGS)
: "=a"(f)
: paravirt_type(pv_irq_ops.save_fl),
paravirt_clobber(CLBR_EAX)
- : "memory", "cc");
+ : "memory", "cc" PV_VEXTRA_CLOBBERS);
return f;
}

static inline void raw_local_irq_restore(unsigned long f)
{
- asm volatile(paravirt_alt("pushl %%ecx; pushl %%edx;"
+ asm volatile(paravirt_alt(PV_SAVE_REGS
PARAVIRT_CALL
- "popl %%edx; popl %%ecx")
+ PV_RESTORE_REGS)
: "=a"(f)
- : "0"(f),
+ : PV_FLAGS_ARG (f),
paravirt_type(pv_irq_ops.restore_fl),
paravirt_clobber(CLBR_EAX)
- : "memory", "cc");
+ : "memory", "cc" PV_EXTRA_CLOBBERS);
}

static inline void raw_local_irq_disable(void)
{
- asm volatile(paravirt_alt("pushl %%ecx; pushl %%edx;"
+ asm volatile(paravirt_alt(PV_SAVE_REGS
PARAVIRT_CALL
- "popl %%edx; popl %%ecx")
+ PV_RESTORE_REGS)
:
: paravirt_type(pv_irq_ops.irq_disable),
paravirt_clobber(CLBR_EAX)
- : "memory", "eax", "cc");
+ : "memory", "eax", "cc" PV_VEXTRA_CLOBBERS);
}

static inline void raw_local_irq_enable(void)
{
- asm volatile(paravirt_alt("pushl %%ecx; pushl %%edx;"
+ asm volatile(paravirt_alt(PV_SAVE_REGS
PARAVIRT_CALL
- "popl %%edx; popl %%ecx")
+ PV_RESTORE_REGS)
:
: paravirt_type(pv_irq_ops.irq_enable),
paravirt_clobber(CLBR_EAX)
- : "memory", "eax", "cc");
+ : "memory", "eax", "cc" PV_VEXTRA_CLOBBERS);
}

static inline unsigned long __raw_local_irq_save(void)
@@ -1071,27 +1319,41 @@ static inline unsigned long __raw_local_irq_save(void)
return f;
}

+#ifdef CONFIG_X86_32
+#define SAVE_REGS "pushl %%ecx; pushl %%edx;"
+#define RESTORE_REGS "popl %%edx; popl %%ecx"
+#define CLI_STI_CLOBBERS , "%eax"
+#else /* !X86_32 */
+#define SAVE_REGS "pushq %%rcx; pushq %%rdx;"
+#define RESTORE_REGS "popq %%rdx; popq %%rcx"
+#define CLI_STI_CLOBBERS , "%rax", "%rdi", "%rsi", "%r8", "%r9", "%r10",\
+ "%r11", "%r12", "%r13", "%r14", "%r15"
+#endif /* X86_32 */
+
#define CLI_STRING \
- _paravirt_alt("pushl %%ecx; pushl %%edx;" \
+ _paravirt_alt(SAVE_REGS \
"call *%[paravirt_cli_opptr];" \
- "popl %%edx; popl %%ecx", \
+ RESTORE_REGS, \
"%c[paravirt_cli_type]", "%c[paravirt_clobber]")

+
#define STI_STRING \
- _paravirt_alt("pushl %%ecx; pushl %%edx;" \
+ _paravirt_alt(SAVE_REGS \
"call *%[paravirt_sti_opptr];" \
- "popl %%edx; popl %%ecx", \
+ RESTORE_REGS, \
"%c[paravirt_sti_type]", "%c[paravirt_clobber]")

-#define CLI_STI_CLOBBERS , "%eax"
-#define CLI_STI_INPUT_ARGS \
- , \
- [paravirt_cli_type] "i" (PARAVIRT_PATCH(pv_irq_ops.irq_disable)), \
+
+#define CLI_STI_INPUT_ARGS \
+ , \
+ [paravirt_cli_type] "i" (PARAVIRT_PATCH(pv_irq_ops.irq_disable)),\
[paravirt_cli_opptr] "m" (pv_irq_ops.irq_disable), \
- [paravirt_sti_type] "i" (PARAVIRT_PATCH(pv_irq_ops.irq_enable)), \
+ [paravirt_sti_type] "i" (PARAVIRT_PATCH(pv_irq_ops.irq_enable)),\
[paravirt_sti_opptr] "m" (pv_irq_ops.irq_enable), \
paravirt_clobber(CLBR_EAX)

+
+
/* Make sure as little as possible of this mess escapes. */
#undef PARAVIRT_CALL
#undef __PVOP_CALL
@@ -1106,48 +1368,80 @@ static inline unsigned long __raw_local_irq_save(void)
#undef PVOP_CALL3
#undef PVOP_VCALL4
#undef PVOP_CALL4
+#undef PV_SAVE_REGS
+#undef PV_RESTORE_REGS

#else /* __ASSEMBLY__ */

-#define PARA_PATCH(struct, off) ((PARAVIRT_PATCH_##struct + (off)) / 4)
-
-#define PARA_SITE(ptype, clobbers, ops) \
+#define _PARA_SITE(ptype, clobbers, ops, word) \
771:; \
ops; \
772:; \
.pushsection .parainstructions,"a"; \
- .long 771b; \
+ .align 8; \
+ word 771b; \
.byte ptype; \
.byte 772b-771b; \
.short clobbers; \
.popsection

+#ifdef CONFIG_X86_64
+#define PV_SAVE_REGS pushq %rax; pushq %rdi; pushq %rcx; pushq %rdx
+#define PV_RESTORE_REGS popq %rdx; popq %rcx; popq %rdi; popq %rax
+#define PARA_PATCH(struct, off) ((PARAVIRT_PATCH_##struct + (off)) / 8)
+#define PARA_SITE(ptype, clobbers, ops) _PARA_SITE(ptype, clobbers, ops, .quad)
+#else
+#define PV_SAVE_REGS pushl %eax; pushl %edi; pushl %ecx; pushl %edx
+#define PV_RESTORE_REGS popl %edx; popl %ecx; popl %edi; popl %eax
+#define PARA_PATCH(struct, off) ((PARAVIRT_PATCH_##struct + (off)) / 4)
+#define PARA_SITE(ptype, clobbers, ops) _PARA_SITE(ptype, clobbers, ops, .long)
+#endif
+
#define INTERRUPT_RETURN \
PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_iret), CLBR_NONE, \
jmp *%cs:pv_cpu_ops+PV_CPU_iret)

#define DISABLE_INTERRUPTS(clobbers) \
PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_disable), clobbers, \
- pushl %eax; pushl %ecx; pushl %edx; \
+ PV_SAVE_REGS; \
call *%cs:pv_irq_ops+PV_IRQ_irq_disable; \
- popl %edx; popl %ecx; popl %eax) \
+ PV_RESTORE_REGS;)

#define ENABLE_INTERRUPTS(clobbers) \
PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_enable), clobbers, \
- pushl %eax; pushl %ecx; pushl %edx; \
+ PV_SAVE_REGS; \
call *%cs:pv_irq_ops+PV_IRQ_irq_enable; \
- popl %edx; popl %ecx; popl %eax)
+ PV_RESTORE_REGS;)

#define ENABLE_INTERRUPTS_SYSCALL_RET \
PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_irq_enable_syscall_ret),\
CLBR_NONE, \
jmp *%cs:pv_cpu_ops+PV_CPU_irq_enable_syscall_ret)

+#ifdef CONFIG_X86_32
#define GET_CR0_INTO_EAX \
push %ecx; push %edx; \
call *pv_cpu_ops+PV_CPU_read_cr0; \
pop %edx; pop %ecx

+#else
+/* Those are x86_64 specific */
+#define SWAPGS \
+ PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_swapgs), CLBR_NONE, \
+ pushq %rax; pushq %rdi; pushq %rcx; pushq %rdx; \
+ call *pv_cpu_ops+PV_CPU_swapgs; \
+ popq %rdx; popq %rcx; popq %rdi; popq %rax; \
+ )
+
+#define GET_CR2_INTO_RCX \
+ call *pv_mmu_ops+PV_MMU_read_cr2; \
+ movq %rax, %rcx; \
+ xorq %rax, %rax;
+
+#endif
+
+#undef WSIZE
+
#endif /* __ASSEMBLY__ */
#endif /* CONFIG_PARAVIRT */
#endif /* __ASM_PARAVIRT_H */
--
1.4.4.2

2007-10-31 22:08:03

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 16/16] make vsmp a paravirt client

This patch makes vsmp a paravirt client. It now uses the whole
infrastructure provided by pvops. When we detect we're running
a vsmp box, we change the irq-related paravirt operations (and so,
it have to happen quite early), and the patching function

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/Kconfig.x86_64 | 3 +-
arch/x86/kernel/setup_64.c | 3 ++
arch/x86/kernel/vsmp_64.c | 72 +++++++++++++++++++++++++++++++++++++++----
include/asm-x86/setup.h | 3 +-
4 files changed, 71 insertions(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig.x86_64 b/arch/x86/Kconfig.x86_64
index 04734dd..544bad5 100644
--- a/arch/x86/Kconfig.x86_64
+++ b/arch/x86/Kconfig.x86_64
@@ -148,15 +148,14 @@ config X86_PC
bool "PC-compatible"
help
Choose this option if your computer is a standard PC or compatible.
-
config X86_VSMP
bool "Support for ScaleMP vSMP"
depends on PCI
+ select PARAVIRT
help
Support for ScaleMP vSMP systems. Say 'Y' here if this kernel is
supposed to run on these EM64T-based machines. Only choose this option
if you have one of these machines.
-
endchoice

choice
diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
index 44a11e3..c522549 100644
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -335,6 +335,9 @@ void __init setup_arch(char **cmdline_p)

init_memory_mapping(0, (end_pfn_map << PAGE_SHIFT));

+#ifdef CONFIG_VSMP
+ vsmp_init();
+#endif
dmi_scan_machine();

#ifdef CONFIG_SMP
diff --git a/arch/x86/kernel/vsmp_64.c b/arch/x86/kernel/vsmp_64.c
index 414caf0..547d3b3 100644
--- a/arch/x86/kernel/vsmp_64.c
+++ b/arch/x86/kernel/vsmp_64.c
@@ -8,18 +8,70 @@
*
* Ravikiran Thirumalai <[email protected]>,
* Shai Fultheim <[email protected]>
+ * Paravirt ops integration: Glauber de Oliveira Costa <[email protected]>
*/
-
#include <linux/init.h>
#include <linux/pci_ids.h>
#include <linux/pci_regs.h>
#include <asm/pci-direct.h>
#include <asm/io.h>
+#include <asm/paravirt.h>
+
+/*
+ * Interrupt control for the VSMP architecture:
+ */
+
+static inline unsigned long vsmp_save_fl(void)
+{
+ unsigned long flags = native_save_fl();
+
+ if (flags & X86_EFLAGS_IF)
+ return X86_EFLAGS_IF;
+ return 0;
+}

-static int __init vsmp_init(void)
+static inline void vsmp_restore_fl(unsigned long flags)
+{
+ if (flags & X86_EFLAGS_IF)
+ flags &= ~X86_EFLAGS_AC;
+ if (!(flags & X86_EFLAGS_IF))
+ flags &= X86_EFLAGS_AC;
+ native_restore_fl(flags);
+}
+
+static inline void vsmp_irq_disable(void)
+{
+ unsigned long flags = native_save_fl();
+
+ vsmp_restore_fl((flags & ~X86_EFLAGS_IF));
+}
+
+static inline void vsmp_irq_enable(void)
+{
+ unsigned long flags = native_save_fl();
+
+ vsmp_restore_fl((flags | X86_EFLAGS_IF));
+}
+
+static unsigned __init vsmp_patch(u8 type, u16 clobbers, void *ibuf,
+ unsigned long addr, unsigned len)
+{
+ switch (type) {
+ case PARAVIRT_PATCH(pv_irq_ops.irq_enable):
+ case PARAVIRT_PATCH(pv_irq_ops.irq_disable):
+ case PARAVIRT_PATCH(pv_irq_ops.save_fl):
+ case PARAVIRT_PATCH(pv_irq_ops.restore_fl):
+ return paravirt_patch_default(type, clobbers, ibuf, addr, len);
+ default:
+ return native_patch(type, clobbers, ibuf, addr, len);
+ }
+
+}
+
+int __init vsmp_init(void)
{
void *address;
- unsigned int cap, ctl;
+ unsigned int cap, ctl, cfg;

if (!early_pci_allowed())
return 0;
@@ -29,8 +81,16 @@ static int __init vsmp_init(void)
(read_pci_config_16(0, 0x1f, 0, PCI_DEVICE_ID) != PCI_DEVICE_ID_SCALEMP_VSMP_CTL))
return 0;

+ /* If we are, use the distinguished irq functions */
+ pv_irq_ops.irq_disable = vsmp_irq_disable;
+ pv_irq_ops.irq_enable = vsmp_irq_enable;
+ pv_irq_ops.save_fl = vsmp_save_fl;
+ pv_irq_ops.restore_fl = vsmp_restore_fl;
+ pv_init_ops.patch = vsmp_patch;
+
/* set vSMP magic bits to indicate vSMP capable kernel */
- address = ioremap(read_pci_config(0, 0x1f, 0, PCI_BASE_ADDRESS_0), 8);
+ cfg = read_pci_config(0, 0x1f, 0, PCI_BASE_ADDRESS_0);
+ address = early_ioremap(cfg, 8);
cap = readl(address);
ctl = readl(address + 4);
printk("vSMP CTL: capabilities:0x%08x control:0x%08x\n", cap, ctl);
@@ -42,8 +102,6 @@ static int __init vsmp_init(void)
printk("vSMP CTL: control set to:0x%08x\n", ctl);
}

- iounmap(address);
+ early_iounmap(address, 8);
return 0;
}
-
-core_initcall(vsmp_init);
diff --git a/include/asm-x86/setup.h b/include/asm-x86/setup.h
index 071e054..dd7996c 100644
--- a/include/asm-x86/setup.h
+++ b/include/asm-x86/setup.h
@@ -58,7 +58,8 @@ void __init add_memory_region(unsigned long long start,

extern unsigned long init_pg_tables_end;

-
+/* For EM64T-based VSMP machines */
+int vsmp_init(void);

#endif /* __i386__ */
#endif /* _SETUP */
--
1.4.4.2

2007-10-31 22:18:00

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 9/16] This patch add provisions for time related functions so they

can be later replaced by paravirt versions.

it basically encloses {g,s}et_wallclock inside the
already existent functions update_persistent_clock and
read_persistent_clock, and defines {s,g}et_wallclock
to the core of such functions.

it also allow for a later-on-game time initialization, as done
by i386. Paravirt guests can set a function to do their own
initialization this way.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/time_64.c | 42 ++++++++++++++++++++++++++++--------------
include/asm-x86/time.h | 26 ++++++++++++++++++++++----
2 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/time_64.c b/arch/x86/kernel/time_64.c
index c821edc..85ec401 100644
--- a/arch/x86/kernel/time_64.c
+++ b/arch/x86/kernel/time_64.c
@@ -33,9 +33,12 @@
#include <acpi/acpi_bus.h>
#endif
#include <asm/i8253.h>
+#include <asm/arch_hooks.h>
#include <asm/pgtable.h>
#include <asm/vsyscall.h>
+#include <asm/time.h>
#include <asm/timex.h>
+#include <asm/timer.h>
#include <asm/proto.h>
#include <asm/hpet.h>
#include <asm/sections.h>
@@ -77,18 +80,12 @@ EXPORT_SYMBOL(profile_pc);
* sheet for details.
*/

-static int set_rtc_mmss(unsigned long nowtime)
+int do_set_rtc_mmss(unsigned long nowtime)
{
int retval = 0;
int real_seconds, real_minutes, cmos_minutes;
unsigned char control, freq_select;

-/*
- * IRQs are disabled when we're called from the timer interrupt,
- * no need for spin_lock_irqsave()
- */
-
- spin_lock(&rtc_lock);

/*
* Tell the clock it's being set and stop it.
@@ -138,14 +135,22 @@ static int set_rtc_mmss(unsigned long nowtime)
CMOS_WRITE(control, RTC_CONTROL);
CMOS_WRITE(freq_select, RTC_FREQ_SELECT);

- spin_unlock(&rtc_lock);
-
return retval;
}

int update_persistent_clock(struct timespec now)
{
- return set_rtc_mmss(now.tv_sec);
+ int retval;
+
+/*
+ * IRQs are disabled when we're called from the timer interrupt,
+ * no need for spin_lock_irqsave()
+ */
+ spin_lock(&rtc_lock);
+ retval = set_wallclock(now.tv_sec);
+ spin_unlock(&rtc_lock);
+
+ return retval;
}

static irqreturn_t timer_event_interrupt(int irq, void *dev_id)
@@ -157,7 +162,7 @@ static irqreturn_t timer_event_interrupt(int irq, void *dev_id)
return IRQ_HANDLED;
}

-unsigned long read_persistent_clock(void)
+unsigned long do_get_cmos_time(void)
{
unsigned int year, mon, day, hour, min, sec;
unsigned long flags;
@@ -208,10 +213,15 @@ unsigned long read_persistent_clock(void)
return mktime(year, mon, day, hour, min, sec);
}

+unsigned long read_persistent_clock(void)
+{
+ return get_wallclock();
+}
+
/* calibrate_cpu is used on systems with fixed rate TSCs to determine
* processor frequency */
#define TICK_COUNT 100000000
-static unsigned int __init tsc_calibrate_cpu_khz(void)
+unsigned long __init native_calculate_cpu_khz(void)
{
int tsc_start, tsc_now;
int i, no_ctr_free;
@@ -261,20 +271,23 @@ static struct irqaction irq0 = {
.name = "timer"
};

-void __init time_init(void)
+void __init hpet_time_init(void)
{
if (!hpet_enable())
setup_pit_timer();

setup_irq(0, &irq0);
+}

+void __init time_init(void)
+{
tsc_calibrate();

cpu_khz = tsc_khz;
if (cpu_has(&boot_cpu_data, X86_FEATURE_CONSTANT_TSC) &&
boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
boot_cpu_data.x86 == 16)
- cpu_khz = tsc_calibrate_cpu_khz();
+ cpu_khz = calculate_cpu_khz();

if (unsynchronized_tsc())
mark_tsc_unstable("TSCs unsynchronized");
@@ -287,4 +300,5 @@ void __init time_init(void)
printk(KERN_INFO "time.c: Detected %d.%03d MHz processor.\n",
cpu_khz / 1000, cpu_khz % 1000);
init_tsc_clocksource();
+ late_time_init = choose_time_init();
}
diff --git a/include/asm-x86/time.h b/include/asm-x86/time.h
index eac0113..0d18c78 100644
--- a/include/asm-x86/time.h
+++ b/include/asm-x86/time.h
@@ -1,6 +1,10 @@
-#ifndef _ASMi386_TIME_H
-#define _ASMi386_TIME_H
+#ifndef _ASMX86_TIME_H
+#define _ASMX86_TIME_H

+extern void (*late_time_init)(void);
+extern void hpet_time_init(void);
+
+#ifndef CONFIG_X86_64
#include <linux/efi.h>
#include "mach_time.h"

@@ -28,8 +32,22 @@ static inline int native_set_wallclock(unsigned long nowtime)
return retval;
}

-extern void (*late_time_init)(void);
-extern void hpet_time_init(void);
+#else
+extern unsigned long do_get_cmos_time(void);
+extern int do_set_rtc_mmss(unsigned long nowtime);
+extern void native_time_init_hook(void);
+
+static inline unsigned long native_get_wallclock(void)
+{
+ return do_get_cmos_time();
+}
+
+static inline int native_set_wallclock(unsigned long nowtime)
+{
+ return do_set_rtc_mmss(nowtime);
+}
+
+#endif

#ifdef CONFIG_PARAVIRT
#include <asm/paravirt.h>
--
1.4.4.2

2007-10-31 22:21:50

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 12/16] tweak io_64.h for paravirt.

We need something here because we can't call in and out instructions
directly. However, we have to be careful, because no indirections are
allowed in misc_64.c , and paravirt_ops is a kind of one. So just
call it directly there

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/boot/compressed/misc_64.c | 6 +++++
include/asm-x86/io_64.h | 37 +++++++++++++++++++++++++++++------
2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/x86/boot/compressed/misc_64.c b/arch/x86/boot/compressed/misc_64.c
index 6ea015a..6640a17 100644
--- a/arch/x86/boot/compressed/misc_64.c
+++ b/arch/x86/boot/compressed/misc_64.c
@@ -9,6 +9,12 @@
* High loaded stuff by Hans Lermen & Werner Almesberger, Feb. 1996
*/

+/*
+ * we have to be careful, because no indirections are allowed here, and
+ * paravirt_ops is a kind of one. As it will only run in baremetal anyway,
+ * we just keep it from happening
+ */
+#undef CONFIG_PARAVIRT
#define _LINUX_STRING_H_ 1
#define __LINUX_BITMAP_H 1

diff --git a/include/asm-x86/io_64.h b/include/asm-x86/io_64.h
index a037b07..57fcdd9 100644
--- a/include/asm-x86/io_64.h
+++ b/include/asm-x86/io_64.h
@@ -35,12 +35,24 @@
* - Arnaldo Carvalho de Melo <[email protected]>
*/

-#define __SLOW_DOWN_IO "\noutb %%al,$0x80"
+static inline void native_io_delay(void)
+{
+ asm volatile("outb %%al,$0x80" : : : "memory");
+}

-#ifdef REALLY_SLOW_IO
-#define __FULL_SLOW_DOWN_IO __SLOW_DOWN_IO __SLOW_DOWN_IO __SLOW_DOWN_IO __SLOW_DOWN_IO
+#if defined(CONFIG_PARAVIRT)
+#include <asm/paravirt.h>
#else
-#define __FULL_SLOW_DOWN_IO __SLOW_DOWN_IO
+
+static inline void slow_down_io(void)
+{
+ native_io_delay();
+#ifdef REALLY_SLOW_IO
+ native_io_delay();
+ native_io_delay();
+ native_io_delay();
+#endif
+}
#endif

/*
@@ -52,9 +64,15 @@ static inline void out##s(unsigned x value, unsigned short port) {
#define __OUT2(s,s1,s2) \
__asm__ __volatile__ ("out" #s " %" s1 "0,%" s2 "1"

+#ifndef REALLY_SLOW_IO
+#define REALLY_SLOW_IO
+#define UNSET_REALLY_SLOW_IO
+#endif
+
#define __OUT(s,s1,x) \
__OUT1(s,x) __OUT2(s,s1,"w") : : "a" (value), "Nd" (port)); } \
-__OUT1(s##_p,x) __OUT2(s,s1,"w") __FULL_SLOW_DOWN_IO : : "a" (value), "Nd" (port));} \
+__OUT1(s##_p, x) __OUT2(s, s1, "w") : : "a" (value), "Nd" (port)); \
+ slow_down_io(); }

#define __IN1(s) \
static inline RETURN_TYPE in##s(unsigned short port) { RETURN_TYPE _v;
@@ -63,8 +81,13 @@ static inline RETURN_TYPE in##s(unsigned short port) { RETURN_TYPE _v;
__asm__ __volatile__ ("in" #s " %" s2 "1,%" s1 "0"

#define __IN(s,s1,i...) \
-__IN1(s) __IN2(s,s1,"w") : "=a" (_v) : "Nd" (port) ,##i ); return _v; } \
-__IN1(s##_p) __IN2(s,s1,"w") __FULL_SLOW_DOWN_IO : "=a" (_v) : "Nd" (port) ,##i ); return _v; } \
+__IN1(s) __IN2(s, s1, "w") : "=a" (_v) : "Nd" (port), ##i); return _v; } \
+__IN1(s##_p) __IN2(s, s1, "w") : "=a" (_v) : "Nd" (port), ##i); \
+ slow_down_io(); return _v; }
+
+#ifdef UNSET_REALLY_SLOW_IO
+#undef REALLY_SLOW_IO
+#endif

#define __INS(s) \
static inline void ins##s(unsigned short port, void * addr, unsigned long count) \
--
1.4.4.2

2007-10-31 23:00:43

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 4/16] provide native irq initialization function

The interrupt initialization routine becomes native_init_IRQ and will
be overriden later in case paravirt is on. The interrupt array is made visible
for guests such lguest, that will need to have their own initialization
mechanism (though using most of the same irq lines) later on.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/i8259_64.c | 7 +++++--
include/asm-x86/irq_64.h | 3 +++
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/i8259_64.c b/arch/x86/kernel/i8259_64.c
index 3f27ea0..892eab8 100644
--- a/arch/x86/kernel/i8259_64.c
+++ b/arch/x86/kernel/i8259_64.c
@@ -76,7 +76,7 @@ BUILD_16_IRQS(0xc) BUILD_16_IRQS(0xd) BUILD_16_IRQS(0xe) BUILD_16_IRQS(0xf)
IRQ(x,c), IRQ(x,d), IRQ(x,e), IRQ(x,f)

/* for the irq vectors */
-static void (*interrupt[NR_VECTORS - FIRST_EXTERNAL_VECTOR])(void) = {
+void (*interrupt[NR_VECTORS - FIRST_EXTERNAL_VECTOR])(void) = {
IRQLIST_16(0x2), IRQLIST_16(0x3),
IRQLIST_16(0x4), IRQLIST_16(0x5), IRQLIST_16(0x6), IRQLIST_16(0x7),
IRQLIST_16(0x8), IRQLIST_16(0x9), IRQLIST_16(0xa), IRQLIST_16(0xb),
@@ -448,7 +448,10 @@ void __init init_ISA_irqs (void)
}
}

-void __init init_IRQ(void)
+/* Overridden in paravirt.c */
+void init_IRQ(void) __attribute__((weak, alias("native_init_IRQ")));
+
+void __init native_init_IRQ(void)
{
int i;

diff --git a/include/asm-x86/irq_64.h b/include/asm-x86/irq_64.h
index 5006c6e..4f02446 100644
--- a/include/asm-x86/irq_64.h
+++ b/include/asm-x86/irq_64.h
@@ -46,6 +46,9 @@ static __inline__ int irq_canonicalize(int irq)
extern void fixup_irqs(cpumask_t map);
#endif

+#include <linux/init.h>
+void native_init_IRQ(void);
+
#define __ARCH_HAS_DO_SOFTIRQ 1

#endif /* _ASM_IRQ_H */
--
1.4.4.2

2007-10-31 23:01:02

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 7/16] native versions for set pagetables

This patch turns the set_p{te,md,ud,gd} functions into their
native_ versions. There is no need to patch any caller.

Also, it adds pte_update() and pte_update_defer() calls whenever
we modify a page table entry. This last part was coded to match
i386 as close as possible.

Pieces of the header are moved to below the #ifdef CONFIG_PARAVIRT
site, as they are users of the newly defined set_* macros.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
---
include/asm-x86/pgtable_64.h | 192 ++++++++++++++++++++++++++++--------------
1 files changed, 128 insertions(+), 64 deletions(-)

diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
index 9b0ff47..592d613 100644
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
@@ -57,56 +57,107 @@ extern unsigned long empty_zero_page[PAGE_SIZE/sizeof(unsigned long)];
*/
#define PTRS_PER_PTE 512

-#ifndef __ASSEMBLY__
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+
+#define set_pte native_set_pte
+#define set_pte_at(mm, addr, ptep, pteval) set_pte(ptep, pteval)
+#define set_pmd native_set_pmd
+#define set_pud native_set_pud
+#define set_pgd native_set_pgd
+#define pte_clear(mm, addr, xp) \
+do { \
+ set_pte_at(mm, addr, xp, __pte(0)); \
+} while (0)

-#define pte_ERROR(e) \
- printk("%s:%d: bad pte %p(%016lx).\n", __FILE__, __LINE__, &(e), pte_val(e))
-#define pmd_ERROR(e) \
- printk("%s:%d: bad pmd %p(%016lx).\n", __FILE__, __LINE__, &(e), pmd_val(e))
-#define pud_ERROR(e) \
- printk("%s:%d: bad pud %p(%016lx).\n", __FILE__, __LINE__, &(e), pud_val(e))
-#define pgd_ERROR(e) \
- printk("%s:%d: bad pgd %p(%016lx).\n", __FILE__, __LINE__, &(e), pgd_val(e))
+#define pmd_clear(xp) do { set_pmd(xp, __pmd(0)); } while (0)
+#define pud_clear native_pud_clear
+#define pgd_clear native_pgd_clear
+#define pte_update(mm, addr, ptep) do { } while (0)
+#define pte_update_defer(mm, addr, ptep) do { } while (0)

-#define pgd_none(x) (!pgd_val(x))
-#define pud_none(x) (!pud_val(x))
+#endif

-static inline void set_pte(pte_t *dst, pte_t val)
+#ifndef __ASSEMBLY__
+
+static inline void native_set_pte(pte_t *dst, pte_t val)
{
- pte_val(*dst) = pte_val(val);
+ dst->pte = pte_val(val);
}
-#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)

-static inline void set_pmd(pmd_t *dst, pmd_t val)
+static inline void native_set_pmd(pmd_t *dst, pmd_t val)
{
- pmd_val(*dst) = pmd_val(val);
+ dst->pmd = pmd_val(val);
}

-static inline void set_pud(pud_t *dst, pud_t val)
+static inline void native_set_pud(pud_t *dst, pud_t val)
{
- pud_val(*dst) = pud_val(val);
+ dst->pud = pud_val(val);
}

-static inline void pud_clear (pud_t *pud)
+static inline void native_set_pgd(pgd_t *dst, pgd_t val)
{
- set_pud(pud, __pud(0));
+ dst->pgd = pgd_val(val);
}

-static inline void set_pgd(pgd_t *dst, pgd_t val)
+static inline void native_pud_clear(pud_t *pud)
{
- pgd_val(*dst) = pgd_val(val);
-}
+ set_pud(pud, __pud(0));
+}

-static inline void pgd_clear (pgd_t * pgd)
+static inline void native_pgd_clear(pgd_t *pgd)
{
set_pgd(pgd, __pgd(0));
}

-#define ptep_get_and_clear(mm,addr,xp) __pte(xchg(&(xp)->pte, 0))
+static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pteval)
+{
+ native_set_pte(ptep, pteval);
+}
+
+static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep)
+{
+ native_set_pte_at(mm, addr, ptep, __pte(0));
+}
+
+static inline void native_pmd_clear(pmd_t *pmd)
+{
+ native_set_pmd(pmd, __pmd(0));
+}
+
+
+#define pte_ERROR(e) \
+ printk("%s:%d: bad pte %p(%016llx).\n", \
+ __FILE__, __LINE__, &(e), (u64)pte_val(e))
+#define pmd_ERROR(e) \
+ printk("%s:%d: bad pmd %p(%016llx).\n", \
+ __FILE__, __LINE__, &(e), (u64)pmd_val(e))
+#define pud_ERROR(e) \
+ printk("%s:%d: bad pud %p(%016llx).\n", \
+ __FILE__, __LINE__, &(e), (u64)pud_val(e))
+#define pgd_ERROR(e) \
+ printk("%s:%d: bad pgd %p(%016llx).\n", \
+ __FILE__, __LINE__, &(e), (u64)pgd_val(e))
+
+#define pgd_none(x) (!pgd_val(x))
+#define pud_none(x) (!pud_val(x))

struct mm_struct;

-static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm, unsigned long addr, pte_t *ptep, int full)
+static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+ pte_t pte = __pte(xchg(&ptep->pte, 0));
+ pte_update(mm, addr, ptep);
+ return pte;
+}
+
+static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep,
+ int full)
{
pte_t pte;
if (full) {
@@ -246,7 +297,6 @@ static inline unsigned long pmd_bad(pmd_t pmd)

#define pte_none(x) (!pte_val(x))
#define pte_present(x) (pte_val(x) & (_PAGE_PRESENT | _PAGE_PROTNONE))
-#define pte_clear(mm,addr,xp) do { set_pte_at(mm, addr, xp, __pte(0)); } while (0)

#define pages_to_mb(x) ((x) >> (20-PAGE_SHIFT)) /* FIXME: is this
right? */
@@ -255,11 +305,11 @@ static inline unsigned long pmd_bad(pmd_t pmd)

static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
{
- pte_t pte;
- pte_val(pte) = (page_nr << PAGE_SHIFT);
- pte_val(pte) |= pgprot_val(pgprot);
- pte_val(pte) &= __supported_pte_mask;
- return pte;
+ unsigned long pte;
+ pte = (page_nr << PAGE_SHIFT);
+ pte |= pgprot_val(pgprot);
+ pte &= __supported_pte_mask;
+ return __pte(pte);
}

/*
@@ -283,30 +333,6 @@ static inline pte_t pte_mkwrite(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) |
static inline pte_t pte_mkhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_PSE)); return pte; }
static inline pte_t pte_clrhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_PSE)); return pte; }

-struct vm_area_struct;
-
-static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
-{
- if (!pte_young(*ptep))
- return 0;
- return test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte);
-}
-
-static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
-{
- clear_bit(_PAGE_BIT_RW, &ptep->pte);
-}
-
-/*
- * Macro to mark a page protection value as "uncacheable".
- */
-#define pgprot_noncached(prot) (__pgprot(pgprot_val(prot) | _PAGE_PCD | _PAGE_PWT))
-
-static inline int pmd_large(pmd_t pte) {
- return (pmd_val(pte) & __LARGE_PTE) == __LARGE_PTE;
-}
-
-
/*
* Conversion functions: convert a page and protection to a page entry,
* and a page entry and page directory to the page they refer to.
@@ -340,7 +366,6 @@ static inline int pmd_large(pmd_t pte) {
pmd_index(address))
#define pmd_none(x) (!pmd_val(x))
#define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
-#define pmd_clear(xp) do { set_pmd(xp, __pmd(0)); } while (0)
#define pfn_pmd(nr,prot) (__pmd(((nr) << PAGE_SHIFT) | pgprot_val(prot)))
#define pmd_pfn(x) ((pmd_val(x) & __PHYSICAL_MASK) >> PAGE_SHIFT)

@@ -352,15 +377,53 @@ static inline int pmd_large(pmd_t pte) {

/* page, protection -> pte */
#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot))
-#define mk_pte_huge(entry) (pte_val(entry) |= _PAGE_PRESENT | _PAGE_PSE)
-
+
+static inline pte_t __mk_pte_huge(pte_t entry)
+{
+ unsigned long pte;
+ pte = pte_val(entry);
+ pte |= _PAGE_PRESENT | _PAGE_PSE;
+ return __pte(pte);
+}
+#define mk_pte_huge(entry) ((entry) = __mk_pte_huge(entry))
+
+#include <linux/mm_types.h>
+static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
+{
+ int ret = 0;
+ if (!pte_young(*ptep))
+ return 0;
+ ret = test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte);
+ pte_update(vma->vm_mm, addr, ptep);
+ return ret;
+}
+
+static inline void ptep_set_wrprotect(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+ clear_bit(_PAGE_BIT_RW, &ptep->pte);
+ pte_update(mm, addr, ptep);
+}
+
+/*
+ * Macro to mark a page protection value as "uncacheable".
+ */
+#define pgprot_noncached(prot) (__pgprot(pgprot_val(prot) | _PAGE_PCD | _PAGE_PWT))
+
+static inline int pmd_large(pmd_t pte)
+{
+ return (pmd_val(pte) & __LARGE_PTE) == __LARGE_PTE;
+}
+
/* Change flags of a PTE */
-static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
+static inline pte_t pte_modify(pte_t pte_old, pgprot_t newprot)
{
- pte_val(pte) &= _PAGE_CHG_MASK;
- pte_val(pte) |= pgprot_val(newprot);
- pte_val(pte) &= __supported_pte_mask;
- return pte;
+ unsigned long pte = pte_val(pte_old);
+ pte &= _PAGE_CHG_MASK;
+ pte |= pgprot_val(newprot);
+ pte &= __supported_pte_mask;
+ return __pte(pte);
}

#define pte_index(address) \
@@ -387,6 +450,7 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
int __changed = !pte_same(*(__ptep), __entry); \
if (__changed && __dirty) { \
set_pte(__ptep, __entry); \
+ pte_update_defer((__vma)->vm_mm, (__address), (__ptep)); \
flush_tlb_page(__vma, __address); \
} \
__changed; \
--
1.4.4.2

2007-10-31 23:01:26

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 8/16] add native functions for descriptors handling

This patch turns the basic descriptor handling into native_
functions. It is basically write_idt, load_idt, write_gdt,
load_gdt, set_ldt, store_tr, load_tls, and the ones
for updating a single entry.

In the process of doing that, we change the definition of
load_LDT_nolock, and caller sites have to be patched. We
also patch call sites that now needs a typecast.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/ldt_64.c | 7 +-
include/asm-x86/desc_64.h | 172 +++++++++++++++++++++++++-------------
include/asm-x86/mmu_context_64.h | 23 ++++--
3 files changed, 132 insertions(+), 70 deletions(-)

diff --git a/arch/x86/kernel/ldt_64.c b/arch/x86/kernel/ldt_64.c
index 60e57ab..ab1ff6e 100644
--- a/arch/x86/kernel/ldt_64.c
+++ b/arch/x86/kernel/ldt_64.c
@@ -171,7 +171,7 @@ static int write_ldt(void __user * ptr, unsigned long bytecount, int oldmode)
{
struct task_struct *me = current;
struct mm_struct * mm = me->mm;
- __u32 entry_1, entry_2, *lp;
+ __u32 entry_1, entry_2;
int error;
struct user_desc ldt_info;

@@ -200,7 +200,6 @@ static int write_ldt(void __user * ptr, unsigned long bytecount, int oldmode)
goto out_unlock;
}

- lp = (__u32 *) ((ldt_info.entry_number << 3) + (char *) mm->context.ldt);

/* Allow LDTs to be cleared by the user. */
if (ldt_info.base_addr == 0 && ldt_info.limit == 0) {
@@ -218,8 +217,8 @@ static int write_ldt(void __user * ptr, unsigned long bytecount, int oldmode)

/* Install the new entry ... */
install:
- *lp = entry_1;
- *(lp+1) = entry_2;
+ write_ldt_entry(mm->context.ldt, ldt_info.entry_number,
+ entry_1, entry_2);
error = 0;

out_unlock:
diff --git a/include/asm-x86/desc_64.h b/include/asm-x86/desc_64.h
index 7d9c938..f7008bb 100644
--- a/include/asm-x86/desc_64.h
+++ b/include/asm-x86/desc_64.h
@@ -16,20 +16,18 @@

extern struct desc_struct cpu_gdt_table[GDT_ENTRIES];

-#define load_TR_desc() asm volatile("ltr %w0"::"r" (GDT_ENTRY_TSS*8))
-#define load_LDT_desc() asm volatile("lldt %w0"::"r" (GDT_ENTRY_LDT*8))
-#define clear_LDT() asm volatile("lldt %w0"::"r" (0))
+static inline void native_load_tr_desc(void)
+{
+ asm volatile("ltr %w0"::"r" (GDT_ENTRY_TSS*8));
+}

-static inline unsigned long __store_tr(void)
+static inline unsigned long native_store_tr(void)
{
unsigned long tr;
-
asm volatile ("str %w0":"=r" (tr));
return tr;
}

-#define store_tr(tr) (tr) = __store_tr()
-
/*
* This is the ldt that every process will get unless we need
* something other than this.
@@ -41,62 +39,51 @@ extern struct desc_ptr cpu_gdt_descr[];
/* the cpu gdt accessor */
#define cpu_gdt(_cpu) ((struct desc_struct *)cpu_gdt_descr[_cpu].address)

-static inline void load_gdt(const struct desc_ptr *ptr)
+static inline void native_load_gdt(const struct desc_ptr *ptr)
{
asm volatile("lgdt %w0"::"m" (*ptr));
}

-static inline void store_gdt(struct desc_ptr *ptr)
+static inline void native_store_gdt(struct desc_ptr *ptr)
{
- asm("sgdt %w0":"=m" (*ptr));
+ asm volatile ("sgdt %w0":"=m" (*ptr));
}

-static inline void _set_gate(void *adr, unsigned type, unsigned long func, unsigned dpl, unsigned ist)
+static inline void native_write_idt_entry(void *adr, struct gate_struct *s)
{
- struct gate_struct s;
- s.offset_low = PTR_LOW(func);
- s.segment = __KERNEL_CS;
- s.ist = ist;
- s.p = 1;
- s.dpl = dpl;
- s.zero0 = 0;
- s.zero1 = 0;
- s.type = type;
- s.offset_middle = PTR_MIDDLE(func);
- s.offset_high = PTR_HIGH(func);
- /* does not need to be atomic because it is only done once at setup time */
- memcpy(adr, &s, 16);
-}
-
-static inline void set_intr_gate(int nr, void *func)
-{
- BUG_ON((unsigned)nr > 0xFF);
- _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 0, 0);
-}
+ /* does not need to be atomic because
+ * it is only done once at setup time */
+ memcpy(adr, s, 16);
+}

-static inline void set_intr_gate_ist(int nr, void *func, unsigned ist)
-{
- BUG_ON((unsigned)nr > 0xFF);
- _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 0, ist);
-}
+static inline void native_write_gdt_entry(void *ptr, void *entry,
+ unsigned type, unsigned size)
+{
+ memcpy(ptr, entry, size);
+}

-static inline void set_system_gate(int nr, void *func)
-{
- BUG_ON((unsigned)nr > 0xFF);
- _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 3, 0);
-}
+/*
+ * This one unfortunately can't go with the others, in below, because it has
+ * an user anxious for its definition: set_tssldt_descriptor
+ */
+#ifndef CONFIG_PARAVIRT
+#define write_gdt_entry(_ptr, _e, _type, _size) \
+ native_write_gdt_entry((_ptr), (_e), (_type), (_size))
+#endif

-static inline void set_system_gate_ist(int nr, void *func, unsigned ist)
+static inline void write_dt_entry(struct desc_struct *dt,
+ int entry, u32 entry_low, u32 entry_high)
{
- _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 3, ist);
+ ((struct n_desc_struct *)dt)[entry].a = entry_low;
+ ((struct n_desc_struct *)dt)[entry].b = entry_high;
}

-static inline void load_idt(const struct desc_ptr *ptr)
+static inline void native_load_idt(const struct desc_ptr *ptr)
{
asm volatile("lidt %w0"::"m" (*ptr));
}

-static inline void store_idt(struct desc_ptr *dtr)
+static inline void native_store_idt(struct desc_ptr *dtr)
{
asm("sidt %w0":"=m" (*dtr));
}
@@ -114,7 +101,7 @@ static inline void set_tssldt_descriptor(void *ptr, unsigned long tss, unsigned
d.limit1 = (size >> 16) & 0xF;
d.base2 = (PTR_MIDDLE(tss) >> 8) & 0xFF;
d.base3 = PTR_HIGH(tss);
- memcpy(ptr, &d, 16);
+ write_gdt_entry(ptr, &d, type, 16);
}

static inline void set_tss_desc(unsigned cpu, void *addr)
@@ -165,7 +152,7 @@ static inline void set_ldt_desc(unsigned cpu, void *addr, int size)
(info)->useable == 0 && \
(info)->lm == 0)

-static inline void load_TLS(struct thread_struct *t, unsigned int cpu)
+static inline void native_load_tls(struct thread_struct *t, unsigned int cpu)
{
unsigned int i;
u64 *gdt = (u64 *)(cpu_gdt(cpu) + GDT_ENTRY_TLS_MIN);
@@ -173,28 +160,93 @@ static inline void load_TLS(struct thread_struct *t, unsigned int cpu)
for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++)
gdt[i] = t->tls_array[i];
}
+static inline void native_set_ldt(const void *addr,
+ unsigned int entries)
+{
+ if (likely(entries == 0))
+ __asm__ __volatile__ ("lldt %w0" :: "r" (0));
+ else {
+ unsigned cpu = smp_processor_id();
+
+ set_tssldt_descriptor(&cpu_gdt(cpu)[GDT_ENTRY_LDT],
+ (unsigned long)addr, DESC_LDT,
+ entries * 8 - 1);
+ __asm__ __volatile__ ("lldt %w0"::"r" (GDT_ENTRY_LDT*8));
+ }
+}
+
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define load_TR_desc() native_load_tr_desc()
+#define load_gdt(ptr) native_load_gdt(ptr)
+#define load_idt(ptr) native_load_idt(ptr)
+#define load_TLS(t, cpu) native_load_tls(t, cpu)
+#define set_ldt(addr, entries) native_set_ldt(addr, entries)
+#define store_tr(tr) (tr) = native_store_tr()
+#define store_gdt(ptr) native_store_gdt(ptr)
+#define store_idt(ptr) native_store_idt(ptr)
+
+#define write_idt_entry(_adr, _s) native_write_idt_entry((_adr), (_s))
+#define write_ldt_entry(_ldt, _number, _entry1, _entry2) \
+ write_dt_entry((_ldt), (_number), (_entry1), (_entry2))
+#endif
+
+static inline void _set_gate(void *adr, unsigned type, unsigned long func,
+ unsigned dpl, unsigned ist)
+{
+ struct gate_struct s;
+ s.offset_low = PTR_LOW(func);
+ s.segment = __KERNEL_CS;
+ s.ist = ist;
+ s.p = 1;
+ s.dpl = dpl;
+ s.zero0 = 0;
+ s.zero1 = 0;
+ s.type = type;
+ s.offset_middle = PTR_MIDDLE(func);
+ s.offset_high = PTR_HIGH(func);
+ write_idt_entry(adr, &s);
+}
+
+static inline void set_intr_gate(int nr, void *func)
+{
+ BUG_ON((unsigned)nr > 0xFF);
+ _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 0, 0);
+}
+
+static inline void set_intr_gate_ist(int nr, void *func, unsigned ist)
+{
+ BUG_ON((unsigned)nr > 0xFF);
+ _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 0, ist);
+}
+
+static inline void set_system_gate(int nr, void *func)
+{
+ BUG_ON((unsigned)nr > 0xFF);
+ _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 3, 0);
+}
+
+static inline void set_system_gate_ist(int nr, void *func, unsigned ist)
+{
+ _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 3, ist);
+}
+
+#define clear_LDT() set_ldt(NULL, 0)

/*
* load one particular LDT into the current CPU
*/
-static inline void load_LDT_nolock (mm_context_t *pc, int cpu)
+static inline void load_LDT_nolock(mm_context_t *pc)
{
- int count = pc->size;
-
- if (likely(!count)) {
- clear_LDT();
- return;
- }
-
- set_ldt_desc(cpu, pc->ldt, count);
- load_LDT_desc();
+ set_ldt(pc->ldt, pc->size);
}

static inline void load_LDT(mm_context_t *pc)
{
- int cpu = get_cpu();
- load_LDT_nolock(pc, cpu);
- put_cpu();
+ preempt_disable();
+ load_LDT_nolock(pc);
+ preempt_enable();
}

extern struct desc_ptr idt_descr;
diff --git a/include/asm-x86/mmu_context_64.h b/include/asm-x86/mmu_context_64.h
index 0cce83a..841ee33 100644
--- a/include/asm-x86/mmu_context_64.h
+++ b/include/asm-x86/mmu_context_64.h
@@ -7,7 +7,16 @@
#include <asm/pda.h>
#include <asm/pgtable.h>
#include <asm/tlbflush.h>
+
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
#include <asm-generic/mm_hooks.h>
+static inline void paravirt_activate_mm(struct mm_struct *prev,
+ struct mm_struct *next)
+{
+}
+#endif /* CONFIG_PARAVIRT */

/*
* possibly do the LDT unload here?
@@ -25,7 +34,7 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)

static inline void load_cr3(pgd_t *pgd)
{
- asm volatile("movq %0,%%cr3" :: "r" (__pa(pgd)) : "memory");
+ write_cr3(__pa(pgd));
}

static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
@@ -43,7 +52,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
load_cr3(next->pgd);

if (unlikely(next->context.ldt != prev->context.ldt))
- load_LDT_nolock(&next->context, cpu);
+ load_LDT_nolock(&next->context);
}
#ifdef CONFIG_SMP
else {
@@ -56,7 +65,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
* to make sure to use no freed page tables.
*/
load_cr3(next->pgd);
- load_LDT_nolock(&next->context, cpu);
+ load_LDT_nolock(&next->context);
}
}
#endif
@@ -67,8 +76,10 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
asm volatile("movl %0,%%fs"::"r"(0)); \
} while(0)

-#define activate_mm(prev, next) \
- switch_mm((prev),(next),NULL)
-
+#define activate_mm(prev, next) \
+do { \
+ paravirt_activate_mm(prev, next); \
+ switch_mm((prev), (next), NULL); \
+} while (0)

#endif
--
1.4.4.2

2007-10-31 23:01:41

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 10/16] export cpu_gdt_descr

With paravirualization, hypervisors needs to handle the gdt,
that was right to this point only used at very early
inialization code. Hypervisors (lguest being the current case)
are commonly modules, so make it an export

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/x8664_ksyms_64.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c
index 77c25b3..a005d67 100644
--- a/arch/x86/kernel/x8664_ksyms_64.c
+++ b/arch/x86/kernel/x8664_ksyms_64.c
@@ -8,6 +8,7 @@
#include <asm/processor.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
+#include <asm/desc.h>

EXPORT_SYMBOL(kernel_thread);

@@ -60,3 +61,8 @@ EXPORT_SYMBOL(init_level4_pgt);
EXPORT_SYMBOL(load_gs_index);

EXPORT_SYMBOL(_proxy_pda);
+
+#ifdef CONFIG_PARAVIRT
+/* Virtualized guests may want to use it */
+EXPORT_SYMBOL_GPL(cpu_gdt_descr);
+#endif
--
1.4.4.2

2007-11-01 04:38:47

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 16/16] make vsmp a paravirt client

Glauber de Oliveira Costa wrote:
> This patch makes vsmp a paravirt client. It now uses the whole
> infrastructure provided by pvops. When we detect we're running
> a vsmp box, we change the irq-related paravirt operations (and so,
> it have to happen quite early), and the patching function
>
> Signed-off-by: Glauber de Oliveira Costa <[email protected]>
> Signed-off-by: Steven Rostedt <[email protected]>
> Acked-by: Jeremy Fitzhardinge <[email protected]>
> ---
> arch/x86/Kconfig.x86_64 | 3 +-
> arch/x86/kernel/setup_64.c | 3 ++
> arch/x86/kernel/vsmp_64.c | 72 +++++++++++++++++++++++++++++++++++++++----
> include/asm-x86/setup.h | 3 +-
> 4 files changed, 71 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/Kconfig.x86_64 b/arch/x86/Kconfig.x86_64
> index 04734dd..544bad5 100644
> --- a/arch/x86/Kconfig.x86_64
> +++ b/arch/x86/Kconfig.x86_64
> @@ -148,15 +148,14 @@ config X86_PC
> bool "PC-compatible"
> help
> Choose this option if your computer is a standard PC or compatible.
> -
> config X86_VSMP
> bool "Support for ScaleMP vSMP"
> depends on PCI
> + select PARAVIRT
> help
> Support for ScaleMP vSMP systems. Say 'Y' here if this kernel is
> supposed to run on these EM64T-based machines. Only choose this option
> if you have one of these machines.
> -
> endchoice
>
> choice
> diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
> index 44a11e3..c522549 100644
> --- a/arch/x86/kernel/setup_64.c
> +++ b/arch/x86/kernel/setup_64.c
> @@ -335,6 +335,9 @@ void __init setup_arch(char **cmdline_p)
>
> init_memory_mapping(0, (end_pfn_map << PAGE_SHIFT));
>
> +#ifdef CONFIG_VSMP
> + vsmp_init();
> +#endif
> dmi_scan_machine();
>
> #ifdef CONFIG_SMP
> diff --git a/arch/x86/kernel/vsmp_64.c b/arch/x86/kernel/vsmp_64.c
> index 414caf0..547d3b3 100644
> --- a/arch/x86/kernel/vsmp_64.c
> +++ b/arch/x86/kernel/vsmp_64.c
> @@ -8,18 +8,70 @@
> *
> * Ravikiran Thirumalai <[email protected]>,
> * Shai Fultheim <[email protected]>
> + * Paravirt ops integration: Glauber de Oliveira Costa <[email protected]>
> */
> -
> #include <linux/init.h>
> #include <linux/pci_ids.h>
> #include <linux/pci_regs.h>
> #include <asm/pci-direct.h>
> #include <asm/io.h>
> +#include <asm/paravirt.h>
> +
> +/*
> + * Interrupt control for the VSMP architecture:
> + */
> +
> +static inline unsigned long vsmp_save_fl(void)
>

No point being inline.

> +{
> + unsigned long flags = native_save_fl();
> +
> + if (flags & X86_EFLAGS_IF)
> + return X86_EFLAGS_IF;
>

Is this right, or should the if be testing _AC? Otherwise, why not just
"return flags & X86_EFLAGS_IF"?

> + return 0;
> +}
>
> -static int __init vsmp_init(void)
> +static inline void vsmp_restore_fl(unsigned long flags)
> +{
> + if (flags & X86_EFLAGS_IF)
> + flags &= ~X86_EFLAGS_AC;
> + if (!(flags & X86_EFLAGS_IF))
> + flags &= X86_EFLAGS_AC;
>

Just use "else"?

> + native_restore_fl(flags);
> +}
> +
> +static inline void vsmp_irq_disable(void)
> +{
> + unsigned long flags = native_save_fl();
> +
> + vsmp_restore_fl((flags & ~X86_EFLAGS_IF));
>

((double paren?))

> +}
> +
> +static inline void vsmp_irq_enable(void)
> +{
> + unsigned long flags = native_save_fl();
> +
> + vsmp_restore_fl((flags | X86_EFLAGS_IF));
> +}
> +
> +static unsigned __init vsmp_patch(u8 type, u16 clobbers, void *ibuf,
> + unsigned long addr, unsigned len)
> +{
> + switch (type) {
> + case PARAVIRT_PATCH(pv_irq_ops.irq_enable):
> + case PARAVIRT_PATCH(pv_irq_ops.irq_disable):
> + case PARAVIRT_PATCH(pv_irq_ops.save_fl):
> + case PARAVIRT_PATCH(pv_irq_ops.restore_fl):
> + return paravirt_patch_default(type, clobbers, ibuf, addr, len);
> + default:
> + return native_patch(type, clobbers, ibuf, addr, len);
> + }
> +
> +}
> +
> +int __init vsmp_init(void)
> {
> void *address;
> - unsigned int cap, ctl;
> + unsigned int cap, ctl, cfg;
>
> if (!early_pci_allowed())
> return 0;
> @@ -29,8 +81,16 @@ static int __init vsmp_init(void)
> (read_pci_config_16(0, 0x1f, 0, PCI_DEVICE_ID) != PCI_DEVICE_ID_SCALEMP_VSMP_CTL))
> return 0;
>
> + /* If we are, use the distinguished irq functions */
> + pv_irq_ops.irq_disable = vsmp_irq_disable;
> + pv_irq_ops.irq_enable = vsmp_irq_enable;
> + pv_irq_ops.save_fl = vsmp_save_fl;
> + pv_irq_ops.restore_fl = vsmp_restore_fl;
> + pv_init_ops.patch = vsmp_patch;
> +
> /* set vSMP magic bits to indicate vSMP capable kernel */
> - address = ioremap(read_pci_config(0, 0x1f, 0, PCI_BASE_ADDRESS_0), 8);
> + cfg = read_pci_config(0, 0x1f, 0, PCI_BASE_ADDRESS_0);
> + address = early_ioremap(cfg, 8);
> cap = readl(address);
> ctl = readl(address + 4);
> printk("vSMP CTL: capabilities:0x%08x control:0x%08x\n", cap, ctl);
> @@ -42,8 +102,6 @@ static int __init vsmp_init(void)
> printk("vSMP CTL: control set to:0x%08x\n", ctl);
> }
>
> - iounmap(address);
> + early_iounmap(address, 8);
> return 0;
> }
> -
> -core_initcall(vsmp_init);
> diff --git a/include/asm-x86/setup.h b/include/asm-x86/setup.h
> index 071e054..dd7996c 100644
> --- a/include/asm-x86/setup.h
> +++ b/include/asm-x86/setup.h
> @@ -58,7 +58,8 @@ void __init add_memory_region(unsigned long long start,
>
> extern unsigned long init_pg_tables_end;
>
> -
> +/* For EM64T-based VSMP machines */
> +int vsmp_init(void);
>
> #endif /* __i386__ */
> #endif /* _SETUP */
>

2007-11-01 04:48:53

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

Glauber de Oliveira Costa wrote:
> This patch introduces, and patch callers when needed, native
> versions for read/write_crX functions, clts and wbinvd.
>
> Signed-off-by: Glauber de Oliveira Costa <[email protected]>
> Signed-off-by: Steven Rostedt <[email protected]>
> Acked-by: Jeremy Fitzhardinge <[email protected]>
> ---
> arch/x86/mm/pageattr_64.c | 3 +-
> include/asm-x86/system_64.h | 60 ++++++++++++++++++++++++++++++------------
> 2 files changed, 45 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/mm/pageattr_64.c b/arch/x86/mm/pageattr_64.c
> index c40afba..59a52b0 100644
> --- a/arch/x86/mm/pageattr_64.c
> +++ b/arch/x86/mm/pageattr_64.c
> @@ -12,6 +12,7 @@
> #include <asm/processor.h>
> #include <asm/tlbflush.h>
> #include <asm/io.h>
> +#include <asm/paravirt.h>
>
> pte_t *lookup_address(unsigned long address)
> {
> @@ -77,7 +78,7 @@ static void flush_kernel_map(void *arg)
> much cheaper than WBINVD. */
> /* clflush is still broken. Disable for now. */
> if (1 || !cpu_has_clflush)
> - asm volatile("wbinvd" ::: "memory");
> + wbinvd();
> else list_for_each_entry(pg, l, lru) {
> void *adr = page_address(pg);
> clflush_cache_range(adr, PAGE_SIZE);
> diff --git a/include/asm-x86/system_64.h b/include/asm-x86/system_64.h
> index 4cb2384..b558cb2 100644
> --- a/include/asm-x86/system_64.h
> +++ b/include/asm-x86/system_64.h
> @@ -65,53 +65,62 @@ extern void load_gs_index(unsigned);
> /*
> * Clear and set 'TS' bit respectively
> */
> -#define clts() __asm__ __volatile__ ("clts")
> +static inline void native_clts(void)
> +{
> + asm volatile ("clts");
> +}
>
> -static inline unsigned long read_cr0(void)
> -{
> +static inline unsigned long native_read_cr0(void)
> +{
> unsigned long cr0;
> asm volatile("movq %%cr0,%0" : "=r" (cr0));
> return cr0;
> }
>

This is a pre-existing bug, but it seems to me that these read/write crX
asms should have a constraint to stop the compiler from reordering them
with respect to each other. The brute-force approach would be to add
"memory" clobbers, but the subtle fix would be to add a variable which
is only used to sequence:

static int __cr_seq;

static inline unsigned long native_read_cr0(void)
{
unsigned long cr0;
asm volatile("mov %%cr0, %0" : "=r" (cr0), "=m" (__cr_seq));
}

static inline void native_write_cr0(unsigned long val)
{
asm volatile("mov %1, %%cr0" : "+m" (__cr_seq) : "r" (val));
}


J

2007-11-01 04:51:10

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 11/16] turn priviled operation into a macro in head_64.S

Glauber de Oliveira Costa wrote:
> under paravirt, read cr2 cannot be issued directly anymore.
> So wrap it in a macro, defined to the operation itself in case
> paravirt is off, but to something else if we have paravirt
> in the game
>

Is this actually needed? It's only used in the early fault handler in
head_64.S. Will we be taking that path in the paravirt case? If so,
should we disable the fault handler altogether, since the hypervisor can
probably provide better diagnositcs.

J
> Signed-off-by: Glauber de Oliveira Costa <[email protected]>
> Signed-off-by: Steven Rostedt <[email protected]>
> Acked-by: Jeremy Fitzhardinge <[email protected]>
> ---
> arch/x86/kernel/head_64.S | 9 ++++++++-
> 1 files changed, 8 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index b6167fe..c31b1c9 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -19,6 +19,13 @@
> #include <asm/msr.h>
> #include <asm/cache.h>
>
> +#ifdef CONFIG_PARAVIRT
> +#include <asm/asm-offsets.h>
> +#include <asm/paravirt.h>
> +#else
> +#define GET_CR2_INTO_RCX movq %cr2, %rcx
> +#endif
> +
> /* we are not able to switch in one step to the final KERNEL ADRESS SPACE
> * because we need identity-mapped pages.
> *
> @@ -267,7 +274,7 @@ ENTRY(early_idt_handler)
> xorl %eax,%eax
> movq 8(%rsp),%rsi # get rip
> movq (%rsp),%rdx
> - movq %cr2,%rcx
> + GET_CR2_INTO_RCX
> leaq early_idt_msg(%rip),%rdi
> call early_printk
> cmpl $2,early_recursion_flag(%rip)
>

2007-11-01 13:48:16

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jeremy Fitzhardinge escreveu:
> Glauber de Oliveira Costa wrote:
>> This patch introduces, and patch callers when needed, native
>> versions for read/write_crX functions, clts and wbinvd.
>>
>> Signed-off-by: Glauber de Oliveira Costa <[email protected]>
>> Signed-off-by: Steven Rostedt <[email protected]>
>> Acked-by: Jeremy Fitzhardinge <[email protected]>
>> ---
>> arch/x86/mm/pageattr_64.c | 3 +-
>> include/asm-x86/system_64.h | 60 ++++++++++++++++++++++++++++++------------
>> 2 files changed, 45 insertions(+), 18 deletions(-)
>>
>> diff --git a/arch/x86/mm/pageattr_64.c b/arch/x86/mm/pageattr_64.c
>> index c40afba..59a52b0 100644
>> --- a/arch/x86/mm/pageattr_64.c
>> +++ b/arch/x86/mm/pageattr_64.c
>> @@ -12,6 +12,7 @@
>> #include <asm/processor.h>
>> #include <asm/tlbflush.h>
>> #include <asm/io.h>
>> +#include <asm/paravirt.h>
>>
>> pte_t *lookup_address(unsigned long address)
>> {
>> @@ -77,7 +78,7 @@ static void flush_kernel_map(void *arg)
>> much cheaper than WBINVD. */
>> /* clflush is still broken. Disable for now. */
>> if (1 || !cpu_has_clflush)
>> - asm volatile("wbinvd" ::: "memory");
>> + wbinvd();
>> else list_for_each_entry(pg, l, lru) {
>> void *adr = page_address(pg);
>> clflush_cache_range(adr, PAGE_SIZE);
>> diff --git a/include/asm-x86/system_64.h b/include/asm-x86/system_64.h
>> index 4cb2384..b558cb2 100644
>> --- a/include/asm-x86/system_64.h
>> +++ b/include/asm-x86/system_64.h
>> @@ -65,53 +65,62 @@ extern void load_gs_index(unsigned);
>> /*
>> * Clear and set 'TS' bit respectively
>> */
>> -#define clts() __asm__ __volatile__ ("clts")
>> +static inline void native_clts(void)
>> +{
>> + asm volatile ("clts");
>> +}
>>
>> -static inline unsigned long read_cr0(void)
>> -{
>> +static inline unsigned long native_read_cr0(void)
>> +{
>> unsigned long cr0;
>> asm volatile("movq %%cr0,%0" : "=r" (cr0));
>> return cr0;
>> }
>>
>
> This is a pre-existing bug, but it seems to me that these read/write crX
> asms should have a constraint to stop the compiler from reordering them
> with respect to each other. The brute-force approach would be to add
> "memory" clobbers, but the subtle fix would be to add a variable which
> is only used to sequence:
>
I in fact have seen bugs with mixed reads and writes to the same cr,
(cr4), but adding the volatile
flag to the read function seemed to fix it. Yet, I agree with you that
the theorectical problem exists for the reorder, and your proposed fix
seems fine (although if we're really desperate about memory usage, we
can use a char instead a int and save 3 bytes!)

It also just ocurred to me that this part of the patch can also go into
the consolidation part. So I'll respin it.

Thanks for the comment
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Remi - http://enigmail.mozdev.org

iD8DBQFHKdkujYI8LaFUWXMRAhV2AKDPIjwGQnoLtldys/OWtIEs6biwxwCg1Jd/
o36S+qcb4sWJ6peqhrSRnos=
=YmeC
-----END PGP SIGNATURE-----

2007-11-01 13:49:50

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 11/16] turn priviled operation into a macro in head_64.S

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jeremy Fitzhardinge escreveu:
> Glauber de Oliveira Costa wrote:
>> under paravirt, read cr2 cannot be issued directly anymore.
>> So wrap it in a macro, defined to the operation itself in case
>> paravirt is off, but to something else if we have paravirt
>> in the game
>>
>
> Is this actually needed? It's only used in the early fault handler in
> head_64.S. Will we be taking that path in the paravirt case? If so,
> should we disable the fault handler altogether, since the hypervisor can
> probably provide better diagnositcs.
>

Well, as you told me earlier, xen won't use it. Neither does lguest.
None of us goes through the normal boot process anyway. But maybe some
other technology uses it?

Zach, how does it work for vmware?


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Remi - http://enigmail.mozdev.org

iD8DBQFHKdmVjYI8LaFUWXMRAgxkAKCQB+VtpGSlm+zuyRRWmi3h+k9NqQCgyfmD
cpPyQcfxo9hcSI0WaDFmpWg=
=Oa4v
-----END PGP SIGNATURE-----

2007-11-01 15:31:31

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

Glauber de Oliveira Costa wrote:
> I in fact have seen bugs with mixed reads and writes to the same cr,
> (cr4), but adding the volatile
> flag to the read function seemed to fix it.

Well, volatile will make a read be repeated rather than caching the
previous value, but it has no effect on ordering.

> Yet, I agree with you that
> the theorectical problem exists for the reorder, and your proposed fix
> seems fine (although if we're really desperate about memory usage, we
> can use a char instead a int and save 3 bytes!)

Sure. Ideally the compiler would never even generate a reference to it,
and it could just be extern, but in practice the compiler will generate
references sometimes.

J

2007-11-01 16:14:00

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Keir Fraser escreveu:
> On 1/11/07 15:30, "Jeremy Fitzhardinge" <[email protected]> wrote:
>
>> Glauber de Oliveira Costa wrote:
>>> I in fact have seen bugs with mixed reads and writes to the same cr,
>>> (cr4), but adding the volatile
>>> flag to the read function seemed to fix it.
>> Well, volatile will make a read be repeated rather than caching the
>> previous value, but it has no effect on ordering.
>
> volatile prevents the asm from being 'moved significantly', according to the
> gcc manual. I take that to mean that reordering is not allowed.
>
According to a gcc developer to whom I asked this question, volatile
prevents the code
to be removed, but does not prevent it to be moved (pun indented). In
practice, it should force
a re-read, but not influence the ordering decisions from the compiler.
Besides , 'significantly'
sounds like a significantly unprecise word, whose specific meaning may
be implementation dependant.

So I agree that adding a memory location reference is probably the best
alternative.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Remi - http://enigmail.mozdev.org

iD8DBQFHKftDjYI8LaFUWXMRAiLTAKDqf/M8umNYw6u7r9ONozTEUVy8SwCgygma
jWNKQmxmLpyPxr00KbQy9Vg=
=JM4K
-----END PGP SIGNATURE-----

2007-11-01 16:56:18

by Keir Fraser

[permalink] [raw]
Subject: Re: [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

On 1/11/07 15:30, "Jeremy Fitzhardinge" <[email protected]> wrote:

> Glauber de Oliveira Costa wrote:
>> I in fact have seen bugs with mixed reads and writes to the same cr,
>> (cr4), but adding the volatile
>> flag to the read function seemed to fix it.
>
> Well, volatile will make a read be repeated rather than caching the
> previous value, but it has no effect on ordering.

volatile prevents the asm from being 'moved significantly', according to the
gcc manual. I take that to mean that reordering is not allowed.

-- Keir


2007-11-01 17:41:45

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

Keir Fraser wrote:
> volatile prevents the asm from being 'moved significantly', according to the
> gcc manual. I take that to mean that reordering is not allowed.
>

That phrase doesn't appear in the gcc manual; in fact, it specifically
says that reordering can happen:

The `volatile' keyword indicates that the instruction has important
side-effects. GCC will not delete a volatile `asm' if it is reachable.
(The instruction can still be deleted if GCC can prove that
control-flow will never reach the location of the instruction.) Note
that even a volatile `asm' instruction can be moved relative to other
code, including across jump instructions. For example, on many targets
there is a system register which can be set to control the rounding
mode of floating point operations. You might try setting it with a
volatile `asm', like this PowerPC example:

asm volatile("mtfsf 255,%0" : : "f" (fpenv));
sum = x + y;

This will not work reliably, as the compiler may move the addition back
before the volatile `asm'. To make it work you need to add an
artificial dependency to the `asm' referencing a variable in the code
you don't want moved, for example:

asm volatile ("mtfsf 255,%1" : "=X"(sum): "f"(fpenv));
sum = x + y;

I take from this that it is not a good idea to assume "asm volatile" has
any ordering effects at all.

J

2007-11-02 01:02:20

by Zachary Amsden

[permalink] [raw]
Subject: Re: [Lguest] [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

On Thu, 2007-11-01 at 10:41 -0700, Jeremy Fitzhardinge wrote:
> Keir Fraser wrote:
> > volatile prevents the asm from being 'moved significantly', according to the
> > gcc manual. I take that to mean that reordering is not allowed.
> >

I understood it as reordering was permitted, but no re-ordering across
another volatile load, store, or asm was permitted. And of course, as
long as input and output constraints are written properly, the
re-ordering should not be vulnerable to pathological movement causing
the code to malfunction.

It seems that CPU state side effects which can't be expressed in C need
special care - FPU is certainly one example.

Also, memory clobber on a volatile asm should stop invalid movement
across TLB flushes and other problems areas. Even memory fences should
have memory clobber in order to stop movement of loads and stores across
the fence by the compiler.

Zach

2007-11-02 01:22:28

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Lguest] [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

Zachary Amsden wrote:
> I understood it as reordering was permitted, but no re-ordering across
> another volatile load, store, or asm was permitted.

It doesn't say that, so I wouldn't assume it. Certainly we had problems
with the pda code; until I added the _proxy_pda dependency variable, the
only fix Andi could find was adding both "volatile" and a memory clobber.

> And of course, as
> long as input and output constraints are written properly, the
> re-ordering should not be vulnerable to pathological movement causing
> the code to malfunction.
>

Yes. I think constraints are the only way to control ordering (even if
it's as heavy-handed as a memory clobber). It would be nice if gcc had
a constraint which was only used for ordering, and never generated a
reference. Then you could make up pseudo-variables in order to express
dependencies without having the risk that the compiler would generate
references.

> It seems that CPU state side effects which can't be expressed in C need
> special care - FPU is certainly one example.
>

Not an immediate problem, fortunately.

> Also, memory clobber on a volatile asm should stop invalid movement
> across TLB flushes and other problems areas.

Yes. Any asm which has global effects on how addresses are interpreted
(like tlbflush, reloading the pagetable base, changing modes, etc) needs
to have a memory clobber.

> Even memory fences should
> have memory clobber in order to stop movement of loads and stores across
> the fence by the compiler.
>

Pretty sure they do. A normal compiler barrier is *just* a memory clobber.

J