2007-08-08 07:08:30

by Glauber Costa

[permalink] [raw]
Subject: Introducing paravirt_ops for x86_64

Hi folks,

After some time away from it, and a big rebase as a consequence, here is
the updated version of paravirt_ops for x86_64, heading to inclusion.

Your criticism is of course, very welcome.

Have fun
--
arch/x86_64/Kconfig | 11
arch/x86_64/ia32/syscall32.c | 2
arch/x86_64/kernel/Makefile | 1
arch/x86_64/kernel/apic.c | 2
arch/x86_64/kernel/asm-offsets.c | 14
arch/x86_64/kernel/entry.S | 125 +++--
arch/x86_64/kernel/head.S | 10
arch/x86_64/kernel/head64.c | 2
arch/x86_64/kernel/i8259.c | 15
arch/x86_64/kernel/ldt.c | 6
arch/x86_64/kernel/paravirt.c | 455 +++++++++++++++++++
arch/x86_64/kernel/process.c | 2
arch/x86_64/kernel/reboot.c | 3
arch/x86_64/kernel/setup.c | 41 +
arch/x86_64/kernel/setup64.c | 18
arch/x86_64/kernel/smp.c | 10
arch/x86_64/kernel/smpboot.c | 10
arch/x86_64/kernel/suspend.c | 11
arch/x86_64/kernel/tce.c | 2
arch/x86_64/kernel/time.c | 37 +
arch/x86_64/kernel/traps.c | 1
arch/x86_64/kernel/tsc.c | 42 +
arch/x86_64/kernel/vmlinux.lds.S | 6
arch/x86_64/kernel/vsyscall.c | 4
arch/x86_64/kernel/x8664_ksyms.c | 6
arch/x86_64/mm/pageattr.c | 2
arch/x86_64/vdso/vgetcpu.c | 4
include/asm-x86_64/alternative.h | 8
include/asm-x86_64/apic.h | 13
include/asm-x86_64/desc.h | 183 +++++--
include/asm-x86_64/e820.h | 6
include/asm-x86_64/irq.h | 2
include/asm-x86_64/irqflags.h | 32 +
include/asm-x86_64/mmu_context.h | 23
include/asm-x86_64/msr.h | 284 +++++++-----
include/asm-x86_64/page.h | 36 +
include/asm-x86_64/paravirt.h | 901 +++++++++++++++++++++++++++++++++++++++
include/asm-x86_64/pgalloc.h | 7
include/asm-x86_64/pgtable.h | 152 +++---
include/asm-x86_64/processor.h | 71 ++-
include/asm-x86_64/proto.h | 3
include/asm-x86_64/segment.h | 4
include/asm-x86_64/smp.h | 8
include/asm-x86_64/spinlock.h | 16
include/asm-x86_64/tlbflush.h | 22
include/linux/mm.h | 14
46 files changed, 2271 insertions(+), 356 deletions(-)


2007-08-08 07:07:44

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 3/25] [PATCH] irq_flags / halt routines

This patch turns the irq_flags and halt routines into the
native versions.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
include/asm-x86_64/irqflags.h | 32 +++++++++++++++++++++++++++-----
1 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/include/asm-x86_64/irqflags.h b/include/asm-x86_64/irqflags.h
index 86e70fe..4ba5241 100644
--- a/include/asm-x86_64/irqflags.h
+++ b/include/asm-x86_64/irqflags.h
@@ -16,6 +16,22 @@
* Interrupt control:
*/

+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+# ifdef CONFIG_X86_VSMP
+static inline int raw_irqs_disabled_flags(unsigned long flags)
+{
+ return !(flags & X86_EFLAGS_IF) || (flags & X86_EFLAGS_AC);
+}
+# else
+static inline int raw_irqs_disabled_flags(unsigned long flags)
+{
+ return !(flags & X86_EFLAGS_IF);
+}
+# endif
+
+#else /* PARAVIRT */
+
static inline unsigned long __raw_local_save_flags(void)
{
unsigned long flags;
@@ -31,9 +47,6 @@ static inline unsigned long __raw_local_save_flags(void)
return flags;
}

-#define raw_local_save_flags(flags) \
- do { (flags) = __raw_local_save_flags(); } while (0)
-
static inline void raw_local_irq_restore(unsigned long flags)
{
__asm__ __volatile__(
@@ -88,6 +101,10 @@ static inline int raw_irqs_disabled_flags(unsigned long flags)

#endif

+#endif /* CONFIG_PARAVIRT */
+
+#define raw_local_save_flags(flags) \
+ do { (flags) = __raw_local_save_flags(); } while (0)
/*
* For spinlocks, etc.:
*/
@@ -115,7 +132,7 @@ static inline int raw_irqs_disabled(void)
* Used in the idle loop; sti takes one instruction cycle
* to complete:
*/
-static inline void raw_safe_halt(void)
+static inline void native_raw_safe_halt(void)
{
__asm__ __volatile__("sti; hlt" : : : "memory");
}
@@ -124,11 +141,16 @@ static inline void raw_safe_halt(void)
* Used when interrupts are already enabled or to
* shutdown the processor:
*/
-static inline void halt(void)
+static inline void native_halt(void)
{
__asm__ __volatile__("hlt": : :"memory");
}

+#ifndef CONFIG_PARAVIRT
+#define raw_safe_halt native_raw_safe_halt
+#define halt native_halt
+#endif /* ! CONFIG_PARAVIRT */
+
#else /* __ASSEMBLY__: */
# ifdef CONFIG_TRACE_IRQFLAGS
# define TRACE_IRQS_ON call trace_hardirqs_on_thunk
--
1.4.4.2

2007-08-08 07:08:53

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 2/25] [PATCH] tlb flushing routines

This patch turns the flush_tlb routines into native versions.
In case paravirt is not defined, the natives are defined into
the actually used ones. flush_tlb_others() goes in smp.c, unless
we smp is not in the game

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86_64/kernel/smp.c | 10 +++++++++-
include/asm-x86_64/smp.h | 8 ++++++++
include/asm-x86_64/tlbflush.h | 22 ++++++++++++++++++----
3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/arch/x86_64/kernel/smp.c b/arch/x86_64/kernel/smp.c
index 673a300..39f5f6b 100644
--- a/arch/x86_64/kernel/smp.c
+++ b/arch/x86_64/kernel/smp.c
@@ -165,7 +165,7 @@ out:
cpu_clear(cpu, f->flush_cpumask);
}

-static void flush_tlb_others(cpumask_t cpumask, struct mm_struct *mm,
+void native_flush_tlb_others(cpumask_t cpumask, struct mm_struct *mm,
unsigned long va)
{
int sender;
@@ -198,6 +198,14 @@ static void flush_tlb_others(cpumask_t cpumask, struct mm_struct *mm,
spin_unlock(&f->tlbstate_lock);
}

+/* Overriden in paravirt.c if CONFIG_PARAVIRT */
+void __attribute__((weak)) flush_tlb_others(cpumask_t cpumask,
+ struct mm_struct *mm,
+ unsigned long va)
+{
+ native_flush_tlb_others(cpumask, mm, va);
+}
+
int __cpuinit init_smp_flush(void)
{
int i;
diff --git a/include/asm-x86_64/smp.h b/include/asm-x86_64/smp.h
index 3f303d2..6b11114 100644
--- a/include/asm-x86_64/smp.h
+++ b/include/asm-x86_64/smp.h
@@ -19,6 +19,14 @@ extern int disable_apic;

#include <asm/pda.h>

+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+void native_flush_tlb_others(cpumask_t cpumask, struct mm_struct *mm,
+ unsigned long va);
+#else
+#define startup_ipi_hook(apicid, rip, rsp) do { } while (0)
+#endif
+
struct pt_regs;

extern cpumask_t cpu_present_mask;
diff --git a/include/asm-x86_64/tlbflush.h b/include/asm-x86_64/tlbflush.h
index 888eb4a..1c68cc8 100644
--- a/include/asm-x86_64/tlbflush.h
+++ b/include/asm-x86_64/tlbflush.h
@@ -6,21 +6,30 @@
#include <asm/processor.h>
#include <asm/system.h>

-static inline void __flush_tlb(void)
+static inline void native_flush_tlb(void)
{
write_cr3(read_cr3());
}

-static inline void __flush_tlb_all(void)
+static inline void native_flush_tlb_all(void)
{
unsigned long cr4 = read_cr4();
write_cr4(cr4 & ~X86_CR4_PGE); /* clear PGE */
write_cr4(cr4); /* write old PGE again and flush TLBs */
}

-#define __flush_tlb_one(addr) \
- __asm__ __volatile__("invlpg (%0)" :: "r" (addr) : "memory")
+static inline void native_flush_tlb_one(unsigned long addr)
+{
+ asm volatile ("invlpg (%0)" :: "r" (addr) : "memory");
+}

+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define __flush_tlb() native_flush_tlb()
+#define __flush_tlb_all() native_flush_tlb_all()
+#define __flush_tlb_one(addr) native_flush_tlb_one(addr)
+#endif /* CONFIG_PARAVIRT */

/*
* TLB flushing:
@@ -64,6 +73,11 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
__flush_tlb();
}

+static inline void native_flush_tlb_others(cpumask_t *cpumask,
+ struct mm_struct *mm, unsigned long va)
+{
+}
+
#else

#include <asm/smp.h>
--
1.4.4.2

2007-08-08 07:09:25

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 1/25] [PATCH] header file move

Later on, the paravirt_ops patch will deference the vm_area_struct
in asm/pgtable.h. It means this define must be after the struct
definition

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
include/linux/mm.h | 14 +++++++++-----
1 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 655094d..c3f8561 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -35,11 +35,6 @@ extern int sysctl_legacy_va_layout;
#define sysctl_legacy_va_layout 0
#endif

-#include <asm/page.h>
-#include <asm/pgtable.h>
-#include <asm/processor.h>
-
-#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))

/*
* Linux kernel virtual memory manager primitives.
@@ -113,6 +108,15 @@ struct vm_area_struct {
#endif
};

+#include <asm/page.h>
+/*
+ * pgtable.h must be included after the definition of vm_area_struct.
+ * x86_64 pgtable.h is one of the dereferencers of this struct
+ */
+#include <asm/pgtable.h>
+#include <asm/processor.h>
+
+#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
extern struct kmem_cache *vm_area_cachep;

/*
--
1.4.4.2

2007-08-08 07:09:42

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 5/25] [PATCH] native versions for system.h functions

This patch adds the native hook for the functions in system.h
They are the read/write_crX, clts and wbinvd. The later, also
gets its call sites patched.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
---
arch/x86_64/kernel/tce.c | 2 +-
arch/x86_64/mm/pageattr.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86_64/kernel/tce.c b/arch/x86_64/kernel/tce.c
index e3f2569..587f0c2 100644
--- a/arch/x86_64/kernel/tce.c
+++ b/arch/x86_64/kernel/tce.c
@@ -42,7 +42,7 @@ static inline void flush_tce(void* tceaddr)
if (cpu_has_clflush)
asm volatile("clflush (%0)" :: "r" (tceaddr));
else
- asm volatile("wbinvd":::"memory");
+ wbinvd();
}

void tce_build(struct iommu_table *tbl, unsigned long index,
diff --git a/arch/x86_64/mm/pageattr.c b/arch/x86_64/mm/pageattr.c
index 7e161c6..b497afd 100644
--- a/arch/x86_64/mm/pageattr.c
+++ b/arch/x86_64/mm/pageattr.c
@@ -76,7 +76,7 @@ static void flush_kernel_map(void *arg)
/* When clflush is available always use it because it is
much cheaper than WBINVD. */
if (!cpu_has_clflush)
- asm volatile("wbinvd" ::: "memory");
+ wbinvd();
else list_for_each_entry(pg, l, lru) {
void *adr = page_address(pg);
cache_flush_page(adr);
--
1.4.4.2

2007-08-08 07:08:06

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 4/25] [PATCH] Add debugreg/load_rsp native hooks

This patch adds native hooks for debugreg handling functions,
and for the native load_rsp0 function. The later also have its
call sites patched.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86_64/kernel/process.c | 2 +-
arch/x86_64/kernel/smpboot.c | 2 +-
include/asm-x86_64/processor.h | 71 ++++++++++++++++++++++++++++++++++++----
3 files changed, 66 insertions(+), 9 deletions(-)

diff --git a/arch/x86_64/kernel/process.c b/arch/x86_64/kernel/process.c
index 2842f50..33046f1 100644
--- a/arch/x86_64/kernel/process.c
+++ b/arch/x86_64/kernel/process.c
@@ -595,7 +595,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/*
* Reload esp0, LDT and the page table pointer:
*/
- tss->rsp0 = next->rsp0;
+ load_rsp0(tss, next);

/*
* Switch DS and ES.
diff --git a/arch/x86_64/kernel/smpboot.c b/arch/x86_64/kernel/smpboot.c
index 32f5078..be9c7eb 100644
--- a/arch/x86_64/kernel/smpboot.c
+++ b/arch/x86_64/kernel/smpboot.c
@@ -620,7 +620,7 @@ do_rest:
start_rip = setup_trampoline();

init_rsp = c_idle.idle->thread.rsp;
- per_cpu(init_tss,cpu).rsp0 = init_rsp;
+ load_rsp0(&per_cpu(init_tss,cpu), &c_idle.idle->thread);
initial_code = start_secondary;
clear_tsk_thread_flag(c_idle.idle, TIF_FORK);

diff --git a/include/asm-x86_64/processor.h b/include/asm-x86_64/processor.h
index 1952517..65f689b 100644
--- a/include/asm-x86_64/processor.h
+++ b/include/asm-x86_64/processor.h
@@ -249,6 +249,12 @@ struct thread_struct {
.rsp0 = (unsigned long)&init_stack + sizeof(init_stack) \
}

+static inline void native_load_rsp0(struct tss_struct *tss,
+ struct thread_struct *thread)
+{
+ tss->rsp0 = thread->rsp0;
+}
+
#define INIT_MMAP \
{ &init_mm, 0, 0, NULL, PAGE_SHARED, VM_READ | VM_WRITE | VM_EXEC, 1, NULL, NULL }

@@ -264,13 +270,64 @@ struct thread_struct {
set_fs(USER_DS); \
} while(0)

-#define get_debugreg(var, register) \
- __asm__("movq %%db" #register ", %0" \
- :"=r" (var))
-#define set_debugreg(value, register) \
- __asm__("movq %0,%%db" #register \
- : /* no output */ \
- :"r" (value))
+static inline unsigned long native_get_debugreg(int regno)
+{
+ unsigned long val;
+
+ switch (regno) {
+ case 0:
+ asm("movq %%db0, %0" :"=r" (val)); break;
+ case 1:
+ asm("movq %%db1, %0" :"=r" (val)); break;
+ case 2:
+ asm("movq %%db2, %0" :"=r" (val)); break;
+ case 3:
+ asm("movq %%db3, %0" :"=r" (val)); break;
+ case 6:
+ asm("movq %%db6, %0" :"=r" (val)); break;
+ case 7:
+ asm("movq %%db7, %0" :"=r" (val)); break;
+ default:
+ val = 0; /* assign it to keep gcc quiet */
+ WARN_ON(1);
+ }
+ return val;
+}
+
+static inline void native_set_debugreg(unsigned long value, int regno)
+{
+ switch (regno) {
+ case 0:
+ asm("movq %0,%%db0" : /* no output */ :"r" (value));
+ break;
+ case 1:
+ asm("movq %0,%%db1" : /* no output */ :"r" (value));
+ break;
+ case 2:
+ asm("movq %0,%%db2" : /* no output */ :"r" (value));
+ break;
+ case 3:
+ asm("movq %0,%%db3" : /* no output */ :"r" (value));
+ break;
+ case 6:
+ asm("movq %0,%%db6" : /* no output */ :"r" (value));
+ break;
+ case 7:
+ asm("movq %0,%%db7" : /* no output */ :"r" (value));
+ break;
+ default:
+ BUG();
+ }
+}
+
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define paravirt_enabled() 0
+#define load_rsp0 native_load_rsp0
+#define set_debugreg(val, reg) native_set_debugreg(reg, val)
+#define get_debugreg(var, reg) (var) = native_get_debugreg(reg)
+#endif

struct task_struct;
struct mm_struct;
--
1.4.4.2

2007-08-08 07:09:59

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 6/25] [PATCH] add native_apic read and write functions, as well as boot clocks ones

Time for the apic handling functions to get their native counterparts.
Also, put the native hook for the boot clocks functions in the apic.h header

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86_64/kernel/apic.c | 2 +-
arch/x86_64/kernel/smpboot.c | 8 +++++++-
include/asm-x86_64/apic.h | 13 +++++++++++--
3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/x86_64/kernel/apic.c b/arch/x86_64/kernel/apic.c
index 900ff38..2d233ef 100644
--- a/arch/x86_64/kernel/apic.c
+++ b/arch/x86_64/kernel/apic.c
@@ -1193,7 +1193,7 @@ int __init APIC_init_uniprocessor (void)
setup_IO_APIC();
else
nr_ioapics = 0;
- setup_boot_APIC_clock();
+ setup_boot_clock();
check_nmi_watchdog();
return 0;
}
diff --git a/arch/x86_64/kernel/smpboot.c b/arch/x86_64/kernel/smpboot.c
index be9c7eb..5c7eb60 100644
--- a/arch/x86_64/kernel/smpboot.c
+++ b/arch/x86_64/kernel/smpboot.c
@@ -338,7 +338,7 @@ void __cpuinit start_secondary(void)
check_tsc_sync_target();

Dprintk("cpu %d: setting up apic clock\n", smp_processor_id());
- setup_secondary_APIC_clock();
+ setup_secondary_clock();

Dprintk("cpu %d: enabling apic timer\n", smp_processor_id());

@@ -468,6 +468,12 @@ static int __cpuinit wakeup_secondary_via_INIT(int phys_apicid, unsigned int sta
num_starts = 2;

/*
+ * Paravirt wants a startup IPI hook here to set up the
+ * target processor state.
+ */
+ startup_ipi_hook(phys_apicid, (unsigned long) start_rip,
+ (unsigned long) init_rsp);
+ /*
* Run STARTUP IPI loop.
*/
Dprintk("#startup loops: %d.\n", num_starts);
diff --git a/include/asm-x86_64/apic.h b/include/asm-x86_64/apic.h
index 85125ef..de17908 100644
--- a/include/asm-x86_64/apic.h
+++ b/include/asm-x86_64/apic.h
@@ -38,16 +38,25 @@ struct pt_regs;
* Basic functions accessing APICs.
*/

-static __inline void apic_write(unsigned long reg, unsigned int v)
+static __inline void native_apic_write(unsigned long reg, unsigned int v)
{
*((volatile unsigned int *)(APIC_BASE+reg)) = v;
}

-static __inline unsigned int apic_read(unsigned long reg)
+static __inline unsigned int native_apic_read(unsigned long reg)
{
return *((volatile unsigned int *)(APIC_BASE+reg));
}

+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define apic_write(reg, v) native_apic_write(reg, v)
+#define apic_read(reg) native_apic_read(reg)
+#define setup_boot_clock(void) setup_boot_APIC_clock(void)
+#define setup_secondary_clock(void) setup_secondary_APIC_clock(void)
+#endif
+
extern void apic_wait_icr_idle(void);
extern unsigned int safe_apic_wait_icr_idle(void);

--
1.4.4.2

2007-08-08 07:10:32

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 8/25] [PATCH] use macro for sti/cli in spinlock definitions

This patch switches the cli and sti instructions into macros.
In this header, they're just defined to the instructions they
refer to. Later on, when paravirt is defined, they will be
defined to something with paravirt abilities.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
include/asm-x86_64/spinlock.h | 16 +++++++++++++---
1 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/asm-x86_64/spinlock.h b/include/asm-x86_64/spinlock.h
index 88bf981..a81e27a 100644
--- a/include/asm-x86_64/spinlock.h
+++ b/include/asm-x86_64/spinlock.h
@@ -5,6 +5,14 @@
#include <asm/rwlock.h>
#include <asm/page.h>
#include <asm/processor.h>
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define CLI_STI_INPUT_ARGS
+#define CLI_STI_CLOBBERS
+#define CLI_STRING "cli"
+#define STI_STRING "sti"
+#endif

/*
* Your basic SMP spinlocks, allowing only a single CPU anywhere
@@ -48,12 +56,12 @@ static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long fla
"jns 5f\n"
"testl $0x200, %1\n\t" /* interrupts were disabled? */
"jz 4f\n\t"
- "sti\n"
+ STI_STRING "\n\t"
"3:\t"
"rep;nop\n\t"
"cmpl $0, %0\n\t"
"jle 3b\n\t"
- "cli\n\t"
+ CLI_STRING "\n\t"
"jmp 1b\n"
"4:\t"
"rep;nop\n\t"
@@ -61,7 +69,9 @@ static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long fla
"jg 1b\n\t"
"jmp 4b\n"
"5:\n\t"
- : "+m" (lock->slock) : "r" ((unsigned)flags) : "memory");
+ : "+m" (lock->slock)
+ : "r" ((unsigned)flags), CLI_STI_INPUT_ARGS
+ : "memory", CLI_STI_CLOBBERS);
}
#endif

--
1.4.4.2

2007-08-08 07:11:00

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 16/25] [PATCH] introducing paravirt_activate_mm

This function/macro will allow a paravirt guest to be notified we changed
the current task cr3, and act upon it. It's up to them

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
include/asm-x86_64/mmu_context.h | 17 ++++++++++++++---
1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/asm-x86_64/mmu_context.h b/include/asm-x86_64/mmu_context.h
index 9592698..77ce047 100644
--- a/include/asm-x86_64/mmu_context.h
+++ b/include/asm-x86_64/mmu_context.h
@@ -7,7 +7,16 @@
#include <asm/pda.h>
#include <asm/pgtable.h>
#include <asm/tlbflush.h>
+
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
#include <asm-generic/mm_hooks.h>
+static inline void paravirt_activate_mm(struct mm_struct *prev,
+ struct mm_struct *next)
+{
+}
+#endif /* CONFIG_PARAVIRT */

/*
* possibly do the LDT unload here?
@@ -67,8 +76,10 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
asm volatile("movl %0,%%fs"::"r"(0)); \
} while(0)

-#define activate_mm(prev, next) \
- switch_mm((prev),(next),NULL)
-
+#define activate_mm(prev, next) \
+do { \
+ paravirt_activate_mm(prev, next); \
+ switch_mm((prev),(next),NULL); \
+} while (0)

#endif
--
1.4.4.2

2007-08-08 07:11:31

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 10/25] [PATCH] export math_state_restore

Export math_state_restore symbol, so it can be used for hypervisors.
They are commonly loaded as modules.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86_64/kernel/traps.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86_64/kernel/traps.c b/arch/x86_64/kernel/traps.c
index 0388842..aacbe12 100644
--- a/arch/x86_64/kernel/traps.c
+++ b/arch/x86_64/kernel/traps.c
@@ -1081,6 +1081,7 @@ asmlinkage void math_state_restore(void)
task_thread_info(me)->status |= TS_USEDFPU;
me->fpu_counter++;
}
+EXPORT_SYMBOL_GPL(math_state_restore);

void __init trap_init(void)
{
--
1.4.4.2

2007-08-08 07:11:49

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 19/25] [PATCH] time-related functions paravirt provisions

This patch add provisions for time related functions so they
can be later replaced by paravirt versions.

it basically encloses {g,s}et_wallclock inside the
already existent functions update_persistent_clock and
read_persistent_clock, and defines {s,g}et_wallclock
to the core of such functions.

The timer interrupt setup also have to be replaced.
The job is done by time_init_hook().

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86_64/kernel/time.c | 37 +++++++++++++++++++++++++------------
1 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/arch/x86_64/kernel/time.c b/arch/x86_64/kernel/time.c
index 6d48a4e..29fcd91 100644
--- a/arch/x86_64/kernel/time.c
+++ b/arch/x86_64/kernel/time.c
@@ -42,6 +42,7 @@
#include <asm/sections.h>
#include <linux/hpet.h>
#include <asm/apic.h>
+#include <asm/time.h>
#include <asm/hpet.h>
#include <asm/mpspec.h>
#include <asm/nmi.h>
@@ -82,18 +83,12 @@ EXPORT_SYMBOL(profile_pc);
* sheet for details.
*/

-static int set_rtc_mmss(unsigned long nowtime)
+int do_set_rtc_mmss(unsigned long nowtime)
{
int retval = 0;
int real_seconds, real_minutes, cmos_minutes;
unsigned char control, freq_select;

-/*
- * IRQs are disabled when we're called from the timer interrupt,
- * no need for spin_lock_irqsave()
- */
-
- spin_lock(&rtc_lock);

/*
* Tell the clock it's being set and stop it.
@@ -143,14 +138,22 @@ static int set_rtc_mmss(unsigned long nowtime)
CMOS_WRITE(control, RTC_CONTROL);
CMOS_WRITE(freq_select, RTC_FREQ_SELECT);

- spin_unlock(&rtc_lock);
-
return retval;
}

int update_persistent_clock(struct timespec now)
{
- return set_rtc_mmss(now.tv_sec);
+ int retval;
+
+/*
+ * IRQs are disabled when we're called from the timer interrupt,
+ * no need for spin_lock_irqsave()
+ */
+ spin_lock(&rtc_lock);
+ retval = set_wallclock(now.tv_sec);
+ spin_unlock(&rtc_lock);
+
+ return retval;
}

void main_timer_handler(void)
@@ -195,7 +198,7 @@ static irqreturn_t timer_interrupt(int irq, void *dev_id)
return IRQ_HANDLED;
}

-unsigned long read_persistent_clock(void)
+unsigned long do_get_cmos_time(void)
{
unsigned int year, mon, day, hour, min, sec;
unsigned long flags;
@@ -246,6 +249,11 @@ unsigned long read_persistent_clock(void)
return mktime(year, mon, day, hour, min, sec);
}

+unsigned long read_persistent_clock(void)
+{
+ return get_wallclock();
+}
+
/* calibrate_cpu is used on systems with fixed rate TSCs to determine
* processor frequency */
#define TICK_COUNT 100000000
@@ -365,6 +373,11 @@ static struct irqaction irq0 = {
.name = "timer"
};

+inline void time_init_hook()
+{
+ setup_irq(0, &irq0);
+}
+
void __init time_init(void)
{
if (nohpet)
@@ -403,7 +416,7 @@ void __init time_init(void)
cpu_khz / 1000, cpu_khz % 1000);
init_tsc_clocksource();

- setup_irq(0, &irq0);
+ do_time_init();
}

/*
--
1.4.4.2

2007-08-08 07:12:21

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 20/25] [PATCH] replace syscall_init

This patch replaces syscall_init by x86_64_syscall_init.
The former will be later replaced by a paravirt replacement
in case paravirt is on

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86_64/kernel/setup64.c | 8 +++++++-
include/asm-x86_64/proto.h | 3 +++
2 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/arch/x86_64/kernel/setup64.c b/arch/x86_64/kernel/setup64.c
index 49f7342..723822c 100644
--- a/arch/x86_64/kernel/setup64.c
+++ b/arch/x86_64/kernel/setup64.c
@@ -153,7 +153,7 @@ __attribute__((section(".bss.page_aligned")));
extern asmlinkage void ignore_sysret(void);

/* May not be marked __init: used by software suspend */
-void syscall_init(void)
+void x86_64_syscall_init(void)
{
/*
* LSTAR and STAR live in a bit strange symbiosis.
@@ -172,6 +172,12 @@ void syscall_init(void)
wrmsrl(MSR_SYSCALL_MASK, EF_TF|EF_DF|EF_IE|0x3000);
}

+/* Overriden in paravirt.c if CONFIG_PARAVIRT */
+void __attribute__((weak)) syscall_init(void)
+{
+ x86_64_syscall_init();
+}
+
void __cpuinit check_efer(void)
{
unsigned long efer;
diff --git a/include/asm-x86_64/proto.h b/include/asm-x86_64/proto.h
index 31f20ad..77ed2de 100644
--- a/include/asm-x86_64/proto.h
+++ b/include/asm-x86_64/proto.h
@@ -18,6 +18,9 @@ extern void init_memory_mapping(unsigned long start, unsigned long end);

extern void system_call(void);
extern int kernel_syscall(void);
+#ifdef CONFIG_PARAVIRT
+extern void x86_64_syscall_init(void);
+#endif
extern void syscall_init(void);

extern void ia32_syscall(void);
--
1.4.4.2

2007-08-08 07:12:43

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

With paravirt on, we cannot issue operations like swapgs, sysretq,
iretq, cli, sti. So they have to be changed into macros, that will
be later properly replaced for the paravirt case.

The sysretq is a little bit more complicated, and is replaced
by a sequence of three instructions. It is basically because if
we had already issued an swapgs, we would be with a user stack
at this point. So we do all-in-one.

The clobber list follows the idea of the i386 version closely,
and represents which registers are safe to modify at the
point the function is called.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86_64/kernel/entry.S | 125 ++++++++++++++++++++++++++++---------------
1 files changed, 81 insertions(+), 44 deletions(-)

diff --git a/arch/x86_64/kernel/entry.S b/arch/x86_64/kernel/entry.S
index 1d232e5..48e953b 100644
--- a/arch/x86_64/kernel/entry.S
+++ b/arch/x86_64/kernel/entry.S
@@ -51,8 +51,31 @@
#include <asm/page.h>
#include <asm/irqflags.h>

+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define ENABLE_INTERRUPTS(x) sti
+#define DISABLE_INTERRUPTS(x) cli
+#define INTERRUPT_RETURN iretq
+#define SWAPGS swapgs
+#define SYSRETQ \
+ movq %gs:pda_oldrsp,%rsp; \
+ swapgs; \
+ sysretq;
+#endif
+
.code64

+/* Currently paravirt can't handle swapgs nicely when we
+ * don't have a stack. So we either find a way around these
+ * or just fault and emulate if a guest tries to call swapgs
+ * directly.
+ *
+ * Either way, this is a good way to document that we don't
+ * have a reliable stack.
+ */
+#define SWAPGS_NOSTACK swapgs
+
#ifndef CONFIG_PREEMPT
#define retint_kernel retint_restore_args
#endif
@@ -216,14 +239,14 @@ ENTRY(system_call)
CFI_DEF_CFA rsp,PDA_STACKOFFSET
CFI_REGISTER rip,rcx
/*CFI_REGISTER rflags,r11*/
- swapgs
+ SWAPGS_NOSTACK
movq %rsp,%gs:pda_oldrsp
movq %gs:pda_kernelstack,%rsp
/*
* No need to follow this irqs off/on section - it's straight
* and short:
*/
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_ARGS 8,1
movq %rax,ORIG_RAX-ARGOFFSET(%rsp)
movq %rcx,RIP-ARGOFFSET(%rsp)
@@ -245,7 +268,7 @@ ret_from_sys_call:
/* edi: flagmask */
sysret_check:
GET_THREAD_INFO(%rcx)
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
movl threadinfo_flags(%rcx),%edx
andl %edi,%edx
@@ -259,9 +282,7 @@ sysret_check:
CFI_REGISTER rip,rcx
RESTORE_ARGS 0,-ARG_SKIP,1
/*CFI_REGISTER rflags,r11*/
- movq %gs:pda_oldrsp,%rsp
- swapgs
- sysretq
+ SYSRETQ

CFI_RESTORE_STATE
/* Handle reschedules */
@@ -270,7 +291,7 @@ sysret_careful:
bt $TIF_NEED_RESCHED,%edx
jnc sysret_signal
TRACE_IRQS_ON
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
pushq %rdi
CFI_ADJUST_CFA_OFFSET 8
call schedule
@@ -281,7 +302,7 @@ sysret_careful:
/* Handle a signal */
sysret_signal:
TRACE_IRQS_ON
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
testl $(_TIF_SIGPENDING|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jz 1f

@@ -294,7 +315,7 @@ sysret_signal:
1: movl $_TIF_NEED_RESCHED,%edi
/* Use IRET because user could have changed frame. This
works because ptregscall_common has called FIXUP_TOP_OF_STACK. */
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
jmp int_with_check

@@ -326,7 +347,7 @@ tracesys:
*/
.globl int_ret_from_sys_call
int_ret_from_sys_call:
- cli
+ DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_OFF
testl $3,CS-ARGOFFSET(%rsp)
je retint_restore_args
@@ -347,20 +368,20 @@ int_careful:
bt $TIF_NEED_RESCHED,%edx
jnc int_very_careful
TRACE_IRQS_ON
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
pushq %rdi
CFI_ADJUST_CFA_OFFSET 8
call schedule
popq %rdi
CFI_ADJUST_CFA_OFFSET -8
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
jmp int_with_check

/* handle signals and tracing -- both require a full stack frame */
int_very_careful:
TRACE_IRQS_ON
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_REST
/* Check for syscall exit trace */
testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP),%edx
@@ -383,7 +404,7 @@ int_signal:
1: movl $_TIF_NEED_RESCHED,%edi
int_restore_rest:
RESTORE_REST
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
jmp int_with_check
CFI_ENDPROC
@@ -504,7 +525,7 @@ END(stub_rt_sigreturn)
CFI_DEF_CFA_REGISTER rbp
testl $3,CS(%rdi)
je 1f
- swapgs
+ SWAPGS
/* irqcount is used to check if a CPU is already on an interrupt
stack or not. While this is essentially redundant with preempt_count
it is a little cheaper to use a separate counter in the PDA
@@ -525,7 +546,7 @@ ENTRY(common_interrupt)
interrupt do_IRQ
/* 0(%rsp): oldrsp-ARGOFFSET */
ret_from_intr:
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
decl %gs:pda_irqcount
leaveq
@@ -552,13 +573,13 @@ retint_swapgs:
/*
* The iretq could re-enable interrupts:
*/
- cli
+ DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_IRETQ
- swapgs
+ SWAPGS
jmp restore_args

retint_restore_args:
- cli
+ DISABLE_INTERRUPTS(CLBR_ANY)
/*
* The iretq could re-enable interrupts:
*/
@@ -566,10 +587,14 @@ retint_restore_args:
restore_args:
RESTORE_ARGS 0,8,0
iret_label:
- iretq
+#ifdef CONFIG_PARAVIRT
+ INTERRUPT_RETURN
+ENTRY(native_iret)
+#endif
+1: iretq

.section __ex_table,"a"
- .quad iret_label,bad_iret
+ .quad 1b, bad_iret
.previous
.section .fixup,"ax"
/* force a signal here? this matches i386 behaviour */
@@ -577,24 +602,27 @@ iret_label:
bad_iret:
movq $11,%rdi /* SIGSEGV */
TRACE_IRQS_ON
- sti
- jmp do_exit
- .previous
-
+ ENABLE_INTERRUPTS(CLBR_ANY | ~(CLBR_RDI))
+ jmp do_exit
+ .previous
+#ifdef CONFIG_PARAVIRT
+ENDPROC(native_iret)
+#endif
+
/* edi: workmask, edx: work */
retint_careful:
CFI_RESTORE_STATE
bt $TIF_NEED_RESCHED,%edx
jnc retint_signal
TRACE_IRQS_ON
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
pushq %rdi
CFI_ADJUST_CFA_OFFSET 8
call schedule
popq %rdi
CFI_ADJUST_CFA_OFFSET -8
GET_THREAD_INFO(%rcx)
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
jmp retint_check

@@ -602,14 +630,14 @@ retint_signal:
testl $(_TIF_SIGPENDING|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jz retint_swapgs
TRACE_IRQS_ON
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_REST
movq $-1,ORIG_RAX(%rsp)
xorl %esi,%esi # oldset
movq %rsp,%rdi # &pt_regs
call do_notify_resume
RESTORE_REST
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
movl $_TIF_NEED_RESCHED,%edi
GET_THREAD_INFO(%rcx)
@@ -727,7 +755,7 @@ END(spurious_interrupt)
rdmsr
testl %edx,%edx
js 1f
- swapgs
+ SWAPGS
xorl %ebx,%ebx
1:
.if \ist
@@ -743,7 +771,7 @@ END(spurious_interrupt)
.if \ist
addq $EXCEPTION_STKSZ, per_cpu__init_tss + TSS_ist + (\ist - 1) * 8(%rbp)
.endif
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
.if \irqtrace
TRACE_IRQS_OFF
.endif
@@ -772,10 +800,10 @@ paranoid_swapgs\trace:
.if \trace
TRACE_IRQS_IRETQ 0
.endif
- swapgs
+ SWAPGS_NOSTACK
paranoid_restore\trace:
RESTORE_ALL 8
- iretq
+ INTERRUPT_RETURN
paranoid_userspace\trace:
GET_THREAD_INFO(%rcx)
movl threadinfo_flags(%rcx),%ebx
@@ -790,11 +818,11 @@ paranoid_userspace\trace:
.if \trace
TRACE_IRQS_ON
.endif
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
xorl %esi,%esi /* arg2: oldset */
movq %rsp,%rdi /* arg1: &pt_regs */
call do_notify_resume
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
.if \trace
TRACE_IRQS_OFF
.endif
@@ -803,9 +831,9 @@ paranoid_schedule\trace:
.if \trace
TRACE_IRQS_ON
.endif
- sti
+ ENABLE_INTERRUPTS(CLBR_ANY)
call schedule
- cli
+ DISABLE_INTERRUPTS(CLBR_ANY)
.if \trace
TRACE_IRQS_OFF
.endif
@@ -858,7 +886,7 @@ KPROBE_ENTRY(error_entry)
testl $3,CS(%rsp)
je error_kernelspace
error_swapgs:
- swapgs
+ SWAPGS
error_sti:
movq %rdi,RDI(%rsp)
CFI_REL_OFFSET rdi,RDI
@@ -870,7 +898,7 @@ error_sti:
error_exit:
movl %ebx,%eax
RESTORE_REST
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
GET_THREAD_INFO(%rcx)
testl %eax,%eax
@@ -883,7 +911,7 @@ error_exit:
* The iret might restore flags:
*/
TRACE_IRQS_IRETQ
- swapgs
+ SWAPGS
RESTORE_ARGS 0,8,0
jmp iret_label
CFI_ENDPROC
@@ -912,12 +940,12 @@ ENTRY(load_gs_index)
CFI_STARTPROC
pushf
CFI_ADJUST_CFA_OFFSET 8
- cli
- swapgs
+ DISABLE_INTERRUPTS(CLBR_ANY | ~(CLBR_RDI))
+ SWAPGS
gs_change:
movl %edi,%gs
2: mfence /* workaround */
- swapgs
+ SWAPGS
popf
CFI_ADJUST_CFA_OFFSET -8
ret
@@ -931,7 +959,7 @@ ENDPROC(load_gs_index)
.section .fixup,"ax"
/* running with kernelgs */
bad_gs:
- swapgs /* switch back to user gs */
+ SWAPGS /* switch back to user gs */
xorl %eax,%eax
movl %eax,%gs
jmp 2b
@@ -1072,6 +1100,15 @@ KPROBE_ENTRY(int3)
CFI_ENDPROC
KPROBE_END(int3)

+#ifdef CONFIG_PARAVIRT
+ENTRY(native_sysret)
+ movq %gs:pda_oldrsp,%rsp
+ swapgs
+ sysretq
+ENDPROC(native_sysret)
+
+#endif /* CONFIG_PARAVIRT */
+
ENTRY(overflow)
zeroentry do_overflow
END(overflow)
--
1.4.4.2

2007-08-08 07:13:07

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 21/25] [PATCH] export cpu_gdt_descr

With paravirualization, hypervisors needs to handle the gdt,
that was right to this point only used at very early
inialization code. Hypervisors are commonly modules, so make
it an export

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86_64/kernel/x8664_ksyms.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/x86_64/kernel/x8664_ksyms.c b/arch/x86_64/kernel/x8664_ksyms.c
index 77c25b3..8f10698 100644
--- a/arch/x86_64/kernel/x8664_ksyms.c
+++ b/arch/x86_64/kernel/x8664_ksyms.c
@@ -60,3 +60,9 @@ EXPORT_SYMBOL(init_level4_pgt);
EXPORT_SYMBOL(load_gs_index);

EXPORT_SYMBOL(_proxy_pda);
+
+#ifdef CONFIG_PARAVIRT
+extern unsigned long *cpu_gdt_descr;
+/* Virtualized guests may want to use it */
+EXPORT_SYMBOL(cpu_gdt_descr);
+#endif
--
1.4.4.2

2007-08-08 07:13:30

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 7/25] [PATCH] interrupt related native paravirt functions.

The interrupt initialization routine becomes native_init_IRQ and will
be overriden later in case paravirt is on.

The interrupt vector is made global, so paravirt guests can reference
it in their initializations. However, "interrupt" is such a common
name, and could lead to clashes, so it is renamed.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86_64/kernel/i8259.c | 15 +++++++++++----
include/asm-x86_64/irq.h | 2 ++
2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86_64/kernel/i8259.c b/arch/x86_64/kernel/i8259.c
index 948cae6..8dda872 100644
--- a/arch/x86_64/kernel/i8259.c
+++ b/arch/x86_64/kernel/i8259.c
@@ -75,8 +75,12 @@ BUILD_16_IRQS(0xc) BUILD_16_IRQS(0xd) BUILD_16_IRQS(0xe) BUILD_16_IRQS(0xf)
IRQ(x,8), IRQ(x,9), IRQ(x,a), IRQ(x,b), \
IRQ(x,c), IRQ(x,d), IRQ(x,e), IRQ(x,f)

-/* for the irq vectors */
-static void (*interrupt[NR_VECTORS - FIRST_EXTERNAL_VECTOR])(void) = {
+/*
+ * For the irq vectors. It is global rather than static to allow for
+ * paravirtualized guests to use it in their own interrupt initialization
+ * routines
+ */
+void (*interrupt_vector[NR_VECTORS - FIRST_EXTERNAL_VECTOR])(void) = {
IRQLIST_16(0x2), IRQLIST_16(0x3),
IRQLIST_16(0x4), IRQLIST_16(0x5), IRQLIST_16(0x6), IRQLIST_16(0x7),
IRQLIST_16(0x8), IRQLIST_16(0x9), IRQLIST_16(0xa), IRQLIST_16(0xb),
@@ -484,7 +488,10 @@ static int __init init_timer_sysfs(void)

device_initcall(init_timer_sysfs);

-void __init init_IRQ(void)
+/* Overridden in paravirt.c */
+void init_IRQ(void) __attribute__((weak, alias("native_init_IRQ")));
+
+void __init native_init_IRQ(void)
{
int i;

@@ -497,7 +504,7 @@ void __init init_IRQ(void)
for (i = 0; i < (NR_VECTORS - FIRST_EXTERNAL_VECTOR); i++) {
int vector = FIRST_EXTERNAL_VECTOR + i;
if (vector != IA32_SYSCALL_VECTOR)
- set_intr_gate(vector, interrupt[i]);
+ set_intr_gate(vector, interrupt_vector[i]);
}

#ifdef CONFIG_SMP
diff --git a/include/asm-x86_64/irq.h b/include/asm-x86_64/irq.h
index 5006c6e..be55299 100644
--- a/include/asm-x86_64/irq.h
+++ b/include/asm-x86_64/irq.h
@@ -46,6 +46,8 @@ static __inline__ int irq_canonicalize(int irq)
extern void fixup_irqs(cpumask_t map);
#endif

+void native_init_IRQ(void);
+
#define __ARCH_HAS_DO_SOFTIRQ 1

#endif /* _ASM_IRQ_H */
--
1.4.4.2

2007-08-08 07:13:49

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 22/25] [PATCH] turn priviled operation into a macro

under paravirt, read cr2 cannot be issued directly anymore.
So wrap it in a macro, defined to the operation itself in case
paravirt is off, but to something else if we have paravirt
in the game

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86_64/kernel/head.S | 10 +++++++++-
1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/arch/x86_64/kernel/head.S b/arch/x86_64/kernel/head.S
index e89abcd..1bb6c55 100644
--- a/arch/x86_64/kernel/head.S
+++ b/arch/x86_64/kernel/head.S
@@ -18,6 +18,12 @@
#include <asm/page.h>
#include <asm/msr.h>
#include <asm/cache.h>
+#ifdef CONFIG_PARAVIRT
+#include <asm/asm-offsets.h>
+#include <asm/paravirt.h>
+#else
+#define GET_CR2_INTO_RCX mov %cr2, %rcx
+#endif

/* we are not able to switch in one step to the final KERNEL ADRESS SPACE
* because we need identity-mapped pages.
@@ -267,7 +273,9 @@ ENTRY(early_idt_handler)
xorl %eax,%eax
movq 8(%rsp),%rsi # get rip
movq (%rsp),%rdx
- movq %cr2,%rcx
+ /* When PARAVIRT is on, this operation may clobber rax. It is
+ something safe to do, because we've just zeroed rax. */
+ GET_CR2_INTO_RCX
leaq early_idt_msg(%rip),%rdi
call early_printk
cmpl $2,early_recursion_flag(%rip)
--
1.4.4.2

2007-08-08 07:14:12

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 9/25] [PATCH] report ring kernel is running without paravirt

When paravirtualization is disabled, the kernel is always
running at ring 0. So report it in the appropriate macro

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
include/asm-x86_64/segment.h | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/asm-x86_64/segment.h b/include/asm-x86_64/segment.h
index 04b8ab2..240c1bf 100644
--- a/include/asm-x86_64/segment.h
+++ b/include/asm-x86_64/segment.h
@@ -50,4 +50,8 @@
#define GDT_SIZE (GDT_ENTRIES * 8)
#define TLS_SIZE (GDT_ENTRY_TLS_ENTRIES * 8)

+#ifndef CONFIG_PARAVIRT
+#define get_kernel_rpl() 0
+#endif
+
#endif
--
1.4.4.2

2007-08-08 07:14:40

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 11/25] [PATCH] introduce paravirt_release_pgd()

This patch introduces a new macro/function that informs a paravirt
guest when its page table is not more in use, and can be released.
In case we're not paravirt, just do nothing.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
include/asm-x86_64/pgalloc.h | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/asm-x86_64/pgalloc.h b/include/asm-x86_64/pgalloc.h
index b467be6..dbe1267 100644
--- a/include/asm-x86_64/pgalloc.h
+++ b/include/asm-x86_64/pgalloc.h
@@ -9,6 +9,12 @@
#define QUICK_PGD 0 /* We preserve special mappings over free */
#define QUICK_PT 1 /* Other page table pages that are zero on free */

+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define paravirt_release_pgd(pgd) do { } while (0)
+#endif
+
#define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
#define pud_populate(mm, pud, pmd) \
@@ -100,6 +106,7 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
static inline void pgd_free(pgd_t *pgd)
{
BUG_ON((unsigned long)pgd & (PAGE_SIZE-1));
+ paravirt_release_pgd(pgd);
quicklist_free(QUICK_PGD, pgd_dtor, pgd);
}

--
1.4.4.2

2007-08-08 07:15:01

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 24/25] [PATCH] provide paravirt patching function

This patch introduces apply_paravirt(), a function that shall
be called by i386/alternative.c to apply replacements to
paravirt_functions. It is defined to an do-nothing function
if paravirt is not enabled.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
include/asm-x86_64/alternative.h | 8 +++++---
1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/asm-x86_64/alternative.h b/include/asm-x86_64/alternative.h
index ab161e8..e69a141 100644
--- a/include/asm-x86_64/alternative.h
+++ b/include/asm-x86_64/alternative.h
@@ -143,12 +143,14 @@ static inline void alternatives_smp_switch(int smp) {}
*/
#define ASM_OUTPUT2(a, b) a, b

-struct paravirt_patch;
+struct paravirt_patch_site;
#ifdef CONFIG_PARAVIRT
-void apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end);
+void apply_paravirt(struct paravirt_patch_site *start,
+ struct paravirt_patch_site *end);
#else
static inline void
-apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end)
+apply_paravirt(struct paravirt_patch_site *start,
+ struct paravirt_patch_site *end)
{}
#define __parainstructions NULL
#define __parainstructions_end NULL
--
1.4.4.2

2007-08-08 07:15:41

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

This is finally, the patch we were all looking for. This
patch adds a paravirt.h header with the definition of paravirt_ops
struct. Also, it defines a bunch of inline functions that will
replace, or hook, the other calls. Every one of those functions
adds an entry in the parainstructions section (see vmlinux.lds.S).
Those entries can then be used to runtime-patch the paravirt_ops
functions.

paravirt.c contains implementations of paravirt functions that
are used natively, such as the native_patch. It also fill the
paravirt_ops structure with the whole lot of functions that
were (re)defined throughout this patch set.

There are also changes in asm-offsets.c. paravirt.h needs it
to find out the offsets into the structure of functions
such as irq_enable, used in assembly files.

The text in Kconfig is the same as i386 one.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86_64/Kconfig | 11 +
arch/x86_64/kernel/Makefile | 1 +
arch/x86_64/kernel/asm-offsets.c | 14 +
arch/x86_64/kernel/paravirt.c | 455 +++++++++++++++++++
arch/x86_64/kernel/vmlinux.lds.S | 6 +
include/asm-x86_64/paravirt.h | 901 ++++++++++++++++++++++++++++++++++++++
6 files changed, 1388 insertions(+), 0 deletions(-)

diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
index ffa0364..bfea34c 100644
--- a/arch/x86_64/Kconfig
+++ b/arch/x86_64/Kconfig
@@ -373,6 +373,17 @@ config NODES_SHIFT

# Dummy CONFIG option to select ACPI_NUMA from drivers/acpi/Kconfig.

+config PARAVIRT
+ bool "Paravirtualization support (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ help
+ Paravirtualization is a way of running multiple instances of
+ Linux on the same machine, under a hypervisor. This option
+ changes the kernel so it can modify itself when it is run
+ under a hypervisor, improving performance significantly.
+ However, when run without a hypervisor the kernel is
+ theoretically slower. If in doubt, say N.
+
config X86_64_ACPI_NUMA
bool "ACPI NUMA detection"
depends on NUMA
diff --git a/arch/x86_64/kernel/Makefile b/arch/x86_64/kernel/Makefile
index ff5d8c9..120467f 100644
--- a/arch/x86_64/kernel/Makefile
+++ b/arch/x86_64/kernel/Makefile
@@ -38,6 +38,7 @@ obj-$(CONFIG_X86_VSMP) += vsmp.o
obj-$(CONFIG_K8_NB) += k8.o
obj-$(CONFIG_AUDIT) += audit.o

+obj-$(CONFIG_PARAVIRT) += paravirt.o
obj-$(CONFIG_MODULES) += module.o
obj-$(CONFIG_PCI) += early-quirks.o

diff --git a/arch/x86_64/kernel/asm-offsets.c b/arch/x86_64/kernel/asm-offsets.c
index 778953b..a8ffc95 100644
--- a/arch/x86_64/kernel/asm-offsets.c
+++ b/arch/x86_64/kernel/asm-offsets.c
@@ -15,6 +15,9 @@
#include <asm/segment.h>
#include <asm/thread_info.h>
#include <asm/ia32.h>
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#endif

#define DEFINE(sym, val) \
asm volatile("\n->" #sym " %0 " #val : : "i" (val))
@@ -72,6 +75,17 @@ int main(void)
offsetof (struct rt_sigframe32, uc.uc_mcontext));
BLANK();
#endif
+#ifdef CONFIG_PARAVIRT
+#define ENTRY(entry) DEFINE(PARAVIRT_ ## entry, offsetof(struct paravirt_ops, entry))
+ ENTRY(paravirt_enabled);
+ ENTRY(irq_disable);
+ ENTRY(irq_enable);
+ ENTRY(sysret);
+ ENTRY(iret);
+ ENTRY(read_cr2);
+ ENTRY(swapgs);
+ BLANK();
+#endif
DEFINE(pbe_address, offsetof(struct pbe, address));
DEFINE(pbe_orig_address, offsetof(struct pbe, orig_address));
DEFINE(pbe_next, offsetof(struct pbe, next));
diff --git a/arch/x86_64/kernel/paravirt.c b/arch/x86_64/kernel/paravirt.c
new file mode 100644
index 0000000..a41c1c0
--- /dev/null
+++ b/arch/x86_64/kernel/paravirt.c
@@ -0,0 +1,455 @@
+/* Paravirtualization interfaces
+ Copyright (C) 2007 Glauber de Oliveira Costa and Steven Rostedt,
+ Red Hat Inc.
+ Based on i386 work by Rusty Russell.
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+*/
+#include <linux/errno.h>
+#include <linux/module.h>
+#include <linux/efi.h>
+#include <linux/bcd.h>
+#include <linux/start_kernel.h>
+
+#include <asm/bug.h>
+#include <asm/paravirt.h>
+#include <asm/desc.h>
+#include <asm/setup.h>
+#include <asm/irq.h>
+#include <asm/delay.h>
+#include <asm/fixmap.h>
+#include <asm/apic.h>
+#include <asm/tlbflush.h>
+#include <asm/msr.h>
+#include <asm/page.h>
+#include <asm/pgtable.h>
+#include <asm/proto.h>
+#include <asm/e820.h>
+#include <asm/time.h>
+#include <asm/asm-offsets.h>
+#include <asm/smp.h>
+#include <asm/irqflags.h>
+
+/* nop stub */
+void _paravirt_nop(void)
+{
+}
+
+/* natively, we do normal setup, but we still need to return something */
+static int native_arch_setup(void)
+{
+ return 0;
+}
+
+static void __init default_banner(void)
+{
+ printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
+ paravirt_ops.name);
+}
+
+void memory_setup(void)
+{
+ paravirt_ops.memory_setup();
+}
+
+void syscall_init(void)
+{
+ paravirt_ops.syscall_init();
+}
+
+/* Simple instruction patching code. */
+#define DEF_NATIVE(name, code) \
+ extern const char start_##name[], end_##name[]; \
+ asm("start_" #name ": " code "; end_" #name ":")
+
+DEF_NATIVE(irq_disable, "cli");
+DEF_NATIVE(irq_enable, "sti");
+DEF_NATIVE(restore_fl, "pushq %rdi; popfq");
+DEF_NATIVE(save_fl, "pushfq; popq %rax");
+DEF_NATIVE(iret, "iretq");
+DEF_NATIVE(read_cr2, "movq %cr2, %rax");
+DEF_NATIVE(read_cr3, "movq %cr3, %rax");
+DEF_NATIVE(write_cr3, "movq %rdi, %cr3");
+DEF_NATIVE(flush_tlb_single, "invlpg (%rdi)");
+DEF_NATIVE(clts, "clts");
+DEF_NATIVE(wbinvd, "wbinvd");
+
+/* the three commands give us more control to how to do a sysret */
+DEF_NATIVE(sysret, "movq %gs:" __stringify(pda_oldrsp) ", %rsp; swapgs; sysretq;");
+DEF_NATIVE(swapgs, "swapgs");
+
+DEF_NATIVE(ud2a, "ud2a");
+
+static unsigned native_patch(u8 type, u16 clobbers, void *insns, unsigned len)
+{
+ const unsigned char *start, *end;
+ unsigned ret;
+
+ switch(type) {
+#define SITE(x) case PARAVIRT_PATCH(x): start = start_##x; end = end_##x; goto patch_site
+ SITE(irq_disable);
+ SITE(irq_enable);
+ SITE(restore_fl);
+ SITE(save_fl);
+ SITE(iret);
+ SITE(sysret);
+ SITE(swapgs);
+ SITE(read_cr2);
+ SITE(read_cr3);
+ SITE(write_cr3);
+ SITE(clts);
+ SITE(flush_tlb_single);
+ SITE(wbinvd);
+#undef SITE
+
+ patch_site:
+ ret = paravirt_patch_insns(insns, len, start, end);
+ break;
+
+ case PARAVIRT_PATCH(make_pgd):
+ case PARAVIRT_PATCH(pgd_val):
+ case PARAVIRT_PATCH(make_pte):
+ case PARAVIRT_PATCH(pte_val):
+ case PARAVIRT_PATCH(make_pmd):
+ case PARAVIRT_PATCH(pmd_val):
+ case PARAVIRT_PATCH(make_pud):
+ case PARAVIRT_PATCH(pud_val):
+ /* These functions end up returning what
+ they're passed in the first argument */
+ ret = paravirt_patch_copy_reg(insns, len);
+ break;
+
+ case PARAVIRT_PATCH(set_pte):
+ case PARAVIRT_PATCH(set_pmd):
+ case PARAVIRT_PATCH(set_pud):
+ case PARAVIRT_PATCH(set_pgd):
+ /* These functions end up storing the second
+ * argument in the location pointed by the first */
+ ret = paravirt_patch_store_reg(insns, len);
+ break;
+
+ default:
+ ret = paravirt_patch_default(type, clobbers, insns, len);
+ break;
+ }
+
+ return ret;
+}
+
+unsigned paravirt_patch_nop(void)
+{
+ return 0;
+}
+
+unsigned paravirt_patch_ignore(unsigned len)
+{
+ return len;
+}
+
+unsigned paravirt_patch_call(void *target, u16 tgt_clobbers,
+ void *site, u16 site_clobbers,
+ unsigned len)
+{
+ unsigned char *call = site;
+ unsigned long delta = (unsigned long)target - (unsigned long)(call+5);
+
+ if (tgt_clobbers & ~site_clobbers)
+ return len; /* target would clobber too much for this site */
+ if (len < 5)
+ return len; /* call too long for patch site */
+
+ *call++ = 0xe8; /* call */
+ *(unsigned int *)call = delta;
+
+ return 5;
+}
+
+unsigned paravirt_patch_copy_reg(void *site, unsigned len)
+{
+ unsigned char *mov = site;
+ if (len < 3)
+ return len;
+
+ /* This is mov %rdi, %rax */
+ *mov++ = 0x48;
+ *mov++ = 0x89;
+ *mov = 0xf8;
+ return 3;
+}
+
+unsigned paravirt_patch_store_reg(void *site, unsigned len)
+{
+ unsigned char *mov = site;
+ if (len < 3)
+ return len;
+
+ /* This is mov %rsi, (%rdi) */
+ *mov++ = 0x48;
+ *mov++ = 0x89;
+ *mov = 0x37;
+ return 3;
+}
+
+unsigned paravirt_patch_jmp(void *target, void *site, unsigned len)
+{
+ unsigned char *jmp = site;
+ unsigned long delta = (unsigned long)target - (unsigned long)(jmp+5);
+
+ if (len < 5)
+ return len; /* call too long for patch site */
+
+ *jmp++ = 0xe9; /* jmp */
+ *(unsigned int *)jmp = delta;
+
+ return 5;
+}
+
+unsigned paravirt_patch_default(u8 type, u16 clobbers, void *site, unsigned len)
+{
+ void *opfunc = *((void **)&paravirt_ops + type);
+ unsigned ret;
+
+ if (opfunc == NULL)
+ /* If there's no function, patch it with a ud2a (BUG) */
+ ret = paravirt_patch_insns(site, len, start_ud2a, end_ud2a);
+ else if (opfunc == paravirt_nop)
+ /* If the operation is a nop, then nop the callsite */
+ ret = paravirt_patch_nop();
+ else if (type == PARAVIRT_PATCH(iret) ||
+ type == PARAVIRT_PATCH(sysret))
+ /* If operation requires a jmp, then jmp */
+ ret = paravirt_patch_jmp(opfunc, site, len);
+ else
+ /* Otherwise call the function; assume target could
+ clobber any caller-save reg */
+ ret = paravirt_patch_call(opfunc, CLBR_ANY,
+ site, clobbers, len);
+
+ return ret;
+}
+
+unsigned paravirt_patch_insns(void *site, unsigned len,
+ const char *start, const char *end)
+{
+ unsigned insn_len = end - start;
+
+ if (insn_len > len || start == NULL)
+ insn_len = len;
+ else
+ memcpy(site, start, insn_len);
+
+ return insn_len;
+}
+
+void init_IRQ(void)
+{
+ paravirt_ops.init_IRQ();
+}
+
+static unsigned long native_save_fl(void)
+{
+ unsigned long f;
+ asm volatile("pushfq ; popq %0":"=g" (f): /* no input */);
+ return f;
+}
+
+static void native_restore_fl(unsigned long f)
+{
+ asm volatile("pushq %0 ; popfq": /* no output */
+ :"g" (f)
+ :"memory", "cc");
+}
+
+static void native_irq_disable(void)
+{
+ asm volatile("cli": : :"memory");
+}
+
+static void native_irq_enable(void)
+{
+ asm volatile("sti": : :"memory");
+}
+
+static inline void native_write_dt_entry(void *dt, int entry, u32 entry_low, u32 entry_high)
+{
+ u32 *lp = (u32 *)((char *)dt + entry*8);
+ lp[0] = entry_low;
+ lp[1] = entry_high;
+}
+
+static void native_io_delay(void)
+{
+ asm volatile("outb %al,$0x80");
+}
+
+pte_t native_make_pte(unsigned long pte)
+{
+ return (pte_t){ pte };
+}
+
+pud_t native_make_pud(unsigned long pud)
+{
+ return (pud_t){ pud };
+}
+
+pmd_t native_make_pmd(unsigned long pmd)
+{
+ return (pmd_t){ pmd };
+}
+
+pgd_t native_make_pgd(unsigned long pgd)
+{
+ return (pgd_t){ pgd };
+}
+
+void native_set_pte_at(struct mm_struct *mm, u64 addr, pte_t *ptep,
+ pte_t pteval)
+{
+ native_set_pte(ptep,pteval);
+}
+
+void native_pte_clear(struct mm_struct *mm, u64 addr, pte_t *ptep)
+{
+ native_set_pte_at(mm,addr,ptep,__pte(0));
+}
+
+void native_pmd_clear(pmd_t *pmd)
+{
+ native_set_pmd(pmd,__pmd(0));
+}
+
+void native_swapgs(void)
+{
+ asm volatile ("swapgs" :: :"memory" );
+}
+
+/* These are in entry.S */
+extern void native_iret(void);
+extern void native_sysret(void);
+
+static int __init print_banner(void)
+{
+ paravirt_ops.banner();
+ return 0;
+}
+core_initcall(print_banner);
+
+struct paravirt_ops paravirt_ops = {
+ .kernel_rpl = 0,
+ .paravirt_enabled = 0,
+ .name = "bare hardware",
+ .mem_type = "BIOS-e820",
+
+ .patch = native_patch,
+ .banner = default_banner,
+ .arch_setup = native_arch_setup,
+ .memory_setup = setup_memory_region,
+ .syscall_init = x86_64_syscall_init,
+ .get_wallclock = do_get_cmos_time,
+ .set_wallclock = do_set_rtc_mmss,
+ .time_init = time_init_hook,
+ .init_IRQ = native_init_IRQ,
+
+ .cpuid = native_cpuid,
+ .get_debugreg = native_get_debugreg,
+ .set_debugreg = native_set_debugreg,
+ .clts = native_clts,
+ .read_cr0 = native_read_cr0,
+ .write_cr0 = native_write_cr0,
+ .read_cr2 = native_read_cr2,
+ .write_cr2 = native_write_cr2,
+ .read_cr3 = native_read_cr3,
+ .write_cr3 = native_write_cr3,
+ .read_cr4 = native_read_cr4,
+ .write_cr4 = native_write_cr4,
+ .save_fl = native_save_fl,
+ .restore_fl = native_restore_fl,
+ .irq_disable = native_irq_disable,
+ .irq_enable = native_irq_enable,
+ .safe_halt = native_raw_safe_halt,
+ .halt = native_halt,
+ .wbinvd = native_wbinvd,
+ .read_msr = native_read_msr_safe,
+ .write_msr = native_write_msr_safe,
+ .read_tsc = native_read_tsc,
+ .read_tscp = native_read_tscp,
+ .read_pmc = native_read_pmc,
+ .load_tr_desc = native_load_tr_desc,
+ .set_ldt = native_set_ldt,
+ .load_gdt = native_load_gdt,
+ .load_idt = native_load_idt,
+ .store_gdt = native_store_gdt,
+ .store_idt = native_store_idt,
+ .store_tr = native_store_tr,
+ .load_tls = native_load_tls,
+ .write_ldt_entry = native_write_ldt_entry,
+ .write_gdt_entry = native_write_gdt_entry,
+ .write_idt_entry = native_write_idt_entry,
+ .load_rsp0 = native_load_rsp0,
+
+ .io_delay = native_io_delay,
+
+#ifdef CONFIG_X86_LOCAL_APIC
+ .apic_write = native_apic_write,
+ .apic_read = native_apic_read,
+ .setup_boot_clock = setup_boot_APIC_clock,
+ .setup_secondary_clock = setup_secondary_APIC_clock,
+ .startup_ipi_hook = paravirt_nop,
+#endif
+ .set_lazy_mode = paravirt_nop,
+ .ebda_info = native_ebda_info,
+
+ .flush_tlb_user = native_flush_tlb,
+ .flush_tlb_kernel = native_flush_tlb_all,
+ .flush_tlb_single = native_flush_tlb_one,
+ .flush_tlb_others = native_flush_tlb_others,
+
+ .release_pgd = paravirt_nop,
+
+ .set_pte = native_set_pte,
+ .set_pte_at = native_set_pte_at,
+ .set_pmd = native_set_pmd,
+ .set_pud = native_set_pud,
+ .set_pgd = native_set_pgd,
+
+ .pte_update = paravirt_nop,
+ .pte_update_defer = paravirt_nop,
+
+ .pte_clear = native_pte_clear,
+ .pmd_clear = native_pmd_clear,
+ .pud_clear = native_pud_clear,
+ .pgd_clear = native_pgd_clear,
+
+ .pte_val = native_pte_val,
+ .pud_val = native_pud_val,
+ .pmd_val = native_pmd_val,
+ .pgd_val = native_pgd_val,
+
+ .make_pte = native_make_pte,
+ .make_pmd = native_make_pmd,
+ .make_pud = native_make_pud,
+ .make_pgd = native_make_pgd,
+
+ .swapgs = native_swapgs,
+ .sysret = native_sysret,
+ .iret = native_iret,
+
+ .dup_mmap = paravirt_nop,
+ .exit_mmap = paravirt_nop,
+ .activate_mm = paravirt_nop,
+
+};
+
+EXPORT_SYMBOL(paravirt_ops);
diff --git a/arch/x86_64/kernel/vmlinux.lds.S b/arch/x86_64/kernel/vmlinux.lds.S
index ba8ea97..c3fce85 100644
--- a/arch/x86_64/kernel/vmlinux.lds.S
+++ b/arch/x86_64/kernel/vmlinux.lds.S
@@ -185,6 +185,12 @@ SECTIONS
.altinstr_replacement : AT(ADDR(.altinstr_replacement) - LOAD_OFFSET) {
*(.altinstr_replacement)
}
+ . = ALIGN(8);
+ .parainstructions : AT(ADDR(.parainstructions) - LOAD_OFFSET) {
+ __parainstructions = .;
+ *(.parainstructions)
+ __parainstructions_end = .;
+ }
/* .exit.text is discard at runtime, not link time, to deal with references
from .altinstructions and .eh_frame */
.exit.text : AT(ADDR(.exit.text) - LOAD_OFFSET) { *(.exit.text) }
diff --git a/include/asm-x86_64/paravirt.h b/include/asm-x86_64/paravirt.h
new file mode 100644
index 0000000..fb0347e
--- /dev/null
+++ b/include/asm-x86_64/paravirt.h
@@ -0,0 +1,901 @@
+#ifndef __ASM_PARAVIRT_H
+#define __ASM_PARAVIRT_H
+
+#ifdef CONFIG_PARAVIRT
+/* Various instructions on x86 need to be replaced for
+ * para-virtualization: those hooks are defined here. */
+#include <linux/linkage.h>
+#include <linux/stringify.h>
+#include <asm/desc_defs.h>
+#include <asm/page.h>
+#include <asm/types.h>
+#include <asm/pda.h>
+
+/* Bitmask of what can be clobbered: usually at least rax. */
+#define CLBR_NONE 0x000
+#define CLBR_RAX 0x001
+#define CLBR_RDI 0x002
+#define CLBR_RSI 0x004
+#define CLBR_RCX 0x008
+#define CLBR_RDX 0x010
+#define CLBR_R8 0x020
+#define CLBR_R9 0x040
+#define CLBR_R10 0x080
+#define CLBR_R11 0x100
+#define CLBR_ANY 0xfff
+
+
+#ifndef __ASSEMBLY__
+#include <linux/cpumask.h>
+#include <linux/types.h>
+
+void _paravirt_nop(void);
+#define paravirt_nop ((void *)_paravirt_nop)
+
+/* Lazy mode for batching updates / context switch */
+enum paravirt_lazy_mode {
+ PARAVIRT_LAZY_NONE = 0,
+ PARAVIRT_LAZY_MMU = 1,
+ PARAVIRT_LAZY_CPU = 2,
+ PARAVIRT_LAZY_FLUSH = 3,
+};
+
+struct thread_struct;
+struct desc_struct;
+struct desc_ptr;
+struct tss_struct;
+struct mm_struct;
+
+/*
+ * integers must be use with care here. They can break the PARAVIRT_PATCH(x)
+ * macro, that divides the offset in the structure by 8, to get a number
+ * associated with the hook. Dividing by four would be a solution, but it
+ * would limit the future growth of the structure if needed.
+ *
+ * The first two integers are okay, because they sum up to a long, are packed
+ * together, and sum up to a long
+ */
+struct paravirt_ops
+{
+ unsigned int kernel_rpl;
+ int paravirt_enabled;
+ const char *name;
+ char *mem_type;
+
+ /*
+ * Patch may replace one of the defined code sequences with arbitrary
+ * code, subject to the same register constraints. This generally
+ * means the code is not free to clobber any registers other than RAX.
+ * The patch function should return the number of bytes of code
+ * generated, as we nop pad the rest in generic code.
+ */
+ unsigned (*patch)(u8 type, u16 clobber, void *firstinsn, unsigned len);
+
+ int (*arch_setup)(void);
+ void (*memory_setup)(void);
+ void (*init_IRQ)(void);
+ void (*time_init)(void);
+
+ /* entry point for our hypervisor syscall handler */
+ void (*syscall_init)(void);
+
+ void (*banner)(void);
+
+ unsigned long (*get_wallclock)(void);
+ int (*set_wallclock)(unsigned long);
+
+ /* cpuid emulation, mostly so that caps bits can be disabled */
+ void (*cpuid)(unsigned int *eax, unsigned int *ebx,
+ unsigned int *ecx, unsigned int *edx);
+
+ unsigned long (*get_debugreg)(int regno);
+ void (*set_debugreg)(unsigned long value, int regno);
+
+ void (*clts)(void);
+
+ unsigned long (*read_cr0)(void);
+ void (*write_cr0)(unsigned long);
+
+ unsigned long (*read_cr2)(void);
+ void (*write_cr2)(unsigned long);
+
+ unsigned long (*read_cr3)(void);
+ void (*write_cr3)(unsigned long);
+
+ unsigned long (*read_cr4)(void);
+ void (*write_cr4)(unsigned long);
+
+ /*
+ * Get/set interrupt state. save_fl and restore_fl are only
+ * expected to use X86_EFLAGS_IF; all other bits
+ * returned from save_fl are undefined, and may be ignored by
+ * restore_fl.
+ */
+ unsigned long (*save_fl)(void);
+ void (*restore_fl)(unsigned long);
+ void (*irq_disable)(void);
+ void (*irq_enable)(void);
+ void (*safe_halt)(void);
+ void (*halt)(void);
+
+ void (*wbinvd)(void);
+
+ /* err = 0/-EFAULT. wrmsr returns 0/-EFAULT. */
+ unsigned long (*read_msr)(unsigned int msr, int *err);
+ long (*write_msr)(unsigned int msr, unsigned long val);
+
+ unsigned long (*read_tsc)(void);
+ unsigned long (*read_tscp)(int *aux);
+ unsigned long (*read_pmc)(int counter);
+
+ void (*load_tr_desc)(void);
+ void (*load_gdt)(const struct desc_ptr *);
+ void (*load_idt)(const struct desc_ptr *);
+ void (*store_gdt)(struct desc_ptr *);
+ void (*store_idt)(struct desc_ptr *);
+ void (*set_ldt)(const void *desc, unsigned entries);
+ unsigned long (*store_tr)(void);
+ void (*load_tls)(struct thread_struct *t, unsigned int cpu);
+ void (*write_ldt_entry)(struct desc_struct *,
+ int entrynum, u32 low, u32 high);
+ void (*write_gdt_entry)(void *ptr, void *entry, unsigned type,
+ unsigned size);
+ void (*write_idt_entry)(void *adr, struct gate_struct *s);
+
+ void (*load_rsp0)(struct tss_struct *tss,
+ struct thread_struct *thread);
+
+ void (*io_delay)(void);
+
+ /*
+ * Hooks for intercepting the creation/use/destruction of an
+ * mm_struct.
+ */
+ void (*activate_mm)(struct mm_struct *prev,
+ struct mm_struct *next);
+ void (*dup_mmap)(struct mm_struct *oldmm,
+ struct mm_struct *mm);
+ void (*exit_mmap)(struct mm_struct *mm);
+
+#ifdef CONFIG_X86_LOCAL_APIC
+ void (*apic_write)(unsigned long reg, unsigned int v);
+ unsigned int (*apic_read)(unsigned long reg);
+ void (*setup_boot_clock)(void);
+ void (*setup_secondary_clock)(void);
+
+ void (*startup_ipi_hook)(int phys_apicid,
+ unsigned long start_rip,
+ unsigned long start_rsp);
+
+#endif
+
+ void (*flush_tlb_user)(void);
+ void (*flush_tlb_kernel)(void);
+ void (*flush_tlb_single)(unsigned long addr);
+ void (*flush_tlb_others)(cpumask_t cpus, struct mm_struct *mm,
+ unsigned long va);
+
+ void (*release_pgd)(pgd_t *pgd);
+
+ void (*set_pte)(pte_t *ptep, pte_t pteval);
+ void (*set_pte_at)(struct mm_struct *mm, u64 addr, pte_t *ptep, pte_t pteval);
+ void (*set_pmd)(pmd_t *pmdp, pmd_t pmdval);
+ void (*set_pud)(pud_t *pudp, pud_t pudval);
+ void (*set_pgd)(pgd_t *pgdp, pgd_t pgdval);
+
+ void (*pte_update)(struct mm_struct *mm, u64 addr, pte_t *ptep);
+ void (*pte_update_defer)(struct mm_struct *mm, u64 addr, pte_t *ptep);
+
+ void (*pte_clear)(struct mm_struct *mm, u64 addr, pte_t *ptep);
+ void (*pmd_clear)(pmd_t *pmdp);
+ void (*pud_clear)(pud_t *pudp);
+ void (*pgd_clear)(pgd_t *pgdp);
+
+ unsigned long (*pte_val)(pte_t);
+ unsigned long (*pud_val)(pud_t);
+ unsigned long (*pmd_val)(pmd_t);
+ unsigned long (*pgd_val)(pgd_t);
+
+ pte_t (*make_pte)(unsigned long pte);
+ pud_t (*make_pud)(unsigned long pud);
+ pmd_t (*make_pmd)(unsigned long pmd);
+ pgd_t (*make_pgd)(unsigned long pgd);
+
+ void (*swapgs)(void);
+ void (*ebda_info)(unsigned *addr, unsigned *size);
+ void (*set_lazy_mode)(int mode);
+
+ /* These two are jmp to, not actually called. */
+ void (*sysret)(void);
+ void (*iret)(void);
+};
+
+extern struct paravirt_ops paravirt_ops;
+
+/*
+ * This generates an indirect call based on the operation type number.
+ * The type number, computed in PARAVIRT_PATCH, is derived from the
+ * offset into the paravirt_ops structure, and can therefore be freely
+ * converted back into a structure offset. It induces a limitation in
+ * what can go in the paravirt_ops structure. For further information,
+ * see comments in the top of the struct
+ */
+#define PARAVIRT_PATCH(x) \
+ (offsetof(struct paravirt_ops, x) / sizeof(void *))
+
+#define paravirt_type(type) \
+ [paravirt_typenum] "i" (PARAVIRT_PATCH(type))
+#define paravirt_clobber(clobber) \
+ [paravirt_clobber] "i" (clobber)
+
+/*
+ * Generate some code, and mark it as patchable by the
+ * apply_paravirt() alternate instruction patcher.
+ */
+#define _paravirt_alt(insn_string, type, clobber) \
+ "771:\n\t" insn_string "\n" "772:\n" \
+ ".pushsection .parainstructions,\"a\"\n" \
+ ".align 8\n" \
+ " .quad 771b\n" \
+ " .byte " type "\n" \
+ " .byte 772b-771b\n" \
+ " .long " clobber "\n" \
+ ".popsection\n"
+
+/* Generate patchable code, with the default asm parameters. */
+#define paravirt_alt(insn_string) \
+ _paravirt_alt(insn_string, "%c[paravirt_typenum]", "%c[paravirt_clobber]")
+
+unsigned paravirt_patch_nop(void);
+unsigned paravirt_patch_ignore(unsigned len);
+unsigned paravirt_patch_call(void *target, u16 tgt_clobbers,
+ void *site, u16 site_clobbers,
+ unsigned len);
+unsigned paravirt_patch_jmp(void *target, void *site, unsigned len);
+unsigned paravirt_patch_default(u8 type, u16 clobbers, void *site, unsigned len);
+unsigned paravirt_patch_copy_reg(void *site, unsigned len);
+unsigned paravirt_patch_store_reg(void *site, unsigned len);
+unsigned paravirt_patch_insns(void *site, unsigned len,
+ const char *start, const char *end);
+/*
+ * This generates an indirect call based on the operation type number.
+ * The type number, computed in PARAVIRT_PATCH, is derived from the
+ * offset into the paravirt_ops structure, and can therefore be freely
+ * converted back into a structure offset.
+ */
+#define PARAVIRT_CALL "call *(paravirt_ops+%c[paravirt_typenum]*8);"
+
+/*
+ * These macros are intended to wrap calls into a paravirt_ops
+ * operation, so that they can be later identified and patched at
+ * runtime.
+ *
+ * Normally, a call to a pv_op function is a simple indirect call:
+ * (paravirt_ops.operations)(args...).
+ *
+ * Unfortunately, this is a relatively slow operation for modern CPUs,
+ * because it cannot necessarily determine what the destination
+ * address is. In this case, the address is a runtime constant, so at
+ * the very least we can patch the call to be a simple direct call, or
+ * ideally, patch an inline implementation into the callsite. (Direct
+ * calls are essentially free, because the call and return addresses
+ * are completely predictable.)
+ *
+ * All caller-save registers are expected expected to be modified
+ * (either clobbered or used for return values). They are the return
+ * value (rax), the arguments potentially used by the functions, (rdi, rsi,
+ * rdx, rcx), and the others caller-saved (r8-r11)
+ *
+ * The call instruction itself is marked by placing its start address
+ * and size into the .parainstructions section, so that
+ * apply_paravirt() in arch/i386/kernel/alternative.c can do the
+ * appropriate patching under the control of the backend paravirt_ops
+ * implementation.
+ *
+ * Unfortunately there's no way to get gcc to generate the args setup
+ * for the call, and then allow the call itself to be generated by an
+ * inline asm. Because of this, we must do the complete arg setup and
+ * return value handling from within these macros. This is fairly
+ * cumbersome.
+ *
+ * There are 5 sets of PVOP_* macros for dealing with 0-4 arguments.
+ * It could be extended to more arguments, but there would be little
+ * to be gained from that. For each number of arguments, there are
+ * the two VCALL and CALL variants for void and non-void functions.
+ * Small structures are passed and returned in registers. The macro
+ * calling convention can't directly deal with this, so the wrapper
+ * functions must do this.
+ *
+ * These PVOP_* macros are only defined within this header. This
+ * means that all uses must be wrapped in inline functions. This also
+ * makes sure the incoming and outgoing types are always correct.
+*/
+#define CALL_CLOBBERS "r8", "r9", "r10", "r11"
+
+#define __PVOP_CALL(rettype, op, pre, post, ...) \
+ ({ \
+ rettype __ret; \
+ unsigned long __rax, __rdi, __rsi, __rdx, __rcx; \
+ asm volatile(pre \
+ paravirt_alt(PARAVIRT_CALL) \
+ post \
+ : "=a" (__rax), "=D" (__rdi), \
+ "=S" (__rsi), "=d" (__rdx), \
+ "=c" (__rcx) \
+ : paravirt_type(op), \
+ paravirt_clobber(CLBR_ANY), \
+ ##__VA_ARGS__ \
+ : "memory", CALL_CLOBBERS, "cc"); \
+ __ret = (rettype)__rax; \
+ })
+
+#define __PVOP_VCALL(op, pre, post, ...) \
+ ({ \
+ unsigned long __rax, __rdi, __rsi, __rdx, __rcx; \
+ asm volatile(pre \
+ paravirt_alt(PARAVIRT_CALL) \
+ post \
+ : "=a" (__rax), "=D" (__rdi), \
+ "=S" (__rsi), "=d" (__rdx), \
+ "=c" (__rcx) \
+ : paravirt_type(op), \
+ paravirt_clobber(CLBR_ANY), \
+ ##__VA_ARGS__ \
+ : "memory", CALL_CLOBBERS, "cc"); \
+ })
+
+#define PVOP_CALL0(rettype, op) \
+ __PVOP_CALL(rettype, op, "", "")
+#define PVOP_VCALL0(op) \
+ __PVOP_VCALL(op, "", "")
+
+#define PVOP_CALL1(rettype, op, arg1) \
+ __PVOP_CALL(rettype, op, "", "", "D" ((u64)(arg1)))
+#define PVOP_VCALL1(op, arg1) \
+ __PVOP_VCALL(op, "", "", "D" ((u64)(arg1)))
+
+#define PVOP_CALL2(rettype, op, arg1, arg2) \
+ __PVOP_CALL(rettype, op, "", "", "D" ((u64)(arg1)), "S" ((u64)(arg2)))
+#define PVOP_VCALL2(op, arg1, arg2) \
+ __PVOP_VCALL(op, "", "", "D" ((u64)(arg1)), "S" ((u64)(arg2)))
+
+#define PVOP_CALL3(rettype, op, arg1, arg2, arg3) \
+ __PVOP_CALL(rettype, op, "", "", "D" ((u64)(arg1)), \
+ "S"((u64)(arg2)), "d"((u64)(arg3)))
+#define PVOP_VCALL3(op, arg1, arg2, arg3) \
+ __PVOP_VCALL(op, "", "", "D" ((u64)(arg1)), "S"((u64)(arg2)), \
+ "d"((u64)(arg3)))
+
+#define PVOP_CALL4(rettype, op, arg1, arg2, arg3, arg4) \
+ __PVOP_CALL(rettype, op, "", "", "D" ((u64)(arg1)), \
+ "S"((u64)(arg2)), "d"((u64)(arg3)), "c" ((u64)(arg4)))
+#define PVOP_VCALL4(op, arg1, arg2, arg3, arg4) \
+ __PVOP_VCALL(op, "", "", "D" ((u64)(arg1)), "S"((u64)(arg2)), \
+ "d"((u64)(arg3)), "c"((u64)(arg4)))
+
+#define paravirt_arch_setup() paravirt_ops.arch_setup()
+
+#define get_kernel_rpl (paravirt_ops.kernel_rpl)
+
+static inline int paravirt_enabled(void)
+{
+ return paravirt_ops.paravirt_enabled;
+}
+
+static inline void load_rsp0(struct tss_struct *tss,
+ struct thread_struct *thread)
+{
+ PVOP_VCALL2(load_rsp0, tss,thread);
+}
+
+static inline void clts(void)
+{
+ PVOP_VCALL0(clts);
+}
+
+static inline unsigned long read_cr0(void)
+{
+ return PVOP_CALL0(unsigned long, read_cr0);
+}
+
+static inline void write_cr0(unsigned long x)
+{
+ PVOP_VCALL1(write_cr0, x);
+}
+
+static inline unsigned long read_cr2(void)
+{
+ return PVOP_CALL0(unsigned long, read_cr2);
+}
+
+static inline void write_cr2(unsigned long x)
+{
+ PVOP_VCALL1(write_cr2, x);
+}
+
+static inline unsigned long read_cr3(void)
+{
+ return PVOP_CALL0(unsigned long, read_cr3);
+}
+static inline void write_cr3(unsigned long x)
+{
+ PVOP_VCALL1(write_cr3, x);
+}
+
+static inline unsigned long read_cr4(void)
+{
+ return PVOP_CALL0(unsigned long, read_cr4);
+}
+static inline void write_cr4(unsigned long x)
+{
+ PVOP_VCALL1(write_cr4, x);
+}
+
+static inline void wbinvd(void)
+{
+ PVOP_VCALL0(wbinvd);
+}
+
+#define get_debugreg(var, reg) var = paravirt_ops.get_debugreg(reg)
+#define set_debugreg(val, reg) paravirt_ops.set_debugreg(reg, val)
+
+
+static inline void raw_safe_halt(void)
+{
+ PVOP_VCALL0(safe_halt);
+}
+
+static inline void halt(void)
+{
+ PVOP_VCALL0(safe_halt);
+}
+
+static inline unsigned long get_wallclock(void)
+{
+ return PVOP_CALL0(unsigned long, get_wallclock);
+}
+
+static inline int set_wallclock(unsigned long nowtime)
+{
+ return PVOP_CALL1(int, set_wallclock, nowtime);
+}
+
+static inline void do_time_init(void)
+{
+ PVOP_VCALL0(time_init);
+}
+
+/* The paravirtualized CPUID instruction. */
+static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
+ unsigned int *ecx, unsigned int *edx)
+{
+ PVOP_VCALL4(cpuid, eax, ebx, ecx, edx);
+}
+
+
+static inline unsigned long read_msr(unsigned int msr)
+{
+ int __err;
+ return PVOP_CALL2(unsigned long, read_msr, msr, &__err);
+}
+
+static inline unsigned long write_msr(unsigned int msr, unsigned long val)
+{
+ return PVOP_CALL2(unsigned long, write_msr, msr, val);
+}
+
+static inline unsigned long read_msr_safe(unsigned int msr, int *err)
+{
+ return PVOP_CALL2(unsigned long, read_msr, msr, err);
+}
+
+static inline unsigned int write_msr_safe(unsigned int msr, unsigned long val)
+{
+ return PVOP_CALL2(unsigned long, write_msr, msr, val);
+}
+
+static inline unsigned long read_pmc(int counter)
+{
+ return PVOP_CALL1(unsigned long, read_pmc, counter);
+}
+
+static inline unsigned long read_tsc_reg(void)
+{
+ return PVOP_CALL0(unsigned long, read_tsc);
+}
+static inline unsigned long read_tscp(int *aux)
+{
+ return PVOP_CALL1(unsigned long, read_tscp, aux);
+}
+
+static inline void load_TR_desc(void)
+{
+ PVOP_VCALL0(load_tr_desc);
+}
+
+static inline void load_gdt(const struct desc_ptr *dtr)
+{
+ PVOP_VCALL1(load_gdt, dtr);
+}
+
+static inline void load_idt(const struct desc_ptr *dtr)
+{
+ PVOP_VCALL1(load_idt, dtr);
+}
+
+static inline void set_ldt(void *addr, unsigned entries)
+{
+ PVOP_VCALL2(set_ldt, addr, entries);
+}
+
+static inline void store_gdt(struct desc_ptr *dtr)
+{
+ PVOP_VCALL1(store_gdt, dtr);
+}
+
+static inline void store_idt(struct desc_ptr *dtr)
+{
+ PVOP_VCALL1(store_idt, dtr);
+}
+
+static inline unsigned long paravirt_store_tr(void)
+{
+ return PVOP_CALL0(unsigned long, store_tr);
+}
+
+#define store_tr(tr) (tr) = paravirt_store_tr();
+
+static inline void load_TLS(struct thread_struct *t, unsigned int cpu)
+{
+ PVOP_VCALL2(load_tls, t,cpu);
+}
+
+static inline void write_ldt_entry(struct desc_struct *desc,
+ int num, u32 entry1, u32 entry2)
+{
+ PVOP_VCALL4(write_ldt_entry, desc, num, entry1, entry2);
+}
+
+static inline void write_gdt_entry(void *ptr, void *entry,
+ unsigned type, unsigned size)
+{
+ PVOP_VCALL4(write_gdt_entry, ptr, entry, type, size);
+}
+
+static inline void write_idt_entry(void *adr, struct gate_struct *s)
+{
+ PVOP_VCALL2(write_idt_entry, adr, s);
+}
+
+static inline pte_t __pte(unsigned long pte)
+{
+ return (pte_t) {PVOP_CALL1(unsigned long, make_pte, pte)};
+}
+static inline unsigned long pte_val(pte_t pte)
+{
+ return PVOP_CALL1(unsigned long, pte_val, pte.pte);
+}
+
+static inline pgd_t __pgd(unsigned long pgd)
+{
+ return (pgd_t) {PVOP_CALL1(unsigned long, make_pgd, pgd)};
+}
+static inline unsigned long pgd_val(pgd_t pgd)
+{
+ return PVOP_CALL1(unsigned long, pgd_val, pgd.pgd);
+}
+
+static inline pud_t __pud(unsigned long pud)
+{
+ return (pud_t) {PVOP_CALL1(unsigned long, make_pud, pud)};
+}
+static inline unsigned long pud_val(pud_t pud)
+{
+ return PVOP_CALL1(unsigned long, pud_val, pud.pud);
+}
+
+static inline pmd_t __pmd(unsigned long pmd)
+{
+ return (pmd_t) {PVOP_CALL1(unsigned long, make_pmd, pmd)};
+}
+static inline unsigned long pmd_val(pmd_t pmd)
+{
+ return PVOP_CALL1(unsigned long, pmd_val, pmd.pmd);
+}
+
+/* The paravirtualized I/O functions */
+static inline void slow_down_io(void) {
+ paravirt_ops.io_delay();
+#ifdef REALLY_SLOW_IO
+ paravirt_ops.io_delay();
+ paravirt_ops.io_delay();
+ paravirt_ops.io_delay();
+#endif
+}
+
+#ifdef CONFIG_X86_LOCAL_APIC
+/*
+ * Basic functions accessing APICs.
+ */
+static inline void apic_write(unsigned long reg, unsigned long v)
+{
+ PVOP_VCALL2(apic_write, reg,v);
+}
+
+static inline unsigned int apic_read(unsigned long reg)
+{
+ return PVOP_CALL1(unsigned long, apic_read, reg);
+}
+
+static inline void setup_boot_clock(void)
+{
+ PVOP_VCALL0(setup_boot_clock);
+}
+
+static inline void setup_secondary_clock(void)
+{
+ PVOP_VCALL0(setup_secondary_clock);
+}
+
+static inline void startup_ipi_hook(int phys_apicid, unsigned long start_rip,
+ unsigned long init_rsp)
+{
+ PVOP_VCALL3(startup_ipi_hook, phys_apicid, start_rip, init_rsp);
+}
+
+#endif
+
+void native_nop(void);
+
+static inline void paravirt_activate_mm(struct mm_struct *prev,
+ struct mm_struct *next)
+{
+ PVOP_VCALL2(activate_mm, prev, next);
+}
+
+static inline void arch_dup_mmap(struct mm_struct *oldmm,
+ struct mm_struct *mm)
+{
+ PVOP_VCALL2(dup_mmap, oldmm, mm);
+}
+
+static inline void arch_exit_mmap(struct mm_struct *mm)
+{
+ PVOP_VCALL1(exit_mmap, mm);
+}
+
+static inline void __flush_tlb(void)
+{
+ PVOP_VCALL0(flush_tlb_user);
+}
+static inline void __flush_tlb_all(void)
+{
+ PVOP_VCALL0(flush_tlb_kernel);
+}
+static inline void __flush_tlb_one(unsigned long addr)
+{
+ PVOP_VCALL1(flush_tlb_single, addr);
+}
+
+static inline void paravirt_release_pgd(pgd_t *pgd)
+{
+ PVOP_VCALL1(release_pgd, pgd);
+}
+
+static inline void set_pte(pte_t *ptep, pte_t pteval)
+{
+ PVOP_VCALL2(set_pte, ptep, pteval.pte);
+}
+
+static inline void set_pte_at(struct mm_struct *mm, u64 addr, pte_t *ptep, pte_t pteval)
+{
+ PVOP_VCALL4(set_pte_at, mm, addr, ptep, pteval.pte);
+}
+
+static inline void set_pmd(pmd_t *pmdp, pmd_t pmdval)
+{
+ PVOP_VCALL2(set_pmd, pmdp, pmdval.pmd);
+}
+
+static inline void pte_update(struct mm_struct *mm, u32 addr, pte_t *ptep)
+{
+ PVOP_VCALL3(pte_update, mm, addr, ptep);
+}
+
+static inline void pte_update_defer(struct mm_struct *mm, u32 addr, pte_t *ptep)
+{
+ PVOP_VCALL3(pte_update_defer, mm, addr, ptep);
+}
+
+
+static inline void set_pgd(pgd_t *pgdp, pgd_t pgdval)
+{
+ PVOP_VCALL2(set_pgd, pgdp, pgdval.pgd);
+}
+
+static inline void set_pud(pud_t *pudp, pud_t pudval)
+{
+ PVOP_VCALL2(set_pud, pudp, pudval.pud);
+}
+
+static inline void pte_clear(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+ PVOP_VCALL3(pte_clear, mm, addr, ptep);
+}
+
+static inline void pmd_clear(pmd_t *pmdp)
+{
+ PVOP_VCALL1(pmd_clear, pmdp);
+}
+
+static inline void pud_clear(pud_t *pudp)
+{
+ PVOP_VCALL1(pud_clear, pudp);
+}
+
+static inline void pgd_clear(pgd_t *pgdp)
+{
+ PVOP_VCALL1(pgd_clear, pgdp);
+}
+
+#define __HAVE_ARCH_ENTER_LAZY_CPU_MODE
+#define arch_enter_lazy_cpu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_CPU)
+#define arch_leave_lazy_cpu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_NONE)
+
+#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
+#define arch_enter_lazy_mmu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_MMU)
+#define arch_leave_lazy_mmu_mode() paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_NONE)
+
+/* These functions tends to be very simple. So, if they touch any register,
+ * the calle-saved ones may already fulfill their needs, and hopefully we
+ * have no need to save any. */
+static inline unsigned long __raw_local_save_flags(void)
+{
+ unsigned long f;
+
+ __asm__ __volatile__(paravirt_alt(PARAVIRT_CALL)
+ : "=a"(f)
+ : paravirt_type(save_fl),
+ paravirt_clobber(CLBR_RAX)
+ : "memory", "cc");
+ return f;
+}
+
+static inline void raw_local_irq_restore(unsigned long f)
+{
+ __asm__ __volatile__(paravirt_alt(PARAVIRT_CALL)
+ :
+ : "D" (f),
+ paravirt_type(restore_fl),
+ paravirt_clobber(CLBR_RAX)
+ : "memory", "rax", "cc");
+}
+
+static inline void raw_local_irq_disable(void)
+{
+ __asm__ __volatile__(paravirt_alt(PARAVIRT_CALL)
+ :
+ : paravirt_type(irq_disable),
+ paravirt_clobber(CLBR_RAX)
+ : "memory", "rax", "cc");
+}
+
+static inline void raw_local_irq_enable(void)
+{
+ __asm__ __volatile__(paravirt_alt(PARAVIRT_CALL)
+ :
+ : paravirt_type(irq_enable),
+ paravirt_clobber(CLBR_RAX)
+ : "memory", "rax", "cc");
+}
+
+/* These all sit in the .parainstructions section to tell us what to patch. */
+struct paravirt_patch_site {
+ u8 *instr; /* original instructions */
+ u8 instrtype; /* type of this instruction */
+ u8 len; /* length of original instruction */
+ u32 clobbers; /* what registers you may clobber */
+} __attribute__((aligned(8)));
+
+extern struct paravirt_patch_site __parainstructions[],
+ __parainstructions_end[];
+
+#define CLI_STRING _paravirt_alt("call *paravirt_ops+%c[paravirt_cli_type];", \
+ "%c[paravirt_cli_type]", "%c[paravirt_clobber]")
+
+#define STI_STRING _paravirt_alt("call *paravirt_ops+%c[paravirt_sti_type];", \
+ "%c[paravirt_sti_type]", "%c[paravirt_clobber]")
+
+/* XXX: Should we clobber more? */
+#define CLI_STI_CLOBBERS "rax"
+#define CLI_STI_INPUT_ARGS \
+ [paravirt_cli_type] "i" (PARAVIRT_PATCH(irq_disable)), \
+ [paravirt_sti_type] "i" (PARAVIRT_PATCH(irq_enable)), \
+ paravirt_clobber(CLBR_RAX)
+
+#else /* __ASSEMBLY__ */
+
+/* Make sure as little as possible of this mess escapes. */
+#undef PARAVIRT_CALL
+#undef __PVOP_CALL
+#undef __PVOP_VCALL
+#undef PVOP_VCALL0
+#undef PVOP_CALL0
+#undef PVOP_VCALL1
+#undef PVOP_CALL1
+#undef PVOP_VCALL2
+#undef PVOP_CALL2
+#undef PVOP_VCALL3
+#undef PVOP_CALL3
+#undef PVOP_VCALL4
+#undef PVOP_CALL4
+
+#define PARA_PATCH(off) ((off) / 8)
+
+#define PARA_SITE(ptype, clobbers, ops) \
+771:; \
+ ops; \
+772:; \
+ .pushsection .parainstructions,"a"; \
+ .align 8; \
+ .quad 771b; \
+ .byte ptype; \
+ .byte 772b-771b; \
+ .long clobbers; \
+ .popsection
+
+/*
+ * For DISABLE/ENABLE_INTERRUPTS and SWAPGS
+ * we'll save some regs, but the callee needs to be careful
+ * not to touch others. We'll save the normal rax, rdi,
+ * rcx and rdx, but that's it!
+ */
+#define DISABLE_INTERRUPTS(clobbers) \
+ PARA_SITE(PARA_PATCH(PARAVIRT_irq_disable), clobbers, \
+ pushq %rax; pushq %rdi; pushq %rcx; pushq %rdx; \
+ call *paravirt_ops+PARAVIRT_irq_disable; \
+ popq %rdx; popq %rcx; popq %rdi; popq %rax; \
+ );
+
+#define ENABLE_INTERRUPTS(clobbers) \
+ PARA_SITE(PARA_PATCH(PARAVIRT_irq_enable), clobbers, \
+ pushq %rax; pushq %rdi; pushq %rcx; pushq %rdx; \
+ call *%cs:paravirt_ops+PARAVIRT_irq_enable; \
+ popq %rdx; popq %rcx; popq %rdi; popq %rax; \
+ );
+
+#define SWAPGS \
+ PARA_SITE(PARA_PATCH(PARAVIRT_swapgs), CLBR_NONE, \
+ pushq %rax; pushq %rdi; pushq %rcx; pushq %rdx; \
+ call *paravirt_ops+PARAVIRT_swapgs; \
+ popq %rdx; popq %rcx; popq %rdi; popq %rax; \
+ );
+
+/*
+ * SYSRETQ and INTERRUPT_RETURN don't return, and we jump to a function.
+ * So it is all up to the callee to make sure that the registers
+ * are preserved.
+ */
+#define SYSRETQ \
+ PARA_SITE(PARA_PATCH(PARAVIRT_sysret), CLBR_ANY, \
+ jmp *%cs:paravirt_ops+PARAVIRT_sysret)
+
+#define INTERRUPT_RETURN \
+ PARA_SITE(PARA_PATCH(PARAVIRT_iret), CLBR_NONE, \
+ jmp *%cs:paravirt_ops+PARAVIRT_iret)
+
+
+/* this is needed in early_idt_handler */
+#define GET_CR2_INTO_RCX \
+ call *paravirt_ops+PARAVIRT_read_cr2; \
+ movq %rax, %rcx; \
+ xorq %rax, %rax; \
+
+#endif /* __ASSEMBLY__ */
+
+#else
+# error "You should not include paravirt headers without paravirt support"
+#endif /* CONFIG_PARAVIRT */
+
+#endif /* __ASM_PARAVIRT_H */
--
1.4.4.2

2007-08-08 07:16:11

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 14/25] [PATCH] add native functions for descriptors handling

This patch turns the basic descriptor handling into native_
functions. It is basically write_idt, load_idt, write_gdt,
load_gdt, set_ldt, store_tr, load_tls, and the ones
for updating a single entry.

In the process of doing that, we change the definition of
load_LDT_nolock, and caller sites have to be patched. We
also patch call sites that now needs a typecast.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86_64/kernel/head64.c | 2 +-
arch/x86_64/kernel/ldt.c | 6 +-
arch/x86_64/kernel/reboot.c | 3 +-
arch/x86_64/kernel/setup64.c | 4 +-
arch/x86_64/kernel/suspend.c | 11 ++-
include/asm-x86_64/desc.h | 183 +++++++++++++++++++++++++++----------
include/asm-x86_64/mmu_context.h | 4 +-
7 files changed, 148 insertions(+), 65 deletions(-)

diff --git a/arch/x86_64/kernel/head64.c b/arch/x86_64/kernel/head64.c
index 6c34bdd..a0d05d7 100644
--- a/arch/x86_64/kernel/head64.c
+++ b/arch/x86_64/kernel/head64.c
@@ -70,7 +70,7 @@ void __init x86_64_start_kernel(char * real_mode_data)

for (i = 0; i < IDT_ENTRIES; i++)
set_intr_gate(i, early_idt_handler);
- asm volatile("lidt %0" :: "m" (idt_descr));
+ load_idt(&idt_descr);

early_printk("Kernel alive\n");

diff --git a/arch/x86_64/kernel/ldt.c b/arch/x86_64/kernel/ldt.c
index bc9ffd5..8e6fcc1 100644
--- a/arch/x86_64/kernel/ldt.c
+++ b/arch/x86_64/kernel/ldt.c
@@ -173,7 +173,7 @@ static int write_ldt(void __user * ptr, unsigned long bytecount, int oldmode)
{
struct task_struct *me = current;
struct mm_struct * mm = me->mm;
- __u32 entry_1, entry_2, *lp;
+ __u32 entry_1, entry_2;
int error;
struct user_desc ldt_info;

@@ -202,7 +202,6 @@ static int write_ldt(void __user * ptr, unsigned long bytecount, int oldmode)
goto out_unlock;
}

- lp = (__u32 *) ((ldt_info.entry_number << 3) + (char *) mm->context.ldt);

/* Allow LDTs to be cleared by the user. */
if (ldt_info.base_addr == 0 && ldt_info.limit == 0) {
@@ -220,8 +219,7 @@ static int write_ldt(void __user * ptr, unsigned long bytecount, int oldmode)

/* Install the new entry ... */
install:
- *lp = entry_1;
- *(lp+1) = entry_2;
+ write_ldt_entry(mm->context.ldt, ldt_info.entry_number, entry_1, entry_2);
error = 0;

out_unlock:
diff --git a/arch/x86_64/kernel/reboot.c b/arch/x86_64/kernel/reboot.c
index 368db2b..ebc242c 100644
--- a/arch/x86_64/kernel/reboot.c
+++ b/arch/x86_64/kernel/reboot.c
@@ -11,6 +11,7 @@
#include <linux/sched.h>
#include <asm/io.h>
#include <asm/delay.h>
+#include <asm/desc.h>
#include <asm/hw_irq.h>
#include <asm/system.h>
#include <asm/pgtable.h>
@@ -136,7 +137,7 @@ void machine_emergency_restart(void)
}

case BOOT_TRIPLE:
- __asm__ __volatile__("lidt (%0)": :"r" (&no_idt));
+ load_idt((struct desc_ptr *)&no_idt);
__asm__ __volatile__("int3");

reboot_type = BOOT_KBD;
diff --git a/arch/x86_64/kernel/setup64.c b/arch/x86_64/kernel/setup64.c
index 395cf02..49f7342 100644
--- a/arch/x86_64/kernel/setup64.c
+++ b/arch/x86_64/kernel/setup64.c
@@ -224,8 +224,8 @@ void __cpuinit cpu_init (void)
memcpy(cpu_gdt(cpu), cpu_gdt_table, GDT_SIZE);

cpu_gdt_descr[cpu].size = GDT_SIZE;
- asm volatile("lgdt %0" :: "m" (cpu_gdt_descr[cpu]));
- asm volatile("lidt %0" :: "m" (idt_descr));
+ load_gdt(&cpu_gdt_descr[cpu]);
+ load_idt(&idt_descr);

memset(me->thread.tls_array, 0, GDT_ENTRY_TLS_ENTRIES * 8);
syscall_init();
diff --git a/arch/x86_64/kernel/suspend.c b/arch/x86_64/kernel/suspend.c
index 573c0a6..24055b6 100644
--- a/arch/x86_64/kernel/suspend.c
+++ b/arch/x86_64/kernel/suspend.c
@@ -32,9 +32,9 @@ void __save_processor_state(struct saved_context *ctxt)
/*
* descriptor tables
*/
- asm volatile ("sgdt %0" : "=m" (ctxt->gdt_limit));
- asm volatile ("sidt %0" : "=m" (ctxt->idt_limit));
- asm volatile ("str %0" : "=m" (ctxt->tr));
+ store_gdt((struct desc_ptr *)&ctxt->gdt_limit);
+ store_idt((struct desc_ptr *)&ctxt->idt_limit);
+ store_tr(ctxt->tr);

/* XMM0..XMM15 should be handled by kernel_fpu_begin(). */
/*
@@ -91,8 +91,9 @@ void __restore_processor_state(struct saved_context *ctxt)
* now restore the descriptor tables to their proper values
* ltr is done i fix_processor_context().
*/
- asm volatile ("lgdt %0" :: "m" (ctxt->gdt_limit));
- asm volatile ("lidt %0" :: "m" (ctxt->idt_limit));
+ load_gdt((struct desc_ptr *)&ctxt->gdt_limit);
+ load_idt((struct desc_ptr *)&ctxt->idt_limit);
+

/*
* segment registers
diff --git a/include/asm-x86_64/desc.h b/include/asm-x86_64/desc.h
index ac991b5..5710e52 100644
--- a/include/asm-x86_64/desc.h
+++ b/include/asm-x86_64/desc.h
@@ -16,9 +16,17 @@

extern struct desc_struct cpu_gdt_table[GDT_ENTRIES];

-#define load_TR_desc() asm volatile("ltr %w0"::"r" (GDT_ENTRY_TSS*8))
-#define load_LDT_desc() asm volatile("lldt %w0"::"r" (GDT_ENTRY_LDT*8))
-#define clear_LDT() asm volatile("lldt %w0"::"r" (0))
+static inline void native_load_tr_desc(void)
+{
+ asm volatile("ltr %w0"::"r" (GDT_ENTRY_TSS*8));
+}
+
+static inline unsigned long native_store_tr(void)
+{
+ unsigned long tr;
+ asm volatile ("str %w0":"=r" (tr));
+ return tr;
+}

/*
* This is the ldt that every process will get unless we need
@@ -31,44 +39,55 @@ extern struct desc_ptr cpu_gdt_descr[];
/* the cpu gdt accessor */
#define cpu_gdt(_cpu) ((struct desc_struct *)cpu_gdt_descr[_cpu].address)

-static inline void _set_gate(void *adr, unsigned type, unsigned long func, unsigned dpl, unsigned ist)
+static inline void native_load_gdt(const struct desc_ptr *ptr)
{
- struct gate_struct s;
- s.offset_low = PTR_LOW(func);
- s.segment = __KERNEL_CS;
- s.ist = ist;
- s.p = 1;
- s.dpl = dpl;
- s.zero0 = 0;
- s.zero1 = 0;
- s.type = type;
- s.offset_middle = PTR_MIDDLE(func);
- s.offset_high = PTR_HIGH(func);
- /* does not need to be atomic because it is only done once at setup time */
- memcpy(adr, &s, 16);
-}
+ asm volatile("lgdt %w0"::"m" (*ptr));
+}

-static inline void set_intr_gate(int nr, void *func)
-{
- BUG_ON((unsigned)nr > 0xFF);
- _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 0, 0);
-}
+static inline void native_store_gdt(struct desc_ptr *ptr)
+{
+ asm ("sgdt %w0":"=m" (*ptr));
+}

-static inline void set_intr_gate_ist(int nr, void *func, unsigned ist)
-{
- BUG_ON((unsigned)nr > 0xFF);
- _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 0, ist);
-}
+static inline void native_write_idt_entry(void *adr, struct gate_struct *s)
+{
+ /* does not need to be atomic because
+ * it is only done once at setup time */
+ memcpy(adr, s, 16);
+}

-static inline void set_system_gate(int nr, void *func)
-{
- BUG_ON((unsigned)nr > 0xFF);
- _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 3, 0);
-}
+static inline void native_write_gdt_entry(void *ptr, void *entry,
+ unsigned type, unsigned size)
+{
+ memcpy(ptr, entry, size);
+}

-static inline void set_system_gate_ist(int nr, void *func, unsigned ist)
+/*
+ * This one unfortunately can't go with the others, in below, because it has
+ * an user anxious for its definition: set_tssldt_descriptor
+ */
+#ifndef CONFIG_PARAVIRT
+#define write_gdt_entry(_ptr, _e, _type, _size) \
+ native_write_gdt_entry((_ptr),(_e), (_type), (_size))
+#endif
+
+static inline void native_write_ldt_entry(struct desc_struct *ldt,
+ int entrynum, u32 entry_1, u32 entry_2)
{
- _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 3, ist);
+ __u32 *lp;
+ lp = (__u32 *) ((entrynum << 3) + (char *) ldt);
+ *lp = entry_1;
+ *(lp+1) = entry_2;
+}
+
+static inline void native_load_idt(const struct desc_ptr *ptr)
+{
+ asm volatile("lidt %w0"::"m" (*ptr));
+}
+
+static inline void native_store_idt(struct desc_ptr *dtr)
+{
+ asm ("sidt %w0":"=m" (*dtr));
}

static inline void set_tssldt_descriptor(void *ptr, unsigned long tss, unsigned type,
@@ -84,7 +103,7 @@ static inline void set_tssldt_descriptor(void *ptr, unsigned long tss, unsigned
d.limit1 = (size >> 16) & 0xF;
d.base2 = (PTR_MIDDLE(tss) >> 8) & 0xFF;
d.base3 = PTR_HIGH(tss);
- memcpy(ptr, &d, 16);
+ write_gdt_entry(ptr, &d, type, 16);
}

static inline void set_tss_desc(unsigned cpu, void *addr)
@@ -135,7 +154,7 @@ static inline void set_ldt_desc(unsigned cpu, void *addr, int size)
(info)->useable == 0 && \
(info)->lm == 0)

-static inline void load_TLS(struct thread_struct *t, unsigned int cpu)
+static inline void native_load_tls(struct thread_struct *t, unsigned int cpu)
{
unsigned int i;
u64 *gdt = (u64 *)(cpu_gdt(cpu) + GDT_ENTRY_TLS_MIN);
@@ -143,28 +162,92 @@ static inline void load_TLS(struct thread_struct *t, unsigned int cpu)
for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++)
gdt[i] = t->tls_array[i];
}
+static inline void native_set_ldt(const void *addr,
+ unsigned int entries)
+{
+ if (likely(entries == 0))
+ __asm__ __volatile__ ("lldt %w0" :: "r" (0));
+ else {
+ unsigned cpu = smp_processor_id();
+
+ set_tssldt_descriptor(&cpu_gdt(cpu)[GDT_ENTRY_LDT],
+ (unsigned long)addr, DESC_LDT,
+ entries * 8 - 1);
+ __asm__ __volatile__ ("lldt %w0"::"r" (GDT_ENTRY_LDT*8));
+ }
+}
+
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define load_TR_desc() native_load_tr_desc()
+#define load_gdt(ptr) native_load_gdt(ptr)
+#define load_idt(ptr) native_load_idt(ptr)
+#define load_TLS(t, cpu) native_load_tls(t, cpu)
+#define set_ldt(addr, entries) native_set_ldt(addr, entries)
+#define store_tr(tr) (tr) = native_store_tr()
+#define store_gdt(ptr) native_store_gdt(ptr)
+#define store_idt(ptr) native_store_idt(ptr)
+
+#define write_idt_entry(_adr, _s) native_write_idt_entry((_adr),(_s))
+#define write_ldt_entry(_ldt, _number, _entry1, _entry2) \
+ native_write_ldt_entry((_ldt), (_number), (_entry1), (_entry2))
+#endif
+
+static inline void _set_gate(void *adr, unsigned type, unsigned long func, unsigned dpl, unsigned ist)
+{
+ struct gate_struct s;
+ s.offset_low = PTR_LOW(func);
+ s.segment = __KERNEL_CS;
+ s.ist = ist;
+ s.p = 1;
+ s.dpl = dpl;
+ s.zero0 = 0;
+ s.zero1 = 0;
+ s.type = type;
+ s.offset_middle = PTR_MIDDLE(func);
+ s.offset_high = PTR_HIGH(func);
+ write_idt_entry(adr, &s);
+}
+
+static inline void set_intr_gate(int nr, void *func)
+{
+ BUG_ON((unsigned)nr > 0xFF);
+ _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 0, 0);
+}
+
+static inline void set_intr_gate_ist(int nr, void *func, unsigned ist)
+{
+ BUG_ON((unsigned)nr > 0xFF);
+ _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 0, ist);
+}
+
+static inline void set_system_gate(int nr, void *func)
+{
+ BUG_ON((unsigned)nr > 0xFF);
+ _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 3, 0);
+}
+
+static inline void set_system_gate_ist(int nr, void *func, unsigned ist)
+{
+ _set_gate(&idt_table[nr], GATE_INTERRUPT, (unsigned long) func, 3, ist);
+}
+
+#define clear_LDT() set_ldt(NULL,0)

/*
* load one particular LDT into the current CPU
*/
-static inline void load_LDT_nolock (mm_context_t *pc, int cpu)
+static inline void load_LDT_nolock (mm_context_t *pc)
{
- int count = pc->size;
-
- if (likely(!count)) {
- clear_LDT();
- return;
- }
-
- set_ldt_desc(cpu, pc->ldt, count);
- load_LDT_desc();
+ set_ldt(pc->ldt, pc->size);
}

static inline void load_LDT(mm_context_t *pc)
{
- int cpu = get_cpu();
- load_LDT_nolock(pc, cpu);
- put_cpu();
+ preempt_disable();
+ load_LDT_nolock(pc);
+ preempt_enable();
}

extern struct desc_ptr idt_descr;
diff --git a/include/asm-x86_64/mmu_context.h b/include/asm-x86_64/mmu_context.h
index 0cce83a..c8cdc1e 100644
--- a/include/asm-x86_64/mmu_context.h
+++ b/include/asm-x86_64/mmu_context.h
@@ -43,7 +43,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
load_cr3(next->pgd);

if (unlikely(next->context.ldt != prev->context.ldt))
- load_LDT_nolock(&next->context, cpu);
+ load_LDT_nolock(&next->context);
}
#ifdef CONFIG_SMP
else {
@@ -56,7 +56,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
* to make sure to use no freed page tables.
*/
load_cr3(next->pgd);
- load_LDT_nolock(&next->context, cpu);
+ load_LDT_nolock(&next->context);
}
}
#endif
--
1.4.4.2

2007-08-08 07:16:40

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 12/25] [PATCH] native versions for set pagetables

This patch turns the set_p{te,md,ud,gd} functions into their
native_ versions. There is no need to patch any caller.

Also, it adds pte_update() and pte_update_defer() calls whenever
we modify a page table entry. This last part was coded to match
i386 as close as possible.

Pieces of the header are moved to below the #ifdef CONFIG_PARAVIRT
site, as they are users of the newly defined set_* macros.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
include/asm-x86_64/pgtable.h | 152 ++++++++++++++++++++++++-----------------
1 files changed, 89 insertions(+), 63 deletions(-)

diff --git a/include/asm-x86_64/pgtable.h b/include/asm-x86_64/pgtable.h
index c9d8764..dd572a2 100644
--- a/include/asm-x86_64/pgtable.h
+++ b/include/asm-x86_64/pgtable.h
@@ -57,55 +57,77 @@ extern unsigned long empty_zero_page[PAGE_SIZE/sizeof(unsigned long)];
*/
#define PTRS_PER_PTE 512

-#ifndef __ASSEMBLY__
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else

-#define pte_ERROR(e) \
- printk("%s:%d: bad pte %p(%016lx).\n", __FILE__, __LINE__, &(e), pte_val(e))
-#define pmd_ERROR(e) \
- printk("%s:%d: bad pmd %p(%016lx).\n", __FILE__, __LINE__, &(e), pmd_val(e))
-#define pud_ERROR(e) \
- printk("%s:%d: bad pud %p(%016lx).\n", __FILE__, __LINE__, &(e), pud_val(e))
-#define pgd_ERROR(e) \
- printk("%s:%d: bad pgd %p(%016lx).\n", __FILE__, __LINE__, &(e), pgd_val(e))
+#define set_pte native_set_pte
+#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)
+#define set_pmd native_set_pmd
+#define set_pud native_set_pud
+#define set_pgd native_set_pgd
+#define pte_clear(mm,addr,xp) do { set_pte_at(mm, addr, xp, __pte(0)); } while (0)
+#define pmd_clear(xp) do { set_pmd(xp, __pmd(0)); } while (0)
+#define pud_clear native_pud_clear
+#define pgd_clear native_pgd_clear
+#define pte_update(mm, addr, ptep) do { } while (0)
+#define pte_update_defer(mm, addr, ptep) do { } while (0)

-#define pgd_none(x) (!pgd_val(x))
-#define pud_none(x) (!pud_val(x))
+#endif
+
+#ifndef __ASSEMBLY__

-static inline void set_pte(pte_t *dst, pte_t val)
+static inline void native_set_pte(pte_t *dst, pte_t val)
{
- pte_val(*dst) = pte_val(val);
+ dst->pte = pte_val(val);
}
-#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)

-static inline void set_pmd(pmd_t *dst, pmd_t val)
+
+static inline void native_set_pmd(pmd_t *dst, pmd_t val)
{
- pmd_val(*dst) = pmd_val(val);
+ dst->pmd = pmd_val(val);
}

-static inline void set_pud(pud_t *dst, pud_t val)
+static inline void native_set_pud(pud_t *dst, pud_t val)
{
- pud_val(*dst) = pud_val(val);
+ dst->pud = pud_val(val);
}

-static inline void pud_clear (pud_t *pud)
+static inline void native_set_pgd(pgd_t *dst, pgd_t val)
{
- set_pud(pud, __pud(0));
+ dst->pgd = pgd_val(val);
}
-
-static inline void set_pgd(pgd_t *dst, pgd_t val)
+static inline void native_pud_clear (pud_t *pud)
{
- pgd_val(*dst) = pgd_val(val);
-}
+ set_pud(pud, __pud(0));
+}

-static inline void pgd_clear (pgd_t * pgd)
+static inline void native_pgd_clear (pgd_t * pgd)
{
set_pgd(pgd, __pgd(0));
}

-#define ptep_get_and_clear(mm,addr,xp) __pte(xchg(&(xp)->pte, 0))
+#define pte_ERROR(e) \
+ printk("%s:%d: bad pte %p(%016lx).\n", __FILE__, __LINE__, &(e), pte_val(e))
+#define pmd_ERROR(e) \
+ printk("%s:%d: bad pmd %p(%016lx).\n", __FILE__, __LINE__, &(e), pmd_val(e))
+#define pud_ERROR(e) \
+ printk("%s:%d: bad pud %p(%016lx).\n", __FILE__, __LINE__, &(e), pud_val(e))
+#define pgd_ERROR(e) \
+ printk("%s:%d: bad pgd %p(%016lx).\n", __FILE__, __LINE__, &(e), pgd_val(e))
+
+#define pgd_none(x) (!pgd_val(x))
+#define pud_none(x) (!pud_val(x))

struct mm_struct;

+static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+ pte_t pte = __pte(xchg(&ptep->pte, 0));
+ pte_update(mm, addr, ptep);
+ return pte;
+}
+
static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm, unsigned long addr, pte_t *ptep, int full)
{
pte_t pte;
@@ -245,7 +267,6 @@ static inline unsigned long pmd_bad(pmd_t pmd)

#define pte_none(x) (!pte_val(x))
#define pte_present(x) (pte_val(x) & (_PAGE_PRESENT | _PAGE_PROTNONE))
-#define pte_clear(mm,addr,xp) do { set_pte_at(mm, addr, xp, __pte(0)); } while (0)

#define pages_to_mb(x) ((x) >> (20-PAGE_SHIFT)) /* FIXME: is this
right? */
@@ -254,11 +275,11 @@ static inline unsigned long pmd_bad(pmd_t pmd)

static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
{
- pte_t pte;
- pte_val(pte) = (page_nr << PAGE_SHIFT);
- pte_val(pte) |= pgprot_val(pgprot);
- pte_val(pte) &= __supported_pte_mask;
- return pte;
+ unsigned long pte;
+ pte = (page_nr << PAGE_SHIFT);
+ pte |= pgprot_val(pgprot);
+ pte &= __supported_pte_mask;
+ return __pte(pte);
}

/*
@@ -282,30 +303,6 @@ static inline pte_t pte_mkwrite(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) |
static inline pte_t pte_mkhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_PSE)); return pte; }
static inline pte_t pte_clrhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_PSE)); return pte; }

-struct vm_area_struct;
-
-static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
-{
- if (!pte_young(*ptep))
- return 0;
- return test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte);
-}
-
-static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
-{
- clear_bit(_PAGE_BIT_RW, &ptep->pte);
-}
-
-/*
- * Macro to mark a page protection value as "uncacheable".
- */
-#define pgprot_noncached(prot) (__pgprot(pgprot_val(prot) | _PAGE_PCD | _PAGE_PWT))
-
-static inline int pmd_large(pmd_t pte) {
- return (pmd_val(pte) & __LARGE_PTE) == __LARGE_PTE;
-}
-
-
/*
* Conversion functions: convert a page and protection to a page entry,
* and a page entry and page directory to the page they refer to.
@@ -339,7 +336,6 @@ static inline int pmd_large(pmd_t pte) {
pmd_index(address))
#define pmd_none(x) (!pmd_val(x))
#define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
-#define pmd_clear(xp) do { set_pmd(xp, __pmd(0)); } while (0)
#define pfn_pmd(nr,prot) (__pmd(((nr) << PAGE_SHIFT) | pgprot_val(prot)))
#define pmd_pfn(x) ((pmd_val(x) & __PHYSICAL_MASK) >> PAGE_SHIFT)

@@ -352,14 +348,43 @@ static inline int pmd_large(pmd_t pte) {
/* page, protection -> pte */
#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot))
#define mk_pte_huge(entry) (pte_val(entry) |= _PAGE_PRESENT | _PAGE_PSE)
-
+
+struct vm_area_struct;
+
+#include <linux/mm.h>
+static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
+{
+ int ret = 0;
+ if (!pte_young(*ptep))
+ return 0;
+ ret = test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte);
+ pte_update(vma->vm_mm, addr, ptep);
+ return ret;
+}
+
+static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+ clear_bit(_PAGE_BIT_RW, &ptep->pte);
+ pte_update(mm, addr, ptep);
+}
+
+/*
+ * Macro to mark a page protection value as "uncacheable".
+ */
+#define pgprot_noncached(prot) (__pgprot(pgprot_val(prot) | _PAGE_PCD | _PAGE_PWT))
+
+static inline int pmd_large(pmd_t pte) {
+ return (pmd_val(pte) & __LARGE_PTE) == __LARGE_PTE;
+}
+
/* Change flags of a PTE */
-static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
+static inline pte_t pte_modify(pte_t pte_old, pgprot_t newprot)
{
- pte_val(pte) &= _PAGE_CHG_MASK;
- pte_val(pte) |= pgprot_val(newprot);
- pte_val(pte) &= __supported_pte_mask;
- return pte;
+ unsigned long pte = pte_val(pte_old);
+ pte &= _PAGE_CHG_MASK;
+ pte |= pgprot_val(newprot);
+ pte &= __supported_pte_mask;
+ return __pte(pte);
}

#define pte_index(address) \
@@ -386,6 +411,7 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
int __changed = !pte_same(*(__ptep), __entry); \
if (__changed && __dirty) { \
set_pte(__ptep, __entry); \
+ pte_update_defer((__vma)->vm_mm, (__address), (__ptep)); \
flush_tlb_page(__vma, __address); \
} \
__changed; \
--
1.4.4.2

2007-08-08 07:17:11

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 15/25] [PATCH] get rid of inline asm for load_cr3

Besides not elegant, it is now even forbidden, since it can
break paravirtualized guests. load_cr3 should call write_cr3()
instead.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
include/asm-x86_64/mmu_context.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/asm-x86_64/mmu_context.h b/include/asm-x86_64/mmu_context.h
index c8cdc1e..9592698 100644
--- a/include/asm-x86_64/mmu_context.h
+++ b/include/asm-x86_64/mmu_context.h
@@ -25,7 +25,7 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)

static inline void load_cr3(pgd_t *pgd)
{
- asm volatile("movq %0,%%cr3" :: "r" (__pa(pgd)) : "memory");
+ write_cr3(__pa(pgd));
}

static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
--
1.4.4.2

2007-08-08 07:17:44

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 13/25] [PATCH] turn msr.h functions into native versions

This patch turns makes the basic operations in msr.h out of native
ones. Those operations are: rdmsr, wrmsr, rdtsc, rdtscp, rdpmc, and
cpuid. After they are turned into functions, some call sites need
casts, and so we provide them.

There is also a fixup needed in the functions located in the vsyscall
area, as they cannot call any of them anymore (otherwise, the call
would go through a kernel address, invalid in userspace mapping).

The solution is to call the now-provided native_ versions instead.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86_64/ia32/syscall32.c | 2 +-
arch/x86_64/kernel/setup64.c | 6 +-
arch/x86_64/kernel/tsc.c | 42 ++++++-
arch/x86_64/kernel/vsyscall.c | 4 +-
arch/x86_64/vdso/vgetcpu.c | 4 +-
include/asm-x86_64/msr.h | 284 +++++++++++++++++++++++++---------------
6 files changed, 226 insertions(+), 116 deletions(-)

diff --git a/arch/x86_64/ia32/syscall32.c b/arch/x86_64/ia32/syscall32.c
index 15013ba..dd1b4a3 100644
--- a/arch/x86_64/ia32/syscall32.c
+++ b/arch/x86_64/ia32/syscall32.c
@@ -79,5 +79,5 @@ void syscall32_cpu_init(void)
checking_wrmsrl(MSR_IA32_SYSENTER_ESP, 0ULL);
checking_wrmsrl(MSR_IA32_SYSENTER_EIP, (u64)ia32_sysenter_target);

- wrmsrl(MSR_CSTAR, ia32_cstar_target);
+ wrmsrl(MSR_CSTAR, (u64)ia32_cstar_target);
}
diff --git a/arch/x86_64/kernel/setup64.c b/arch/x86_64/kernel/setup64.c
index 1200aaa..395cf02 100644
--- a/arch/x86_64/kernel/setup64.c
+++ b/arch/x86_64/kernel/setup64.c
@@ -122,7 +122,7 @@ void pda_init(int cpu)
asm volatile("movl %0,%%fs ; movl %0,%%gs" :: "r" (0));
/* Memory clobbers used to order PDA accessed */
mb();
- wrmsrl(MSR_GS_BASE, pda);
+ wrmsrl(MSR_GS_BASE, (u64)pda);
mb();

pda->cpunumber = cpu;
@@ -161,8 +161,8 @@ void syscall_init(void)
* but only a 32bit target. LSTAR sets the 64bit rip.
*/
wrmsrl(MSR_STAR, ((u64)__USER32_CS)<<48 | ((u64)__KERNEL_CS)<<32);
- wrmsrl(MSR_LSTAR, system_call);
- wrmsrl(MSR_CSTAR, ignore_sysret);
+ wrmsrl(MSR_LSTAR, (u64)system_call);
+ wrmsrl(MSR_CSTAR, (u64)ignore_sysret);

#ifdef CONFIG_IA32_EMULATION
syscall32_cpu_init ();
diff --git a/arch/x86_64/kernel/tsc.c b/arch/x86_64/kernel/tsc.c
index 2a59bde..0db0041 100644
--- a/arch/x86_64/kernel/tsc.c
+++ b/arch/x86_64/kernel/tsc.c
@@ -9,6 +9,46 @@

#include <asm/timex.h>

+#ifdef CONFIG_PARAVIRT
+/*
+ * When paravirt is on, some functionalities are executed through function
+ * pointers in the paravirt_ops structure, for both the host and guest.
+ * These function pointers exist inside the kernel and can not
+ * be accessed by user space. To avoid this, we make a copy of the
+ * get_cycles_sync (called in kernel) but force the use of native_read_tsc.
+ * For the host, it will simply do the native rdtsc. The guest
+ * should set up it's own clock and vread
+ */
+static __always_inline long long vget_cycles_sync(void)
+{
+ unsigned long long ret;
+ unsigned eax, edx;
+
+ /*
+ * Use RDTSCP if possible; it is guaranteed to be synchronous
+ * and doesn't cause a VMEXIT on Hypervisors
+ */
+ alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP,
+ ASM_OUTPUT2("=a" (eax), "=d" (edx)),
+ "a" (0U), "d" (0U) : "ecx", "memory");
+ ret = (((unsigned long long)edx) << 32) | ((unsigned long long)eax);
+ if (ret)
+ return ret;
+
+ /*
+ * Don't do an additional sync on CPUs where we know
+ * RDTSC is already synchronous:
+ */
+ alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
+ "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
+ ret = native_read_tsc();
+
+ return ret;
+}
+#else
+# define vget_cycles_sync() get_cycles_sync()
+#endif
+
static int notsc __initdata = 0;

unsigned int cpu_khz; /* TSC clocks / usec, not used here */
@@ -165,7 +205,7 @@ static cycle_t read_tsc(void)

static cycle_t __vsyscall_fn vread_tsc(void)
{
- cycle_t ret = (cycle_t)get_cycles_sync();
+ cycle_t ret = (cycle_t)vget_cycles_sync();
return ret;
}

diff --git a/arch/x86_64/kernel/vsyscall.c b/arch/x86_64/kernel/vsyscall.c
index 06c3494..22fc4c9 100644
--- a/arch/x86_64/kernel/vsyscall.c
+++ b/arch/x86_64/kernel/vsyscall.c
@@ -184,7 +184,7 @@ time_t __vsyscall(1) vtime(time_t *t)
long __vsyscall(2)
vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
{
- unsigned int dummy, p;
+ unsigned int p;
unsigned long j = 0;

/* Fast cache - only recompute value once per jiffies and avoid
@@ -199,7 +199,7 @@ vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
p = tcache->blob[1];
} else if (__vgetcpu_mode == VGETCPU_RDTSCP) {
/* Load per CPU data from RDTSCP */
- rdtscp(dummy, dummy, p);
+ native_read_tscp(&p);
} else {
/* Load per CPU data from GDT */
asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
diff --git a/arch/x86_64/vdso/vgetcpu.c b/arch/x86_64/vdso/vgetcpu.c
index 91f6e85..61d0def 100644
--- a/arch/x86_64/vdso/vgetcpu.c
+++ b/arch/x86_64/vdso/vgetcpu.c
@@ -15,7 +15,7 @@

long __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
{
- unsigned int dummy, p;
+ unsigned int p;
unsigned long j = 0;

/* Fast cache - only recompute value once per jiffies and avoid
@@ -30,7 +30,7 @@ long __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
p = tcache->blob[1];
} else if (*vdso_vgetcpu_mode == VGETCPU_RDTSCP) {
/* Load per CPU data from RDTSCP */
- rdtscp(dummy, dummy, p);
+ native_read_tscp(&p);
} else {
/* Load per CPU data from GDT */
asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
diff --git a/include/asm-x86_64/msr.h b/include/asm-x86_64/msr.h
index d5c55b8..6ec254e 100644
--- a/include/asm-x86_64/msr.h
+++ b/include/asm-x86_64/msr.h
@@ -10,110 +10,193 @@
* Note: the rd* operations modify the parameters directly (without using
* pointer indirection), this allows gcc to optimize better
*/
+static inline unsigned long native_read_msr(unsigned int msr)
+{
+ unsigned long a, d;
+ asm volatile("rdmsr" : "=a" (a), "=d" (d) : "c" (msr));
+ return a | (d << 32);
+}

-#define rdmsr(msr,val1,val2) \
- __asm__ __volatile__("rdmsr" \
- : "=a" (val1), "=d" (val2) \
- : "c" (msr))
+static inline unsigned long native_read_msr_safe(unsigned int msr, int *err)
+{
+ unsigned long a, d;
+ asm volatile ( "1: rdmsr; xorl %0, %0\n"
+ "2:\n"
+ ".section .fixup,\"ax\"\n"
+ "3: movl %4,%0\n"
+ " jmp 2b\n"
+ ".previous\n"
+ ".section __ex_table,\"a\"\n"
+ " .align 8\n"
+ " .quad 1b,3b\n"
+ ".previous":"=&bDS" (*err), "=a"((a)), "=d"((d))
+ :"c"(msr), "i"(-EIO), "0"(0));
+ return a | (d << 32);
+}

+static inline unsigned long native_write_msr(unsigned int msr,
+ unsigned long val)
+{
+ asm volatile( "wrmsr"
+ : /* no outputs */
+ : "c" (msr), "a" ((u32)val), "d" ((u32)(val>>32)));
+ return 0;
+}

-#define rdmsrl(msr,val) do { unsigned long a__,b__; \
- __asm__ __volatile__("rdmsr" \
- : "=a" (a__), "=d" (b__) \
- : "c" (msr)); \
- val = a__ | (b__<<32); \
-} while(0)
+/* wrmsr with exception handling */
+static inline long native_write_msr_safe(unsigned int msr, unsigned long val)
+{
+ unsigned long ret;
+ asm volatile("2: wrmsr ; xorq %0,%0\n"
+ "1:\n\t"
+ ".section .fixup,\"ax\"\n\t"
+ "3: movq %4,%0 ; jmp 1b\n\t"
+ ".previous\n\t"
+ ".section __ex_table,\"a\"\n"
+ " .align 8\n\t"
+ " .quad 2b,3b\n\t"
+ ".previous"
+ : "=r" (ret)
+ : "c" (msr), "a" ((u32)val), "d" ((u32)(val>>32)),
+ "i" (-EFAULT));
+ return ret;
+}

-#define wrmsr(msr,val1,val2) \
- __asm__ __volatile__("wrmsr" \
- : /* no outputs */ \
- : "c" (msr), "a" (val1), "d" (val2))
+static inline unsigned long native_read_tsc(void)
+{
+ unsigned long low, high;
+ asm volatile("rdtsc" : "=a" (low), "=d" (high));
+ return low | (high << 32);
+}

-#define wrmsrl(msr,val) wrmsr(msr,(__u32)((__u64)(val)),((__u64)(val))>>32)
+static inline unsigned long native_read_pmc(int counter)
+{
+ unsigned long low, high;
+ asm volatile ("rdpmc"
+ : "=a" (low), "=d" (high)
+ : "c" (counter));

-/* wrmsr with exception handling */
-#define wrmsr_safe(msr,a,b) ({ int ret__; \
- asm volatile("2: wrmsr ; xorl %0,%0\n" \
- "1:\n\t" \
- ".section .fixup,\"ax\"\n\t" \
- "3: movl %4,%0 ; jmp 1b\n\t" \
- ".previous\n\t" \
- ".section __ex_table,\"a\"\n" \
- " .align 8\n\t" \
- " .quad 2b,3b\n\t" \
- ".previous" \
- : "=a" (ret__) \
- : "c" (msr), "0" (a), "d" (b), "i" (-EFAULT)); \
- ret__; })
-
-#define checking_wrmsrl(msr,val) wrmsr_safe(msr,(u32)(val),(u32)((val)>>32))
-
-#define rdmsr_safe(msr,a,b) \
- ({ int ret__; \
- asm volatile ("1: rdmsr\n" \
- "2:\n" \
- ".section .fixup,\"ax\"\n" \
- "3: movl %4,%0\n" \
- " jmp 2b\n" \
- ".previous\n" \
- ".section __ex_table,\"a\"\n" \
- " .align 8\n" \
- " .quad 1b,3b\n" \
- ".previous":"=&bDS" (ret__), "=a"(*(a)), "=d"(*(b))\
- :"c"(msr), "i"(-EIO), "0"(0)); \
- ret__; })
-
-#define rdtsc(low,high) \
- __asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high))
-
-#define rdtscl(low) \
- __asm__ __volatile__ ("rdtsc" : "=a" (low) : : "edx")
-
-#define rdtscp(low,high,aux) \
- asm volatile (".byte 0x0f,0x01,0xf9" : "=a" (low), "=d" (high), "=c" (aux))
-
-#define rdtscll(val) do { \
- unsigned int __a,__d; \
- asm volatile("rdtsc" : "=a" (__a), "=d" (__d)); \
- (val) = ((unsigned long)__a) | (((unsigned long)__d)<<32); \
-} while(0)
-
-#define rdtscpll(val, aux) do { \
- unsigned long __a, __d; \
- asm volatile (".byte 0x0f,0x01,0xf9" : "=a" (__a), "=d" (__d), "=c" (aux)); \
- (val) = (__d << 32) | __a; \
+ return low | (high << 32);
+}
+
+static inline void native_cpuid(unsigned int *eax, unsigned int *ebx,
+ unsigned int *ecx, unsigned int *edx)
+{
+ asm volatile ("cpuid"
+ : "=a" (*eax),
+ "=b" (*ebx),
+ "=c" (*ecx),
+ "=d" (*edx)
+ : "0" (*eax), "2" (*ecx));
+}
+
+static inline unsigned long native_read_tscp(int *aux)
+{
+ unsigned long low, high;
+ asm volatile (".byte 0x0f,0x01,0xf9"
+ : "=a" (low), "=d" (high), "=c" (*aux));
+ return low | (high >> 32);
+}
+
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define read_msr(msr) native_read_msr(msr)
+#define read_msr_safe(msr, err) native_read_msr_safe(msr, err)
+#define write_msr(msr, val) native_write_msr(msr, val)
+#define write_msr_safe(msr, val) native_write_msr_safe(msr, val)
+#define read_tsc_reg() native_read_tsc()
+#define read_tscp(aux) native_read_tscp(aux)
+#define read_pmc(counter) native_read_pmc(counter)
+#define __cpuid native_cpuid
+#endif
+
+#define rdmsr(msr, low, high) \
+do { \
+ unsigned long __val = read_msr(msr); \
+ low = (u32)__val; \
+ high = (u32)(__val >> 32); \
} while (0)

-#define write_tsc(val1,val2) wrmsr(0x10, val1, val2)
+#define rdmsrl(msr, val) (val) = read_msr(msr)
+
+#define rdmsr_safe(msr, a, d) \
+({ \
+ int __ret; \
+ unsigned long __val = read_msr_safe(msr, &__ret); \
+ (*a) = (u32)(__val); \
+ (*d) = (u32)(__val >> 32); \
+ __ret; \
+})
+
+#define wrmsr(msr, val1, val2) \
+do { \
+ unsigned long __val; \
+ __val = (unsigned long)val2 << 32; \
+ __val |= val1; \
+ write_msr(msr, __val); \
+} while (0)
+
+#define wrmsrl(msr, val) write_msr(msr, val)
+
+#define wrmsr_safe(msr, a, d) \
+ write_msr_safe(msr, a | ( ((u64)d) << 32))
+
+#define checking_wrmsrl(msr, val) wrmsr_safe(msr, (u32)(val),(u32)((val)>>32))
+
+#define write_tsc(val1, val2) wrmsr(0x10, val1, val2)

#define write_rdtscp_aux(val) wrmsr(0xc0000103, val, 0)

-#define rdpmc(counter,low,high) \
- __asm__ __volatile__("rdpmc" \
- : "=a" (low), "=d" (high) \
- : "c" (counter))
+#define rdtsc(low, high) \
+do { \
+ unsigned long __val = read_tsc_reg(); \
+ low = (u32)__val; \
+ high = (u32)(__val >> 32); \
+} while (0)
+
+#define rdtscl(low) (low) = (u32)read_tsc_reg()
+
+#define rdtscll(val) (val) = read_tsc_reg()
+
+#define rdtscp(low, high, aux) \
+do { \
+ int __aux; \
+ unsigned long __val = read_tscp(&__aux); \
+ (low) = (u32)__val; \
+ (high) = (u32)(__val >> 32); \
+ (aux) = __aux; \
+} while (0)
+
+#define rdtscpll(val, aux) \
+do { \
+ unsigned long __aux; \
+ val = read_tscp(&__aux); \
+ (aux) = __aux; \
+} while (0)
+
+#define rdpmc(counter, low, high) \
+do { \
+ unsigned long __val = read_pmc(counter); \
+ (low) = (u32)(__val); \
+ (high) = (u32)(__val >> 32); \
+} while (0)

static inline void cpuid(int op, unsigned int *eax, unsigned int *ebx,
unsigned int *ecx, unsigned int *edx)
{
- __asm__("cpuid"
- : "=a" (*eax),
- "=b" (*ebx),
- "=c" (*ecx),
- "=d" (*edx)
- : "0" (op));
+ *eax = op;
+ *ecx = 0;
+ __cpuid(eax, ebx, ecx, edx);
}

/* Some CPUID calls want 'count' to be placed in ecx */
static inline void cpuid_count(int op, int count, int *eax, int *ebx, int *ecx,
int *edx)
{
- __asm__("cpuid"
- : "=a" (*eax),
- "=b" (*ebx),
- "=c" (*ecx),
- "=d" (*edx)
- : "0" (op), "c" (count));
+ *eax = op;
+ *ecx = count;
+ __cpuid(eax, ebx, ecx, edx);
}

/*
@@ -121,42 +204,29 @@ static inline void cpuid_count(int op, int count, int *eax, int *ebx, int *ecx,
*/
static inline unsigned int cpuid_eax(unsigned int op)
{
- unsigned int eax;
-
- __asm__("cpuid"
- : "=a" (eax)
- : "0" (op)
- : "bx", "cx", "dx");
+ unsigned int eax, ebx, ecx, edx;
+ cpuid(op, &eax, &ebx, &ecx, &edx);
return eax;
}
+
static inline unsigned int cpuid_ebx(unsigned int op)
{
- unsigned int eax, ebx;
-
- __asm__("cpuid"
- : "=a" (eax), "=b" (ebx)
- : "0" (op)
- : "cx", "dx" );
+ unsigned int eax, ebx, ecx, edx;
+ cpuid(op, &eax, &ebx, &ecx, &edx);
return ebx;
}
+
static inline unsigned int cpuid_ecx(unsigned int op)
{
- unsigned int eax, ecx;
-
- __asm__("cpuid"
- : "=a" (eax), "=c" (ecx)
- : "0" (op)
- : "bx", "dx" );
+ unsigned int eax, ebx, ecx, edx;
+ cpuid(op, &eax, &ebx, &ecx, &edx);
return ecx;
}
+
static inline unsigned int cpuid_edx(unsigned int op)
{
- unsigned int eax, edx;
-
- __asm__("cpuid"
- : "=a" (eax), "=d" (edx)
- : "0" (op)
- : "bx", "cx");
+ unsigned int eax, ebx, ecx, edx;
+ cpuid(op, &eax, &ebx, &ecx, &edx);
return edx;
}

--
1.4.4.2

2007-08-08 07:18:06

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 17/25] [PATCH] turn page operations into native versions

This patch turns the page operations (set and make a page table)
into native_ versions. The operations itself will be later
overriden by paravirt.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
include/asm-x86_64/page.h | 36 +++++++++++++++++++++++++++++++-----
1 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/include/asm-x86_64/page.h b/include/asm-x86_64/page.h
index 88adf1a..ec8b245 100644
--- a/include/asm-x86_64/page.h
+++ b/include/asm-x86_64/page.h
@@ -64,16 +64,42 @@ typedef struct { unsigned long pgprot; } pgprot_t;

extern unsigned long phys_base;

-#define pte_val(x) ((x).pte)
-#define pmd_val(x) ((x).pmd)
-#define pud_val(x) ((x).pud)
-#define pgd_val(x) ((x).pgd)
-#define pgprot_val(x) ((x).pgprot)
+static inline unsigned long native_pte_val(pte_t pte)
+{
+ return pte.pte;
+}
+
+static inline unsigned long native_pud_val(pud_t pud)
+{
+ return pud.pud;
+}
+
+
+static inline unsigned long native_pmd_val(pmd_t pmd)
+{
+ return pmd.pmd;
+}
+
+static inline unsigned long native_pgd_val(pgd_t pgd)
+{
+ return pgd.pgd;
+}
+
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define pte_val(x) native_pte_val(x)
+#define pmd_val(x) native_pmd_val(x)
+#define pud_val(x) native_pud_val(x)
+#define pgd_val(x) native_pgd_val(x)

#define __pte(x) ((pte_t) { (x) } )
#define __pmd(x) ((pmd_t) { (x) } )
#define __pud(x) ((pud_t) { (x) } )
#define __pgd(x) ((pgd_t) { (x) } )
+#endif /* CONFIG_PARAVIRT */
+
+#define pgprot_val(x) ((x).pgprot)
#define __pgprot(x) ((pgprot_t) { (x) } )

#endif /* !__ASSEMBLY__ */
--
1.4.4.2

2007-08-08 07:19:43

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 23/25] [PATCH] paravirt hooks for arch initialization

This patch add paravirtualization hooks in the arch initialization
process. paravirt_arch_setup() lets the guest issue any specific
initialization routine, and skip all the rest if it see fits, which
it signals by a proper return value.

In case the initialization continues, we hook at least memory_setup(),
so it can handle it in its own way.

The hypervisor can make its own ebda mapping visible by providing
its custom ebda_info function.

Signed-off-by: Glauber de Oliveira Costa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86_64/kernel/setup.c | 41 ++++++++++++++++++++++++++++++++++++-----
include/asm-x86_64/e820.h | 6 ++++++
2 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/arch/x86_64/kernel/setup.c b/arch/x86_64/kernel/setup.c
index af838f6..8e58a5d 100644
--- a/arch/x86_64/kernel/setup.c
+++ b/arch/x86_64/kernel/setup.c
@@ -65,6 +65,12 @@
#include <asm/sections.h>
#include <asm/dmi.h>

+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
+#else
+#define paravirt_arch_setup() 0
+#endif
+
/*
* Machine setup..
*/
@@ -201,17 +207,28 @@ static inline void copy_edd(void)
unsigned __initdata ebda_addr;
unsigned __initdata ebda_size;

-static void discover_ebda(void)
+void native_ebda_info(unsigned *addr, unsigned *size)
{
/*
* there is a real-mode segmented pointer pointing to the
* 4K EBDA area at 0x40E
*/
- ebda_addr = *(unsigned short *)__va(EBDA_ADDR_POINTER);
- ebda_addr <<= 4;
+ *addr = *(unsigned short *)__va(EBDA_ADDR_POINTER);
+ *addr <<= 4;
+
+ *size = *(unsigned short *)__va(*addr);
+}

- ebda_size = *(unsigned short *)__va(ebda_addr);
+/* Overriden in paravirt.c if CONFIG_PARAVIRT */
+void __attribute__((weak)) ebda_info(unsigned *addr, unsigned *size)
+{
+ native_ebda_info(addr, size);
+}

+static void discover_ebda(void)
+{
+
+ ebda_info(&ebda_addr, &ebda_size);
/* Round EBDA up to pages */
if (ebda_size == 0)
ebda_size = 1;
@@ -221,6 +238,13 @@ static void discover_ebda(void)
ebda_size = 64*1024;
}

+/* Overridden in paravirt.c if CONFIG_PARAVIRT */
+void __attribute__((weak)) memory_setup(void)
+{
+ return setup_memory_region();
+}
+
+
void __init setup_arch(char **cmdline_p)
{
printk(KERN_INFO "Command line: %s\n", boot_command_line);
@@ -231,12 +255,19 @@ void __init setup_arch(char **cmdline_p)
saved_video_mode = SAVED_VIDEO_MODE;
bootloader_type = LOADER_TYPE;

+ /*
+ * By returning non-zero here, a paravirt impl can choose to
+ * skip the rest of the setup process
+ */
+ if (paravirt_arch_setup())
+ return;
+
#ifdef CONFIG_BLK_DEV_RAM
rd_image_start = RAMDISK_FLAGS & RAMDISK_IMAGE_START_MASK;
rd_prompt = ((RAMDISK_FLAGS & RAMDISK_PROMPT_FLAG) != 0);
rd_doload = ((RAMDISK_FLAGS & RAMDISK_LOAD_FLAG) != 0);
#endif
- setup_memory_region();
+ memory_setup();
copy_edd();

if (!MOUNT_ROOT_RDONLY)
diff --git a/include/asm-x86_64/e820.h b/include/asm-x86_64/e820.h
index 3486e70..2ced3ba 100644
--- a/include/asm-x86_64/e820.h
+++ b/include/asm-x86_64/e820.h
@@ -20,7 +20,12 @@
#define E820_ACPI 3
#define E820_NVS 4

+#define MAP_TYPE_STR "BIOS-e820"
+
#ifndef __ASSEMBLY__
+
+void native_ebda_info(unsigned *addr, unsigned *size);
+
struct e820entry {
u64 addr; /* start of memory segment */
u64 size; /* size of memory segment */
@@ -56,6 +61,7 @@ extern struct e820map e820;

extern unsigned ebda_addr, ebda_size;
extern unsigned long nodemap_addr, nodemap_size;
+
#endif/*!__ASSEMBLY__*/

#endif/*__E820_HEADER*/
--
1.4.4.2

2007-08-08 07:30:22

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 5/25] [PATCH] native versions for system.h functions

Okay, this one is obviously wrong, my fault (it doesn't do what it
says it does in the body of the e-mail. Resending...


--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."


Attachments:
(No filename) (258.00 B)
native-versions-for-system.h-functions.patch (213.00 B)
Download all attachments

2007-08-08 10:02:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4/25] [PATCH] Add debugreg/load_rsp native hooks


>
> @@ -264,13 +270,64 @@ struct thread_struct {
> set_fs(USER_DS); \
> } while(0)
>
> -#define get_debugreg(var, register) \
> - __asm__("movq %%db" #register ", %0" \
> - :"=r" (var))
> -#define set_debugreg(value, register) \
> - __asm__("movq %0,%%db" #register \
> - : /* no output */ \
> - :"r" (value))
> +static inline unsigned long native_get_debugreg(int regno)
> +{
> + unsigned long val;

It would be better to have own functions for each debug register I think

-Andi

2007-08-08 10:02:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 23/25] [PATCH] paravirt hooks for arch initialization


> -static void discover_ebda(void)
> +void native_ebda_info(unsigned *addr, unsigned *size)

I guess it would be better to use the resources frame work here.
Before checking EBDA check if it is already reserved. Then lguest/Xen
can reserve these areas and stop using it.


> +/* Overridden in paravirt.c if CONFIG_PARAVIRT */
> +void __attribute__((weak)) memory_setup(void)
> +{
> + return setup_memory_region();
> +}
> +
> +
> void __init setup_arch(char **cmdline_p)
> {
> printk(KERN_INFO "Command line: %s\n", boot_command_line);
> @@ -231,12 +255,19 @@ void __init setup_arch(char **cmdline_p)
> saved_video_mode = SAVED_VIDEO_MODE;
> bootloader_type = LOADER_TYPE;
>
> + /*
> + * By returning non-zero here, a paravirt impl can choose to
> + * skip the rest of the setup process
> + */
> + if (paravirt_arch_setup())
> + return;

Sorry, but that's an extremly ugly and clumpsy interface and will lead
to extensive code duplication in hypervisors because so much code
is disabled.

This needs to be solved in some better way.


-Andi

2007-08-08 10:02:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/25] [PATCH] irq_flags / halt routines


> +#ifdef CONFIG_PARAVIRT
> +#include <asm/paravirt.h>
> +# ifdef CONFIG_X86_VSMP
> +static inline int raw_irqs_disabled_flags(unsigned long flags)
> +{
> + return !(flags & X86_EFLAGS_IF) || (flags & X86_EFLAGS_AC);
> +}
> +# else
> +static inline int raw_irqs_disabled_flags(unsigned long flags)
> +{
> + return !(flags & X86_EFLAGS_IF);
> +}
> +# endif

You should really turn the vsmp special case into a paravirt client first
instead of complicating all this even more.

> +#ifndef CONFIG_PARAVIRT
> +#define raw_safe_halt native_raw_safe_halt
> +#define halt native_halt
> +#endif /* ! CONFIG_PARAVIRT */

This seems inconsistent

-Andi

2007-08-08 10:03:13

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64


> +config PARAVIRT
> + bool "Paravirtualization support (EXPERIMENTAL)"

This should be hidden and selected by the clients as needed
(I already did this change on i386)

Users know nothing about paravirt, they just know about Xen, lguest
etc.

Strictly you would at least need a !X86_VSMP dependency, but
with the vsmp change i requested that will be unnecessary

Is this really synced with the latest version of the i386 code?


> +#ifdef CONFIG_PARAVIRT
> +#include <asm/paravirt.h>
> +#endif


> +#include <linux/errno.h>
> +#include <linux/module.h>
> +#include <linux/efi.h>
> +#include <linux/bcd.h>
> +#include <linux/start_kernel.h>
> +
> +#include <asm/bug.h>
> +#include <asm/paravirt.h>
> +#include <asm/desc.h>
> +#include <asm/setup.h>
> +#include <asm/irq.h>
> +#include <asm/delay.h>
> +#include <asm/fixmap.h>
> +#include <asm/apic.h>
> +#include <asm/tlbflush.h>
> +#include <asm/msr.h>
> +#include <asm/page.h>
> +#include <asm/pgtable.h>
> +#include <asm/proto.h>
> +#include <asm/e820.h>
> +#include <asm/time.h>
> +#include <asm/asm-offsets.h>
> +#include <asm/smp.h>
> +#include <asm/irqflags.h>


Are the includes really all needed?


> + if (opfunc == NULL)
> + /* If there's no function, patch it with a ud2a (BUG) */
> + ret = paravirt_patch_insns(site, len, start_ud2a, end_ud2a);

This will actually give corrupted BUGs because you don't supply
the full inline BUG header. Perhaps another trap would be better.


> +EXPORT_SYMBOL(paravirt_ops);

Definitely _GPL at least.



> +extern struct paravirt_ops paravirt_ops;

Should be native_paravirt_ops I guess

> +
> + * This generates an indirect call based on the operation type number.

The macros here don't

> +static inline unsigned long read_msr(unsigned int msr)
> +{
> + int __err;

No need for __ in inlines

> +/* The paravirtualized I/O functions */
> +static inline void slow_down_io(void) {

I doubt this needs to be inline and it's large

> + __asm__ __volatile__(paravirt_alt(PARAVIRT_CALL)

No __*__ in new code please

> + : "=a"(f)
> + : paravirt_type(save_fl),
> + paravirt_clobber(CLBR_RAX)
> + : "memory", "cc");
> + return f;
> +}
> +
> +static inline void raw_local_irq_restore(unsigned long f)
> +{
> + __asm__ __volatile__(paravirt_alt(PARAVIRT_CALL)
> + :
> + : "D" (f),

Have you investigated if a different input register generates better/smaller
code? I would assume rdi to be usually used already for the caller's
arguments so it will produce spilling

Similar for the rax return in the other functions.



-Andi

2007-08-08 10:03:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 13/25] [PATCH] turn msr.h functions into native versions

On Wednesday 08 August 2007 06:19, Glauber de Oliveira Costa wrote:

> +static __always_inline long long vget_cycles_sync(void)

Why is there a copy of this function now? That seems wrong

> + native_read_tscp(&p);

The instruction is called rdtscp not read_tscp. Please follow that


> +#define rdtsc(low, high) \

This macro can be probably eliminated, no callers in kernel


-Andi

2007-08-08 10:03:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S


> +#define SYSRETQ \
> + movq %gs:pda_oldrsp,%rsp; \
> + swapgs; \
> + sysretq;

When the macro does more than sysret it should have a different
name


> */
> .globl int_ret_from_sys_call
> int_ret_from_sys_call:
> - cli
> + DISABLE_INTERRUPTS(CLBR_ANY)

ANY? There are certainly some registers alive at this point like rax

> retint_restore_args:
> - cli
> + DISABLE_INTERRUPTS(CLBR_ANY)

Similar.


> /*
> * The iretq could re-enable interrupts:
> */
> @@ -566,10 +587,14 @@ retint_restore_args:
> restore_args:
> RESTORE_ARGS 0,8,0
> iret_label:
> - iretq
> +#ifdef CONFIG_PARAVIRT
> + INTERRUPT_RETURN
> +ENTRY(native_iret)

ENTRY adds alignment. Why do you need that export anyways?

> +#endif
> +1: iretq
>
> .section __ex_table,"a"
> - .quad iret_label,bad_iret
> + .quad 1b, bad_iret

iret_label seems more expressive to me than 1

> + ENABLE_INTERRUPTS(CLBR_NONE)

In many of the CLBR_NONEs there are actually some registers free;
but it might be safer to keep it this way. But if some client can get
significantly better code with one or two free registers it might
be worthwhile to investigate.

> - swapgs
> + SWAPGS_NOSTACK

There's still stack here

> paranoid_restore\trace:
> RESTORE_ALL 8
> - iretq
> + INTERRUPT_RETURN

I suspect Xen will need much more changes anyways because of its
ring 3 guest. Are these changes sufficient for lguest?

-Andi

2007-08-08 11:34:34

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 21/25] [PATCH] export cpu_gdt_descr

On Wed, 2007-08-08 at 01:19 -0300, Glauber de Oliveira Costa wrote:
> With paravirualization, hypervisors needs to handle the gdt,
> that was right to this point only used at very early
> inialization code. Hypervisors are commonly modules, so make
> it an export
>

the GDT is so deeply internal that this really ought to be a _GPL
export..


2007-08-08 11:53:14

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S


Hi Andi,

Thanks for all the comments, it's greatly appreciated.

On Wed, 8 Aug 2007, Andi Kleen wrote:

>
> > +#define SYSRETQ \
> > + movq %gs:pda_oldrsp,%rsp; \
> > + swapgs; \
> > + sysretq;
>
> When the macro does more than sysret it should have a different
> name

Noted. Do you have a better idea? Something like SETSTACK_SWAPGS_SYSRETQ?

>
>
> > */
> > .globl int_ret_from_sys_call
> > int_ret_from_sys_call:
> > - cli
> > + DISABLE_INTERRUPTS(CLBR_ANY)
>
> ANY? There are certainly some registers alive at this point like rax

Glauber will need to address this, this is his code ;-)

>
> > retint_restore_args:
> > - cli
> > + DISABLE_INTERRUPTS(CLBR_ANY)
>
> Similar.
>
>
> > /*
> > * The iretq could re-enable interrupts:
> > */
> > @@ -566,10 +587,14 @@ retint_restore_args:
> > restore_args:
> > RESTORE_ARGS 0,8,0
> > iret_label:
> > - iretq
> > +#ifdef CONFIG_PARAVIRT
> > + INTERRUPT_RETURN
> > +ENTRY(native_iret)
>
> ENTRY adds alignment. Why do you need that export anyways?

The paravirt ops struct points to it.

>
> > +#endif
> > +1: iretq
> >
> > .section __ex_table,"a"
> > - .quad iret_label,bad_iret
> > + .quad 1b, bad_iret
>
> iret_label seems more expressive to me than 1

The reason for this change is because of the added:
#ifdef CONFIG_PARAVIRT
INTERRUPT_RETURN
ENTRY(native_iret)
#endif

If we are paravirt ops, we need the iretq in the exception table, not the
paravit ops function call. Since that function call may simply call the
native_iretq, and if we take a fault at the iretq, it wont be in the
exception table.

>
> > + ENABLE_INTERRUPTS(CLBR_NONE)
>
> In many of the CLBR_NONEs there are actually some registers free;
> but it might be safer to keep it this way. But if some client can get
> significantly better code with one or two free registers it might
> be worthwhile to investigate.

Glauber, that comment is for you.

>
> > - swapgs
> > + SWAPGS_NOSTACK
>
> There's still stack here

OK, bad name then. How about SWAPGS_UNTRUSTED_STACK?

>From earlier in the file where SWAPGS_NOSTACK is declared we have:


/* Currently paravirt can't handle swapgs nicely when we
* don't have a stack. So we either find a way around these
* or just fault and emulate if a guest tries to call swapgs
* directly.
*
* Either way, this is a good way to document that we don't
* have a reliable stack.
*/
#define SWAPGS_NOSTACK swapgs

>
> > paranoid_restore\trace:
> > RESTORE_ALL 8
> > - iretq
> > + INTERRUPT_RETURN
>
> I suspect Xen will need much more changes anyways because of its
> ring 3 guest. Are these changes sufficient for lguest?

Probably not, but this part of the code I don't fully understand. But just
doing this doesn't break lguest. But perhaps it only did not break it yet
;-)


Thanks,

-- Steve

2007-08-08 11:59:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64


On Wed, 8 Aug 2007, Andi Kleen wrote:

> Strictly you would at least need a !X86_VSMP dependency, but
> with the vsmp change i requested that will be unnecessary
>
> Is this really synced with the latest version of the i386 code?

Glauber started the paravirt ops 64 a second time around, from scratch
using the PVOPS of i386 as a base. But since we couldn't just take a the
PVOPS patch from i386 and apply it to x86_64, this was mainly done by
looking at i386 code and massaging it for x86_64. Somethings may have
slipped (and we may have been looking at different versions of PVOPS).
It's not easy trying to keep up with a moving target ;-)

-- Steve

2007-08-08 12:25:07

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S


> Probably not, but this part of the code I don't fully understand.

I would suggest to defer all this until at least one example to test it
(except vsmp which is too simple) is around.

-Andi


2007-08-08 12:38:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S


On Wed, 8 Aug 2007, Andi Kleen wrote:

>
> > Probably not, but this part of the code I don't fully understand.
>
> I would suggest to defer all this until at least one example to test it
> (except vsmp which is too simple) is around.

Who uses that code? NMIs and debug regs? Lguest only has the host handle
the NMIs (doesn't pass to guest). And we haven't gotten to debug regs. Who
else uses that part of the code?

We can leave the "paranoid check" change out for this series.

-- Steve

2007-08-08 12:58:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

Steven Rostedt <[email protected]> writes:

> On Wed, 8 Aug 2007, Andi Kleen wrote:
>
> >
> > > Probably not, but this part of the code I don't fully understand.
> >
> > I would suggest to defer all this until at least one example to test it
> > (except vsmp which is too simple) is around.
>
> Who uses that code? NMIs and debug regs? Lguest only has the host handle
> the NMIs (doesn't pass to guest). And we haven't gotten to debug regs. Who
> else uses that part of the code?

I'm not sure I understand your question. You're asking who uses entry.S?
Answer would be everybody. If you asked something else please reformulate.

-Andi

2007-08-08 13:24:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S


--
On Wed, 8 Aug 2007, Andi Kleen wrote:

> Steven Rostedt <[email protected]> writes:
>
> > On Wed, 8 Aug 2007, Andi Kleen wrote:
> >
> > >
> > > > Probably not, but this part of the code I don't fully understand.
> > >
> > > I would suggest to defer all this until at least one example to test it
> > > (except vsmp which is too simple) is around.
> >
> > Who uses that code? NMIs and debug regs? Lguest only has the host handle
> > the NMIs (doesn't pass to guest). And we haven't gotten to debug regs. Who
> > else uses that part of the code?
>
> I'm not sure I understand your question. You're asking who uses entry.S?
> Answer would be everybody. If you asked something else please reformulate.

When I said "this part of the code I don't fully understand" I was not
talking about entry.S. I understand entry.S very well, but the comment
was originally on the paranoid_restore code. Which I thought had to deal
with NMIs and such that I didn't worry about that I simply did the
default.

>> paranoid_restore\trace:
>> RESTORE_ALL 8
>> - iretq
>> + INTERRUPT_RETURN
>
>I suspect Xen will need much more changes anyways because of its
>ring 3 guest. Are these changes sufficient for lguest?


The above was what I was replying to.

-- Steve

2007-08-08 13:28:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S


On Wed, 8 Aug 2007, Steven Rostedt wrote:

> On Wed, 8 Aug 2007, Andi Kleen wrote:
>
> >> paranoid_restore\trace:
> >> RESTORE_ALL 8
> >> - iretq
> >> + INTERRUPT_RETURN
> >
> >I suspect Xen will need much more changes anyways because of its
> >ring 3 guest. Are these changes sufficient for lguest?
>
>
> The above was what I was replying to.

If you were talking about the general iretq => INTERRUPT_RETURN, then the
answer is "Yes, they are sufficient". The first version of lguest ran the
guest kernel in ring 3 (using dual page tables for guest kernel and guest
user). The current version I'm pushing runs lguest in ring 1, and the
entry.S code worked for both.

-- Steve

2007-08-08 13:29:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

> When I said "this part of the code I don't fully understand" I was not
> talking about entry.S. I understand entry.S very well, but the comment
> was originally on the paranoid_restore code. Which I thought had to deal
> with NMIs and such that I didn't worry about that I simply did the
> default.

The paranoid path is used for more than just NMIs; it's also used for MCEs,
stack faults, double faults or debug exceptions. Anything that might
happen with a invalid stack or unknown GS state or system in other unknown
state.

If you can guarantee your hypervisor never injects any of those it could be ignored;
but at least losing debug exceptions would be probably not nice.

>
> >> paranoid_restore\trace:
> >> RESTORE_ALL 8
> >> - iretq
> >> + INTERRUPT_RETURN
> >
> >I suspect Xen will need much more changes anyways because of its
> >ring 3 guest. Are these changes sufficient for lguest?

This was really a general comment not especially applying to the
paranoid path.

-Andi

2007-08-08 13:30:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

> If you were talking about the general iretq => INTERRUPT_RETURN, then the
> answer is "Yes, they are sufficient". The first version of lguest ran the
> guest kernel in ring 3 (using dual page tables for guest kernel and guest
> user). The current version I'm pushing runs lguest in ring 1, and the
> entry.S code worked for both.

How do you implement system calls then?

-Andi

2007-08-08 13:48:33

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S


--
On Wed, 8 Aug 2007, Andi Kleen wrote:

> > If you were talking about the general iretq => INTERRUPT_RETURN, then the
> > answer is "Yes, they are sufficient". The first version of lguest ran the
> > guest kernel in ring 3 (using dual page tables for guest kernel and guest
> > user). The current version I'm pushing runs lguest in ring 1, and the
> > entry.S code worked for both.
>
> How do you implement system calls then?

/me working very hard to get lguest64 ready for public display

Here's a snippet from my version of core.c. I've been thinking of ways to
optimize it, but for now it works fine. This was done for both ring 3 and
ring 1 lguest versions (this is the host running):


/*
* Update the LSTAR to point to the HV syscall handler.
* Also update the fsbase if the guest uses one.
*/
wrmsrl(MSR_LSTAR, (unsigned long)HV_OFFSET(&lguest_syscall_trampoline));

[...]
asm volatile ("pushq %2; pushq %%rsp; pushfq; pushq %3; call *%6;"
/* The stack we pushed is off by 8, due to the
previous pushq */
"addq $8, %%rsp"
: "=D"(foo), "=a"(bar)
: "i" (__KERNEL_DS), "i" (__KERNEL_CS), "0"
(vcpu->vcpu),
"1"(get_idt_table()),
"r" (sw_guest)
: "memory", "cc");

[...]
/* restore old LSTAR */
wrmsrl(MSR_LSTAR, vcpu->host_syscall);



Also in the switcher.S (The Hypervisor):

.global lguest_syscall_trampoline
.type lguest_syscall_trampoline, @function
lguest_syscall_trampoline:
/*
* Tricky, we don't have much to choose from here.
* The only way to get to our stack is with swapgs.
* but we need to save the stack too, so we have to play
* very carefully.
*/
swapgs
/* now gs points to our VCPU Guest Data */

/* first save the stack! */
movq %rsp, %gs:LGUEST_GUEST_DATA_regs_rsp

/*
* x86 arch doesn't have an easy way to find out where
* gs is located. So we need to read the MSR. But first
* we need to save off the rcx, rax and rdx.
*/
movq %rax, %gs:LGUEST_GUEST_DATA_regs_rax
movq %rdx, %gs:LGUEST_GUEST_DATA_regs_rdx
movq %rcx, %gs:LGUEST_GUEST_DATA_regs_rcx

/* Need to read manual, does rdmsr clear
* the top 32 bits of rax? */
xor %rax, %rax

movl $MSR_GS_BASE,%ecx
rdmsr
shl $32, %rdx
orq %rax, %rdx

movq %rdx, %rsp

/* see if we need to disable interrupts */
testq $(1<<9), %gs:LGUEST_GUEST_DATA_SFMASK
jz 1f
movq $0, %gs:LGUEST_GUEST_DATA_irq_enabled
jmp 2f
1:
/* Still need to clear bit 10 (just in case) */
/* (see lguest_iretq) */
movq $(1<<10), %rax
not %rax
andq %rax, %gs:LGUEST_GUEST_DATA_irq_enabled
2:
/* put back the generic regs */
movq %gs:LGUEST_GUEST_DATA_regs_rdx, %rdx
movq %gs:LGUEST_GUEST_DATA_regs_rcx, %rcx
movq %gs:LGUEST_GUEST_DATA_regs_rax, %rax

/* Is this a hypercall? */
testq $1, %gs:LGUEST_GUEST_DATA_is_hc
jnz handle_hcall

/* We have 64 bytes to play with */
addq $LGUEST_GUEST_DATA_regs, %rsp

/* do the swapgs if possible */
testq $1, %gs:LGUEST_GUEST_DATA_do_swapgs
je 1f

/* We now have a stack to use */
/* go back to the guest's gs */
swapgs
DO_SWAPGS_USE_STACK
/* and then back to HV gs */
swapgs
1:
/*
* The stack has 64 bytes of playing room.
* Which is enough to do a jump to the guest kernel.
* We store the guest LSTAR register in the scatch pad
* because we don't care if the guest messes with it.
* If it is a bad address, we fault from the guest side
* and we kill the guest. No harm done to the host.
*/
pushq $(__KERNEL_DS | 1)
pushq %gs:LGUEST_GUEST_DATA_regs_rsp
pushfq
/* Make sure we have actual interrupts on */
orq $(1<<9), 0(%rsp)
pushq $(__KERNEL_CS | 1)
pushq %gs:LGUEST_GUEST_DATA_LSTAR

swapgs
iretq



NOTE: This is still under development, since I'm going with a new design
change to try to stay more in sync with lguest32.

-- Steve

2007-08-08 13:58:19

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

Thank you for the attention, andi

let's go:

On 8/8/07, Andi Kleen <[email protected]> wrote:
>
> > +#define SYSRETQ \
> > + movq %gs:pda_oldrsp,%rsp; \
> > + swapgs; \
> > + sysretq;
>
> When the macro does more than sysret it should have a different
> name
That's fair. Again, suggestions are welcome. Maybe SYSCALL_RETURN ?

> > */
> > .globl int_ret_from_sys_call
> > int_ret_from_sys_call:
> > - cli
> > + DISABLE_INTERRUPTS(CLBR_ANY)
>
> ANY? There are certainly some registers alive at this point like rax
yes, this one is wrong. Thanks for the catch

> > retint_restore_args:
> > - cli
> > + DISABLE_INTERRUPTS(CLBR_ANY)
>
> Similar.
I don't think so. They are live here, but restore_args follows, so we
can safely clobber anything here. Right?

>
> > /*
> > * The iretq could re-enable interrupts:
> > */
> > @@ -566,10 +587,14 @@ retint_restore_args:
> > restore_args:
> > RESTORE_ARGS 0,8,0
> > iret_label:
> > - iretq
> > +#ifdef CONFIG_PARAVIRT
> > + INTERRUPT_RETURN
> > +ENTRY(native_iret)
>
> ENTRY adds alignment. Why do you need that export anyways?
Just went on the flow. Will change.

> > +#endif
> > +1: iretq
> >
> > .section __ex_table,"a"
> > - .quad iret_label,bad_iret
> > + .quad 1b, bad_iret
>
> iret_label seems more expressive to me than 1

fair.

> > + ENABLE_INTERRUPTS(CLBR_NONE)
>
> In many of the CLBR_NONEs there are actually some registers free;
> but it might be safer to keep it this way. But if some client can get
> significantly better code with one or two free registers it might
> be worthwhile to investigate.
That's exactly what I had in mind. I'd highly prefer to keep it this
way until it is merged, and we are sure all the rest is stable

> > - swapgs
> > + SWAPGS_NOSTACK
>
> There's still stack here

Yes, but it is not safe to use. I think Roasted addressed it later on.

> > paranoid_restore\trace:
> > RESTORE_ALL 8
> > - iretq
> > + INTERRUPT_RETURN
>
> I suspect Xen will need much more changes anyways because of its
> ring 3 guest. Are these changes sufficient for lguest?

Yes, they are sufficient for lguest.
Does any xen folks have any comment?

--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-08 14:00:38

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

> > ENTRY adds alignment. Why do you need that export anyways?
>
> The paravirt ops struct points to it.

But the paravirt_ops probably won't need it as an export. So I guess
andi is right.


--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-08 14:03:05

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S


--
On Wed, 8 Aug 2007, Andi Kleen wrote:

> > When I said "this part of the code I don't fully understand" I was not
> > talking about entry.S. I understand entry.S very well, but the comment
> > was originally on the paranoid_restore code. Which I thought had to deal
> > with NMIs and such that I didn't worry about that I simply did the
> > default.
>
> The paranoid path is used for more than just NMIs; it's also used for MCEs,
> stack faults, double faults or debug exceptions. Anything that might
> happen with a invalid stack or unknown GS state or system in other unknown
> state.
>
> If you can guarantee your hypervisor never injects any of those it could be ignored;
> but at least losing debug exceptions would be probably not nice.

For now it does ignore them. But that's something to work on
for later versions of lguest64. I want lguest out in the public and
working for the general case. I don't plan on ignoring the paranoid
section forever. But it's more on an anomaly for now. But you are right, I
want the debug stuff working. But I also need to walk before I can run ;-)


>
> >
> > >> paranoid_restore\trace:
> > >> RESTORE_ALL 8
> > >> - iretq
> > >> + INTERRUPT_RETURN
> > >
> > >I suspect Xen will need much more changes anyways because of its
> > >ring 3 guest. Are these changes sufficient for lguest?
>
> This was really a general comment not especially applying to the
> paranoid path.

Ah, OK, I assumed you were talking about just the previous code. But
rereading your post, I see it was more general. Sorry, for the
miscommunication.

-- Steve

2007-08-08 14:03:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

On Wed, Aug 08, 2007 at 09:47:05AM -0400, Steven Rostedt wrote:
> /me working very hard to get lguest64 ready for public display
>
> Here's a snippet from my version of core.c. I've been thinking of ways to
> optimize it, but for now it works fine. This was done for both ring 3 and
> ring 1 lguest versions (this is the host running):
>
>
> /*
> * Update the LSTAR to point to the HV syscall handler.
> * Also update the fsbase if the guest uses one.
> */
> wrmsrl(MSR_LSTAR, (unsigned long)HV_OFFSET(&lguest_syscall_trampoline));
>
> [...]
> asm volatile ("pushq %2; pushq %%rsp; pushfq; pushq %3; call *%6;"
> /* The stack we pushed is off by 8, due to the
> previous pushq */
> "addq $8, %%rsp"

Weird stack inbalance, i'm surprised that works. %6 must be doing strange
things.

> /* Need to read manual, does rdmsr clear
> * the top 32 bits of rax? */
> xor %rax, %rax
>
> movl $MSR_GS_BASE,%ecx
> rdmsr


This seems incredibly slow. Since GS changes are controlled in
the kernel why don't you just cache them in the per cpu area?
The only case where user can reload it is using segment
selector changes, which can be also handled.

Also why do you need the guest gs value anyways?

You could just use your own stack and let the guest
switch to its own.

-Andi

2007-08-08 14:08:38

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 23/25] [PATCH] paravirt hooks for arch initialization

On 8/8/07, Andi Kleen <[email protected]> wrote:
>
> > -static void discover_ebda(void)
> > +void native_ebda_info(unsigned *addr, unsigned *size)
>
> I guess it would be better to use the resources frame work here.
> Before checking EBDA check if it is already reserved. Then lguest/Xen
> can reserve these areas and stop using it.
Let's make sure I understand: So you suggest skipping discover
altogether in case it is already reserved?

>
> > +/* Overridden in paravirt.c if CONFIG_PARAVIRT */
> > +void __attribute__((weak)) memory_setup(void)
> > +{
> > + return setup_memory_region();
> > +}
> > +
> > +
> > void __init setup_arch(char **cmdline_p)
> > {
> > printk(KERN_INFO "Command line: %s\n", boot_command_line);
> > @@ -231,12 +255,19 @@ void __init setup_arch(char **cmdline_p)
> > saved_video_mode = SAVED_VIDEO_MODE;
> > bootloader_type = LOADER_TYPE;
> >
> > + /*
> > + * By returning non-zero here, a paravirt impl can choose to
> > + * skip the rest of the setup process
> > + */
> > + if (paravirt_arch_setup())
> > + return;
>
> Sorry, but that's an extremly ugly and clumpsy interface and will lead
> to extensive code duplication in hypervisors because so much code
> is disabled.

We can just wipe out the return value right now. Note that it was a
choice, it would only lead to code duplication if the hypervisor
wanted it. But yeah, I understand your concern. They may chose to
return 1 here just to change some tiny thing in the bottom.

I don't know exactly what other kinds of hooks we could put there.
lguest surely didn't need any. Are you okay with just turning it into
void by now ?

--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-08 14:10:53

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 3/25] [PATCH] irq_flags / halt routines

On 8/8/07, Andi Kleen <[email protected]> wrote:
>
> > +#ifdef CONFIG_PARAVIRT
> > +#include <asm/paravirt.h>
> > +# ifdef CONFIG_X86_VSMP
> > +static inline int raw_irqs_disabled_flags(unsigned long flags)
> > +{
> > + return !(flags & X86_EFLAGS_IF) || (flags & X86_EFLAGS_AC);
> > +}
> > +# else
> > +static inline int raw_irqs_disabled_flags(unsigned long flags)
> > +{
> > + return !(flags & X86_EFLAGS_IF);
> > +}
> > +# endif
>
> You should really turn the vsmp special case into a paravirt client first
> instead of complicating all this even more.

By "client" you mean a user of the paravirt interface?

> > +#ifndef CONFIG_PARAVIRT
> > +#define raw_safe_halt native_raw_safe_halt
> > +#define halt native_halt
> > +#endif /* ! CONFIG_PARAVIRT */
>
> This seems inconsistent
Sorry andi. Can't see why. Could you elaborate?


--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-08 14:11:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

On Wednesday 08 August 2007 15:58:06 Glauber de Oliveira Costa wrote:

> On 8/8/07, Andi Kleen <[email protected]> wrote:
> >
> > > +#define SYSRETQ \
> > > + movq %gs:pda_oldrsp,%rsp; \
> > > + swapgs; \
> > > + sysretq;
> >
> > When the macro does more than sysret it should have a different
> > name
> That's fair. Again, suggestions are welcome. Maybe SYSCALL_RETURN ?

Sounds reasonable.

> > > retint_restore_args:
> > > - cli
> > > + DISABLE_INTERRUPTS(CLBR_ANY)
> >
> > Similar.
> I don't think so. They are live here, but restore_args follows, so we
> can safely clobber anything here. Right?

The non argument registers cannot be clobbered.



-Andi

2007-08-08 14:14:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 23/25] [PATCH] paravirt hooks for arch initialization

On Wednesday 08 August 2007 16:08:25 Glauber de Oliveira Costa wrote:
> On 8/8/07, Andi Kleen <[email protected]> wrote:
> >
> > > -static void discover_ebda(void)
> > > +void native_ebda_info(unsigned *addr, unsigned *size)
> >
> > I guess it would be better to use the resources frame work here.
> > Before checking EBDA check if it is already reserved. Then lguest/Xen
> > can reserve these areas and stop using it.
> Let's make sure I understand: So you suggest skipping discover
> altogether in case it is already reserved?

Yes


> I don't know exactly what other kinds of hooks we could put there.
> lguest surely didn't need any. Are you okay with just turning it into
> void by now ?

Yes.

-Andi

2007-08-08 14:15:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/25] [PATCH] irq_flags / halt routines

On Wednesday 08 August 2007 16:10:28 Glauber de Oliveira Costa wrote:
> On 8/8/07, Andi Kleen <[email protected]> wrote:
> >
> > > +#ifdef CONFIG_PARAVIRT
> > > +#include <asm/paravirt.h>
> > > +# ifdef CONFIG_X86_VSMP
> > > +static inline int raw_irqs_disabled_flags(unsigned long flags)
> > > +{
> > > + return !(flags & X86_EFLAGS_IF) || (flags & X86_EFLAGS_AC);
> > > +}
> > > +# else
> > > +static inline int raw_irqs_disabled_flags(unsigned long flags)
> > > +{
> > > + return !(flags & X86_EFLAGS_IF);
> > > +}
> > > +# endif
> >
> > You should really turn the vsmp special case into a paravirt client first
> > instead of complicating all this even more.
>
> By "client" you mean a user of the paravirt interface?

Yes

>
> > > +#ifndef CONFIG_PARAVIRT
> > > +#define raw_safe_halt native_raw_safe_halt
> > > +#define halt native_halt
> > > +#endif /* ! CONFIG_PARAVIRT */
> >
> > This seems inconsistent
> Sorry andi. Can't see why. Could you elaborate?

Hmm, must have misread it. It's ok

-Andi



2007-08-08 14:20:42

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 13/25] [PATCH] turn msr.h functions into native versions

On 8/8/07, Andi Kleen <[email protected]> wrote:
> On Wednesday 08 August 2007 06:19, Glauber de Oliveira Costa wrote:
>
> > +static __always_inline long long vget_cycles_sync(void)
>
> Why is there a copy of this function now? That seems wrong

Yeah, the other one is in i386 headers, so We probably wan't to leave
it there. One option is to move get_cycles_sync to x86_64 headers, and
then #ifdef just the offending part.

> > + native_read_tscp(&p);
>
> The instruction is called rdtscp not read_tscp. Please follow that

Although the operation consists in reading tscp. I choose this to be
consistent with i386, but I have no special feelings about it. I'm
okay with changing it if you prefer.

> > +#define rdtsc(low, high) \
>
> This macro can be probably eliminated, no callers in kernel
>
>
Fine.

--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-08 14:23:48

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 3/25] [PATCH] irq_flags / halt routines

On 8/8/07, Andi Kleen <[email protected]> wrote:
> On Wednesday 08 August 2007 16:10:28 Glauber de Oliveira Costa wrote:
> > On 8/8/07, Andi Kleen <[email protected]> wrote:
> > >
> > > > +#ifdef CONFIG_PARAVIRT
> > > > +#include <asm/paravirt.h>
> > > > +# ifdef CONFIG_X86_VSMP
> > > > +static inline int raw_irqs_disabled_flags(unsigned long flags)
> > > > +{
> > > > + return !(flags & X86_EFLAGS_IF) || (flags & X86_EFLAGS_AC);
> > > > +}
> > > > +# else
> > > > +static inline int raw_irqs_disabled_flags(unsigned long flags)
> > > > +{
> > > > + return !(flags & X86_EFLAGS_IF);
> > > > +}
> > > > +# endif
> > >
> > > You should really turn the vsmp special case into a paravirt client first
> > > instead of complicating all this even more.
> >
> > By "client" you mean a user of the paravirt interface?
>
> Yes

Ohhh, I see. You're talking about just the first piece of code inside
the #ifdef CONFIG_PARAVIRT. In this case yes, I agree with you. It can
be done.

--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-08 14:24:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 13/25] [PATCH] turn msr.h functions into native versions

> The instruction is called rdtscp not read_tscp. Please follow that
>
> Although the operation consists in reading tscp.

There is no tscp register

-Andi

2007-08-08 14:25:22

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 21/25] [PATCH] export cpu_gdt_descr

On 8/8/07, Arjan van de Ven <[email protected]> wrote:
> On Wed, 2007-08-08 at 01:19 -0300, Glauber de Oliveira Costa wrote:
> > With paravirualization, hypervisors needs to handle the gdt,
> > that was right to this point only used at very early
> > inialization code. Hypervisors are commonly modules, so make
> > it an export
> >
>
> the GDT is so deeply internal that this really ought to be a _GPL
> export..

Yes, Arjan, I agree. Thanks for noticing it.


--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-08 14:49:38

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

On 8/8/07, Andi Kleen <[email protected]> wrote:
>
> Is this really synced with the latest version of the i386 code?
Roasted already commented on this. I will check out and change it here.

>
> > +#ifdef CONFIG_PARAVIRT
> > +#include <asm/paravirt.h>
> > +#endif
>
>
> > +#include <linux/errno.h>
> > +#include <linux/module.h>
> > +#include <linux/efi.h>
> > +#include <linux/bcd.h>
> > +#include <linux/start_kernel.h>
> > +
> > +#include <asm/bug.h>
> > +#include <asm/paravirt.h>
> > +#include <asm/desc.h>
> > +#include <asm/setup.h>
> > +#include <asm/irq.h>
> > +#include <asm/delay.h>
> > +#include <asm/fixmap.h>
> > +#include <asm/apic.h>
> > +#include <asm/tlbflush.h>
> > +#include <asm/msr.h>
> > +#include <asm/page.h>
> > +#include <asm/pgtable.h>
> > +#include <asm/proto.h>
> > +#include <asm/e820.h>
> > +#include <asm/time.h>
> > +#include <asm/asm-offsets.h>
> > +#include <asm/smp.h>
> > +#include <asm/irqflags.h>
>
>
> Are the includes really all needed?
delay.h is not needed anymore. Most of them, could be maybe moved to
paravirt.c , which is the one that really needs all the native_
things. Yeah, it will be better code this way, will change.

>
> > + if (opfunc == NULL)
> > + /* If there's no function, patch it with a ud2a (BUG) */
> > + ret = paravirt_patch_insns(site, len, start_ud2a, end_ud2a);
>
> This will actually give corrupted BUGs because you don't supply
> the full inline BUG header. Perhaps another trap would be better.

You mean this:
> > +#include <asm/bug.h>
?

>
> > +EXPORT_SYMBOL(paravirt_ops);
>
> Definitely _GPL at least.
Sure.

>
> Should be native_paravirt_ops I guess

makes sense.

> > +
> > + * This generates an indirect call based on the operation type number.
>
> The macros here don't
>

> > +static inline unsigned long read_msr(unsigned int msr)
> > +{
> > + int __err;
>
> No need for __ in inlines
Right. Thanks.


> > +/* The paravirtualized I/O functions */
> > +static inline void slow_down_io(void) {
>
> I doubt this needs to be inline and it's large
In a second look, i386 have such a function in io.h because they need
slow_down_io in a bunch of I/O instructions. It seems that we do not.
Could we just get rid of it, then?

> > + __asm__ __volatile__(paravirt_alt(PARAVIRT_CALL)
>
> No __*__ in new code please

Yup, will fix.

> > + : "=a"(f)
> > + : paravirt_type(save_fl),
> > + paravirt_clobber(CLBR_RAX)
> > + : "memory", "cc");
> > + return f;
> > +}
> > +
> > +static inline void raw_local_irq_restore(unsigned long f)
> > +{
> > + __asm__ __volatile__(paravirt_alt(PARAVIRT_CALL)
> > + :
> > + : "D" (f),
>
> Have you investigated if a different input register generates better/smaller
> code? I would assume rdi to be usually used already for the caller's
> arguments so it will produce spilling
>
> Similar for the rax return in the other functions.
I don't think we can do different. These functions can be patched, and
if it happens, they will put their return value in rax. So we'd better
expect it there.
Same goes for rdi, as they will expect the value to be there as an input.

I don't think it will spill in the normal case, as rdi is already the
parameter. So the compiler will just leave it there, untouched.

--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-08 14:53:55

by Nakajima, Jun

[permalink] [raw]
Subject: RE: Introducing paravirt_ops for x86_64

Glauber de Oliveira Costa wrote:
> Hi folks,
>
> After some time away from it, and a big rebase as a consequence, here
is
> the updated version of paravirt_ops for x86_64, heading to inclusion.
>
> Your criticism is of course, very welcome.
>
> Have fun

Do you assume that the kernel ougtht to use 2MB pages for its mappings
(e.g. initilal text/data, direct mapping of physical memory) under your
paravirt_ops? As far as I look at the patches, I don't find one.

Jun
---
Intel Open Source Technology Center

2007-08-08 14:54:24

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

On 8/8/07, Andi Kleen <[email protected]> wrote:
> > > Similar.
> > I don't think so. They are live here, but restore_args follows, so we
> > can safely clobber anything here. Right?
>
> The non argument registers cannot be clobbered.
But they are not. Yeah, I ommited it in the changelog, (it is in a
comment at paravirt.h), should probably include. the CLBR_ defines
only accounts for the caller-saved registers.


--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-08 14:58:25

by Glauber Costa

[permalink] [raw]
Subject: Re: Introducing paravirt_ops for x86_64

On 8/8/07, Nakajima, Jun <[email protected]> wrote:
> Glauber de Oliveira Costa wrote:
> > Hi folks,
> >
> > After some time away from it, and a big rebase as a consequence, here
> is
> > the updated version of paravirt_ops for x86_64, heading to inclusion.
> >
> > Your criticism is of course, very welcome.
> >
> > Have fun
>
> Do you assume that the kernel ougtht to use 2MB pages for its mappings
> (e.g. initilal text/data, direct mapping of physical memory) under your
> paravirt_ops? As far as I look at the patches, I don't find one.

I don't think how it could be relevant here. lguest kernel does use
2MB pages, and it goes smootly. For 2MB pages, we will update the page
tables in the very same way, and in the very places we did before.
Just that the operations can now be overwritten.

So, unless I'm very wrong, it only makes sense to talk about not
supporting large pages in the guest level. But it is not a
paravirt_ops problem.


--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-08 17:09:07

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S


--
On Wed, 8 Aug 2007, Andi Kleen wrote:

> On Wed, Aug 08, 2007 at 09:47:05AM -0400, Steven Rostedt wrote:
> >
> > [...]
> > asm volatile ("pushq %2; pushq %%rsp; pushfq; pushq %3; call *%6;"
> > /* The stack we pushed is off by 8, due to the
> > previous pushq */
> > "addq $8, %%rsp"
>
> Weird stack inbalance, i'm surprised that works. %6 must be doing strange
> things.

Heh, no it's very subtle. We do the "push %rsp" after the "push %2". So on
return from the call, the stack that was pushed, is really 8 bytes off.
Because we didn't save the stack pointer from the entry of the asm, we
saved it after doing that "push %2". :-)

We get back to this call via a retq from the HV switcher code. So that rsp
that was pushed will be the stack.


>
> > /* Need to read manual, does rdmsr clear
> > * the top 32 bits of rax? */
> > xor %rax, %rax
> >
> > movl $MSR_GS_BASE,%ecx
> > rdmsr
>
>
> This seems incredibly slow. Since GS changes are controlled in
> the kernel why don't you just cache them in the per cpu area?
> The only case where user can reload it is using segment
> selector changes, which can be also handled.
>
> Also why do you need the guest gs value anyways?

I did a swapgs here, so we have the HV GS value. I don't know of an easy
way to read the GS value to put it into the RSP.

>
> You could just use your own stack and let the guest
> switch to its own.

Well, it is quite complex, since we can't save anything in the writable
section that the guest can't also write to. Remember, we are in ring 1 for
the guest kernel, so we have no protection. All the host related stuff is
in read only sections when we are using the guest cr3.

What I would like to do is have the GS point to the read only section, and
load the RSP from a value there. But first we need to store the old value
of RSP. The problem is, where do you store it. Hmm, I guess I can create
my own "big" offset from the read only section :)

OK this gives me an idea, I *can* point the kernel GS to the vcpu read
only section, and have a way to store the value with the GS offset. The
read only section is always mapped before the read write section, so it
should be easy to calculate the diff!

Thanks for the comment, you guided me to this great idea!

-- Steve

2007-08-09 00:02:33

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

On Wed, 2007-08-08 at 11:49 -0300, Glauber de Oliveira Costa wrote:
> On 8/8/07, Andi Kleen <[email protected]> wrote:
> > > +EXPORT_SYMBOL(paravirt_ops);
> >
> > Definitely _GPL at least.
> Sure.

We ended up making it EXPORT_SYMBOL in i386 because every driver wants
to save and restore interrupt state.

But questionably-licensed drivers might be less of a concern on x86-64.

Cheers,
Rusty.


2007-08-09 00:24:33

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

On Thursday 09 August 2007 01:18:57 Rusty Russell wrote:
> On Wed, 2007-08-08 at 11:49 -0300, Glauber de Oliveira Costa wrote:
> > On 8/8/07, Andi Kleen <[email protected]> wrote:
> > > > +EXPORT_SYMBOL(paravirt_ops);
> > >
> > > Definitely _GPL at least.
> > Sure.
>
> We ended up making it EXPORT_SYMBOL in i386 because every driver wants
> to save and restore interrupt state.

Ah true.

> But questionably-licensed drivers might be less of a concern on x86-64.

Nvidia/ATI and other binary modules exist too and users will probably unhappy
if they cannot run them anymore.

But at usually irq state changes should be patched in anyways and won't
need paravirt I guess?

Hmm, actually thinking about it the module loader probably has no clue
that the relocation it linked will be overwritten so it'll check
for the export anyways.

So the alternatives would be to add ugly hacks to the module loader
or split paravirt_ops in "common" and "low level system" areas or
export it as a normal export.

Not sure what's best. Ok using a normal export is easiest and not
that big an issue.

-Andi

2007-08-09 00:29:34

by Nakajima, Jun

[permalink] [raw]
Subject: RE: Introducing paravirt_ops for x86_64

Glauber de Oliveira Costa wrote:
> On 8/8/07, Nakajima, Jun <[email protected]> wrote:
> > Glauber de Oliveira Costa wrote:
> > > Hi folks,
> > >
> > > After some time away from it, and a big rebase as a consequence,
here is
> > > the updated version of paravirt_ops for x86_64, heading to
inclusion.
> > >
> > > Your criticism is of course, very welcome.
> > >
> > > Have fun
> >
> > Do you assume that the kernel ougtht to use 2MB pages for its
mappings
> > (e.g. initilal text/data, direct mapping of physical memory) under
your
> > paravirt_ops? As far as I look at the patches, I don't find one.
>
> I don't think how it could be relevant here. lguest kernel does use
> 2MB pages, and it goes smootly. For 2MB pages, we will update the page
> tables in the very same way, and in the very places we did before.
> Just that the operations can now be overwritten.
>
> So, unless I'm very wrong, it only makes sense to talk about not
> supporting large pages in the guest level. But it is not a
> paravirt_ops problem.

Some MMU-related PV techiniques (including Xen, and direct paging mode
for Xen/KVM) need to write-protect page tables, avoiding to use 2MB
pages when mapping page tables. Looks like you did not, and that
exaplains why the patches are missing the relevant (many) paravirt_ops
in include/asm-x86_64/pgalloc.h, for example, compared with the i386
tree.

Jun
---
Intel Open Source Technology Center

2007-08-09 00:31:30

by Glauber Costa

[permalink] [raw]
Subject: Re: Introducing paravirt_ops for x86_64

On 8/8/07, Nakajima, Jun <[email protected]> wrote:
> > So, unless I'm very wrong, it only makes sense to talk about not
> > supporting large pages in the guest level. But it is not a
> > paravirt_ops problem.
>
> Some MMU-related PV techiniques (including Xen, and direct paging mode
> for Xen/KVM) need to write-protect page tables, avoiding to use 2MB
> pages when mapping page tables. Looks like you did not, and that
> exaplains why the patches are missing the relevant (many) paravirt_ops
> in include/asm-x86_64/pgalloc.h, for example, compared with the i386
> tree.
I see.

I'll address this in the next version of the patch.


--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-09 00:45:45

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

On Thu, 2007-08-09 at 02:24 +0200, Andi Kleen wrote:
> Not sure what's best. Ok using a normal export is easiest and not
> that big an issue.

Yeah, there's also been talk of breaking up paravirt_ops into multiple
structs by area, which would lead naturally into finer-grained export
control.

Cheers,
Rusty.

2007-08-09 05:23:08

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

Steven Rostedt wrote:
> /*
> * x86 arch doesn't have an easy way to find out where
> * gs is located. So we need to read the MSR. But first
> * we need to save off the rcx, rax and rdx.
>
Why don't you store it in gs? movq %gs:my_gs_base, %rax?

J

2007-08-09 05:37:23

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Introducing paravirt_ops for x86_64

Glauber de Oliveira Costa wrote:
> On 8/8/07, Nakajima, Jun <[email protected]> wrote:
>
>> Glauber de Oliveira Costa wrote:
>>
>>> Hi folks,
>>>
>>> After some time away from it, and a big rebase as a consequence, here
>>>
>> is
>>
>>> the updated version of paravirt_ops for x86_64, heading to inclusion.
>>>
>>> Your criticism is of course, very welcome.
>>>
>>> Have fun
>>>
>> Do you assume that the kernel ougtht to use 2MB pages for its mappings
>> (e.g. initilal text/data, direct mapping of physical memory) under your
>> paravirt_ops? As far as I look at the patches, I don't find one.
>>
>
> I don't think how it could be relevant here. lguest kernel does use
> 2MB pages, and it goes smootly. For 2MB pages, we will update the page
> tables in the very same way, and in the very places we did before.
> Just that the operations can now be overwritten.
>
> So, unless I'm very wrong, it only makes sense to talk about not
> supporting large pages in the guest level. But it is not a
> paravirt_ops problem.
>

At the moment Xen can't support guests with 2M pages. In 32-bit this
isn't a huge problem, since the kernel doesn't assume it can map itself
with 2M pages. But I think the 64-bit kernel assumes 2M pages are
always available for mapping the kernel. I don't know how pervasive
this assumption is, but it would be nice to parameterize it in pv-ops.


J

2007-08-09 05:39:59

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

Rusty Russell wrote:
> On Thu, 2007-08-09 at 02:24 +0200, Andi Kleen wrote:
>
>> Not sure what's best. Ok using a normal export is easiest and not
>> that big an issue.
>>
>
> Yeah, there's also been talk of breaking up paravirt_ops into multiple
> structs by area, which would lead naturally into finer-grained export
> control.

Yeah, I should dust that patch off again.

J

2007-08-09 05:49:19

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 4/25] [PATCH] Add debugreg/load_rsp native hooks

Andi Kleen wrote:
>> @@ -264,13 +270,64 @@ struct thread_struct {
>> set_fs(USER_DS); \
>> } while(0)
>>
>> -#define get_debugreg(var, register) \
>> - __asm__("movq %%db" #register ", %0" \
>> - :"=r" (var))
>> -#define set_debugreg(value, register) \
>> - __asm__("movq %0,%%db" #register \
>> - : /* no output */ \
>> - :"r" (value))
>> +static inline unsigned long native_get_debugreg(int regno)
>> +{
>> + unsigned long val;
>>
>
> It would be better to have own functions for each debug register I think
>

? A separate pvop for each? Seems excessive. And surely this should
be identical to 32bit either way.

J

2007-08-09 05:49:58

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 4/25] [PATCH] Add debugreg/load_rsp native hooks

On 8/8/07, Andi Kleen <[email protected]> wrote:
>
> >
> > @@ -264,13 +270,64 @@ struct thread_struct {
> > set_fs(USER_DS); \
> > } while(0)
> >
> > -#define get_debugreg(var, register) \
> > - __asm__("movq %%db" #register ", %0" \
> > - :"=r" (var))
> > -#define set_debugreg(value, register) \
> > - __asm__("movq %0,%%db" #register \
> > - : /* no output */ \
> > - :"r" (value))
> > +static inline unsigned long native_get_debugreg(int regno)
> > +{
> > + unsigned long val;
>
> It would be better to have own functions for each debug register I think
>
Andi, you mean:
a) split the debugreg paravirt_ops in various
paravirt_ops.set/get_debugreg{X,Y,Z...}, and then join them together
in a set/get_debugreg(a,b) to keep the current interface. OR
b) keep one paravirt_ops for each set/get_debugreg, then split then in
various set/get_debugregX(a, b), changing the current interface, OR
c) plit the debugreg paravirt_ops in various
paravirt_ops.set/get_debugreg{X,Y,Z...}, and give each its own
function set/get_debugregX(a, b), again, changing the current
interface, OR
d) None of the above?

--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-09 05:54:13

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 7/25] [PATCH] interrupt related native paravirt functions.

Glauber de Oliveira Costa wrote:
> The interrupt initialization routine becomes native_init_IRQ and will
> be overriden later in case paravirt is on.
>
> The interrupt vector is made global, so paravirt guests can reference
> it in their initializations.

Why? And if so, wouldn't it be better to add an accessor function instead?

J

2007-08-09 06:37:47

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

Glauber de Oliveira Costa wrote:
> +static unsigned native_patch(u8 type, u16 clobbers, void *insns, unsigned len)
> +{
> + const unsigned char *start, *end;
> + unsigned ret;
> +
> + switch(type) {
> +#define SITE(x) case PARAVIRT_PATCH(x): start = start_##x; end = end_##x; goto patch_site
> + SITE(irq_disable);
> + SITE(irq_enable);
> + SITE(restore_fl);
> + SITE(save_fl);
> + SITE(iret);
> + SITE(sysret);
> + SITE(swapgs);
> + SITE(read_cr2);
> + SITE(read_cr3);
> + SITE(write_cr3);
> + SITE(clts);
> + SITE(flush_tlb_single);
> + SITE(wbinvd);
> +#undef SITE
> +
> + patch_site:
> + ret = paravirt_patch_insns(insns, len, start, end);
> + break;
> +
> + case PARAVIRT_PATCH(make_pgd):
> + case PARAVIRT_PATCH(pgd_val):
> + case PARAVIRT_PATCH(make_pte):
> + case PARAVIRT_PATCH(pte_val):
> + case PARAVIRT_PATCH(make_pmd):
> + case PARAVIRT_PATCH(pmd_val):
> + case PARAVIRT_PATCH(make_pud):
> + case PARAVIRT_PATCH(pud_val):
> + /* These functions end up returning what
> + they're passed in the first argument */
>

Is this still true with 64-bit? Either way, I don't think its worth
having this here. The damage to codegen around all those sites has
already happened, and the additional cost of a noop direct call is
pretty trivial. I think this is a nanooptimisation which risks more
problems than it could possibly be worth.

> + case PARAVIRT_PATCH(set_pte):
> + case PARAVIRT_PATCH(set_pmd):
> + case PARAVIRT_PATCH(set_pud):
> + case PARAVIRT_PATCH(set_pgd):
> + /* These functions end up storing the second
> + * argument in the location pointed by the first */
> + ret = paravirt_patch_store_reg(insns, len);
> + break;
>

Ditto, really. Do this in a later patch if it actually seems to help.

> +unsigned paravirt_patch_copy_reg(void *site, unsigned len)
> +{
> + unsigned char *mov = site;
> + if (len < 3)
> + return len;
> +
> + /* This is mov %rdi, %rax */
> + *mov++ = 0x48;
> + *mov++ = 0x89;
> + *mov = 0xf8;
> + return 3;
> +}
> +
> +unsigned paravirt_patch_store_reg(void *site, unsigned len)
> +{
> + unsigned char *mov = site;
> + if (len < 3)
> + return len;
> +
> + /* This is mov %rsi, (%rdi) */
> + *mov++ = 0x48;
> + *mov++ = 0x89;
> + *mov = 0x37;
> + return 3;
> +}
>

These seem excessively special-purpose. Are their only uses the ones I
commented on above.

> +/*
> + * integers must be use with care here. They can break the PARAVIRT_PATCH(x)
> + * macro, that divides the offset in the structure by 8, to get a number
> + * associated with the hook. Dividing by four would be a solution, but it
> + * would limit the future growth of the structure if needed.
>

Why not just stick them at the end of the structure?


J

2007-08-09 06:55:17

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

Andi Kleen wrote:
>> + if (opfunc == NULL)
>> + /* If there's no function, patch it with a ud2a (BUG) */
>> + ret = paravirt_patch_insns(site, len, start_ud2a, end_ud2a);
>>
>
> This will actually give corrupted BUGs because you don't supply
> the full inline BUG header. Perhaps another trap would be better.
>

The BUG handler will still report it as a normal illegal instruction.
It should never happen; the main thing is that it clearly points out
where the problem is (as opposed to jumping to a NULL pointer and
getting the unhelpful "oh, eip is zero" symptom).

>> +EXPORT_SYMBOL(paravirt_ops);
>>
>
> Definitely _GPL at least.
>

No, for the same reason as i386.

>> +extern struct paravirt_ops paravirt_ops;
>>
>
> Should be native_paravirt_ops I guess
>
>

No, because its the current set of pv_ops. It starts all native, but it
is either completely or partially overwritten by hypervisor-specific ops.

>> +
>> + * This generates an indirect call based on the operation type number.
>>
>
> The macros here don't
>

Yes, PARAVIRT_CALL does: "call *(paravirt_ops+%c[paravirt_typenum]*8);"


>> + : "=a"(f)
>> + : paravirt_type(save_fl),
>> + paravirt_clobber(CLBR_RAX)
>> + : "memory", "cc");
>> + return f;
>> +}
>> +
>> +static inline void raw_local_irq_restore(unsigned long f)
>> +{
>> + __asm__ __volatile__(paravirt_alt(PARAVIRT_CALL)
>> + :
>> + : "D" (f),
>>
>
> Have you investigated if a different input register generates better/smaller
> code? I would assume rdi to be usually used already for the caller's
> arguments so it will produce spilling
>
> Similar for the rax return in the other functions.
>

This has to match the normal C calling convention though, doesn't it?

J

2007-08-09 07:02:25

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

> > + case PARAVIRT_PATCH(make_pgd):
> > + case PARAVIRT_PATCH(pgd_val):
> > + case PARAVIRT_PATCH(make_pte):
> > + case PARAVIRT_PATCH(pte_val):
> > + case PARAVIRT_PATCH(make_pmd):
> > + case PARAVIRT_PATCH(pmd_val):
> > + case PARAVIRT_PATCH(make_pud):
> > + case PARAVIRT_PATCH(pud_val):
> > + /* These functions end up returning what
> > + they're passed in the first argument */
> >
>
> Is this still true with 64-bit? Either way, I don't think its worth
> having this here. The damage to codegen around all those sites has
> already happened, and the additional cost of a noop direct call is
> pretty trivial. I think this is a nanooptimisation which risks more
> problems than it could possibly be worth.

No it is not. But it is just the comment that is broken. (I forgot to
update it). The case here, is that they put in rax what they receive
in rdi.

> > + case PARAVIRT_PATCH(set_pte):
> > + case PARAVIRT_PATCH(set_pmd):
> > + case PARAVIRT_PATCH(set_pud):
> > + case PARAVIRT_PATCH(set_pgd):
> > + /* These functions end up storing the second
> > + * argument in the location pointed by the first */
> > + ret = paravirt_patch_store_reg(insns, len);
> > + break;
> >
>
> Ditto, really. Do this in a later patch if it actually seems to help.

Okay, I can remove them both.

> > +/*
> > + * integers must be use with care here. They can break the PARAVIRT_PATCH(x)
> > + * macro, that divides the offset in the structure by 8, to get a number
> > + * associated with the hook. Dividing by four would be a solution, but it
> > + * would limit the future growth of the structure if needed.
> >
>
> Why not just stick them at the end of the structure?

Does it really matter?

--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-09 07:04:37

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

Glauber de Oliveira Costa wrote:
>>> +/*
>>> + * integers must be use with care here. They can break the PARAVIRT_PATCH(x)
>>> + * macro, that divides the offset in the structure by 8, to get a number
>>> + * associated with the hook. Dividing by four would be a solution, but it
>>> + * would limit the future growth of the structure if needed.
>>>
>>>
>> Why not just stick them at the end of the structure?
>>
>
> Does it really matter?
>

Well, yes, if alignment is an issue.


J

2007-08-09 07:07:43

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

On 8/9/07, Jeremy Fitzhardinge <[email protected]> wrote:
> >
> > Does it really matter?
> >
>
> Well, yes, if alignment is an issue.
Of course, But the question rises from the context that they are both
together at the beginning. So they are not making anybody non-aligned.
Then the question: Why would putting it in the end be different to
putting them _together_, aligned at the beginning ?

--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-09 07:14:44

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

Glauber de Oliveira Costa wrote:
> On 8/9/07, Jeremy Fitzhardinge <[email protected]> wrote:
>
>>> Does it really matter?
>>>
>>>
>> Well, yes, if alignment is an issue.
>>
> Of course, But the question rises from the context that they are both
> together at the beginning. So they are not making anybody non-aligned.
> Then the question: Why would putting it in the end be different to
> putting them _together_, aligned at the beginning ?
>

Well, the point is that if you add new ones then alignment may be an
issue. Putting them at the end (with a comment explaining why they're
there) will make it more robust. Though splitting them into their own
sub-structure would probably be better.

Hm. So x86-64 doesn't make 64-bit pointers be 64-bit aligned?

J

2007-08-09 07:30:12

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 3/25] [PATCH] irq_flags / halt routines

On 8/8/07, Andi Kleen <[email protected]> wrote:
>
> > +#ifdef CONFIG_PARAVIRT
> > +#include <asm/paravirt.h>
> > +# ifdef CONFIG_X86_VSMP
> > +static inline int raw_irqs_disabled_flags(unsigned long flags)
> > +{
> > + return !(flags & X86_EFLAGS_IF) || (flags & X86_EFLAGS_AC);
> > +}
> > +# else
> > +static inline int raw_irqs_disabled_flags(unsigned long flags)
> > +{
> > + return !(flags & X86_EFLAGS_IF);
> > +}
> > +# endif
>
> You should really turn the vsmp special case into a paravirt client first
> instead of complicating all this even more.
Looking at it more carefully, it turns out that those functions are
not eligible for being paravirt clients. They do no privileged
operation at all. In fact, all they do is bit manipulation.
That said, the code got a little bit cleaner by moving them down, and so I did.

But later on, you voiced concern about making CONFIG_PARAVIRT depend
on !VSMP. (and said it would be okay, because these functions would be
paravirt clients: but they won't) Given this updated picture, what's
your position about this?

Again, as they don't do anything besides bit manipulation, I don't
think they will stop VSMP from working with PARAVIRT.

--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-09 12:18:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

--
On Wed, 8 Aug 2007, Jeremy Fitzhardinge wrote:

> Steven Rostedt wrote:
> > /*
> > * x86 arch doesn't have an easy way to find out where
> > * gs is located. So we need to read the MSR. But first
> > * we need to save off the rcx, rax and rdx.
> >
> Why don't you store it in gs? movq %gs:my_gs_base, %rax?

Because it can't be trusted. After the swapgs, we are pointing to the RW
section of the HV. But by running the guest kernel in ring 1, we have no
protection from the guest writing into that area too. So we can't put
anything in that section and expect it to be safe after jumping to guest
code. The only trusted values, is jumping in after getting there by an
interrupt, where the hardware places the values onto the stack.

But with Andi's comments, I realized I can point the gs pointer to the
RO area. And make a constant offset that will point up into the RW area,
so we could save the stack and replace it.

-- Steve

2007-08-09 12:32:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64


--
On Thu, 9 Aug 2007, Jeremy Fitzhardinge wrote:

> Glauber de Oliveira Costa wrote:
> > On 8/9/07, Jeremy Fitzhardinge <[email protected]> wrote:
> >
> >>> Does it really matter?
> >>>
> >>>
> >> Well, yes, if alignment is an issue.
> >>
> > Of course, But the question rises from the context that they are both
> > together at the beginning. So they are not making anybody non-aligned.
> > Then the question: Why would putting it in the end be different to
> > putting them _together_, aligned at the beginning ?
> >
>
> Well, the point is that if you add new ones then alignment may be an
> issue. Putting them at the end (with a comment explaining why they're
> there) will make it more robust. Though splitting them into their own
> sub-structure would probably be better.

Glauber,

I was thinking of putting them at the end too, and that would make it all
work better. But I didn't mention it because I was in the mindset of "well
i386 has that there, we should too" :-(

>
> Hm. So x86-64 doesn't make 64-bit pointers be 64-bit aligned?

yeah, it usually does. But it's one of those paranoid things, where you
want it to still work even if someone later on throws an
__attribute__((packed)) in on paravirt ops ;-)

-- Steve

2007-08-09 12:49:43

by Steven Rostedt

[permalink] [raw]
Subject: Re: [Lguest] Introducing paravirt_ops for x86_64


--
On Wed, 8 Aug 2007, Jeremy Fitzhardinge wrote:

>
> At the moment Xen can't support guests with 2M pages. In 32-bit this
> isn't a huge problem, since the kernel doesn't assume it can map itself
> with 2M pages. But I think the 64-bit kernel assumes 2M pages are
> always available for mapping the kernel. I don't know how pervasive
> this assumption is, but it would be nice to parameterize it in pv-ops.
>

I got 2M pages working in lguest64, but it was a cause of a lot of pain.
By removing support for 2M pages in guests, that might clean up a lot of
lguest64 shadow paging as well.

The only advantage I see for 2M pages or a guest, is that it saves the
guest from having extra page tables. The host still needs the map the 2M
to 4K, so it's not an advantage there. It doesn't speed anything up, so
I'd be all for scrapping 2M pages for guests.

-- Steve

2007-08-09 14:27:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64


> This has to match the normal C calling convention though, doesn't it?

Native cli/sti/save/restore_flags are all only assembly and can be easily
(in fact more easily than in C) written as pure assembler functions. Then
you can use whatever calling convention you want.

While some paravirt implementations may have more complicated implementations
i guess it's still a reasonable requirement to make them simple enough
in pure assembler. If not they can use a trampoline, but that's hopefully
not needed.

-Andi

2007-08-09 14:27:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64


> Hm. So x86-64 doesn't make 64-bit pointers be 64-bit aligned?

The ABI does of course, although the penalty of not doing it on current
CPUs is only minor.

-Andi

2007-08-09 14:38:58

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

--
On Thu, 9 Aug 2007, Andi Kleen wrote:
>
> > This has to match the normal C calling convention though, doesn't it?
>
> Native cli/sti/save/restore_flags are all only assembly and can be easily
> (in fact more easily than in C) written as pure assembler functions. Then
> you can use whatever calling convention you want.

I agree.
Should we make a paravirt_ops_asm.S file that can implement these native
funcions, and so we can get rid of the C functions only doing asm?

>
> While some paravirt implementations may have more complicated implementations
> i guess it's still a reasonable requirement to make them simple enough
> in pure assembler. If not they can use a trampoline, but that's hopefully
> not needed.

It works for lguest64. I'm sure it should be no problem with other HVs.

-- Steve

2007-08-09 15:11:19

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

Steven Rostedt wrote:
> --
> On Thu, 9 Aug 2007, Andi Kleen wrote:
>
>>> This has to match the normal C calling convention though, doesn't it?
>>>
>> Native cli/sti/save/restore_flags are all only assembly and can be easily
>> (in fact more easily than in C) written as pure assembler functions. Then
>> you can use whatever calling convention you want.
>>
>
> I agree.
> Should we make a paravirt_ops_asm.S file that can implement these native
> funcions, and so we can get rid of the C functions only doing asm?
>
>
>> While some paravirt implementations may have more complicated implementations
>> i guess it's still a reasonable requirement to make them simple enough
>> in pure assembler. If not they can use a trampoline, but that's hopefully
>> not needed.
>>
>
> It works for lguest64. I'm sure it should be no problem with other HVs.
>

Hm, I can't say the idea thrills me. Lets get the thing working first,
and then worry about having special per-pvop calling conventions.

J

2007-08-09 17:46:00

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 23/25] [PATCH] paravirt hooks for arch initialization

Andi Kleen wrote:
>> -static void discover_ebda(void)
>> +void native_ebda_info(unsigned *addr, unsigned *size)
>>
>
> I guess it would be better to use the resources frame work here.
> Before checking EBDA check if it is already reserved. Then lguest/Xen
> can reserve these areas and stop using it.
>

What's the EBDA actually used for? The only place which seems to use
ebda_addr is in the e820 code to avoid that area as RAM.

Seems to me that we can just arrange to have the early lguest/xen setup
code set the EBDA_ADDR pointer to NULL and make discover_ebda() special
case that to zero out ebda_addr/size.

>> +/* Overridden in paravirt.c if CONFIG_PARAVIRT */
>> +void __attribute__((weak)) memory_setup(void)
>> +{
>> + return setup_memory_region();
>> +}
>> +
>> +
>> void __init setup_arch(char **cmdline_p)
>> {
>> printk(KERN_INFO "Command line: %s\n", boot_command_line);
>> @@ -231,12 +255,19 @@ void __init setup_arch(char **cmdline_p)
>> saved_video_mode = SAVED_VIDEO_MODE;
>> bootloader_type = LOADER_TYPE;
>>
>> + /*
>> + * By returning non-zero here, a paravirt impl can choose to
>> + * skip the rest of the setup process
>> + */
>> + if (paravirt_arch_setup())
>> + return;
>>
>
> Sorry, but that's an extremly ugly and clumpsy interface and will lead
> to extensive code duplication in hypervisors because so much code
> is disabled.
>
> This needs to be solved in some better way.
>

Yeah, it seems a bit hamfisted. Looks like it would be better to deal
with it by some combination of:

1. implement pv analogues of existing functions
2. try to neutralize functions we don't care about in pv-land
3. refactoring the setup() function to make the pv-friendly and
pv-hostile parts clearly distinct

and remember: native isn't a special case; its just another pv driver.

J

2007-08-09 17:57:10

by Alan

[permalink] [raw]
Subject: Re: [PATCH 23/25] [PATCH] paravirt hooks for arch initialization

> What's the EBDA actually used for? The only place which seems to use
> ebda_addr is in the e820 code to avoid that area as RAM.

It belongs to the firmware.

2007-08-10 18:08:47

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 23/25] [PATCH] paravirt hooks for arch initialization

On 8/9/07, Alan Cox <[email protected]> wrote:
> > What's the EBDA actually used for? The only place which seems to use
> > ebda_addr is in the e820 code to avoid that area as RAM.
>
> It belongs to the firmware.

Wouldn't it be better, then, to just skip this step unconditionally if
we are running a paravirtualized guest? What do we from doing it?

--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

2007-08-10 18:19:20

by Alan

[permalink] [raw]
Subject: Re: [PATCH 23/25] [PATCH] paravirt hooks for arch initialization

On Fri, 10 Aug 2007 15:08:35 -0300
"Glauber de Oliveira Costa" <[email protected]> wrote:

> On 8/9/07, Alan Cox <[email protected]> wrote:
> > > What's the EBDA actually used for? The only place which seems to use
> > > ebda_addr is in the e820 code to avoid that area as RAM.
> >
> > It belongs to the firmware.
>
> Wouldn't it be better, then, to just skip this step unconditionally if
> we are running a paravirtualized guest? What do we from doing it?

An EBDA is an optional BIOS feature so providing you virtualise the
relevant entry in the zero page (to avoid guest applications trying to
scan the EBDA for some system tables they may want to use like DMI) as
a zero entry you can ignore the EBDA.

That probably the right thing to do since the host EBDA won't match the
guest environment and the guest might migrate anyway

2007-08-10 18:53:33

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 23/25] [PATCH] paravirt hooks for arch initialization

Glauber de Oliveira Costa wrote:
> On 8/9/07, Alan Cox <[email protected]> wrote:
>
>>> What's the EBDA actually used for? The only place which seems to use
>>> ebda_addr is in the e820 code to avoid that area as RAM.
>>>
>> It belongs to the firmware.
>>
>
> Wouldn't it be better, then, to just skip this step unconditionally if
> we are running a paravirtualized guest? What do we from doing it?
>

It's better to make discover_ebda() quietly cope with a missing ebda for
whatever reason. We could add an explicit interface to paravirt_ops to
handle this one little corner, but it isn't very important, not very
general and really its just clutter. Its much better to have things
cope with being virtualized quietly on their own rather than hit them
all with the pv_ops hammer. pv_ops is really for things where the
hypervisor-specific code really has to get actively involved.

For Xen-domU and lguest, there probably won't be an ebda, but its quite
likely that it will be present in some form for vmware, kvm and
xen-dom0; in those cases the (presumably virtual) ebda will be put in
place as it would be in the native case, eliminating the need for any
special handling.

J

2007-08-10 19:17:50

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 23/25] [PATCH] paravirt hooks for arch initialization

Jeremy Fitzhardinge escreveu:
> Glauber de Oliveira Costa wrote:
>> On 8/9/07, Alan Cox <[email protected]> wrote:
>>
>>>> What's the EBDA actually used for? The only place which seems to use
>>>> ebda_addr is in the e820 code to avoid that area as RAM.
>>>>
>>> It belongs to the firmware.
>>>
>> Wouldn't it be better, then, to just skip this step unconditionally if
>> we are running a paravirtualized guest? What do we from doing it?
>>
>
> It's better to make discover_ebda() quietly cope with a missing ebda for
> whatever reason. We could add an explicit interface to paravirt_ops to
> handle this one little corner, but it isn't very important, not very
> general and really its just clutter. Its much better to have things
> cope with being virtualized quietly on their own rather than hit them
> all with the pv_ops hammer. pv_ops is really for things where the
> hypervisor-specific code really has to get actively involved.
>
I think the idea you gave me earlier of using probe_kernel_address could
work. Xen/lguest/put_yours_here that won't use an ebda would then have
to unmap the page, to make sure a read would fault.


2007-08-10 20:03:35

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 23/25] [PATCH] paravirt hooks for arch initialization

Glauber de Oliveira Costa wrote:
> I think the idea you gave me earlier of using probe_kernel_address could
> work. Xen/lguest/put_yours_here that won't use an ebda would then have
> to unmap the page, to make sure a read would fault.

Hm, the memory might be mapped anyway, but we could make sure its all
zero. discover_ebda should be able to deal with that OK.

J

2007-08-10 20:12:56

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 23/25] [PATCH] paravirt hooks for arch initialization

Jeremy Fitzhardinge escreveu:
> Glauber de Oliveira Costa wrote:
>> I think the idea you gave me earlier of using probe_kernel_address could
>> work. Xen/lguest/put_yours_here that won't use an ebda would then have
>> to unmap the page, to make sure a read would fault.
>
> Hm, the memory might be mapped anyway, but we could make sure its all
> zero. discover_ebda should be able to deal with that OK.
>
> J
Indeed, as the EBDA_ADDR_POINTER is not aligned, this may work even better.

It seems to me safe to assume that if we read zero on that line:

ebda_addr = *(unsigned short *)__va(EBDA_ADDR_POINTER);

We could just do ebda_size = 0 and go home happy, skipping the rest of
the process.

Andi, are you okay with it ?

2007-08-10 20:33:06

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 23/25] [PATCH] paravirt hooks for arch initialization

Glauber de Oliveira Costa wrote:
> Indeed, as the EBDA_ADDR_POINTER is not aligned, this may work even
> better.
>
> It seems to me safe to assume that if we read zero on that line:
>
> ebda_addr = *(unsigned short *)__va(EBDA_ADDR_POINTER);
>
> We could just do ebda_size = 0 and go home happy, skipping the rest of
> the process.

Sure, but it should use probe_kernel_addr as well, just so that it will
be robust against having that page unmapped too.

J